Independent finite approximations for Bayesian nonparametric inference
Tin D. Nguyen, Jonathan Huggins, Lorenzo Masoero, Lester Mackey, Tamara Broderick
IIndependent finite approximations forBayesian nonparametric inference:construction, error bounds, and practicalimplications
Tin D. Nguyen , Jonathan Huggins , Lorenzo Masoero , Lester Mackey ,Tamara Broderick CSAIL, MIT, e-mail: [email protected] ; [email protected] ; [email protected] Department of Statistics & Mathematics, Boston University, e-mail: [email protected] Microsoft Research, e-mail: [email protected]
Abstract:
Bayesian nonparametrics based on completely random measures (CRMs)offers a flexible modeling approach when the number of clusters or latent componentsin a dataset is unknown. However, managing the infinite dimensionality of CRMs oftenleads to slow computation. Practical inference typically relies on either integrating outthe infinite-dimensional parameter or using a finite approximation : a truncated finiteapproximation (TFA) or an independent finite approximation (IFA). The atom weightsof TFAs are constructed sequentially, while the atoms of IFAs are independent, which(1) make them well-suited for parallel and distributed computation and (2) facilitatesmore convenient inference schemes. While IFAs have been developed in certain spe-cial cases in the past, there has not yet been a general template for construction or asystematic comparison to TFAs. We show how to construct IFAs for approximating dis-tributions in a large family of CRMs, encompassing all those typically used in practice.We quantify the approximation error between IFAs and the target nonparametric prior,and prove that, in the worst-case, TFAs provide more component-efficient approxima-tions than IFAs. However, in experiments on image denoising and topic modeling taskswith real data, we find that the error of Bayesian approximation methods overwhelmsany finite approximation error, and IFAs perform very similarly to TFAs.
1. Introduction
Many data analysis problems can be seen as discovering a latent set of traits in a population.For instance, we might recover topics or themes from scientific papers, ancestral populationsfrom genetic data, interest groups from social network data, or unique speakers across audiorecordings of many meetings (Palla, Knowles and Ghahramani, 2012; Blei, Griffiths andJordan, 2010; Fox et al., 2010). In all of these cases, we might reasonably expect the numberof latent traits present in a data set to grow with the size of the data. One modeling optionis to choose a different prior for different data set sizes, but is unwieldy and inconvenient. Asimpler option is to choose a single prior that naturally yields different expected numbersof traits for different numbers of data points. In theory,
Bayesian nonparametrics providesa rich set of priors with exactly this desirable property thanks to a countable infinity oftraits, so that there are always more traits to reveal through the accumulation of more data.This latent, infinite-dimensional parameter presents a major practical challenge, though.In what follows, we propose a simple approximation across a wide range of BNP models,which can be seen as a generalization of certain existing special cases. Furthermore, it a r X i v : . [ s t a t . M E ] S e p . Nguyen et al./Independent finite approximations is amenable to modern, efficient inference schemes and black-box code; fits easily withincomplex, potentially deep generative models; and admits straightforward parallelization. Background
A particular challenge of the infinite-dimensional parameter is that it isimpossible to store an infinity of random variables in memory or learn the distribution overan infinite number of variables in finite time. Some authors have developed conjugate priorsand likelihoods (Orbanz, 2010) to circumvent the infinite representation via marginalizationand thereby perform exact Bayesian posterior inference (Broderick, Wilson and Jordan,2018; James, 2017). However, these priors and likelihoods are often just a single piece withina more complex generative model, which is no longer fully conjugate and therefore requiresan approximate posterior inference scheme such as Markov Chain Monte Carlo (MCMC) orvariational Bayes (VB). Some local steps in, e.g., an MCMC sampler can still take advantageof conditional conjugacy via special marginal forms such the Chinese restaurant process (Tehet al., 2006) or the Indian buffet process (Griffiths and Ghahramani, 2005); see Broderick,Wilson and Jordan (2018) and James (2017) for general treatments. But using these marginaldistributions rather than a full and explicit representation of the latent variables typicallynecessitates a Gibbs sampler, which can be slow to mix and may require special-purpose,model-specific sampling moves. To take advantage of black-box variational inference methods(Ranganath, Gerrish and Blei, 2014; Kucukelbir et al., 2015), modern MCMC methods suchas Metropolis-adjusted Langevin algorithm (Roberts and Tweedie, 1996) or HamiltonianMonte Carlo (HMC) (Neal, 2011; Betancourt, 2017), or modern probabilistic programmingsystems such as Stan (Carpenter et al., 2017), a full trait representation is generally required.An alternative approach that still allows use of these convenient inference methods is toapproximate the infinite-dimensional prior with a finite-dimensional prior that essentiallyreplaces the infinite collection of random traits by a finite subset of “likely” traits. Unlikea fixed finite-dimensional prior across all data set sizes, this finite dimensional prior is seenas an approximation to the BNP prior and thereby its cardinality is informed directly bythe BNP prior. Note that since any moderately complex model will necessitate approximateinference, so long as the approximation error from using the finite-dimensional prior ap-proximation is on the order of the approximation error from MCMC or VB, no inferentialquality has been lost.Much of the previous work on finite approximations developed and analyzed trunca-tions of the random measures underlying the nonparametric prior (Doshi-Velez et al., 2009;Paisley, Blei and Jordan, 2012; Roychowdhury and Kulis, 2015; Campbell et al., 2019); wecall these truncated finite approximations (TFAs) and refer to Campbell et al. (2019) fora thorough study of constructions for TFAs. In the present work, we instead consider afinite approximation consisting of independent and identical (i.i.d.) representations of thetraits together with their rates within the population; we call these independent finite ap-proximations (IFAs). The IFA approach has the potential to be simpler to incorporate in acomplex hierarchical model, to exhibit improved mixing, and to be amenable to parallelizingcomputation during inference. There are not many known finite approximations using i.i.d.random variables and we are unaware of any general-purpose results on constructing them.
Our Contributions
We propose a construction for IFAs that subsumes a number ofspecial cases which have already been successfully used in applications, with practitionersreporting similar performance to the truncation approach but with faster mixing (Kurihara,Welling and Teh, 2007; Saria, Koller and Penn, 2010; Fox et al., 2010; Johnson and Willsky,2013). On the other hand, our construction is distinct from that presented in Lee, James . Nguyen et al./Independent finite approximations and Choi (2016), which has an arguably smaller scope of application. We propose a broadmechanism for our i.i.d. finite approximation and relate these to existing work. We thenquantify the effect of replacing the infinite-dimensional priors with an IFA in probabilisticmodels, providing interpretable error bounds with explicit dependence on the size of theapproximation and the data cardinality. The error bounds reveal that in the worst case, toapproximate the target to an accuracy, it is necessary to use a large IFA model while a smallTFA model would suffice. However, differences have not been observed in practice, and weconfirm through experiments with image denoising and topic modeling that IFAs and TFAsperform similarly on applied problems – IFAs benefit from conceptual ease-of-use.
2. Background
We start by summarizing relevant background on nonparametric priors constructed fromcompletely random measures, and how truncated and independent finite approximations forthese priors are constructed. Let ψ i represent the i th trait of interest and Let θ i representthe rate, or frequency, of this trait in the population. We can collect the pairs of traits withtheir frequencies ( ψ i , θ i ) in a measure that places non-negative mass θ i at location ψ i : Θ := (cid:80) Ii =1 θ i δ ψ i . I , the total number of traits, may be finite or, as in the nonparametric setting,countably infinite. To perform Bayesian inference, we need to choose a prior distribution onΘ and a likelihood for the observed data Y N := { Y n } Nn =1 given Θ, and then we must applyBayes theorem to obtain the posterior on Θ given the observed data. Completely random measures
Most common BNP priors can be conveniently formu-lated as (normalizations of) completely random measures (CRMs). CRMs are constructedfrom Poisson point processes, which are straightforward to manipulate both analyticallyand algorithmically. Consider a Poisson point process on R + := [0 , ∞ ) with rate measure ν (d θ ) such that ν ( R + ) = ∞ and (cid:82) min(1 , θ ) ν (d θ ) < ∞ . Such a process generates an infinitenumber of rates ( θ i ) ∞ i =1 , θ i ∈ R + , having an almost surely finite sum (cid:80) ∞ i =1 θ i < ∞ . Weassume throughout that ψ i ∈ Ψ for some space Ψ and ψ i i.i.d. ∼ H for some diffuse distribution H . H serves as a prior on the trait values: in topic modeling, each topic is a probabilityvector in the simplex of vocabulary words, and it is typical to use H = Dir . The result-ing measure Θ in this case is a completely random measure (CRM) (Kingman, 1967). Asshorthand, we will write CRM(
H, ν ) for the completely random measure generated as justdescribed: Θ := (cid:80) i θ i δ ψ i ∼ CRM(
H, ν ) . The corresponding normalized CRM (NCRM) isΞ := Θ / Θ(Ψ), which is a discrete probability measure. The set of atom locations of Ξ is thesame as that of Θ, while the atom sizes are normalized Ξ = (cid:80) i ξ i δ ψ i where ξ i = θ i / ( (cid:80) j θ j ) . Finite approximations
Since the sequence ( θ i ) ∞ i =1 is countably infinite, it may be difficultto simulate or perform posterior inference in the full model. One approximation scheme isto define the finite approximation Θ K := (cid:80) Ki =1 θ i δ ψ i . Since it involves a finite numberof parameters, Θ K can be used for efficient posterior inference, including with black-boxMCMC and VB algorithms—but some approximation error is introduced by not using thefull CRM Θ.A truncated finite approximation (TFA) (Doshi-Velez et al., 2009; Paisley, Blei and Jor-dan, 2012; Roychowdhury and Kulis, 2015) requires constructing an ordering on the sequence The possible fixed-location and deterministic components of an (N)CRM (Kingman, 1967) are notconsidered here for brevity; these components can be added (assuming they are purely atomic) and ouranalysis modified without undue effort. . Nguyen et al./Independent finite approximations ( θ i ) ∞ i =1 such that θ i is a function of some auxiliary random variables ξ , . . . , ξ i ; hence, θ i +1 reuses the same auxiliary randomness as θ i , plus uses an additional random variable ξ i +1 .Thus, the value of θ i +1 implicitly depends on the values of θ , . . . , θ i . Truncated finite ap-proximations are attractive because of the nestedness of the approximations K : in general,the approximation quality increases with K , and to refine existing truncations, it sufficesto generate the next terms in the sequence. On the other hand the complex dependencesbetween the atoms θ , θ , . . . potentially make inference more challenging.We here instead pursue what we call an independent finite approximation (IFA), whichinvolves choosing a sequence of probability measures ν , ν , . . . such that for any approxima-tion level K , we choose θ , . . . , θ K i.i.d. ∼ ν K . The ν K are chosen in such a way that Θ K D = ⇒ Θas K → ∞ — that is, the IFAs converge in distribution to the CRM. The pros and consof the IFA invert those of the TFA: the atoms are now i.i.d., potentially making inferenceeasier, but a completely new approximation must be constructed if K changes. Existingwork (Paisley and Carin, 2009; Broderick et al., 2015; Acharya, Ghosh and Zhou, 2015; Lee,James and Choi, 2016; Lee, Miscouridou and Caron, 2019) has only developed i.i.d. finiteapproximations on a case-by-case basis, where as our focus is a general-purpose mechanism.For the normalized atom sizes ξ i = θ i / (cid:80) j θ j , finite approximations also involve randommeasures with finite support Ξ K = (cid:80) Ki =1 ξ i δ ψ i . TFAs can be defined in one of two ways.In the first approach, the TFA corresponding to the CRM can be normalized to form theapproximation of the NCRM (Campbell et al., 2019). The second approach instead directlyconstructs an ordering over the sequence ( ξ i ) ∞ i =1 and truncate this representation (Ishwaranand James, 2001; Blei and Jordan, 2006). Regarding the independent approach, we will onlynormalize the IFAs that target a given CRM to form the approximation of the correspondingNCRM. The beta process.
For concreteness, we consider the beta process (Teh and G¨or¨ur, 2009;Broderick, Jordan and Pitman, 2012) as a running example of a CRM. We denote its dis-tribution as BP( γ, α, d ), with discount parameter d ∈ [0 , α > − d , massparameter γ >
0, and rate measure ν (d θ ) = γ Γ( α +1)Γ(1 − d )Γ( α +1) [ θ ≤ θ − d − (1 − θ ) α + d − d θ. The case in which d = 0 is the standard beta process (Hjort, 1990; Thibaux and Jordan,2007). The beta process is typically paired with the Bernoulli likelihood process l ( x | θ ) = θ x (1 − θ ) − x ; the combination has been used for factor analysis (Doshi-Velez et al., 2009;Paisley, Blei and Jordan, 2012) or dictionary learning (Zhou et al., 2009).
3. Constructing independent finite approximations
We first show how to easily construct independent finite approximations to a completely ran-dom measure. Specifically, our first main result shows how to construct IFAs that convergein distribution to CRMs with rate measures of a particular form. As an important specialcase, if the CRM is an exponential family CRM (Broderick, Wilson and Jordan, 2018) andthe “discount” parameter d = 0, then the IFA is constructed from random variables in thesame exponential family, a connection which is not only useful for approximate inferencealgorithms, but also for the theoretical analysis of the approximation itself. Finally, we showhow normalized IFAs converge to the corresponding NCRM, in the sense that the partitioninduced by IFA converges to that induced by NCRM.Formally, IFAs take the following form. For probability measures H and ν K , write Θ K ∼ . Nguyen et al./Independent finite approximations IFA K ( H, ν K ) ifΘ K = (cid:80) Ki =1 θ K,i δ ψ K,i θ K,i indep ∼ ν K ψ K,i i.i.d. ∼ H. We consider CRMs with rate measures ν with densities that, near zero, are (essentially) pro-portional to θ − − d , where d ∈ [0 ,
1) is the “discount” parameter. The explicit assumptionson ν are given in Assumption 1. Assumption 1.
For d ∈ [0 ,
1) and η ∈ E ⊆ R d , let Θ ∼ CRM(
H, ν ( · ; d, η )), where ν (d θ ; d, η ) := γθ − − d g ( θ ) − d h ( θ ; η ) Z (1 − d, η ) d θ. Assume that:1. for ξ > η ∈ E , Z ( ξ, η ) = (cid:82) θ ξ − g ( θ ) ξ h ( θ ; η )d θ < ∞ ;2. g is continuous, g (0) = 1, and ∃ < c ∗ ≤ c ∗ < ∞ such that c ∗ ≤ g ( θ ) − ≤ c ∗ (1 + θ ); and3. there exists (cid:15) > η ∈ E , θ (cid:55)→ h ( θ ; η ) is continuous and bounded on [0 , (cid:15) ].Other than the discount d and mass γ , the rate measure ν potentially has additional hy-perparameters, which are encapsulated by η . The finiteness of the normalizer Z is necessaryin defining finite-dimensional distributions whose densities are very similar in form to ν . Theconditions on the behaviors of g ( θ ) and h ( θ ; η ) imply that the overall rate measure’s behav-ior near θ = 0 is dominated by the θ − − d term. These are mild regularity conditions: mostpopular BNP priors can be cast in such form, and the functions g ( θ ) and h ( θ ; η ) are suchthat all three assumptions can be easily verified. Appendix A shows how common processsuch as beta, gamma (Ferguson and Klass, 1972; Kingman, 1975; Brix, 1999; Titsias, 2008;James, 2013), beta prime (Broderick, Wilson and Jordan, 2018) and generalized gammaprocess satisfy Assumption 1.We will now define a sequence of IFAs that converge in distribution to such a CRM. OurIFA construction requires the following definition. Definition 3.1.
The parameterized function family { S b } b ∈ R + are approximate indicators if, for any b ∈ R + , S b ( θ ) is a real increasing function such that S b ( θ ) = 0 for θ ≤ S b ( θ ) = 1 for θ ≥ b .Valid examples of approximate indicators are the indicator function S b ( θ ) = [ θ >
0] andthe smoothed indicator function S b ( θ ) = (cid:40) exp (cid:16) − − ( θ − b ) /b + 1 (cid:17) if θ ∈ (0 , b ) [ θ >
0] otherwise.Our first result now shows how to construct IFAs that provably converge to our family ofCRMs.
Theorem 3.2.
Suppose Assumption 1 hold. Let { S b } b ∈ R + be a family of approximate indi-cators. Fix a > , and ( b K ) K ∈ N , a decreasing sequence such that b K → . For c := γ h (0; η ) Z (1 − d,η ) and κ = min(1 , (cid:15) ) , let ν K (d θ ) := θ − cK − − dS bK ( θ − aK − ) g ( θ ) cK − − d h ( θ ; η ) Z − K d θ, be a family of probability densities, where Z K is chosen such that (cid:82) ν K (d θ ) = 1 . If Θ K ∼ IFA K ( H, ν K ) , then Θ K D = ⇒ Θ as K → ∞ . . Nguyen et al./Independent finite approximations The proof can be found in Appendix B.1. The scope of Theorem 3.2 is broader thanknown i.i.d. finite approximations. Namely, Lee, James and Choi (2016, Theorem 2) designsi.i.d. finite approximations that converge in distribution to either a beta process with d > d > d = 0, whereasour construction naturally incorporates this situation.An important corollary of Theorem 3.2 applies to exponential family CRM with d = 0.In common BNP models, the relationship between the likelihood l ( · | θ ) and the CRM prioris closely related to the well-known conjugacy in exponential families (Broderick, Wilsonand Jordan, 2018, Section 4). In particular, the likelihood has an exponential family form l ( x | θ ) := κ ( x ) θ φ ( x ) exp ( (cid:104) µ ( θ ) , t ( x ) (cid:105) − A ( θ )) . (1)Here x ∈ N ∪{ } , κ ( x ) is the base density, (cid:2) t ( x ) , φ ( x ) (cid:3) T is the vector of sufficient statistics, A ( θ ) is the log partition function, (cid:2) µ ( θ ) , log θ (cid:3) T is the vector of natural parameters, and (cid:104) µ ( θ ) , t ( x ) (cid:105) is an inner product. As for the rate measure, we will analyze those that behavelike θ − near 0 ν ( θ ) := γ (cid:48) θ − exp (cid:26) (cid:104) (cid:18) ψλ (cid:19) , (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:27) { θ ∈ U } , (2)where γ (cid:48) > λ > U ⊂ R + is the support of ν. Eq. (2) leads to the suggestive terminologyof exponential
CRMs. The θ − dependence near 0 means that these models lack power-lawbehavior e.g., in beta process, see Teh and G¨or¨ur (2009). Models that can be cast in thisform include beta process with Bernoulli likelihood, beta process with negative binomiallikelihood (Broderick et al., 2015; Zhou et al., 2012) and gamma process with Poisson like-lihood (Acharya, Ghosh and Zhou, 2015; Roychowdhury and Kulis, 2015). For short-hand,we refer to these models as beta–Bernoulli, beta–negative binomial and gamma–Poisson,respectively. The normalizer S ( ξ, η ) := (cid:90) U θ ξ exp (cid:26) (cid:104) η, (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:27) dθ. (3)of the exponential family distribution plays an important role in the sequel. Note that S isequal to the normalization quantity Z appearing in Assumption 1, but specialized for theexponential family rate measure.We now state the simple form taken by IFA K for exponential family CRMs. The assump-tions are the natural analogues of Assumption 1, specialized for exponential family ratemeasures. Corollary 3.3.
Let ν be of the form Eq. (2) , and assume that:1. S ( ξ, η ) < ∞ for ξ > − ;2. There exists (cid:15) > such that for any ψ, λ , θ (cid:55)→ exp (cid:26) (cid:104) η, (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:27) { θ ∈ U } is acontinuous and bounded function of θ on [0 , (cid:15) ] . . Nguyen et al./Independent finite approximations For c := γ (cid:48) exp (cid:26) (cid:104) η, (cid:18) µ (0) − A (0) (cid:19) (cid:105) (cid:27) , let ν K ( θ ) := { θ ∈ U } S ( c/K − , η ) θ c/K − exp (cid:26) (cid:104) η, (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:27) . (4) If Θ K ∼ IFA K ( H, ν K ) , then Θ K D = ⇒ Θ . Corollary 3.3 is sufficient to recover known IFA results for BP( γ, α,
0) (Doshi-Velez et al.,2009; Paisley and Carin, 2009; Griffiths and Ghahramani, 2011). Appendix A uses Corol-lary 3.3 to construct IFAs for more example CRMs.
Example 3.1 (Beta process) . When d = 0, the rate measure of the beta process is ν ( θ ) = γαθ − exp(( α −
1) log(1 − θ )) { ≤ θ ≤ } . The normalizer depends only on ξ and α − S BP = (cid:82) θ ξ (1 − θ ) α − exp(0) dθ = B ( ξ + 1 , α ) . The assumptions in Corollary 3.3 can bequickly verified. S BP < ∞ for ξ > − B ( ξ + 1 , α ) < ∞ for ξ + 1 > , α > θ (cid:55)→ (1 − θ ) α − is clearly bounded and continuous on the interval [0 , .
5] forany α >
0. Therefore ν K = Beta ( γα/K, α ).In comparison, Doshi-Velez et al. (2009) approximates BP( γ, ,
0) with each ν K is a Beta ( γ/K,
1) distribution. Griffiths and Ghahramani (2011) also approximates BP( γ, α, ν K being Beta ( γα/K, α ). Lastly, Paisley and Carin (2009) approximates BP( γ, α, ν K being Beta ( γα/K, α (1 − /K )) distribution, with the difference between Beta ( γα/K, α )and Beta ( γα/K, α (1 − /K )) being not substantive.Given that IFA K is a converging approximation to the corresponding target CRM, it isnatural to ask if the normalization of IFA K converges to the corresponding normalizationof CRM i.e., NCRM. Our next result shows that normalized IFA indeed converges, in thesense of exchangeable partition probability functions , or EPPF (Pitman, 1995). The EPPF ofa NCRM Ξ gives the probability of partitions of { , , . . . , N } induced by sampling from Ξ.In particular, under the model Ξ ∼ NCRM , V n | Ξ i.i.d. ∼ Ξ for 1 ≤ n ≤ N with the effect of Ξmarginalized out, the ties among the V n ’s induce a partition over the set { , , . . . , N } . Letthere be t ≤ N distinct values among the V n ’s, and let n i be the number of elements in the i -th block of the partition induced by sampling from Ξ, so that n i ≥ , (cid:80) ti =1 n i = N . Theprobability of the induced partition is a symmetric function p ( n , n , . . . , n t ) that dependsonly on the frequencies n i of each block. The EPPF of IFA K is defined analogously. Theorem 3.4.
Suppose Assumption 1 holds, and let Θ K be as in Theorem 3.2. Let p ( n , n , . . . , n t ) be the EPPF of a NCRM Ξ where Ξ := Θ / Θ(Ψ) and let p K ( n , n , . . . , n t ) be the EPPF ofnormalized IFA Ξ K where Ξ K := Θ K / Θ K (Ψ) . Then, for any N , for any n i ≥ , (cid:80) ti =1 n i = N , lim K →∞ p K ( n , n , . . . , n t ) = p ( n , n , . . . , n t ) . The proof can be found in Appendix B.2. Since the EPPF gives the probability of eachpartition, the point-wise convergence in Theorem 3.4 certifies that the distribution overpartitions induced by the normalized IFA K converges to that induced by the target NCRM,for any finite data cardinality N .
4. Non-asymptotic error bounds for CRM-based models
Theorem 3.2 justifies the use of IFA K in the asymptotic limit K → ∞ but does not provideguidance on choosing an appropriate approximation level for modeling a data process with a . Nguyen et al./Independent finite approximations given cardinality N . In this section, we quantify the effect of replacing CRM with IFA K (forfinite K ) in probabilistic models using error bounds that are simple to manipulate, easilyyielding recommendation of the appropriate K for a given N and accuracy level.The CRM prior on Θ is typically combined with a likelihood that generates trait countsfor each data point. Let l ( · | θ ) be a proper probability mass function on N ∪{ } for all θ in thesupport of ν . Then a collection of conditionally independent observations X N given Θ aredistributed according to the likelihood process LP( l, Θ) – i.e., X n := (cid:80) i x ni δ ψ i i.i.d. ∼ LP( l, Θ) –if x ni ∼ l ( · | θ i ) independently across i and i.i.d. across n . Since the trait counts are typicallylatent in a full generative model specification, define the observed data Y n | X n indep ∼ f ( · | X n )for a probability kernel f . For instance, if the sequence ( θ i ) ∞ i =1 represents the topic rates ina document corpus, X n might capture how many words in document n are generated fromeach topic and Y n might be the observed collection of words for that document. The targetnonparametric model can thus be summarized asΘ ∼ CRM(
H, ν ) , X n | Θ i.i.d. ∼ LP( l ; Θ) , Y n | X n indep ∼ f ( · | X n ) n = 1 , , . . . , N. (5)The approximating finite-dimensional model, with ν K being given in Theorem 3.2 (or Corol-lary 3.3), isΘ K ∼ IFA K ( H, ν K ) , Z n | Θ K i.i.d. ∼ LP( l ; Θ K ) , W n | Z n indep ∼ f ( . | Z n ) n = 1 , , . . . , N. (6)Let P N, ∞ be the distribution of the observations Y N , and P N,K be the distribution ofthe observations W N . We define approximation error to be the total variation distance d T V ( P N,K , P N, ∞ ) between two observational processes, one using the CRM and the otherone using the approximate IFA K as the prior (Ishwaran and Zarepour, 2002; Doshi-Velezet al., 2009; Paisley, Blei and Jordan, 2012; Campbell et al., 2019). Recall that totalvariation distance is the supremum difference in probability mass over measurable sets d T V ( P N,K , P N, ∞ ) := sup A | P N,K ( A ) − P N, ∞ ( A ) | . We restrict attention to exponential family CRM-likelihood pairs. We require Definition 4.1to express our the assumptions on the target model.
Definition 4.1.
Suppose l ( · | θ ) has the form Eq. (1) and ν ( θ ) has the form Eq. (2). For n ∈ N , x n − ∈ ( N ∪{ } ) n − , define shorthands T n := (cid:80) n − m =1 t ( x m ) and Φ n := (cid:80) n − m =1 φ ( x m ).For x ∈ N ∪ { } , let h c ( x | x n − ) := κ ( x ) S (cid:18) − n + φ ( x ) , η + (cid:18) T n + t ( x ) n (cid:19)(cid:19) S (cid:18) − n , η + (cid:18) T n n − (cid:19)(cid:19) and (cid:101) h c ( x | x n − ) := κ ( x ) S (cid:18) c/K − n + φ ( x ) , η + (cid:18) T n + t ( x ) n (cid:19)(cid:19) S (cid:18) c/K − n , η + (cid:18) T n n − (cid:19)(cid:19) . Nguyen et al./Independent finite approximations and M n,x := γ (cid:48) κ (0) n − κ ( x ) S (cid:18) c/K − n − φ (0) + φ ( x ) , η + (cid:18) ( n − t (0) + t ( x )) n (cid:19)(cid:19) . We show in Appendix C that the functions h c , (cid:101) h c , M n,x govern the marginal process representation of the probabilistic models (Broderick, Wilson and Jordan, 2018, Section6). Namely, the joint distribution of X N can be expressed in terms of the conditionals X n | X n − , with M n,x and h c governing this process. Similarly, the joint distribution Z N can be expressed in terms of the conditionals Z n | Z n − , with (cid:101) h c governing thisprocess. For the beta-Bernoulli process with d = 0, the functions have particularly simpleforms. Example 4.1 (Beta-Bernoulli with d = 0) . For the beta-Bernoulli model with d = 0, wehave h c ( x | x n − ) = (cid:80) n − i =1 x i α − n { x = 1 } + α + (cid:80) n − i =1 (1 − x i ) α − n { x = 0 } . (cid:101) h c ( x | x n − ) = (cid:80) n − i =1 x i + γα/Kα − n + γα/K { x = 1 } + α + (cid:80) n − i =1 (1 − x i ) α − n + γα/K { x = 0 } ,M n, = γαα − n , M n,x = 0 for x > . We now formulate the conditions which can be used to show that d T V ( P N,K , P N, ∞ ) issmall. Assumption 2.
There exist constants { C i } i =1 such that the following hold.1. For all n ∈ N , ∞ (cid:88) x =1 M n,x ≤ C n − C . (7)2. For all n ∈ N , ∞ (cid:88) x =1 h ( x | x n − = 0) ≤ K C n − C . (8)3. For any n ∈ N , for any { x i } n − i =1 , ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12) h c ( x | x n − ) − (cid:101) h c ( x | x n − ) (cid:12)(cid:12)(cid:12) ≤ K C n − C . (9)4. For all n ∈ N , for any K ≥ C (ln n + C ), ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ K C ln n + C n − C . (10)Note that the conditions depend only on the functions in Definition 4.1 and not on theobservational likelihood f ( . ) which maps the latent states to the observations. The firstcondition constrains the growth rate of the target model. (cid:80) Nn =1 (cid:80) ∞ x =1 M n,x is the expectednumber of components for data cardinality N – since each (cid:80) ∞ x =1 M n,x is at most O (1 /n ), thetotal number of components is O (ln N ). The second condition means that (cid:101) h c is a very good . Nguyen et al./Independent finite approximations approximation of h c in total variation distance; furthermore, the longer the vector { x i } n − i =1 ,the smaller the error. Similarly, the third condition means that K (cid:101) h c ( · |
0) is a very accurateapproximation of M n,. , and there is also a reduction in the error as n increases. The setof constants C i which satisfy Assumption 2 is not unique: we are in general not interestedin the best constants C i , rather that they exist. We speculate that such assumptions canbe made more explicit in the normalizer S . For instance, the 1 /K dependence is due tosmoothness of S in its first argument, while the dependence on n is due to some inherentnotion of scale dictated by the second and third arguments.Assumption 2 can be verified for the most important CRM models. In Example 4.2 weverify it for the beta-Bernoulli model, and in Appendix E, we verify it for beta-negativebinomial and gamma-Poisson models. Example 4.2 (Beta-Bernoulli with d = 0, continued) . The growth rate of the target modelis ∞ (cid:88) x =1 M n,x = M n, = γαn − α . Since (cid:101) h c is supported on { , } , the growth rate of the approximate model is (cid:101) h c (1 | x n − = 0) = γα/Kα − n + γα/K ≤ K γαn − α . Since both h c and (cid:101) h c are supported on { , } , Eq. (9) becomes (cid:12)(cid:12)(cid:12) h c (1 | x n − ) − (cid:101) h c (1 | x n − ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) n − i =1 x i + γα/Kα − n + γα/K − (cid:80) n − i =1 x i α − n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ γαK n − α . Again, because M n,x = (cid:101) h c ( x | . ) = 0 for x >
1, Eq. (10) becomes (cid:12)(cid:12)(cid:12) M n, − K (cid:101) h c (1 | x n − = 0) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) γαα − n − γαα − n + γαK (cid:12)(cid:12)(cid:12)(cid:12) ≤ γ αK n − α . Calibrating { C i } based on these inequalities is straightforward. Under the aforementioned assumptions, Theorem 4.2 upper bounds the approximation error.
Theorem 4.2 (Upper bound for exponential family CRMs) . If Assumption 2 holds, thenthere exist positive constants C (cid:48) , C (cid:48)(cid:48) , C (cid:48)(cid:48)(cid:48) depending only on { C i } i =1 such that d T V ( P N, ∞ , P N,K ) ≤ C (cid:48) + C (cid:48)(cid:48) ln N + C (cid:48)(cid:48)(cid:48) ln N ln KK .
The proof can be found in Appendix F.1. Theorem 4.2 states that the IFA approximationerror grows as O (ln N ) with fixed K , and as decreases as O (cid:0) ln KK (cid:1) for fixed N . On the onehand, for fixed K , it is expected that the error increases as N increases: with more data,the number of latent components in the data increases, demanding finite approximationsof increasingly larger sizes. In particular, O (ln N ) is the standard Bayesian nonparametricgrowth rate for non-power law models (Griffiths and Ghahramani, 2011). It is likely that . Nguyen et al./Independent finite approximations the O (ln N ) factor can be improved to O (ln N ) – more generally, we conjecture that theerror directly depends on the expected number of latent components in a model for N observations. On the other hand, for fixed N , the error goes to zero at least as fast as O (cid:0) ln KK (cid:1) . We also suspect the ln K factor in the numerator can be removed. As Theorem 4.2 is only an upper bound, a natural question to investigate is the tightnessof the bound in terms of
N, K . In this section, we focus on the beta-Bernoulli process with d = 0, i.e., P N, ∞ refers to the observational process coming from BP( γ, α,
0) and P N,K refersto the observational process IFA K with ν K as in Example 3.1.We first look at the dependence of the error bound in terms of ln N . For any N ∈ N , α >
0, we define the growth function C ( N, α ) := N (cid:88) n =1 αn − α . (11)It is known that C ( N, α ) = Ω(ln N ) (see Lemma D.9). Theorem 4.3 shows that finiteapproximations cannot be accurate if the approximation level is too small compared to thegrowth function C ( N, α ). Theorem 4.3 (ln N is necessary) . For the beta-Bernoulli model with d = 0 , there exists anobservation likelihood f , independent of K and N , such that for any N , if K ≤ γC ( N, α ) ,then d T V ( P N, ∞ , P N,K ) ≥ − CN γα/ , where C only depends on hyper-parameters of the beta process i.e., γ, α . The proof is given in Appendix F.2. Theorem 4.3 implies that as N grows, if the approxi-mation level K fails to surpass the γC ( N, α ) = Ω(ln N ) threshold, then the total variationbetween the approximate and the target model remains bounded from zero – in fact, theerror tends to one.Now turning to the dependence on K of the upper bound Theorem 4.2, we discuss a lowerbound on the approximation error, which reveals that the K factor in the upper bound is tight (modulo logarithmic factors). Theorem 4.4 (Lower bound of 1 /K ) . For the beta-Bernoulli model with d = 0 , there existsan observation likelihood f , independent of K and N , such that for any N , d T V ( P N, ∞ , P N,K ) ≥ C ( γ ) γ K γ/K ) , where C ( γ ) :=
18 1 γ +exp( − γ +1) max(12 γ , γ, . The proof can be found in Appendix F.2. While Theorem 4.2 implies that an IFA with K = O (poly(ln N ) /(cid:15) ) atoms suffices in approximating the target model to less than (cid:15) error,Theorem 4.4 implies that an IFA with K = Ω (1 /(cid:15) ) atoms is necessary in the worst case.This dependence on the accuracy level means that IFAs are worse than TFAs in theory. Forexample, consider Bondesson approximations (Bondesson, 1982) of BP( γ, α, . Nguyen et al./Independent finite approximations Example 4.3 (Bondesson approximation (Bondesson, 1982)) . Let α ≥
1. Let E l iid ∼ Exp (1)and Γ k = (cid:80) kl =1 E l . The level K Bondesson approximation of BP( γ, α,
0) is a TFA (cid:80) Kk =1 θ k δ ψ k where θ k = V k exp( − Γ k /γα ) , V k iid ∼ Beta (1 , α −
1) and ψ k iid ∼ H .The following result gives a bound on the error of the Bondesson approximation: Proposition 4.5. (Campbell et al., 2019) For γ > , α ≥ , let Θ K be distributed accordingto a level K Bondesson approximation of
BP( γ, α, , R n | Θ K iid ∼ LP( l ; Θ K ) , T n | R n indep ∼ f ( . | R n ) with N observations. Let Q N,K be the distribution of the observations T N . Then: d T V (cid:0) P N, ∞ , P Q N,K (cid:1) ≤ N γ (cid:18) γα γα (cid:19) K . Proposition 4.5 implies that a TFA with K = O (ln ( N/(cid:15) )) atoms suffices in approximatingthe target model to less than (cid:15) error. Modulo log factors, comparing the necessary (cid:15) levelfor IFA and the sufficient ln (cid:0) (cid:15) (cid:1) level for TFA, we conclude that the necessary size for IFAis exponentially larger than the sufficient size for TFA, in the worst case.
5. Non-asymptotic error bounds for Dirichlet process-based models
Having analyzed the error incurred by IFA K in CRM-based models like beta-Bernoulli,gamma-Poisson and beta-negative binomial, we now turn the approximation error in NRCM-based models. Our notion of approximation error remains the total variation distance be-tween the target and the approximate observational processes. The forms of the upper andlower bounds are very similar to Theorems 4.2, 4.3 and 4.4. We leave to future work toderive bounds for more general NCRMs.We focus on the Dirichlet process [DP] (Ferguson, 1973; Sethuraman, 1994) – which isthe normalization of a non-power law gamma process – and the finite symmetric Dirichlet[FSD] distribution – which is the normalization of the IFA for gamma process. The Dirichletprocess is one of the most widely used nonparametric priors. The gamma process CRMhas rate measure ν (d θ ) = γ λ − d Γ(1 − d ) θ − d − e − λθ d θ. We denote its distribution as ΓP( γ, λ, d ).The normalization of ΓP( γ, ,
0) is a Dirichlet process with mass parameter γ (Kingman,1975; Ferguson, 1973). By Corollary 3.3, IFA K ( H, ν K ) (where ν K ( θ ) = Gam ( θ ; γ/K, γ, , K ( H, ν K ) is equal in distribution to (cid:80) Ki =1 p i δ ψ i where ψ i i.i.d. ∼ H and { p i } Ki =1 ∼ Dir ( γK K ) . We denote this as FSD K ( γ, H ).We consider Dirichlet process mixture models (Antoniak, 1974)Θ ∼ DP( α, H ) , X n | Θ i.i.d. ∼ Θ , Y n | X n i.i.d. ∼ f ( · | X n ) (12)with corresponding approximationΘ K ∼ FSD K ( α, H ) , Z n | Θ K i.i.d. ∼ Θ K , W n | Z n i.i.d. ∼ f ( · | Z n ) . (13)Let P N, ∞ be the distribution of the observations Y N . Let P N,K be the distribution of theobservations W N . . Nguyen et al./Independent finite approximations Upper bounds on the error made by FSD K can be used to determine the sufficient K to approximate the target process for a given N and accuracy level. We upper bound d T V ( P N, ∞ , P N,K ) in Theorem 5.1.
Theorem 5.1 (Upper bound for DP mixture model) . For some constants C , C , C thatonly depend on α , d T V ( P N, ∞ , P N,K ) ≤ C + C ln N + C ln N ln KK .
The proof is given in Appendix G.1. Theorem 5.1 is similar to Theorem 4.2. The O (ln N )growth of the bound for fixed N can likely be reduced to O (ln N ), the inherent growth rate ofDP mixture models (Miller and Harrison, 2013). The O (cid:0) ln KK (cid:1) rate of decrease to zero is tightbecause of a K lower bound on the approximation error. Theorem 5.1 is an improvementover the existing theory for FSD K , in the sense that Ishwaran and Zarepour (2002, Theorem4) provides an upper bound on d T V ( P N, ∞ , P N,K ) that lacks an explicit dependence on K or N – that bound cannot be inverted to determine the sufficient K to approximate thetarget to a given accuracy, while it is simple to determine using Theorem 5.1.Theorem 5.1 can also be used to analyze models with additional hierarchical structure.For instance, the hierarchical Dirichlet process [HDP] and variants are important use cases ofDP and have demonstrated great practical use. We will analyze the error made by FSD K fora variant of HDP we call modified HDP. In HDP, there is a population measure generatedby DP, G ∼ DP( ω, H ), and for each sub-population indexed by d , the sub-populationmeasure is generated as G d | G ∼ DP( α, G ). In modified HDP, the sub-population measureis instead distributed as G d | G ∼ TSB T ( α, G ) where the TSB distribution is explained inExample 5.1. Example 5.1 (Stick-breaking approximation (Sethuraman, 1994)) . For i = 1 , , . . . , K − v i i.i.d. ∼ Beta(1 , α ). Set v K = 1. Let ξ i = v i (cid:81) i − j =1 (1 − v j ). Let ψ k i.i.d. ∼ H , and Ξ K = (cid:80) Kk =1 ξ k δ ψ k . We denote the distribution of Ξ K as TSB K ( α, H ) . In all, the generative process of modified HDP is G ∼ DP( ω, H ) H d | G indep ∼ TSB T ( α, G ) across dβ dn | H d indep ∼ H d ( . ) , W dn | β dn indep ∼ f ( . | β dn ) across d, n (14)Observation groups are indexed by d and individual observations are indexed by n, d . Eachgroup manifests at most T distinct atoms of the population-level measure in the style ofExample 5.1. The number of groups is D , and the number of observations in each group is N. The finite approximation we consider replaces the population level DP with FSD K , keep-ing the other conditionals intact G K ∼ FSD K ( ω, H ) F d | G K indep ∼ TSB T ( α, G K ) across dψ dn | F d indep ∼ F d ( . ) , Z dn | ψ dn indep ∼ f ( . | ψ dn ) across d, n (15) . Nguyen et al./Independent finite approximations Let P ( N,D ) , ∞ be the distribution of the observations { W dn } . Let P ( N,D ) ,K be the distri-bution of the observations { Z dn } . We have the following corollary to Theorem 5.1. Corollary 5.2 (Upper bound for modified HDP) . For some constants C , C , C whichdepend only on ω , d T V (cid:0) P ( N,D ) , ∞ , P ( N,D ) ,K (cid:1) ≤ C + C ln ( DT ) + C ln( DT ) ln KK .
The proof can be found in Appendix G.1. For fixed K , Corollary 5.2 is independent of N , the number of observations in each group, but grows like O (poly(ln D )) with the numberof groups D . For fixed D , the approximation error decrease to zero at rate no slower that O (cid:0) ln KK (cid:1) . As Theorem 5.1 is only an upper bound, we now investigate the tightness of the inequalityin terms of N and K . We return to DP mixture models. We first look at the dependence ofthe error bound in terms of ln N . Theorem 5.3 shows that finite approximations cannot beaccurate if the approximation level is too small compared to the growth rate ln N . Theorem 5.3 (ln N is necessary) . There exists a probability kernel f ( . ) , independent of K, N , such that for any N ≥ , if K ≤ C ( N, α ) , then d T V ( P N, ∞ , P N,K ) ≥ − C (cid:48) N α/ where C (cid:48) is a constant only dependent on α . The proof is given in Appendix G.2. Theorem 5.3 implies that as N grows, if the approx-imation level K fails to surpass the C ( N, α ) threshold, then the total variation betweenthe approximate and the target model remains bounded from zero – in fact, the error tendsto one. Recall that C ( N, α ) = Ω(ln N ), so the necessary approximation level is Ω(ln N ).Theorem 5.3 is the analog of Theorem 4.3.We also investigate the tightness of Theorem 5.1 in terms of K . In Theorem 5.4, ourlower bound indicates that the K factor in Theorem 5.1 is tight (up to log factors). Theorem 5.4 (1 /K lower bound) . There exists a probability kernel f ( . ) , independent of K, N , such that for any N ≥ , d T V ( P N, ∞ , P N,K ) ≥ α α K .
The proof is given in Appendix G.2. While Theorem 5.1 implies that the normalizedIFA K with K = O (poly(ln N ) /(cid:15) ) atoms suffices in approximating the DP mixture modelto less than (cid:15) error, Theorem 5.4 implies that a normalized IFA with K = Ω (1 /(cid:15) ) atomsis necessary in the worst case. This worst-case behavior is analogous to Theorem 4.4 forDP-based models.The (cid:15) dependence means that IFAs are worse than TFAs in theory. It is known thatsmall TFA models are already excellent approximations of the DP. Example 5.1 is a verywell-known finite approximation whose error is upper bounded in Proposition 5.5. . Nguyen et al./Independent finite approximations Proposition 5.5. (Ishwaran and James, 2001, Theorem 2) Let Ξ K ∼ TSB K ( α, H ) , R n | Ξ K i.i.d. ∼ Ξ K , T n | R n indep ∼ f ( . | R n ) with N observations. Let Q N,K be the distribution of the observa-tions T N . Then d T V ( P N, ∞ , Q N,K ) ≤ N exp (cid:18) − K − α (cid:19) . Proposition 5.5 implies that a TFA with K = O (ln ( N/(cid:15) )) atoms suffices in approximatingthe DP mixture model to less than (cid:15) error. Modulo log factors, comparing the necessary (cid:15) level for IFA and the sufficient ln (cid:0) (cid:15) (cid:1) level for TFA, we conclude that the necessary size fornormalized IFA is exponentially larger than the sufficient size for TFA, in the worst case.
6. Conceptual benefits of finite approximations
As part of Bayesian inference, we need to compute the posterior over the latent variablesin our finite-dimensional probabilistic models (Eq. (6)). To set up notation, we denote by θ = ( θ i ) Ki =1 the collection atom sizes, ψ = ( ψ i ) Ki =1 the collection of atom locations and x = ( x n,i ) the trait count of each observation.Standard tools to explore or approximate the posterior distribution P ( θ, ψ, x | data) requireeasy-to-simulate Gibbs conditional distributions or tractable expectations. On the one hand,because of the discreteness of the trait counts x , even with the recent advances in Hamil-tonian Monte Carlo (Hoffman and Gelman, 2014), successful Markov chain Monte Carlo(MCMC) algorithms have been based largely on Gibbs sampling (Geman and Geman, 1984).In particular, blocked Gibbs sampling utilizing the natural Markov blanket structure isstraightforward to implement when the complete conditionals P ( θ | x, ψ, data) , P ( x | ψ, θ, data)or P ( ψ | x, θ, data) are easy to simulate from. On the other hand, variational inference us-ing mean-field approximation and KL divergence (Wainwright and Jordan, 2008) requiresanalytical expectations. The variational distributions are typically chosen to match the para-metric form of the complete conditionals – information about the latent variables is easilysummarized, and the divergence between approximation and target is (locally) optimized us-ing coordinate ascent updates. Such updates require expectations of the form E θ ∼ q [ln l ( x | θ )]where q ( θ ) is the variational distribution over atom sizes.Since finite approximations (IFAs/TFAs) with the same number of atoms K only differ inthe prior P ( θ ), to compare the ease-of-use between IFAs and TFAs, it suffices to compare thetractability of P ( θ | x, ψ, data) under different approximations. For exponential family CRMswith d = 0, IFAs are highly compatible with standard inference schemes, because the Gibbsconditional P ( θ | x, ψ, data) comes from the same exponential family as the prior ν K . Lemma 6.1 (Conditional conjugacy of IFA) . Suppose the likelihood is Eq. (1) and the IFAprior ν K is as in Corollary 3.3. Then the complete conditional of atom sizes factorizes acrossatoms P ( θ | x, ψ, data ) = K (cid:89) k =1 P ( θ k | x .,k ) . Furthermore, each P ( θ k | x .,k ) is in the same exponential family as the IFA prior, with densityproportional to { θ ∈ U } θ c/K + (cid:80) Nn =1 φ ( x n,k ) − exp (cid:32) (cid:104) ψ + N (cid:88) n =1 t ( x n,k ) , µ ( θ ) (cid:105) + ( λ + N )[ − A ( θ )] (cid:33) dθ. (16) . Nguyen et al./Independent finite approximations The proof follows from the results in Appendix C. Lemma 6.1 implies that the derivationof simulation steps/expectation equations for IFAs of common models such as beta-Bernoulli,gamma-Poisson and beta-negative binomial is straightforward. The complete conditionalsover atom sizes are easy-to-simulate because they are well-known exponential families (betaand gamma). Also, the expectations of ln l ( x | θ ) when θ has the exponential family distribu-tion (Eq. (16)) are tractable because of the exponential family algebra between log-likelihoodand prior. Finally, a parallelizing strategy to utilize the factorization structure across atomscan yield user-time speed up, with the gains being greatest when there are many instantiatedatoms.There are many different types of TFAs, but in general the derivation of simulationsteps/expectation equations are much more involved than for IFAs. While the prior P ( θ )can be reasonably easy to sample from, the incorporation of trait counts leads to intractableconditionals P ( θ | x ). We consider two illustrative examples, both for exponential CRMs with d = 0. In Example 6.1, the complete conditional of atom size is both hard to sample fromand leads to analytically intractable expectations. In Example 6.2, the complete conditionalof atom sizes can be sampled from without introducing auxiliary variables, but importantexpectations are not analytically tractable. Example 6.1 (Stick-breaking approximation (Broderick, Jordan and Pitman, 2012; Paisley,Carin and Blei, 2011)) . The following finite approximation is a TFA for BP( γ, α, K = K (cid:88) i =1 C i (cid:88) j =1 V ( i ) i,j i − (cid:89) l =1 (1 − V ( l ) i,j ) δ ψ ij where C i iid ∼ Poisson( γ ), V ( l ) i,j iid ∼ Beta(1 , α ) and ψ i,j iid ∼ H . A priori, the atom sizes V ( i ) i,j (cid:81) i − l =1 (1 − V ( l ) i,j ) can be sampled (using stick-breaking proportions V i,j ), but there is no tractable wayto sample from/compute expectations with respect to the conditional distribution P ( θ | x )because of the dependence on C i as well as the entangled form of each θ. Strategies tomake the model more tractable include introducing auxiliary round indicator variables r k (Broderick, Jordan and Pitman, 2012; Paisley, Carin and Blei, 2011), marginalizing out thestick-breaking proportions (Broderick, Jordan and Pitman, 2012) or replacing the product (cid:81) i − l =1 (1 − V ( l ) i,j ) with more succinct representation (Paisley, Carin and Blei, 2011). However,the final model from these attempts all contain at least one Gibbs conditional that is ei-ther difficult to sample from (Broderick, Jordan and Pitman, 2012, Equation 37) or lackstractable expectations (Paisley, Carin and Blei, 2011, Section 3.3).Other superposition-based approximations, like decoupled Bondesson or power-law (Camp-bell et al., 2019), will similarly struggle with the number of atoms per round variables C i and the entanglement among the atom sizes. Example 6.2 (Bondesson approximation (Doshi-Velez et al., 2009; Teh, G¨or¨ur and Ghahra-mani, 2007)) . When α = 1, the Bondesson approximation in Example 4.3 becomesΘ K = K (cid:88) i =1 i (cid:89) j =1 p j δ ψ i where p j i.i.d. ∼ Beta( γ,
1) and ψ i iid ∼ H. The atom sizes are tangled by the p j ’s, θ i = (cid:81) ij =1 p j ,but the complete conditional of atom sizes P ( θ | x ) admits a density with respect to Lebesgue, . Nguyen et al./Independent finite approximations and it is proportional to { ≤ θ K ≤ θ K − ≤ . . . ≤ θ ≤ } K (cid:89) j =1 θ γ { j = K } + (cid:80) Nn =1 x n,j − j (1 − θ j ) N − (cid:80) Nn =1 x n,j . The conditional distributions P ( θ i | θ − i , x ) are truncated betas, so adaptive rejection sampling(Gilks and Wild, 1992) can be used as a sub-routine to sample each P ( θ i | θ − i , x ), and thensweep over all atom sizes. However, for this exponential family, expectations of the sufficientstatistics ln θ i and ln(1 − θ i ) are not tractable: variational inference as conducted in Doshi-Velez et al. (2009) required additional approximations.Other series-based approximations, like thinning or rejection sampling (Campbell et al.,2019), have more intractable dependencies between atom sizes in both the prior and theconditional P ( θ | x ).
7. Empirical evaluation
We compare the practical performance of IFAs and TFAs on two real-data examples: animage denoising application using the beta-Bernoulli model and topic modeling using themodified HDP. Existing empirical work (e.g., Doshi-Velez et al. (2009, Table 1,2) and Kuri-hara, Welling and Teh (2007, Figure 4)) suggests two patterns: that the approximationsimprove in performance as the number of instantiated atoms K increase, and for the same K , normalized IFA and TFA have similar performance. Our experiments confirm and expandupon these previous findings. Image denoising through dictionary learning is an application where finite approximationsof BNP model - in particular beta-Bernoulli with d = 0 – have proven useful (Zhou et al.,2009). The goal is recovering the original noiseless image (left of Fig. 1) from a corruptedone (right of Fig. 1). To do so, the input image is deconstructed into small contiguouspatches and we postulate that each patch is a combination of underlying basis elements . Byestimating the coefficients expressing the combination, possibly in addition to estimating thebasis elements themselves, one can denoise the individual patches and ultimately the overallimage. The beta-Bernoulli process allow simultaneous estimation of basis elements and basisassignments. The nonparametric nature sidesteps the cumbersome problem of calibratingthe number of basis elements. The number of extracted patches depends on both the patchsize and the dimensions of the input image: even on the same input image, the analysismight process a varying number of “observations.” Better denoised images have high peaksignal-to-noise-ratio, or PSNR (Hore and Ziou, 2010), with respect to the noiseless image:the PSNR between two identical images is ∞ .To compare IFA and TFA, we considered beta process BP( γ, ,
0) due to past work whichsuggests that the hyper-parameters γ, α do not play a large role (Zhou et al., 2009). Eachconfiguration of the latent variables x, ψ, θ leads to a candidate denoised image. By default,a sequential Gibbs sampler traverses the posterior over latent variables – the final denoised Patches i.e., observations are gradually introduced in epochs, and the sampler only modifies the latentvariables of the current epoch’s observations. . Nguyen et al./Independent finite approximations Fig 1: Original versus corrupted images. The number plotted on top of the noisy image ispeak signal-to-noise-ratio, or PSNR, with respect to the noiseless image.image is a weighted average of the candidate images encountered during the sampler run.There is randomness in how the latent variables are initialized, as well as in the simulationof the Gibbs conditionals. The gradual data introduction employed in the Gibbs samplercan be thought of as a way to initialize the latent variables for the entire set of observations.For a 256 ×
256 image like the right panel of Fig. 1, the number of extracted patches, N ,is about 60 k. More details about the finite approximations, hyper-parameter settings andinference can be found in Appendix H.1.In Fig. 2, the quality of denoised images improves with increasing K – furthermore, thequality is very similar across the two types of approximation. Both kinds perform muchbetter than the baseline i.e., noisy input image. The improvement with K is largest forsmall K , and plateaus for larger values of K . For a given approximation level, the quality ofTFA denoising and that of IFA are almost the same. Furthermore, the denoised image fromTFA is more similar to the denoised image from IFA than it is similar to the original image,indicated by the large gap in PSNR. The error bars reflect randomness in both initializationand simulation of the conditionals across 5 trials.Fig. 3 shows that the modes of TFA posterior are centers of regions of attraction in IFAposterior, and vice-versa. For both kinds of approximation, K = 60. Rather than randomlyinitializing the latent variables at the beginning of the Gibbs sampler of one model i.e.,cold start, we can use the last configuration of latent variables visited in the other modelas the initial state of the Gibbs sampler – i.e., warm start. To isolate the effect of theinitial conditions, all the patches are available from the start as opposed to being graduallyintroduced. For both kinds of approximation, the Gibbs sampler initialized at the warm startvisits candidate images that basically have the same PSNR as the starting configuration.The early iterates of cold-start Gibbs sampler are noticeably lower in quality compared tothe warm-start iterates, and the quality at the plateau is still lower than that of the warmstart. Each trace of PSNR of cold-start Gibbs corresponds to a random seed in intializationand simulation of the conditionals, while each trace of warm-start PSNR corresponds to adifferent final state of the alternative model’s training. The variation across warm starts istiny – the variation across cold starts is larger but still very small.Experiments on other noisy images can be found in Appendix I; the trends are the same. . Nguyen et al./Independent finite approximations Fig 2: Finite approximations have similar performance across approximation levels. For each K , the final denoised image is a weighted average of candidate images encountered duringGibbs sampling. (a) TFA training (b) IFA training Fig 3: The output of one model is a good initialization for the training of the other one.
Finally, we compare the performance of normalized IFA (i.e., FSD K ) and TFA (i.e., TSB K )when used in DP-based model. In this section, we provide evidence of the same trends inthe modified HDP – a more complicated model than a Dirichlet process mixture – when . Nguyen et al./Independent finite approximations Fig 4: Finite approximations have similar performance across approximation levels.analyzing Wikipedia documents.For both IFA and TFA, we use stochastic variational inference with mean-field factor-ization (Hoffman et al., 2013) to approximate the posterior over the latent topics based ontraining documents. The training corpus is nearly one million documents from Wikipedia.There is randomness in the initial values of the variational parameters, as well as in theorder that data minibatches are processed. The quality of inferred topics is measured bythe predictive log-likelihood on a set of 10 k held-out documents. More details about thefinite approximations, hyper-parameter settings, variational inference and definition of testlog-likelihood can be found in Appendix H.2.In Fig. 4, the quality of the inferred topics improves as the approximation level grows –furthermore, the quality is very similar across the two types of approximation. The improve-ment with K is largest for small K : the slope plateaus for large K . For a given approximationlevel, the quality of TFA topics and that of normalized IFA are almost the same. The er-ror bars reflect variation across both the random initialization and the ordering of dataminibatches processed by stochastic variational inference.In Fig. 5, the modes of TFA posterior are centers of regions of attraction in IFA posterior,and vice-versa. The number of topics is fixed to be K = 300 . Rather than randomly initial-izing the variational parameters at the start of variational inference of one model i.e., coldstart, we can use the variational parameters at the end of the other model’s training as theinitialization i.e., warm start. The learning rate for warm-start training is slightly differentfrom that for cold start, to reflect the fact that many batches of data had been processedleading up to the warm-start variational parameters. For both kinds of approximation, thetest log-likelihood basically stays the same for warm-start training iterates, hinting thatsuch initialization is part of an attractive region. The early iterates of cold start are notice-ably lower in quality compared to the warm iterates – however at the end of training, thetest log-likelihoods are nearly the same. Each trace of cold start corresponds to a differentinitialization and ordering of data batches processed. Each trace of warm start correspondsto a different output of the other model’s training and a different ordering of data batchesprocessed. The variation across either cold starts or warm starts is small. . Nguyen et al./Independent finite approximations (a) TFA training (b) IFA training Fig 5: The output of one model is a good initialization for the training of the other one.
8. Discussion
We have provided a general construction of independent finite approximations for completelyrandom measures, analyzed error bounds on IFAs for conjugate exponential family CRMwith no power law and the Dirichlet process, and investigated how they compare to truncatedfinite approximations in realistic data applications. Our error bounds reveal that in the worstcase, for the same number of atoms instantiated, IFA has larger error than TFA. However,we have not observed the worst case in our experiments, suggesting that either the errorbounds can be tightened for relevant conditional densities f or that additional sources oferror, such as those from approximate inference, dominate approximation error made by thefinite approximations. From a practical point of view, IFA is easier than TFA to work with.Our analyses and experiments suggest a number of directions for future work. For exam-ple, the error bound analysis could be extended for conjugate family CRM with power-lawbehavior. We speculate that in such situations, the O (ln N ) factor appearing in the numera-tor of the upper bounds will be replaced by O ( N a ) where O ( N a ) is the growth rate of BNPmodels with power law behavior. References
Acharya, A. , Ghosh, J. and
Zhou, M. (2015). Nonparametric Bayesian factor analysisfor dynamic count matrices. In
AISTATS . Adell, J. A. and
Lekuona, A. (2005). Sharp estimates in signed Poisson approximationof Poisson mixtures.
Bernoulli Aldous, D. (1985). Exchangeability and related topics. ´Ecole d’ ´Et´e de Probabilit´es de Saint-Flour XIII—1983
Alzer, H. (1997). On some inequalities for the gamma and psi functions.
Mathematics ofcomputation Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesiannonparametric problems.
The Annals of Statistics Barbour, A. D. and
Hall, P. (1984). On the rate of Poisson convergence. In
MathematicalProceedings of the Cambridge Philosophical Society . Nguyen et al./Independent finite approximations Betancourt, M. (2017). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv.org . Blackwell, D. and
MacQueen, J. B. (1973). Ferguson Distributions Via Polya UrnSchemes.
Ann. Statist. Blei, D. M. , Griffiths, T. L. and
Jordan, M. I. (2010). The nested Chinese restaurantprocess and Bayesian nonparametric inference of topic hierarchies.
Journal of the ACM Blei, D. M. and
Jordan, M. I. (2006). Variational Inference for Dirichlet Process Mixtures.
Bayesian Analysis Bondesson, L. (1982). On simulation from infinitely divisible distributions.
Advances inApplied Probability . Brix, A. (1999). Generalized gamma measures and shot-noise Cox processes.
Advances inApplied Probability Broderick, T. , Jordan, M. I. and
Pitman, J. (2012). Beta processes, stick-breakingand power laws.
Bayesian analysis Broderick, T. , Wilson, A. C. and
Jordan, M. I. (2018). Posteriors, conjugacy, andexponential families for completely random measures.
Bernoulli Broderick, T. , Mackey, L. , Paisley, J. and
Jordan, M. I. (2015). Combinatorial Clus-tering and the Beta Negative Binomial Process.
IEEE Transactions on Pattern Analysisand Machine Intelligence Campbell, T. , Huggins, J. H. , How, J. P. and
Broderick, T. (2019). Truncatedrandom measures.
Bernoulli Carpenter, B. , Gelman, A. , Hoffman, M. D. , Lee, D. , Goodrich, B. , Betan-court, M. , Brubaker, M. , Guo, J. , Li, P. and
Riddell, A. (2017). Stan: A Proba-bilistic Programming Language.
Journal of Statistical Software . Doshi-Velez, F. , Miller, K. T. , Van Gael, J. and
Teh, Y. W. (2009). Variationalinference for the Indian buffet process. In
Artificial Intelligence and Statistics
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.
The Annalsof Statistics
Ferguson, T. S. and
Klass, M. J. (1972). A representation of independent incrementprocesses without Gaussian components.
The Annals of Mathematical Statistics . Fox, E. B. , Sudderth, E. , Jordan, M. I. and
Willsky, A. S. (2010). A Sticky HDP-HMM with Application to Speaker Diarization.
The Annals of Applied Statistics Geman, S. and
Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and theBayesian Restoration of Images.
Pattern Analysis and Machine Intelligence, IEEE Trans-actions on Gilks, W. R. and
Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling.
Journal of the Royal Statistical Society: Series C (Applied Statistics) Gnedin, A. V. (1998). On convergence and extensions of size-biased permutations.
Journalof Applied Probability Gordon, L. (1994). A Stochastic Approach to the Gamma Function.
The American Math-ematical Monthly
Griffiths, T. L. and
Ghahramani, Z. (2005). Infinite Latent Feature models and theIndian Buffet Process. In
Advances in Neural Information Processing Systems . Griffiths, T. L. and
Ghahramani, Z. (2011). The Indian Buffet Process: An Introductionand Review.
Journal of Machine Learning Research Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models . Nguyen et al./Independent finite approximations for life history data. the Annals of Statistics Hoffman, M. , Bach, F. R. and
Blei, D. M. (2010). Online learning for latent Dirichletallocation. In
Advances in Neural Information Processing Systems
Hoffman, M. D. and
Gelman, A. (2014). The No-U-Turn sampler: adaptively settingpath lengths in Hamiltonian Monte Carlo.
Journal of Machine Learning Research Hoffman, M. D. , Blei, D. M. , Wang, C. and
Paisley, J. (2013). Stochastic variationalinference.
Journal of Machine Learning Research Hore, A. and
Ziou, D. (2010). Image quality metrics: PSNR vs. SSIM. In
Ishwaran, H. and
James, L. F. (2001). Gibbs sampling methods for stick-breaking priors.
Journal of the American Statistical Association . Ishwaran, H. and
Zarepour, M. (2002). Exact and approximate sum representations forthe Dirichlet process.
Canadian Journal of Statistics James, L. F. (2013). Stick-breaking PG( α , ζ )-Generalized Gamma Processes. arXiv.org . James, L. F. (2017). Bayesian Poisson calculus for latent feature modeling via generalizedIndian Buffet Process priors.
The Annals of Statistics Johnson, N. L. , Kemp, A. W. and
Kotz, S. (2005).
Univariate Discrete Distributions . Wiley Series in Probability and Statistics . Wiley.
Johnson, M. J. and
Willsky, A. S. (2013). Bayesian Nonparametric Hidden Semi-MarkovModels.
Journal of Machine Learning Research Kallenberg, O. (2002).
Foundations of modern probability , 2nd ed. Springer, New York.
Kingman, J. F. C. (1967). Completely random measures.
Pacific Journal of Mathematics Kingman, J. F. C. (1975). Random discrete distributions.
Journal of the Royal StatisticalSociety B Kucukelbir, A. , Ranganath, R. , Gelman, A. and
Blei, D. M. (2015). AutomaticVariational Inference in Stan. In
Advances in Neural Information Processing Systems . Kurihara, K. , Welling, M. and
Teh, Y. W. (2007). Collapsed Variational DirichletProcess Mixture Models. In
International Joint Conference on Artificial Intelligence
Last, G. and
Penrose, M. (2017).
Lectures on the Poisson Process . Institute of Mathe-matical Statistics Textbooks . Le Cam, L. (1960). An approximation theorem for the Poisson binomial distribution.
PacificJ. Math. Lee, J. , James, L. F. and
Choi, S. (2016). Finite-dimensional BFRY priors and variationalBayesian inference for power law models. In
Advances in Neural Information ProcessingSystems
Lee, J. , Miscouridou, X. and
Caron, F. (2019). A unified construction for series rep-resentations and finite approximations of completely random measures. arXiv preprintarXiv:1905.10733 . Loeve, M. (1956). Ranking Limit Problem. In
Proceedings of the Third Berkeley Symposiumon Mathematical Statistics and Probability, Volume 2: Contributions to Probability Theory
Madras, N. and
Sezer, D. (2010). Quantitative bounds for Markov chain convergence:Wasserstein and total variation distances.
Bernoulli Miller, J. W. and
Harrison, M. T. (2013). A simple example of Dirichlet process mix-ture inconsistency for the number of components. In
Advances in Neural Information . Nguyen et al./Independent finite approximations Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani andK. Q. Weinberger, eds.) 199–206. Curran Associates, Inc.
Neal, R. M. (2011). MCMC using Hamiltonian dynamics. In
Handbook of Markov ChainMonte Carlo
Orbanz, P. (2010). Conjugate Projective Limits. arXiv.org . Paisley, J. , Blei, D. M. and
Jordan, M. I. (2012). Stick-breaking beta processes andthe Poisson process. In
Artificial Intelligence and Statistics
Paisley, J. and
Carin, L. (2009). Nonparametric factor analysis with beta process priors.In
Proceedings of the 26th Annual International Conference on Machine Learning . ICML’09
Paisley, J. , Carin, L. and
Blei, D. (2011). Variational inference for stick-breaking betaprocess priors. In
Proceedings of the 28th International Conference on International Con-ference on Machine Learning
Palla, K. , Knowles, D. A. and
Ghahramani, Z. (2012). An Infinite Latent AttributeModel for Network Data. In
International Conference on Machine Learning . Universityof Cambridge.
Perman, M. , Pitman, J. and
Yor, M. (1992). Size-biased sampling of Poisson pointprocesses and excursions.
Probability Theory and Related Fields Pitman, J. (1995). Exchangeable and partially exchangeable random partitions.
Probabilitytheory and related fields
Pitman, J. (1996). Some developments of the Blackwell-MacQueen urn scheme.
LectureNotes-Monograph Series
Pollard, D. (2001).
A User’s Guide to Measure Theoretic Probability . Cambridge Uni-versity Press. Ranganath, R. , Gerrish, S. and
Blei, D. M. (2014). Black Box Variational Inference.In
International Conference on Artificial Intelligence and Statistics
Roberts, G. O. and
Tweedie, R. L. (1996). Exponential convergence of Langevin distri-butions and their discrete approximations.
Bernoulli Roychowdhury, A. and
Kulis, B. (2015). Gamma processes, stick-breaking, and varia-tional inference. In
Artificial Intelligence and Statistics
Saria, S. , Koller, D. and
Penn, A. (2010). Learning individual and population leveltraits from clinical temporal data Technical Report.
Sethuraman, J. (1994). A Constructive Definition of Dirichlet Priors.
Statistica Sinica Teh, Y. W. , G¨or¨ur, D. and
Ghahramani, Z. (2007). Stick-breaking Construction for theIndian Buffet Process. In
International Conference on Artificial Intelligence and Statis-tics . Teh, Y. W. and
G¨or¨ur, D. (2009). Indian buffet processes with power-law behavior. In
Advances in Neural Information Processing Systems . Teh, Y. W. , Jordan, M. I. , Beal, M. J. and
Blei, D. M. (2006). Hierarchical DirichletProcesses.
Journal of the American Statistical Association
Thibaux, R. and
Jordan, M. I. (2007). Hierarchical Beta Processes and the Indian BuffetProcess. In
International Conference on Artificial Intelligence and Statistics . Titsias, M. (2008). The infinite gamma-Poisson feature model. In
Advances in NeuralInformation Processing Systems . Wainwright, M. J. and
Jordan, M. I. (2008). Graphical Models, Exponential Families,and Variational Inference.
Foundations and Trends R (cid:13) in Machine Learning Wang, C. , Paisley, J. and
Blei, D. (2011). Online variational inference for the hier- . Nguyen et al./Independent finite approximations archical Dirichlet process. In Proceedings of the Fourteenth International Conference onArtificial Intelligence and Statistics
Zhou, M. , Chen, H. , Ren, L. , Sapiro, G. , Carin, L. and
Paisley, J. W. (2009). Non-parametric Bayesian dictionary learning for sparse image representations. In
Advances inNeural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. D. Lafferty,C. K. I. Williams and A. Culotta, eds.) 2295–2303. Curran Associates, Inc.
Zhou, M. , Hannah, L. , Dunson, D. and
Carin, L. (2012). Beta-negative binomial pro-cess and Poisson factor analysis. In
Artificial Intelligence and Statistics
Appendix A: Additional examples of IFA construction
Let B ( α, β ) = Γ( α )Γ( β )Γ( α + β ) denote the beta function. Example A.1 (Beta process) . Taking E = R + , g ( θ ) = 1, h ( θ ; η ) = (1 − θ ) η − [ θ ≤ Z ( ξ, η ) = B ( ξ, η ) in Theorem 3.2 yields the beta process BP( γ, η − d, d ), which has ratemeasure ν (d θ ) = γ [ θ ≤ B ( η, − d ) θ − − d (1 − θ ) η − d θ. Since h is continuous and bounded on [0 , / Example A.2 (Beta prime process) . Taking E = R + , g ( θ ) = (1 + θ ) − , h ( θ ; η ) = (1 + θ ) − η ,and Z ( ξ, η ) = B ( ξ, η ) in Theorem 3.2 yields the beta prime process, which has rate measure ν (d θ ) = γB ( η, − d ) θ − − d (1 + θ ) − d − η d θ. Since g is continuous, g (0) = 1, 1 ≤ g ( θ ) ≤ θ , and h ( θ ; η ) is continuous and bounded on[0 , d = 0, c = γη and ν n ( θ ) = Beta (cid:48) ( θ ; γη/n, η ) . Example A.3 (Gamma process) . Taking E = R + , g ( θ ) = 1, h ( θ ; η ) = e − ηθ , and Z ( ξ, η ) =Γ( ξ ) η − ξ in Theorem 3.2 yields the gamma process, with rate measure ν (d θ ) = γ λ − d Γ(1 − d ) θ − d − e − λθ d θ. Since h ( θ ; η ) is continuous and bounded on [0 , Example A.4 (Generalized gamma process) . Taking E = R , g ( θ ) = 1, h ( θ ; η ) = e − ( η θ ) η ,and Z ( ξ, η ) = Γ( ξ/η )( η η ) − ξ in Theorem 3.2 yields the generalized gamma distribution Gam ( ξ, η , η ). The corresponding rate measure is ν (d θ ) = γ ( η η ) − d Γ((1 − d ) /η ) θ − d − e − ( η θ ) η d θ, which is the rate measure for the gamma process ΓP( γ, η, d ). Since h ( θ ; η ) is continuous andbounded on [0 , d = 0, c = γη η Γ( η − ) and ν n ( θ ) = Gam (cid:18) θ ; γη η n Γ( η − ) , η , η (cid:19) . https://en.wikipedia.org/wiki/Generalized_gamma_distribution . Nguyen et al./Independent finite approximations Appendix B: Proof of IFA convergence
B.1. IFA converges to CRM in distribution
In order to prove our main result, we require a few auxiliary results.
Lemma B.1 ((Kallenberg, 2002, Lemmas 12.1 and 12.2)) . Let Θ be a random measure and Θ , Θ , . . . a sequence of random measures. If for all measurable sets A and t > , lim K →∞ E [ e − t Θ K ( A ) ] = E [ e − t Θ( A ) ] , then Θ K D = ⇒ Θ . For a density f , let µ ( t, f ) : θ (cid:55)→ (1 − e − tθ ) f ( θ ). In results that follow we assume allmeasures on R + have densities with respect to Lebesgue measure. We abuse notation anduse the same symbol to denote the measure and the density. Proposition B.2.
Let Θ ∼ CRM(
H, ν ) and for K = 1 , , . . . , let Θ K ∼ IFA K ( H, ν K ) where ν is a measure and ν , ν , . . . are probability measures on R + , all absolutely continuous withrespect to Lebesgue measure. If (cid:107) µ (1 , nν K ) − µ (1 , ν ) (cid:107) → , then Θ K D = ⇒ Θ .Proof. Let t > A a measurable set. First, recall that the Laplace functional of theCRM Θ is E [ e − t Θ( A ) ] = exp (cid:26) − H ( A ) (cid:90) ∞ µ ( t, ν )( θ ) d θ (cid:27) . We have E [ e − tθ K, ( ψ K, ∈ A ) ] = P ( ψ K, ∈ A ) E [ e − tθ K, ] + P ( ψ K, / ∈ A )= H ( A ) E [ e − tθ K, ] + 1 − H ( A )= 1 − H ( A )(1 − E [ e − tθ K, ])= 1 − H ( A ) K (cid:90) ∞ µ ( t, Kν K )( θ ) d θ. Since | − e − tθ || − e − θ | ≤ max(1 , t ), it follows by hypothesis that (cid:107) µ ( t, Kν K ) − µ ( t, ν ) (cid:107) →
0. Thus,by dominated convergence and the standard exponential limit,lim K →∞ E [ e − tθ K, ( ψ K, ∈ A ) ] K = lim K →∞ (cid:18) − H ( A ) K (cid:90) ∞ µ ( t, Kν K )( θ ) d θ (cid:19) K = exp (cid:26) − lim K →∞ H ( A ) (cid:90) ∞ µ ( t, Kν K )( θ ) d θ (cid:27) = exp (cid:26) − H ( A ) (cid:90) ∞ µ ( t, ν )( θ ) d θ (cid:27) . Finally, by the independence of the random variables { θ K,i } Ki =1 ,lim K →∞ E [ e − t Θ K ( A ) ] = lim K →∞ E [ e − tθ K, ( ψ K, ∈ A ) ] K , so result follows from Lemma B.1. . Nguyen et al./Independent finite approximations Lemma B.3.
If there exist measures π ( θ ) d θ and π (cid:48) ( θ ) d θ on R + such that for some κ > ,1. the measures µ, µ , µ , . . . have densities f, f , f , . . . wrt π and densities f (cid:48) , f (cid:48) , f (cid:48) , . . . wrt π (cid:48) ,2. (cid:82) κ | f (cid:48) ( θ ) − f (cid:48) K ( θ ) | d θ → ,3. sup θ ∈ [ κ, ∞ ) | f ( θ ) − f K ( θ ) | → ,4. sup θ ∈ [0 ,κ ] π (cid:48) ( θ ) ≤ c (cid:48) < ∞ , and5. (cid:82) ∞ κ π ( θ ) d θ ≤ c < ∞ ,then (cid:107) µ − µ K (cid:107) → . Proof.
We have, using the assumptions and H¨older’s inequality, (cid:107) µ − µ K (cid:107) = (cid:90) κ | f (cid:48) ( θ ) − f (cid:48) K ( θ ) | π (cid:48) (d θ ) + (cid:90) ∞ κ | f ( θ ) − f K ( θ ) | π (d θ ) ≤ (cid:32) sup θ ∈ [0 ,κ ] π (cid:48) ( θ ) (cid:33) (cid:90) κ | f (cid:48) ( θ ) − f (cid:48) K ( θ ) | d θ + (cid:32) sup θ ∈ [ κ, ∞ ) | f ( θ ) − f K ( θ ) | (cid:33) (cid:90) ∞ κ π (d θ ) ≤ c (cid:48) (cid:90) κ | f (cid:48) ( θ ) − f (cid:48) K ( θ ) | d θ + c sup θ ∈ [ κ, ∞ ) | f ( θ ) − f K ( θ ) | . The conclusion follows by dominated convergence.
Proof of Theorem 3.2.
Note that since h is continuous and bounded on [0 , (cid:15) ], c < ∞ . We willapply Lemma B.3 with κ as given in the theorem statement, µ = µ (1 , ν ), µ K = µ (1 , nν K ), π ( θ ) = p ( θ ; 1 − d, η ) = θ − d g ( θ ) − d h ( θ ; η ) Z (1 − d, η ) , and π (cid:48) ( θ ) := ( θg ( θ )) d π ( θ ). Thus, f ( θ ) = γ (1 − e − θ )( θg ( θ )) − , f K ( θ ) = nZ − K (1 − e − θ ) θ − cK − + d − dS bK ( θ − aK − ) g ( θ ) − cK − , and f (cid:48) ( θ ) = ( θg ( θ )) − d f ( θ ), and f (cid:48) K ( θ ) = ( θg ( θ )) − d f K ( θ ).We now note a few useful properties that we will use repeatedly in the proof. Observethat ( a/K ) cK − = 1 + o (1). The assumption that h is bounded and continuous implies thaton [0 , a/K ], h ( θ ; η ) = h (0; η ) + o (1). Similarly, for any δ > g ( θ ) is bounded and continuousfor θ ∈ [0 , δ ] and therefore, together with the fact that g (0) = 1, we can conclude that on[0 , a/K ], g ( θ ) = 1 + o (1).For the remainder of the proof we will consider K large enough that aK − + 2 b K and cK − are less than κ . The normalizing constant Z K can be written as Z K = (cid:90) a/K ( θg ( θ )) − cK − π (cid:48) (d θ )+ (cid:90) κa/K θ − cK − − dS bK ( θ − aK − ) g ( θ ) − cK − π (cid:48) (d θ )+ (cid:90) ∞ κ ( θg ( θ )) − cK − − d π (cid:48) (d θ ) . . Nguyen et al./Independent finite approximations We rewrite each term in turn. For the first term, (cid:90) a/K θ − cK − g ( θ ) − cK − π (cid:48) (d θ ) = ( c/γ + o (1)) (cid:90) a/K θ − cK − d θ = ( c/γ + o (1)) Kc (cid:16) aK (cid:17) cK − = Kγ + o ( K ) . Since κ ≤ S b K ∈ [0 , θ ∈ [ a/K, κ ], θ − dS bK ( θ − aK − ) ≤ θ − d . Since g (0) = 1, c ∗ ≤ g ( θ ) − cK − ≤ c − c ∗ . Hence the second term is upper bounded by c − c ∗ (cid:90) κa/K θ − cK − − d π (cid:48) (d θ ) ≤ c − ∗ ( c/γ + O (1)) K d a d Kc ( κ cK − − ( a/K ) cK − )= O ( K d ) × O (log K )= o ( K ) . For the third term, (cid:90) ∞ κ ( θg ( θ )) − cK − − d π (cid:48) (d θ ) = (cid:90) ∞ κ ( θg ( θ )) − cK − π (d θ ) ≤ ( κc ∗ ) − cK − (cid:90) ∞ κ π (d θ ) ≤ ( κc ∗ ) − . Hence, Z K = Kγ + o ( K ) and KZ − K = γ (1 + e K ), where e K = o (1).Next, we have sup θ ∈ [ κ, ∞ ) | f ( θ ) − f K ( θ ) | = sup θ ∈ [ κ, ∞ ) (1 − e − θ )( θg ( θ )) − | γ − KZ − K ( θg ( θ )) cK − |≤ sup θ ∈ [ κ, ∞ ) γ ( θg ( θ )) − | − (1 + e K )( θg ( θ )) cK − |≤ γ sup θ ∈ [ κ, ∞ ) ( θg ( θ )) − | − ( θg ( θ )) cK − | + γe K sup θ ∈ [ κ, ∞ ) ( θg ( θ )) − cK − . (B.1)To bound the two terms we will use the fact that if θ ≥ κ , then θg ( θ ) ≥ θc ∗ (1 + θ ) ≥ κc ∗ (1 + κ ) =: ˜ κ and if θ ≤ θg ( θ ) ≤ c ∗ ≤
1. Hence, letting ψ := θg ( θ ), for the first term in Eq. (B.1) . Nguyen et al./Independent finite approximations we have γ sup θ ∈ [ κ, ∞ ) ( θg ( θ )) − | − ( θg ( θ )) cK − |≤ γ sup ψ ∈ [˜ κ, ∞ ) ψ − | − ψ cK − |≤ γ sup ψ ∈ [˜ κ, ψ − | − ψ cK − | + γ sup ψ ∈ [1 , ∞ ) ψ − | − ψ cK − |≤ γ ˜ κ − sup ψ ∈ [˜ κ, | − ψ cK − | + γ (cid:18) K − cK (cid:19) Kc − (cid:12)(cid:12)(cid:12)(cid:12) − KK − c (cid:12)(cid:12)(cid:12)(cid:12) ≤ γ ˜ κ − (1 − ˜ κ cK − ) + O (1) × cK − c = γ ˜ κ − × o (1) + O ( K − ) → . Similarly, for the second term in Eq. (B.1) we have γe K sup θ ∈ [ κ, ∞ ) ( θg ( θ )) − cK − ≤ γe K sup ψ ∈ [˜ κ, ∞ ) ψ − cK − ≤ γ ˜ κ − e K → . Since g ( θ ) is bounded on [0 , κ ], g ( θ ) cK − = 1+ o (1) and therefore (1+ e K ) g ( θ ) cK − = 1+ e (cid:48) K ,where e (cid:48) K = o (1). Using this observation together with the bound (1 − e − θ ) θ − ≤
1, we have (cid:90) κ | f (cid:48) ( θ ) − f (cid:48) K ( θ ) | d θ = (cid:90) κ ( θg ( θ )) − d | f ( θ ) − f K ( θ ) | d θ = (cid:90) κ (1 − e − θ )( θg ( θ )) − − d | γ − KZ − K θ cK − + d − dS bK ( θ − aK − ) g ( θ ) cK − | d θ ≤ γ [ c ∗ (1 + κ )] d (cid:90) κ θ − d | − (1 + e (cid:48) K ) θ cK − + d − dS bK ( θ − aK − ) | d θ ≤ γ (cid:90) κ θ − d | − θ cK − + d − dS bK ( θ − aK − ) | d θ + γe (cid:48) K (cid:90) κ θ cK − + d − dS bK ( θ − aK − ) d θ. (B.2)We bound the first integral in Eq. (B.2) in four parts: from 0 to aK − , from aK − to aK − + b K , from aK − + b K to κ − b K , and from κ − b K to κ . The first part is equal to (cid:90) aK − θ − d | − θ d + cK − | d θ ≤ (cid:90) aK − θ − d + θ cK − d θ = θ − d − d + Kc + K θ cK − (cid:12)(cid:12)(cid:12)(cid:12) aK − = 11 − d ( aK − ) − d + Kc + K ( aK − ) cK − → . . Nguyen et al./Independent finite approximations The second part is equal to (cid:90) aK − + b K aK − θ − d | − θ cK − + d − dS bK ( θ − aK − ) | d θ ≤ (cid:90) aK − + b K aK − θ − d + θ cK − − d d θ ≤ (cid:90) aK − + b K aK − θ − d d θ = 21 − d θ − d (cid:12)(cid:12)(cid:12)(cid:12) aK − + b K aK − = 21 − d (cid:0) ( aK − + b K ) − d − ( aK − ) − d (cid:1) → . The third part is equal to (cid:90) κ − b K aK − + b K θ − d | − θ cK − | d θ = (cid:90) κ − b K aK − + b K θ − d − θ cK − − d d θ = 11 − d θ − d − Kc + K (1 − d ) θ − d + cK − (cid:12)(cid:12)(cid:12)(cid:12) κ − b K aK − + b K = ( κ − b K ) − d − d − Kc + K (1 − d ) ( κ − b K ) − d + cK − − ( aK − + b K ) − d − d + Kc + K ( aK − + b K ) − d + cK − → . The fourth part is equal to (cid:90) κκ − b K θ − d | − θ cK − | d θ ≤ (cid:90) κκ − b K θ − d + θ cK − − d d θ → γe (cid:48) K (cid:90) κ θ cK − − dS bK ( θ − aK − ) d θ ≤ γe (cid:48) K (cid:90) κ θ − d d θ = γe (cid:48) K κ − d − d = o ( K ) . Since sup θ ∈ [0 ,κ ] π (cid:48) ( θ ) < ∞ by the boundedness of g and h and π is a probability densityby construction, conclude using Lemma B.3 that (cid:107) µ − µ K (cid:107) →
0. It then follows fromLemma B.1 that Θ K D = ⇒ Θ. B.2. Normalized IFA EPPF converges to NCRM EPPF
Proof of Theorem 3.4.
First, we show that the total mass of IFA converges in distributionto the total mass of CRM. Through Appendix B.1, we have shown that for all measurablesets A and t >
0, the Laplace functionals converge:lim K →∞ E [ e − t Θ K ( A ) ] = E [ e − t Θ( A ) ] , . Nguyen et al./Independent finite approximations By choosing A = Ψ i.e. the ground space, we have that Θ K (Ψ) is the total mass of IFA andΘ(Ψ) is the total mass of CRMΘ K (Ψ) = K (cid:88) i =1 θ K,i , Θ(Ψ) = ∞ (cid:88) i =1 θ i . Since for any t >
0, the Laplace transform of Θ K (Ψ) converges to that of Θ(Ψ), we concludethat Θ K (Ψ) converges to Θ(Ψ) in distribution (Kallenberg, 2002, Theorem 5.3): K (cid:88) i =1 θ K,i D = ⇒ Θ(Ψ) . (B.3)Second, we show that the decreasing order statistics of IFA atom sizes converges (infinite-dimensional distributions i.e., in f.d.d) to the decreasing order statistics of CRM atomsizes. For each K , the decreasing order statistics of IFA atoms is denoted by { θ K, ( i ) } Ki =1 : θ K, (1) ≥ θ K, (2) ≥ · · · ≥ θ K, ( K ) . We will leverage (Loeve, 1956, Theorem 4 and page 191) to find the limiting distribution { θ K, ( i ) } Ki =1 as K → ∞ . It is easy to verify the conditions to use the theorem: because thesums (cid:80) Ki =1 θ K,i converge in distribution to a limit, we know that all the θ K,i ’s are uniformlyasymptotically negligible (Kallenberg, 2002, Lemma 15.13). Now, we discuss what the limitsare. It is well-known that Θ(Ψ) is an infinitely divisible positive random variable with nodrift component and Levy measure exactly ν ( dθ ) Perman, Pitman and Yor (1992). In theterminology of (Loeve, 1956, Equation 2), the characteristics of Θ(Ψ) are a = b = 0 (nodrift or Gaussian parts), L ( x M ( x ) := − ν ([ x, ∞ )) . Let I be a counting process in reverse over (0 , ∞ ) defined based on the Poisson pointprocess { θ i } ∞ i =1 in the following way. For any x , I ( x ) is the number of points θ i exceedingthe threshold x : I ( x ) := |{ i : θ i ≥ x }| . We augment I (0) = ∞ and I ( ∞ ) = 0. As a stochastic process, I has independent increments,in that for all 0 = t < t < · · · < t k , the increments I ( t i ) − I ( t i − ) are independent,furthermore the law of the increments is I ( t i − ) − I ( t i ) ∼ Poisson( M ( t i ) − M ( t i − )). Theseproperties are simple consequences of the counting measure induced by the Poisson pointprocess. According to (Loeve, 1956, Page 191), the limiting distribution of { θ K, ( i ) } Ki =1 isgoverned by I , in the sense that for any fixed t ∈ N , for any x , x , . . . , x t ∈ [0 , ∞ ):lim K →∞ P ( θ K, (1) < x , θ K, (2) < x , . . . , θ K, ( t ) < x t )= P ( I ( x ) < , I ( x ) < , . . . , I ( x t ) < t ) . (B.4)Because the θ i ’s induce I , we can relate the left hand side to the order statistics of thePoisson point process. We denote the decreasing order statistic of the { θ i } ∞ i =1 as: θ (1) ≥ θ (2) ≥ · · · ≥ θ ( n ) ≥ · · · . Nguyen et al./Independent finite approximations Clearly, for any t ∈ N , the event that I ( x ) exceeds t is the same as the top t jumps amongthe { θ i } ∞ i =1 exceed x: I ( x ) ≥ t ⇐⇒ θ ( t ) ≥ x . Therefore Eq. (B.4) can be rewritten as, forany fixed t ∈ N , for any x , x , . . . , x t ∈ [0 , ∞ ):lim K →∞ P ( θ K, (1) < x , θ K, (2) < x , . . . , θ K, ( t ) < x t ) = P ( θ (1) < x , θ (2) < x , . . . , θ ( t ) < x t )(B.5)It is well-known that convergence of the distribution function imply weak convergence– for instance, see Problem 1 of https://link.springer.com/content/pdf/10.1007/978-1-4612-5254-2_3.pdf . Actually, from (Loeve, 1956, Theorem 5 and page 194), forany fixed t ∈ N , the convergence in distribution of { θ K, ( i ) } ti =1 to { θ i } ti =1 holds jointly withthe convergence of (cid:80) Ki =1 θ K, ( i ) to (cid:80) ∞ i =1 θ i : the two conditions of the theorem, which arecontinuity of the distribution function of each θ K,i and M (0) = −∞ (there is a typo inLoeve (1956)), are easily verified. Therefore, by continuous mapping theorem, if we definethe normalized atom sizes: p K, ( s ) := θ K, ( s ) (cid:80) Ki =1 θ K,i p ( s ) := θ ( s ) (cid:80) ∞ i =1 θ i we also have that the normalized decreasing order statistics converge:( p K,i ) Ki =1 f.d.d. → ( p K, ( i ) ) ∞ i =1 Finally we show that the EPPFs converge. In addition, if we define the size-biased per-mutation (in the sense of (Gnedin, 1998, Section 2) ) of the normalized atom sizes: { (cid:101) p K,i } ∼
SBP( p K, ( s ) ) { (cid:101) p i } ∼ SBP( p ( s ) )then by (Gnedin, 1998, Theorem 1), the finite-dimensional distributions of the size-biasedpermutation also converges: ( (cid:101) p K,i ) Ki =1 f.d.d. → ( (cid:101) p i ) ∞ i =1 (B.6)From here, we fix the number of samples N , the number of components t and the size ofthe clusters n i . (Pitman, 1996, Equation 45) gives the EPPF of Ξ = Θ / Θ(Ψ): p ( n , n , . . . , n t ) = E t (cid:89) i =1 (cid:101) p n i − i t − (cid:89) i =1 − i (cid:88) j =1 (cid:101) p j , Likewise, the EPPF of Ξ K = Θ K / Θ K (Ψ) is: p K ( n , n , . . . , n t ) = E t (cid:89) i =1 (cid:101) p n i − K,i t − (cid:89) i =1 − i (cid:88) j =1 (cid:101) p K,j
Since t is fixed, and each p j is [0 ,
1] valued, the mapping from the t -dimensional vector p tothe product (cid:81) ti =1 p n i − i (cid:81) t − i =1 (cid:16) − (cid:80) ij =1 p j (cid:17) is continuous and bounded. The choice of N , t , n i have been fixed but arbitrary. Hence, the convergence in finite-dimensional distributionsof in Eq. (B.6) imply that the EPPFs converge. . Nguyen et al./Independent finite approximations Appendix C: Marginal processes of exponential CRMs
The marginal process characterization describes the probabilistic model not through thetwo-stage sampling Θ ∼ CRM(
H, ν ) and X n | Θ iid ∼ LP( l ; Θ), but through the conditionaldistributions X n | X n − , X n − , . . . , X i.e. the underlying Θ has been marginalized out . Thisperspective removes the need to infer a countably infinite set of target variables. In addition,the exchangeability between X , X , . . . , X N i.e. the joint distribution’s invariance with re-spect to ordering of observations Aldous (1985), often enables the development of inferencealgorithms, namely Gibbs samplers.(Broderick, Wilson and Jordan, 2018, Corollary 6.2) derives the conditional distributions X n | X n − , X n − , . . . , X for general exponential family CRMs Eqs. (1) and (2). Proposition C.1 (Target’s marginal process (Broderick, Wilson and Jordan, 2018, Corol-lary 6.2)) . For any n , X n | X n − , . . . , X is a random measure with finite support.1. Let { ζ i } K n − i =1 be the union of atom locations in X , X , . . . , X n − . For ≤ m ≤ n − , let x m,j be the atom size of X m at atom location ζ j . Denote x n,i to be the atom size of X n at atom location ζ i . The x n,i ’s are independent across i and the p.m.f. of x n,i at x is: κ ( x ) S (cid:18) − (cid:80) n − m =1 φ ( x m,i ) + φ ( x ) , η + (cid:18)(cid:80) n − m =1 t ( x m,i ) + t ( x ) n (cid:19)(cid:19) S (cid:18) − (cid:80) n − m =1 φ ( x m,i ) , η + (cid:18)(cid:80) n − m =1 t ( x m,i ) n − (cid:19)(cid:19) .
2. For each x ∈ N , X n has p n,x atoms whose atom size is exactly x . The locations ofeach atom are iid H : as H is diffuse, they are disjoint from the existing union of atoms { ζ i } K n − i =1 . p n,x is Poisson-distributed, independently across x , with mean: γ (cid:48) κ (0) n − κ ( x ) S (cid:18) c/K − n − φ (0) + φ ( x ) , η + (cid:18) ( n − t (0) + t ( x )) n (cid:19)(cid:19) . In Proposition C.2, we state a similar characterization of Z n | Z n − , Z n − , . . . , Z for finite-dimensional model Eq. (6) and give the proof. Proposition C.2 (Approximation’s marginal process) . For any n , Z n | Z n − , . . . , Z is arandom measure with finite support.1. Let { ζ i } K n − i =1 be the union of atom locations in Z , Z , . . . , Z n − . For ≤ m ≤ n − , let z m,j be the atom size of Z m at atom location ζ j . Denote z n,i to be the atom size of Z n at atom location ζ i . z n,i ’s are independently across i and the p.m.f. of z n,i at x is: κ ( x ) S (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) + φ ( x ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) + t ( x ) n (cid:19)(cid:19) S (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19)(cid:19) . K − K n − atom locations are generated iid from H . Z n has p n,x atoms whose size isexactly x (for x ∈ N ∪ { } ) over these K − K n − atom locations (the p n, atoms whoseatom size is can be interpreted as not present in Z n ). The joint distribution of p n,x is . Nguyen et al./Independent finite approximations a Multinomial with K − K n − trials, with success of type x having probability: κ ( x ) S (cid:18) c/K − n − φ (0) + φ ( x ) , η + (cid:18) ( n − t (0) + t ( x ) n (cid:19)(cid:19) S (cid:18) c/K − n − φ (0) , η + (cid:18) ( n − t (0) n − (cid:19)(cid:19) . Proof of Proposition C.2.
We only need to prove the conditional distributions for the atomsizes: that the K distinct atom locations are generated iid from the base measure is clear.First we consider n = 1. By construction Corollary 3.3, a priori, the trait frequencies { θ i } Ki =1 are independent, each following the distribution: P ( θ i ∈ dθ ) = { θ ∈ U } S ( c/K − , η ) θ c/K − exp (cid:18) (cid:104) η, (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:19) . Conditioned on { θ i } Ki =1 , the atom sizes z ,i that Z puts on the i th atom location areindependent across i and each is distributed as: P ( z ,i = x | θ i ) = κ ( x ) θ φ ( x ) exp ( (cid:104) µ ( θ i ) , t ( x ) (cid:105) − A ( θ i )) . Integrating out θ i , the marginal distribution for z ,i is: P ( z ,i = x ) = (cid:90) P ( z ,i = x | θ i = θ ) P ( θ i ∈ dθ )= κ ( x ) S ( c/K − , η ) (cid:90) U θ c/K − φ ( x ) exp (cid:18) (cid:104) η + (cid:18) t ( x )1 (cid:19) , (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:19) dθ = κ ( x ) S (cid:18) c/K − φ ( x ) , η + (cid:18) t ( x )1 (cid:19)(cid:19) S ( c/K − , η ) , by definition of S as the normalizer Eq. (3).Now we consider n ≥
2. The distribution of z n,i only depends on the distribution of z n − ,i , z n − ,i , . . . , z ,i since the atom sizes across different atoms are independent of eachother both a priori and a posteriori. The predictive distribution is an integral: P ( z n,i = x | z n − ,i ) = (cid:90) P ( z n,i = x | θ i ) P ( θ i ∈ dθ | z n − ,i ) . Because the prior over θ i is conjugate for the likelihood z i,j | θ i , and the observations z i,j are conditionally indepndent given θ i , the posterior P ( θ i ∈ dθ | z n − ,i ) is in the sameexponential family but with different natural parameters: { θ ∈ U } θ c/K − (cid:80) n − m =1 φ ( z m,i ) exp (cid:18) (cid:104) η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19) , (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:19) dθS (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19)(cid:19) . . Nguyen et al./Independent finite approximations This means that the predictive distribution P ( z n,i = x | z n − ,i ) equals: κ ( x ) (cid:82) U θ c/K − (cid:80) n − m =1 φ ( z m,i )+ φ ( x ) exp (cid:18) (cid:104) η + (cid:18)(cid:80) n − m =1 t ( z m,i ) + t ( x ) n (cid:19) , (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:19) dθS (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19)(cid:19) = κ ( x ) S (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) + φ ( x ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) + t ( x ) n (cid:19)(cid:19) S (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19)(cid:19) . The predictive distribution P ( z n,i = x | z n − ,i ) govern both the distribution of atom sizesfor known atom locations and new atom locations. Appendix D: Technical lemmas
D.1. Concentration
Lemma D.1 (Modified upper tail Chernoff bound) . Let X = (cid:80) ni =1 X i , where X i = 1 withprobability p i and X i = 0 with probability − p i , and all X i are independent. Let µ be anupper bound on E ( X ) = (cid:80) ni =1 p i . Then for all δ > : P ( X ≥ (1 + δ ) µ ) ≤ exp (cid:18) − δ δ µ (cid:19) . Proof of Lemma D.1.
The proof relies on the regular upper tail Chernoff bound http://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf and an argument using stochas-tic domination. Truly, we pad the first n Poisson trials that define X with additional trials X n +1 , X n +2 , . . . , X n + m where m is the smallest natural number such that µ − E [ X ] m ≤ X n + i is a Bernoulli with probability µ − E [ X ] m , and the trials are independent. Then Y = X + (cid:80) mj =1 X n + j is itself the sum of Poisson trials with mean exactly µ , so the regularChernoff bound applies: P ( Y ≥ (1 + δ ) µ ) ≤ exp (cid:18) − δ δ µ (cid:19) . However by construction, X is stochastically dominated by Y , so the tail probabilities of X is bounded by the tail probabilities of Y . Lemma D.2 (Lower tail Chernoff bound) . Let X = (cid:80) ni =1 X i , where X i = 1 with probability p i and X i = 0 with probability − p i , and all X i are independent. Let µ := E ( X ) = (cid:80) ni =1 p i .Then for all δ ∈ (0 , : P ( X ≤ (1 − δ ) µ ) ≤ exp( − µδ / . Lemma D.3 (Tail bounds for Poisson distribution) . If X ∼ Poisson( λ ) then for any x > : P ( X ≥ λ + x ) ≤ exp (cid:18) − x λ + x ) (cid:19) , and for any < x < λ : P ( X ≤ λ − x ) ≤ exp (cid:18) − x λ (cid:19) . . Nguyen et al./Independent finite approximations Proof of Lemma D.3.
For x ≥ −
1, let ψ ( x ) := 2((1 + x ) ln(1 + x ) − x ) /x .We first inspect the upper tail bound. If X ∼ Poisson( λ ), for any x >
0, (Pollard, 2001,Exercise 3 p.272) implies that: P ( Z ≥ λ + x ) ≤ exp (cid:18) − x λ ψ (cid:16) xλ (cid:17)(cid:19) . To show the upper tail bound, it suffices to prove that x λ ψ (cid:0) xλ (cid:1) is greater than x λ + x ) . Ingeneral, we show that for u ≥
0: ( u + 1) ψ ( u ) − ≥ . (D.1)The denominator of ( u + 1) ψ ( u ) − u +1) ψ ( u ) −
1, which is g ( u ) := 2(( u + 1) ln( u + 1) − u ( u + 1) − u . Its 1st and 2nd derivativesare: g (cid:48) ( u ) = 4( u + 1) ln( u + 1) − u + 1 g (cid:48)(cid:48) ( u ) = 4 ln( u + 1) + 2 . Since g (cid:48)(cid:48) ( u ) ≥ g (cid:48) ( u ) is monotone increasing. Since g (cid:48) (0) = 1, g (cid:48) ( u ) > u ≥
0, hence g ( u ) is monotone increasing. Because g (0) = 0, we conclude that g ( u ) ≥ u > u = x/λ : ψ (cid:16) xλ (cid:17) ≥
11 + xλ = λx + λ , which shows x λ ψ (cid:0) xλ (cid:1) ≥ x λ + x ) .Now we inspect the lower tail bound. We follow the proof of . We first argue that: P ( X ≤ λ − x ) ≤ exp (cid:18) − x λ ψ (cid:16) − xλ (cid:17)(cid:19) . (D.2)For any θ , the moment generating function E [exp( θX )] is well-defined and well-known: E [exp( θX )] := exp( λ (exp( θ ) − . Therefore: P ( X ≤ λ − x ) ≤ P (exp( θX ) ≤ exp( θ ( λ − x )) ≤ P (exp( θ ( λ − x − X )) ≥ ≤ exp( θ ( λ − x )) E [exp( − θX )] , where we have used Markov’s inequality. We now aim to minimize exp( θ ( λ − x )) E [exp( − θX )]as a function of θ . Its logarithm is: λ (exp( − θ ) −
1) + θ ( λ − x ) . This is a convex function, whose derivative vanishes at θ = − ln (cid:0) − xλ (cid:1) . Overall this meansthe best upper bound on P ( X ≤ λ − x ) is:exp (cid:16) − λ (cid:16) xλ + (1 − xλ ) ln(1 − xλ ) (cid:17)(cid:17) , . Nguyen et al./Independent finite approximations which is exactly the right hand side of Eq. (D.2). Hence to demonstrate the lower tail bound,it suffices to show that: ψ (cid:16) − xλ (cid:17) ≥ . More generally, we show that for − ≤ u ≤ ψ ( u ) − ≥
0. Consider the numerator of ψ ( u ) −
1, which is h ( u ) := 2((1 + u ) ln(1 + u ) − u ) − u . The first two derivatives are: h (cid:48) ( u ) = 2(1 + ln(1 + u )) − uh (cid:48)(cid:48) ( u ) = 21 + u − h (cid:48)(cid:48) ( u ) ≥ h ( u ) is convex on [ − , h (0) = 0. Also, by simple continuityargument, h ( −
1) = 2. Therefore, h is non-negative on [0 , ψ ( u ) ≥ Lemma D.4 (Multinomial-Poisson approximation) . Let { p i } ∞ i =1 , p i ≥ , (cid:80) ∞ i =1 p i < .Suppose there are n independent trials: in each trial, success of type i has probability p i . Let X = { X i } ∞ i =1 be the number of type i successes after n trial. Let Y = { Y i } ∞ i =1 be independentPoisson random variables, where Y i has mean np i . Then: d T V ( X, Y ) ≤ n (cid:32) ∞ (cid:88) i =1 p i (cid:33) . Proof of Lemma D.4.
First we remark that both X and Y can be sampled in two-steps. • Regarding X , first sample N ∼ Binom ( n, (cid:80) ∞ i =1 p i ). Then, for each 1 ≤ k (cid:54) = N ,sample Z k where P ( Z k = i ) = p i (cid:80) ∞ j =1 p j . Then, X i = (cid:80) N k =1 { Z k = i } for each i . • Regarding Y , first sample N ∼ Poisson ( n (cid:80) ∞ i =1 p i ). Then, for each 1 ≤ k ≤ N ,sample T k where P ( T k = i ) = p i (cid:80) ∞ j =1 p j . Then, Y i = (cid:80) N k =1 { T k = i } for each i .The two-step sampling perspective for X comes from rejection sampling: to generate asuccess of type k , we first generate some type of success, and then re-calibrate to get theright proportion for type k . The two-step perspective for Y comes from the thinning propertyof Poisson distribution (Last and Penrose, 2017, Exercise 1.5). The thinning property impliesthat for any finite index set K , all { Y i } for i ∈ K are mutually independent and marginally, Y i ∼ Poisson( np i ). Hence the whole collection { Y i } i =1 are independent Poissons and themean of Y i is np i .Observing that the conditional X | N = n is the same as Y | N = n , we use propagationrule Lemma D.7: d T V ( X, Y ) ≤ d T V ( N , N ) . Total variation between N and N is just the classic Binomial-Poisson approximationLe Cam (1960). d T V ( N , N ) ≤ n (cid:32) ∞ (cid:88) i =1 p i (cid:33) . Lemma D.5 (Total variation between Poissons (Adell and Lekuona, 2005, Corrollary 3.1)) . Let P be the Poisson distribution with mean s , P the Poisson distribution with mean t .Then: d T V ( P , P ) ≤ − exp( −| s − t | ) ≤ | s − t | . . Nguyen et al./Independent finite approximations D.2. Total variation
First is the chain rule, which will be applied to compare joint distributions that admitdensities.
Lemma D.6 (Chain rule) . Suppose ( X , Y ) and ( X , Y ) are two distributions, over A×B ,that have densities w.r.t a common measure. Then: d T V ( P X ,Y , P X ,Y ) ≤ d T V ( P X , P X ) + sup a ∈A d T V ( P Y | X = a , P Y | X = a ) . Proof of Lemma D.6.
Because both P X ,Y and P X ,Y have densities, total variation dis-tance is half of L distance between the densities: d TV ( P X ,Y , P X ,Y ) = 12 (cid:90) ( a,b ) ∈A×B | P X ,Y ( a, b ) − P X ,Y ( a, b ) | dadb = 12 (cid:90) ( a,b ) ∈A×B | P X ,Y ( a, b ) − P X ( a ) P Y | X ( b | a ) + P X ( a ) P Y | X ( b | a ) − P X ,Y ( a, b ) | dadb ≤ (cid:90) ( a,b ) ∈A×B (cid:0) P Y | X ( b | a ) | P X ( a ) − P X ( a ) | + P X ( a ) | P Y | X ( b | a ) − P Y | X ( b | a ) | (cid:1) dadb = 12 (cid:90) ( a,b ) ∈A×B P Y | X ( b | a ) | P X ( a ) − P X ( a ) | dadb + 12 (cid:90) ( a,b ) ∈A×B P X ( a ) | P Y | X ( b | a ) − P Y | X ( b | a ) | dadb. where we have used triangle inequality. Regarding the first term, using Fubini:12 (cid:90) ( a,b ) ∈A×B P Y | X ( b | a ) | P X ( a ) − P X ( a ) | dadb = 12 (cid:90) a ∈A (cid:18)(cid:90) b ∈B P Y | X ( b | a ) db (cid:19) | P X ( a ) − P X ( a ) | da = 12 (cid:90) a ∈A | P X ( a ) − P X ( a ) | da = d TV ( P X , P X ) . Regarding the second term:12 (cid:90) ( a,b ) ∈A×B P X ( a ) | P Y | X ( b | a ) − P Y | X ( b | a ) | dadb = (cid:90) a ∈A (cid:18) (cid:90) b ∈B | P Y | X ( b | a ) − P Y | X ( b | a ) | db (cid:19) P X ( a ) da ≤ (cid:18) sup a ∈A d TV ( P Y | X = a , P Y | X = a ) (cid:19) (cid:90) a ∈A P X ( a ) da = sup a ∈A d TV ( P Y | X = a , P Y | X = a ) Sum of the first and second upper bound give the total variation chain rule.Second is the propagation rule, which applies even if distributions don’t have densities.
Lemma D.7 (Propagation rule) . Suppose ( X , Y ) and ( X , Y ) are two distributions over A × B . Suppose the conditional Y | X = a is the same as the conditional Y | X = a , whichwe just denote as Y | X = a . Then: d T V ( P Y , P Y ) ≤ d T V ( P X , P X ) . . Nguyen et al./Independent finite approximations Proof of Lemma D.7.
It is well-known that total variation between P U and P V is the in-fimum of P ( U (cid:54) = V ) over all couplings ( U, V ) where U ∼ P U and V ∼ P V ((Madras andSezer, 2010, Equation 13)). For any joint distribution of ( X , Y , X , Y ) where marginally( X , Y ) ∼ P X ,Y and ( X , Y ) ∼ P X ,Y , ( Y , Y ) is a coupling where Y ∼ P Y and Y ∼ P Y . Therefore: d T V ( P Y , P Y ) ≤ P ( Y (cid:54) = Y ) = P ( Y (cid:54) = Y , X (cid:54) = X ) + P ( Y (cid:54) = Y , X = X ) . Now suppose the joint distribution over ( X , Y , X , Y ) is such that, conditioned on X = X = a for any a , P ( Y = Y | X = X = a ) = 1 (when X (cid:54) = X , it doesn’t matterthe relationship between Y | X = a and Y | X = b ). This is possible since the conditional Y | X = a is the same as the conditional Y | X = a . For such a distribution, P ( Y (cid:54) = Y , X = X ) = 0. Hence: d T V ( P Y , P Y ) ≤ P ( Y (cid:54) = Y , X (cid:54) = X ) ≤ P ( X (cid:54) = X ) . Now, we recognize that ( X , X ) is an arbitrary coupling between P X and P X . Takinginfimum over all couplings, we arrive at the propagation rule.Third is the product rule. Lemma D.8 (Product rule) . Z = ( X , Y ) and Z = ( X , Y ) are two distributions over A × B . Suppose P X ,Y factorizes into P X P Y and similarly P X ,Y = P X P Y . Then: inf coupling P Z ,P Z P ( Z (cid:54) = Z ) ≤ inf coupling P X ,P X P ( X (cid:54) = X ) + inf coupling P Y ,P Y P ( Y (cid:54) = Y ) Proof of Lemma D.8.
Consider any ( X , X ) that is a coupling of P X and P X , and any( Y , Y ) that is a coupling of P Y and P Y . Because of the factorization structure betweenthe X (cid:48) s and the Y (cid:48) , we can construct ( X (cid:48) , X (cid:48) , Y (cid:48) , Y (cid:48) ) such that ( X (cid:48) , X (cid:48) ) D = ( X , X ),( Y (cid:48) , Y (cid:48) ) D = ( Y , Y ), ( X (cid:48) , Y (cid:48) ) ∼ P X ,Y , ( X (cid:48) , Y (cid:48) ) ∼ P X ,Y . By union bound: P (( X (cid:48) , Y (cid:48) ) (cid:54) = ( X (cid:48) , Y (cid:48) )) ≤ P ( X (cid:48) (cid:54) = X (cid:48) ) + P ( Y (cid:48) (cid:54) = Y (cid:48) )Because inf coupling P Z ,P Z P ( Z (cid:54) = Z ) ≤ P (( X (cid:48) , Y (cid:48) ) (cid:54) = ( X (cid:48) , Y (cid:48) )), we have:inf coupling P Z ,P Z P ( Z (cid:54) = Z ) ≤ P ( X (cid:48) (cid:54) = X (cid:48) ) + P ( Y (cid:48) (cid:54) = Y (cid:48) ) . We finish the proof by taking the infimum over couplings ( X , X ) and ( Y , Y ) of theRHS. D.3. Miscellaneous
Lemma D.9 (Order of growth of harmonic-like sums) . N (cid:88) n =1 αn − α ≥ α (ln N − ψ ( α ) − . where ψ is the digamma function. . Nguyen et al./Independent finite approximations Proof of Lemma D.9.
It is well-known (for instance https://en.wikipedia.org/wiki/Chinese_restaurant_process ) that: N (cid:88) n =1 αn − α = α [ ψ ( α + N ) − ψ ( α )](Gordon, 1994, Theorem 5) says that ψ ( α + N ) ≥ ln( α + N ) − α + N ) − α + N ) ≥ ln N − . We list a collection of technical lemmas that are used when verifying Assumption 2 forthe recurring examples.The first set assists in the beta-Bernoulli model. • For α > i = 1 , , , . . . :1 i + α − ≤ (cid:18) α { i = 1 } + 1 i { i > } (cid:19) . (D.3) • For m, x, y > m ≤ y : (cid:12)(cid:12)(cid:12)(cid:12) m + xy + x − my (cid:12)(cid:12)(cid:12)(cid:12) ≤ xy . (D.4) Proof of Eq. (D.3) . If i = 1, i + α − = α . If i ≥ i + α − ≤ i − ≤ i . Proof of Eq. (D.4) . (cid:12)(cid:12)(cid:12)(cid:12) m + xy + x − my (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ( m + x ) y − m ( y + x ) y ( y + x ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) x ( y − m ) y ( y + x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ xy . The second set aid in the gamma-Poisson model. • For x ∈ [0 , − x ) ln(1 − x ) + x ≥ . (D.5) • For x ∈ (0 , p ≥
0: (1 − x ) p + p x − x ≥ . (D.6) • For λ >
0, for m > , t > , x > d T V (cid:0)
NB( m, t − ) , NB( m + x, t − ) (cid:1) ≤ x /t − /t . (D.7) • For y ≥ , m > , K > (cid:12)(cid:12)(cid:12)(cid:12) my − K Γ( m/K + y )Γ( m/K ) y ! (cid:12)(cid:12)(cid:12)(cid:12) ≤ e m K . (D.8)where e is the Euler constant. . Nguyen et al./Independent finite approximations Proof of Eq. (D.5) . Set g ( x ) to be the function on the right hand side. Then its derivativeis g (cid:48) ( x ) = − ln(1 − x ) ≥
0, meaning the function is monotone increasing. Since g (0) = 0, it’strue that g ( x ) ≥ , Proof of Eq. (D.6) . Let f ( p ) = (1 − x ) p + p x − x −
1. Then f (cid:48) ( p ) = ln(1 − x )(1 − x ) p + x − x .Also f (cid:48)(cid:48) ( p ) = (ln(1 − x )) (1 − x ) p >
0. So f (cid:48) ( p ) is monotone increasing. At p = 0, f (cid:48) (0) =ln(1 − x ) + x − x >
0. Therefore f (cid:48) ( p ) ≥ p . So f ( p ) is increasing. Since f (0) = 0, it’strue that f ( p ) ≥ p . Proof of Eq. (D.7) . It is known that NB( r, θ ) is a Poisson stopped sum distribution (John-son, Kemp and Kotz, 2005, Equation 5.15): • N ∼ Poisson( − r ln(1 − θ )). • Y i iid ∼ Log( θ ) where the Log( θ ) distribution’s pmf at k equals − θ k k ln(1 − θ ) . • (cid:80) Ni =1 Y i ∼ NB( r, θ ).Therefore, by total variation’s chain rule Lemma D.6, to compare NB( m, t − ) withNB( m + γ/K, t − ) it suffices to compare the two generating Poissons. d T V (cid:0)
NB( m, t − ) , NB( m + γ/K, t − ) (cid:1) ≤ d T V (Poisson( − m ln(1 − t − ) , Poisson( − ( m + γλ/K ) ln(1 − t − ))) ≤ − ln(1 − t − ) γλK ≤ t − − t − γλK . We have used the fact that total variation distance between Poissons is dominated by theirdifferent in means Lemma D.5 and Eq. (D.5) where x = ( λ + i ) − . Proof of Eq. (D.8) . Since Γ (cid:0) mK + y (cid:1) = (cid:16)(cid:81) y − j =0 ( mK + j ) (cid:17) Γ (cid:0) mK (cid:1) = Γ (cid:0) mK (cid:1) mK (cid:81) y − j =1 ( mK + j ), wehave: (cid:12)(cid:12)(cid:12)(cid:12) my − K Γ( m/K + y )Γ( m/K ) y ! (cid:12)(cid:12)(cid:12)(cid:12) = my y − (cid:89) j =1 m/K + jj − . We inspect the product in more detail. y − (cid:89) j =1 m/K + jj = y − (cid:89) j =1 (cid:18) m/Kj (cid:19) ≤ y − (cid:89) j =1 exp (cid:18) m/Kj (cid:19) = exp mK y − (cid:88) j =1 j ≤ exp (cid:16) mK (ln y + 1) (cid:17) = ( ey ) m/K . where the ( y − y + 1. In all: (cid:12)(cid:12)(cid:12)(cid:12) my − K Γ( m/K + y )Γ( m/K ) y ! (cid:12)(cid:12)(cid:12)(cid:12) ≤ my (cid:16) ( ey ) m/K − (cid:17) ≤ my mK ( ey − ≤ e m K .
The third set aid in the beta-negative binomial model. . Nguyen et al./Independent finite approximations • For x > z ≥ y ≥ B ( x, y ) − B ( x, z ) ≤ ( z − y ) B ( x + 1 , y − . ≤ ( z − y ) B ( x + 1 , y − . (D.9) • For any θ ∈ [0 , r > b ≥ ∞ (cid:88) y =1 Γ( y + r ) y !Γ( r ) B ( y, b + r ) ≤ rb − . . (D.10) • For b ≥
1, for any c >
0, for any K ≥ c : (cid:12)(cid:12)(cid:12)(cid:12) − Γ( b )Γ( b + c/K ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ cK (2 + ln b ) . (D.11) • There exists a constant D (cid:48)(cid:48) such that for all b > , c > , K ≥ c (ln b + 2): (cid:12)(cid:12)(cid:12)(cid:12) c − KB ( c/K, b ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ cK (3 ln b + 8) . (D.12) Proof of Eq. (D.9) . First we prove that for any x ∈ [0 , √ − x ln(1 − x ) + x ≥ . Truly, let g ( x ) to be the function on the right hand side. Then its derivative is: g (cid:48) ( x ) = 2 √ − x − ln(1 − x ) − √ − x . Denote the numerator function by h ( x ). Its derivative is: h (cid:48) ( x ) = 11 − x − √ − x ≥ , since x ∈ [0 ,
1] meaning h is monotone increasing. Since h (0) = 0, it means h ( x ) ≥
0. Thismeans g (cid:48) ( x ) ≥ g itself is monotone increasing. Since g (0) = 0 it’s true that g ( x ) ≥ x ∈ [0 , x ∈ [0 , p ≥ − x ) p + p x √ − x − ≥ . (D.13)Truly, let f ( p ) = (1 − x ) p + p x √ − x −
1. Then f (cid:48) ( p ) = ln(1 − x )(1 − x ) p + x √ − x . Also f (cid:48)(cid:48) ( p ) = (ln(1 − x )) (1 − x ) p >
0. So f (cid:48) ( p ) is monotone increasing. At p = 0, f (cid:48) (0) =ln(1 − x ) + x √ − x >
0. Therefore f (cid:48) ( p ) ≥ p . So f ( p ) is increasing. Since f (0) = 0,it’s true that f ( p ) ≥ p .We finally prove the inequality about beta functions. B ( x, y ) − B ( x, z ) = (cid:90) θ x − (1 − θ ) y − (1 − (1 − θ ) z − y ) dθ ≤ (cid:90) θ x − (1 − θ ) y − ( z − y ) θ (1 − θ ) − . dθ = ( z − y ) (cid:90) θ x (1 − θ ) y − . dθ = ( z − y ) B ( x + 1 , y − . . where we have usd 1 − (1 − θ ) z − y ≤ ( z − y ) θ (1 − θ ) − / . As for B ( x +1 , y − . ≤ B ( x +1 , y − . Nguyen et al./Independent finite approximations Proof of Eq. (D.10) . ∞ (cid:88) y =1 Γ( y + r ) y !Γ( r ) B ( y, b + r ) = (cid:90) ∞ (cid:88) y =1 Γ( y + r ) y !Γ( r ) θ y − (1 − θ ) b + r − dθ = (cid:90) θ − (cid:32) ∞ (cid:88) y =1 Γ( y + r ) y ! Γ( r ) θ y (cid:33) (1 − θ ) b + r − dθ = (cid:90) (cid:18) θ − (cid:18) − θ ) r − (cid:19)(cid:19) (1 − θ ) b + r − dθ = (cid:90) (cid:0) θ − (1 − (1 − θ ) r ) (cid:1) (1 − θ ) b − dθ ≤ (cid:90) θ − r θ √ − θ (1 − θ ) b − dθ = r (cid:90) (1 − θ ) b − . dθ = rb − . , where the identity (cid:80) ∞ y =1 Γ( y + r ) y ! Γ( r ) θ y = − θ ) r − − (1 − θ ) r . Proof of Eq. (D.11) . First we prove that:1 − Γ( b )Γ( b + c/K ) ≤ cK (2 + ln b ) . The recursion defining Γ( b ) allows us to write:1 − Γ( b )Γ( b + c/K ) = 1 − (cid:98) b (cid:99)− (cid:89) i =1 b − ib + c/K − i Γ( b − (cid:98) b (cid:99) + 1)Γ( b + c/K − (cid:98) b (cid:99) + 1) . The argument proceeds in one of two ways. If Γ( b −(cid:98) b (cid:99) +1)Γ( b + c/K −(cid:98) b (cid:99) +1) ≥
1, then we have:1 − Γ( b )Γ( b + c/K ) ≤ − (cid:98) b (cid:99)− (cid:89) i =1 b − ib + c/K − i = (cid:18) − b − b + c/K − (cid:19) + b − b + c/K − − (cid:98) b (cid:99)− (cid:89) i =1 b − ib + c/K − i = cK b + c/K − b − b + c/K − − (cid:98) b (cid:99)− (cid:89) i =2 b − ib + c/K − i ≤ cK b − − (cid:98) b (cid:99)− (cid:89) i =2 b − ib + c/K − i ≤ ... ≤ cK (cid:98) b (cid:99)− (cid:88) i =1 b − i ≤ cK (ln b + 1) . . Nguyen et al./Independent finite approximations Else, Γ( b −(cid:98) b (cid:99) +1)Γ( b + c/K −(cid:98) b (cid:99) +1) < − Γ( b )Γ( b + c/K )= 1 − Γ( b − (cid:98) b (cid:99) + 1)Γ( b + c/K − (cid:98) b (cid:99) + 1) + Γ( b − (cid:98) b (cid:99) + 1)Γ( b + c/K − (cid:98) b (cid:99) + 1) − (cid:98) b (cid:99)− (cid:89) i =1 b − ib + c/K − i ≤ (cid:18) − Γ( b − (cid:98) b (cid:99) + 1)Γ( b + c/K − (cid:98) b (cid:99) + 1) (cid:19) + cK (ln b + 1) . We now argue that for all x ∈ [1 , K ≥ c , 1 − Γ( x )Γ( x + c/K ) ≤ cK . By convexity of Γ( x ),we know that Γ( x ) ≥ Γ( x + c/K ) − cK Γ (cid:48) ( x + c/K ). Hence Γ( x )Γ( x +1 /K ) ≥ − cK Γ (cid:48) ( x + c/K )Γ( x + c/K ) . Since x + c/K ∈ [1 ,
3) and ψ ( y ) = Γ (cid:48) ( y )Γ( y ) , the digamma function, is a monotone increasing function(it is the derivative of a ln Γ( x ), which is also convex), (cid:12)(cid:12)(cid:12) Γ (cid:48) ( x + c/K )Γ( x + c/K ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) Γ (cid:48) (3)Γ(3) (cid:12)(cid:12)(cid:12) ≤
1. Applyingthis to x = b − (cid:98) b (cid:99) + 1, we conclude that:1 − Γ( b )Γ( b + c/K ) ≤ cK (2 + ln b ) . We now show that: Γ( b )Γ( b + c/K ) − ≥ − cK (ln b + ln 2) . Convexity of Γ( y ) means that:Γ( b ) ≥ Γ( b + c/K ) − cK Γ (cid:48) ( b + c/K ) −→ Γ( b )Γ( b + c/K ) − ≥ − cK Γ (cid:48) ( b + c/K )Γ( b + c/K ) . From (Alzer, 1997, Equation 2.2), we know that ψ ( x ) ≤ ln( x ) for positive x . Therefore: − cK Γ (cid:48) ( b + c/K )Γ( b + c/K ) ≥ − cK ln( b + c/K ) ≥ − cK (ln b + ln 2)since b + cK ≤ b .We combine two sides of the inequality to conclude that the absolute value is at most cK (2 + ln b ). Proof of Eq. (D.12) . (cid:12)(cid:12)(cid:12)(cid:12) c − KB ( c/K, b ) (cid:12)(cid:12)(cid:12)(cid:12) = c (cid:12)(cid:12)(cid:12)(cid:12) K/c Γ( c/K ) Γ( c/K + b )Γ( b ) − (cid:12)(cid:12)(cid:12)(cid:12) = c (cid:12)(cid:12)(cid:12)(cid:12) K/c Γ( c/K ) (cid:18) Γ( c/K + b )Γ( b ) − (cid:19) + (cid:18) K/c Γ( c/K ) − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c (cid:18) K/c Γ( c/K ) (cid:12)(cid:12)(cid:12)(cid:12) Γ( c/K + b )Γ( b ) − (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) K/c Γ( c/K ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) . On the one hand:
K/c Γ( c/K ) = Γ(1)Γ(1 + c/K ) . . Nguyen et al./Independent finite approximations From Eq. (D.11), we know: (cid:12)(cid:12)(cid:12)(cid:12)
Γ(1)Γ(1 + c/K ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ cK . On the other hand, let y = Γ( c/K + b )Γ( b ) . Then: (cid:12)(cid:12)(cid:12)(cid:12) Γ( c/K + b )Γ( b ) − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) y − (cid:12)(cid:12)(cid:12)(cid:12) = | − y | y ≤ cK (2 + ln b ) . Again using Eq. (D.11), | − y | ≤ cK (2 + ln b ). Since K ≥ c (ln b + 2), this is at most 0 . y ≥ .
5. In all: (cid:12)(cid:12)(cid:12)(cid:12) c − KB ( c/K, b ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ c (cid:18)(cid:18) cK (cid:19) cK (2 + ln b ) + 2 cK (cid:19) ≤ cK (3 ln b + 8) . Appendix E: Verification of upper bound’s assumptions for additionalexamples
E.1. Gamma-Poisson
First we write down the functions in Definition 4.1 for gamma-Poisson. This requires ex-pressing the rate measure and likelihood in exponential-family form: h ( x | θ ) = 1 x ! θ x exp( − θ ) , ν ( dθ ) = γλθ − exp( − λθ ) , which means that κ ( x ) = 1 /x ! , φ ( x ) = x, µ ( θ ) = 0 , A ( θ ) = θ . This leads to the normalizer: S = (cid:90) ∞ θ ξ exp( − λθ ) dθ = Γ( ξ + 1) λ − ( ξ +1) . Therefore, h c is: h c ( x n = x | x n − ) = 1 x ! Γ( − (cid:80) n − i =1 x i + x + 1)( λ + n ) − (cid:80) n − i =1 x i + x +1 Γ( − (cid:80) n − i =1 x i + 1)( λ + n − − (cid:80) n − i =1 x i +1 = 1 x ! Γ( (cid:80) n − i =1 x i + x )Γ( (cid:80) n − i =1 x i ) (cid:18) λ + n (cid:19) x (cid:18) − λ + n (cid:19) (cid:80) n − i =1 x i , and similarly (cid:101) h c is: (cid:101) h c ( x n = x | x n − ) = 1 x ! Γ( − (cid:80) n − i =1 x i + x + 1 + γλ/K )( λ + n ) − (cid:80) n − i =1 x i + x +1+ γλ/K Γ( − (cid:80) n − i =1 x i + 1 + γλ/K )( λ + n − − (cid:80) n − i =1 x i +1+ γλ/K = 1 x ! Γ( (cid:80) n − i =1 x i + x + γλ/K )Γ( (cid:80) n − i =1 x i + γλ/K ) (cid:18) λ + n (cid:19) x (cid:18) − λ + n (cid:19) (cid:80) n − i =1 x i + γλ/K , . Nguyen et al./Independent finite approximations and M n,x is: M n,x = γλ x ! Γ( x )( λ + n ) − x = γλx ( λ + n ) x . Now, we state the constants so that gamma-Poisson satisfies Assumption 2, and give theproof.
Proposition E.1 (Gamma-Poisson satisfies Assumption 2) . The following hold for arbitrary γ, λ > . For any n : ∞ (cid:88) x =1 M n,x ≤ γλn − λ . ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) ≤ γλn − λ . For any K : ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12) h c ( x | x n − ) − (cid:101) h c ( x | x n − ) (cid:12)(cid:12)(cid:12) ≤ γλK n − λ . For any K : ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ γ λ + eγ λ K n − λ . Proof of Proposition E.1.
The growth rate condition of target model is simple: ∞ (cid:88) x =1 M n,x = γλ ∞ (cid:88) x =1 x ( λ + n ) x ≤ γλ ∞ (cid:88) x =1 λ + n ) x = γλn − λ . The growth rate condition of approximate model is also simple: ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) = 1 − (cid:101) h c (0 | x n − = 0) = (cid:18) − λ + n (cid:19) γλ/K ≤ γλK ( λ + n ) − − ( λ + n ) − = 1 K γλn − λ , where we have used Eq. (D.6) with p = γλK , x = ( λ + n ) − .For the total variation between h c and (cid:101) h c condition, observe that h c and (cid:101) h c are probabilitymass functions of negative binomial distributions, namely: h c ( x | x n − ) = NB (cid:32) x | n − (cid:88) i =1 x i , ( λ + n ) − (cid:33)(cid:101) h c ( x | x n − ) = NB (cid:32) x | n − (cid:88) i =1 x i + γλ/K, ( λ + n ) − (cid:33) . The two negative binomial distributions have the same success probability and only differin the number of trials. Hence using Eq. (D.7), we have: ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12) h c ( x | x n − ) − (cid:101) h c ( x | x n − ) (cid:12)(cid:12)(cid:12) ≤ γλK ( λ + n ) − − ( λ + n ) − = 2 γλK n − λ . . Nguyen et al./Independent finite approximations For the total variation between M n,. and K (cid:101) h c ( · |
0) condition: ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) = ∞ (cid:88) x =1 λ + n ) x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) γλx − K Γ( γλ/K + x )Γ( γλ/K ) x ! (cid:18) − λ + n (cid:19) γλ/K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) x =1 λ + n ) x (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) γλx (cid:32) − (cid:18) − λ + n (cid:19) γλ/K (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) γλx − K Γ( γλ/K + x )Γ( γλ/K ) x ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:33) . Using Eqs. (D.7) and (D.8) we can upper bound:1 − (cid:18) − λ + n (cid:19) γλ/K ≤ γλK λ + n − (cid:12)(cid:12)(cid:12)(cid:12) γλx − K Γ( γλ/K + x )Γ( γλ/K ) x ! (cid:12)(cid:12)(cid:12)(cid:12) ≤ eγ λ K .
This means: ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) x =1 λ + n ) x γλx γλK λ + n − ∞ (cid:88) x =1 λ + n ) x eγ λ K ≤ γ λ K λ + n − + eγ λ K λ + n − ≤ γ λ + eγ λ K n − λ . E.2. Beta-negative binomial
First we write down the functions in Definition 4.1 for beta-negative binomial. This requiresexpressing the rate measure and likelihood in exponential-family form: h ( x | θ ) = Γ( x + r ) x !Γ( r ) θ x exp( r log(1 − θ )) ,ν ( dθ ) = γαθ − exp(log(1 − θ )( α − { θ ≤ } , which means that κ ( x ) = Γ( x + r ) / Γ( r ) x ! , φ ( x ) = x, µ ( θ ) = 0 , A ( θ ) = − r log(1 − θ ). Thisleads to the normalizer: S = (cid:90) θ ξ (1 − θ ) rλ dθ = B ( ξ + 1 , rλ + 1) . To match the parametrizations, we need to set λ = α − r i.e. rλ = α −
1. Therefore, h c is: h c ( x n = x | x n − ) = Γ( x + r ) x !Γ( r ) B ( (cid:80) n − i =1 x i + x, rn + α ) B ( (cid:80) n − i =1 x i , r ( n −
1) + α ) , . Nguyen et al./Independent finite approximations and (cid:101) h c is: (cid:101) h c ( x n = x | x n − ) = Γ( x + r ) x !Γ( r ) B ( c/K + (cid:80) n − i =1 x i + x, rn + α ) B ( c/K + (cid:80) n − i =1 x i , r ( n −
1) + α ) . and M n,x is: M n,x = γα Γ( x + r ) x !Γ( r ) B ( x, rn + α ) . Now, we state the constants so that beta-negative binomial satisfies Assumption 2, andgive the proof.
Proposition E.2 (Beta-negative binomial satisfies Assumption 2) . The following hold forany γ > , α ≥ . For any n : ∞ (cid:88) x =1 M n,x ≤ γαn − α − . /r . For any n , any K : ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) ≤ K γαn − α − . /r . For any K : ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12) h c ( x | x n − ) − (cid:101) h c ( x | x n − ) (cid:12)(cid:12)(cid:12) ≤ γαK n − α/r . For any n , for K ≥ γα (3 ln( r ( n −
1) + α ) + 8) : ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ γαK (4 γα + 3) ln( rn + α + 1) + (10 + 2 r ) γα + 24 n − α − . /r . Proof of Proposition E.2.
The first growth rate condition is easy to verify: ∞ (cid:88) x =1 M n,x = γα ∞ (cid:88) x =1 Γ( x + r )Γ( r ) x ! B ( x, rn + α ) ≤ γα rr ( n −
1) + α − . . where we have used Eq. (D.10) with b = r ( n −
1) + α .As for the other growth rate condition, ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) = 1 − (cid:101) h c (0 | x n − = 0) = 1 − B ( γα/K, rn + α ) B ( γα/K, r ( n −
1) + α )= B ( γα/K, r ( n −
1) + α ) − B ( γα/K, rn + α ) B ( γα/K, r ( n −
1) + α ) . The numerator is small because of Eq. (D.9) where x = γα/K, y = r ( n −
1) + α, z = rn + α : B ( γα/K, r ( n −
1) + α ) − B ( γα/K, rn + α ) ≤ rB ( γα/K + 1 , r ( n −
1) + α − . . Nguyen et al./Independent finite approximations The denominator is large because Equation (D.12) with Equation (D.12) with c = γα, b = r ( n −
1) + α : 1 B ( γα/K, r ( n −
1) + α ) ≤ γαK . Combining the two give and using a simple bound on the beta function yields: ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) ≤ K γαn − α − . /r . For the total variation between h c and (cid:101) h c condition, we first discuss how each functioncan be expressed a probability mass function of so-called beta negative binomial i.e., BNB((Johnson, Kemp and Kotz, 2005, Section 6.2.3)) distribution. Let A = (cid:80) n − i =1 x i . Observethat: Γ( x + r )Γ( r ) x ! B ( A + x, rn + α ) B ( A, r ( n −
1) + α ) = Γ( A + r )Γ( A ) x ! B ( r + x, A + r ( n −
1) + α ) B ( r, r ( n −
1) + α ) . (E.1)The random variable V whose p.m.f at x appears on the right hand side of Eq. (E.1) is theresult of a two-step sampling procedure. P ∼ Beta ( r, r ( n −
1) + α ) , V | P ∼ NB( A ; P ) . We denote such a distribution as V ∼ BNB( A ; r, r ( n −
1) + α ). An analogous argumentapplies to (cid:101) h c : P ∼ Beta ( r, r ( n −
1) + α ) , V | P ∼ NB (cid:16) A + γαK ; P (cid:17) . Therefore: h c ( x | x n − ) = BNB ( x | A ; r, r ( n −
1) + α ) (cid:101) h c ( x | x n − ) = BNB (cid:16) x | A + γαK ; r, r ( n −
1) + α (cid:17) . We now bound the total variation between the BNB distributions. Because they have acommon mixing distribution, we can upper bound the distance with an integral using simpletriangle inequalities: d T V (cid:16) h c , (cid:101) h c (cid:17) = 12 ∞ (cid:88) x =0 | P ( V = x ) − P ( V = x ) | = 12 ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ( P ( V = x | P = p ) − P ( V = x | P = p )) P ( P ∈ dp ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (cid:32) ∞ (cid:88) x =0 | P ( V = x | P = p ) − P ( V = x | P = p ) | (cid:33) P ( P ∈ dp )= (cid:90) d T V (NB(
A, p ) , NB( A + γα/K, p )) P ( P ∈ dp ) . For any p , Eq. (D.7) is used to upper bound the total variation distance between negative . Nguyen et al./Independent finite approximations binomial distributions. Therefore: d T V (cid:16) h c , (cid:101) h c (cid:17) ≤ (cid:90) γαK p − p P ( P ∈ dp )= γαK B ( r, r ( n −
1) + α ) (cid:90) p r (1 − p ) r ( n − α − dp = γαK B ( r + 1 , r ( n −
1) + α − B ( r, r ( n −
1) + α ) = γαK n − α/r . Finally, we verify the condition between K (cid:101) h c and M n,. , which is showing that the followingsum is small: ∞ (cid:88) x =1 Γ( x + r ) x !Γ( r ) (cid:12)(cid:12)(cid:12)(cid:12) cγαB ( x, rn + α ) − K B ( γα/K + x, rn + α ) B ( γα/K, r ( n −
1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) . We look at the summand for x = 1 and the summation from x = 2 through ∞ separately.For x = 1, we prove that: (cid:12)(cid:12)(cid:12)(cid:12) γαB (1 , rn + α ) − K B ( γα/K + 1 , rn + α ) B ( c/K, r ( n −
1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ rγ α K rn + α + 1) rn + α . (E.2)Expanding gives: (cid:12)(cid:12)(cid:12)(cid:12) γαB (1 , rn + α ) − K B (1 + γα/K, rn + α ) B ( γα/K, r ( n −
1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) = | γαB (1 , rn + α ) B ( γα/K, r ( n −
1) + α ) − KB (1 + γα/K, rn + α ) | B ( γα/K, r ( n −
1) + α ) . (E.3)We look at the numerator of the right hand side in Eq. (E.3): (cid:12)(cid:12)(cid:12)(cid:12) γαB (1 , rn + α ) Γ( γα/K )Γ( r ( n −
1) + α )Γ( γα/K + r ( n −
1) + α ) − K Γ(1 + γα/K )Γ( rn + α )Γ(1 + γα/K + rn + α ) (cid:12)(cid:12)(cid:12)(cid:12) = γα Γ( γα/K ) (cid:12)(cid:12)(cid:12)(cid:12) rn + α Γ( r ( n −
1) + α )Γ( γα/K + r ( n −
1) + α ) − Γ( rn + α )Γ( γα/K + 1 + rn + α ) (cid:12)(cid:12)(cid:12)(cid:12) = γα Γ( γα/K ) rn + α (cid:12)(cid:12)(cid:12)(cid:12) Γ( r ( n −
1) + α )Γ( γα/K + r ( n −
1) + α ) − Γ( rn + α + 1)Γ( γα/K + 1 + rn + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ γα Γ( γα/K ) rn + α (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) Γ( r ( n −
1) + α )Γ( γα/K + r ( n −
1) + α ) − (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) Γ( rn + α + 1)Γ( γα/K + 1 + rn + α ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ≤ γα Γ( γα/K ) rn + α γαK (2 + ln( rn + α + 1)) . where we have used Eq. (D.11) with c = γα and b = r ( n −
1) + α or b = rn + α + 1. In all,Eq. (E.3) is upper bounded by:2 γ α rn + α rn + α + 1) K Γ( γα/K ) B ( γα/K, r ( n −
1) + α )= 2 γ α rn + α rn + α + 1) K Γ( γα/K + r ( n −
1) + α )Γ( r ( n −
1) + α ) ≤ γ α K rn + α + 1) rn + α , . Nguyen et al./Independent finite approximations since Γ( r ( n − α )Γ( r ( n − α + γα/K ) ≥ − γαK (2+ln( r ( n − α )) ≥ . K ≥ γα (2+ln( r ( n − α ),and this is the proof of Eq. (E.2).We now move onto the summands from x = 2 to ∞ . By triangle inequality: (cid:12)(cid:12)(cid:12)(cid:12) γαB ( x, rn + α ) − K B ( γα/K + x, rn + α ) B ( γα/K, r ( n −
1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ T ( x ) + T ( x ) , where: T ( x ) := B ( x, rn + α ) (cid:12)(cid:12)(cid:12)(cid:12) γα − KB ( γα/K, r ( n −
1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ,T ( x ) := K (cid:12)(cid:12) B ( x, rn + α ) − B ( γαK + x, rn + α ) (cid:12)(cid:12) B ( γα/K, r ( n −
1) + α ) . The helper inequalities we have proven once again are useful: (cid:12)(cid:12)(cid:12)(cid:12) γα − KB ( γα/K, r ( n −
1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ γαK (3 ln( r ( n −
1) + α ) + 8) KB ( γα/K, r ( n −
1) + α ) ≤ γα + γαK (3 ln( r ( n −
1) + α ) + 8) ≤ γα, | B ( x, rn + α ) − B ( γα/K + x, rn + α ) | ≤ γαK B ( x − , rn + α + 1)since K ≥ γα (3 ln( r ( n −
1) + α ) + 8), we have applied Eq. (D.12) in the first and secondinequality and Eq. (D.9) in the third one. So for each x ≥
2, each summand is at most:Γ( x + r ) x !Γ( r ) (cid:12)(cid:12)(cid:12)(cid:12) cB ( x, rn + α ) − K B ( γα/K + x, rn + α ) B ( γα/K, r ( n −
1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ γα (3 ln( r ( n −
1) + α ) + 8) K Γ( x + r ) x !Γ( r ) B ( x, rn + α ) + 2 γ α K Γ( x + r ) x !Γ( r ) B ( x − , rn + α + 1) . To upper bound the summation from x = 2 to ∞ , it suffices to bound: ∞ (cid:88) x =2 Γ( x + r )Γ( r ) x ! B ( x, rn + α ) ≤ ∞ (cid:88) x =1 Γ( x + r )Γ( r ) x ! B ( x, rn + α ) ≤ rr ( n −
1) + α − . , and: ∞ (cid:88) x =2 Γ( x + r )Γ( r ) x ! B ( x − , rn + α + 1) ≤ r ∞ (cid:88) x =2 Γ( x − r + 1)Γ( r + 1)( x − B ( x − , rn + α + 1) ≤ r ∞ (cid:88) z =1 Γ( z + r + 1)Γ( r + 1) z ! B ( z, rn + α + 1) ≤ r ( r + 1) r ( n −
1) + α − . x = 2 to ∞ is upper bounded by: γα (3 ln( r ( n −
1) + α ) + 8) K rr ( n −
1) + α − . γ α K r ( r + 1) r ( n −
1) + α − . . Nguyen et al./Independent finite approximations Eqs. (E.2) and (E.4) combine to give: ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ γαK (4 γα + 3) ln( rn + α + 1) + (10 + 2 r ) γα + 24 n − α − . /r . Appendix F: Proofs of CRM bounds
F.1. Upper bound
Proof of Theorem 4.2.
Let β be the smallest positive constant where β C / (1 + β ) ≥
2. Wewill focus on the case where the approximation level K is essentially Ω(ln N ): K ≥ max (( β + 1) max( C ( K, C ) , C ( N, C )) , C (ln N + C )) . (F.1)To see why it is sufficient, observe that the upper bound in Theorem 4.2 naturally holds for K smaller than ln N . Total variation distance is always upper bounded by 1; if K = o (ln N ),then by selecting reasonable constants C (cid:48) , C (cid:48)(cid:48) , C (cid:48)(cid:48)(cid:48) , we can make the right hand side at least1, and satisfy the inequality. In the sequel, we will only consider the situation in Eq. (F.1).First, we argue that it suffices to bound the total variation distance between the feature-allocation matrices coming from the target model and the approximate model. Given thelatent measures X , X , . . . , X N from the target model, we can read off the feature-allocationmatrix F , which has N rows and as many columns as there are unique atom locations amongthe X i ’s:1. The i th row of F records the atom sizes of X i .2. Each column corresponds to an atom location: the locations are sorted first accordingto the index of the first measure X i to manifest it (counting from 1 , , . . . ), and then itsatom size in X i .The marginal process that described the atom sizes of X n | X n − , X n − , . . . , X in Proposi-tion C.1 is also the description of how the rows of F are generated. The joint distribution X , X , . . . , X n can be two-step sampled. First, the feature-allocation matrix F is sampled.Then, the atom locations are drawn iid from the base measure H : each column of F isassigned an atom location, and the latent measure X i has atom size F i,j on the j th atomlocation. A similar two-step sampling generates Z , Z , . . . , Z n , the latent measures underthe approximate model: the distribution over the feature-allocation matrix F (cid:48) follows Propo-sition C.2 instead of Proposition C.1, but conditioned on the feature-allocation matrix, theprocess generating atom locations and constructing latent measures is exactly the same. Inother words, this implies that the conditional distributions Y N | F = f and W N | F (cid:48) = f arethe same, since both models have the same the observational likelihood f given the latentmeasures 1 through N . Denote P F to be the distribution of the feature-allocation matrixunder the target model, and P F (cid:48) the distribution of the feature-allocation matrix under theapproximate model. Lemma D.7 implies that: d T V ( P W N , P Y N ) ≤ d T V ( P F , P F (cid:48) ) . (F.2) . Nguyen et al./Independent finite approximations Next, we parametrize the feature-allocation matrices in a way that is convenient for theanalysis of total variation distance. Let J be the number of columns of F . Our parametriza-tion involves d n,x , for n ∈ [ N ] and x ∈ N , and s j , for j ∈ [ J ]:1. For n = 1 , , . . . , N :(a) If n = 1, for each x ∈ N , d ,x counts the number of columns j where F ,j = x .(b) For n ≥
2, for each x ∈ N , let J n = { j : ∀ i < n, F i,j = 0 } i.e. no observation before n manifests the atom locations indexed by columns in J n . For each x ∈ N , d n,x countsthe number of columns j ∈ J n where F n,j = x .2. For j = 1 , , . . . , J , let I j = min { i : F i,j > } i.e. the first row to manifest the j th atomlocation. Let s j = F I j : N,j i.e. the history of the j th atom location.In words, d n,x is the number of atom locations that is first instantiated by the individual n and each atom has size x , while s j is the history of the j th atom location. (cid:80) Nn =1 (cid:80) ∞ x =1 d n,x is exactly J , the number of columns. We use the short-hand d to refer to the collection of d n,x and s the collection of s j . There is a one-to-one mapping between ( d, s ) and the featureallocation matrix f . Let ( D, S ) be the distribution of d and s under the target model, while( D (cid:48) , S (cid:48) ) is the distribution under the approximate model. We now aim to compare the jointdistribution: d T V ( P F , P F (cid:48) ) = d T V ( P D,S , P D (cid:48) ,S (cid:48) ) . Because total variation distance is the infimum of difference probability over all couplings,to find an upper bound on d T V ( P D,S , P D (cid:48) ,S (cid:48) ), it suffices to demonstrate a joint distributionsuch that P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) )) is small. The rest of the proof is dedicated to that end. Tostart, we only assume that ( D, S, D (cid:48) , S (cid:48) ) is a proper coupling, in that marginally (
D, S ) ∼ P D,S and ( D (cid:48) , S (cid:48) ) ∼ P D (cid:48) ,S (cid:48) . As we progress, gradually more structure is added to the jointdistribution ( D, S, D (cid:48) , S (cid:48) ) to control P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) )).We first decompose P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) )) into other probabilistic quantities which can beanalyzed using Assumption 2. Define the typical set: D ∗ = (cid:40) d : N (cid:88) n =1 ∞ (cid:88) x =1 d n,x ≤ ( β + 1) max( C ( K, C ) , C ( N, C )) (cid:41) .d ∈ D ∗ means that the feature-allocation matrix f has a bounded number of columns. Theclaim is that: P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) )) ≤ P ( D (cid:54) = D (cid:48) ) + P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ) + P ( D / ∈ D ∗ ) . (F.3)This is true from basic properties of probabilities and conditional probabilities: P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) ))= P ( D (cid:54) = D (cid:48) ) + P ( S (cid:54) = S (cid:48) , D = D (cid:48) )= P ( D (cid:54) = D (cid:48) ) + P ( S (cid:54) = S (cid:48) , D = D (cid:48) , D ∈ D ∗ ) + P ( S (cid:54) = S (cid:48) , D = D (cid:48) , D / ∈ D ∗ ) ≤ P ( D (cid:54) = D (cid:48) ) + P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ) + P ( D / ∈ D ∗ ) , The three ideas behind this upper bound are the following. First, because of the growthcondition, we can analyze the atypical set probability P ( D / ∈ D ∗ ). Second, because of thetotal variation between h c and (cid:101) h c , we can analyze P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ). Finally, . Nguyen et al./Independent finite approximations we can analyze P ( D (cid:54) = D (cid:48) ) because of the total variation between K (cid:101) h c and M n,. . In whatfollows we carry out the program. Atypical set probability
The P ( D / ∈ D ∗ ) term in Eq. (F.3) is easiest to control. Underthe target model Proposition C.1, the D i,x ’s are independent Poissons with mean M i,x , sothe sum (cid:80) Ni =1 (cid:80) ∞ x =1 D i,x is itself a Poisson with mean M = (cid:80) Ni =1 (cid:80) ∞ x =1 M i,x . Because ofLemma D.3, for any x > P (cid:32) N (cid:88) i =1 ∞ (cid:88) x =1 D i,x > M + x (cid:33) ≤ exp (cid:18) − x M + x ) (cid:19) . For the event P ( D / ∈ D ∗ ), M + x = ( β + 1) max( C ( K, C ) , C ( N, C )), M ≤ C ( N, C ) dueto Eq. (7), so that x ≥ β max( C ( K, C ) , C ( N, C )). Therefore: P ( D / ∈ D ∗ ) ≤ exp (cid:18) − β β + 1) max( C ( K, C ) , C ( N, C )) (cid:19) . (F.4) Difference between histories
To minimize the difference probability between the his-tories of atom sizes i.e. the P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ) term in Eq. (F.3), we will use Eq. (9).The claim is, there exists a coupling of S (cid:48) | D (cid:48) and S | D such that: P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ) ≤ ( β + 1) max( C ( K, C ) , C ( N, C )) K C ( N, C ) . (F.5)Fix some d ∈ D ∗ – since we are in the typical set, the number of columns in the feature-allocation matrix is at most ( β + 1) max( C ( K, C ) , C ( N, C )). Conditioned on D = d , thereis a finite number of history variables S , one for each atom location; similar for conditioningof S (cid:48) on D (cid:48) = d . For both the target and the approximate model, the density of the jointdistribution factorizes: P ( S = s | D = d ) = J (cid:89) j =1 P ( S j = s j | D = d ) P ( S (cid:48) = s | D (cid:48) = d ) = J (cid:89) j =1 P ( S (cid:48) j = s j | D (cid:48) = d ) , since in both marginal processes, the atom sizes for different atom locations are independentof each other. This means we can use Lemma D.8: d T V ( P S | D = d , P S (cid:48) | D (cid:48) = d ) ≤ J (cid:88) j =1 d T V ( P S j | D = d , P S (cid:48) j | D (cid:48) = d ) . We inspect each d T V ( P S j | D = d , P S (cid:48) j | D (cid:48) = d ). Fixing d also fixes I j , the first row to manifest the j th atom location. The history s j is then a N − I j + 1 dimensional integer vector, whose t thentry is the atom size over the j the atom location of the t + I j − S j (1 : ( t − S (cid:48) j (1 : ( t − s ,the distributions S j ( t ) and S (cid:48) j ( t ) are very similar. The conditional distribution S j ( t ) | D = d, S j (1 : ( t − s is governed by h c Proposition C.1 while S (cid:48) j ( t ) | D (cid:48) = d, S (cid:48) j (1 : ( t − s is governed by (cid:101) h c Proposition C.2. Hence: d T V (cid:16) P S j ( t ) | D = d,S j (1:( t − s , P S (cid:48) j ( t ) | D (cid:48) = d,S (cid:48) j (1:( t − s (cid:17) ≤ K C t + I j − C , . Nguyen et al./Independent finite approximations for any partial history s . To use this conditional bound, we again leverage Lemma D.6 to com-pare the joint S j = ( S j (1) , S j (2) , . . . , S j ( N − I j +1)) with the joint S (cid:48) j = ( S (cid:48) j (1) , S (cid:48) j (2) , . . . , S (cid:48) j ( N − I j + 1)), peeling off one layer at a time. d T V ( P S j | D = d , P S (cid:48) j | D (cid:48) = d ) ≤ N − I j +1 (cid:88) t =1 max s d T V (cid:16) P S j ( t ) | D = d,S j (1:( t − s , P S (cid:48) j ( t ) | D (cid:48) = d,S (cid:48) j (1:( t − s (cid:17) ≤ N − I j +1 (cid:88) t =1 K C t + I j − C ≤ C ( N, C ) K .
Multiplying the right hand side by ( β + 1) max( C ( K, C ) , C ( N, C )), the upper bound on J ,we arrive at the same upper bound for the total variation between P S | D = d and P S (cid:48) | D (cid:48) = d inEq. (F.5). Furthermore, our analysis of the total variation can be back-tracked to constructthe coupling between the conditional distributions S | D = s and S (cid:48) | D (cid:48) = d which attainsthat small probability of difference. Since the choice of conditioning d ∈ D ∗ was arbitrary,we have actually shown Eq. (F.5). Difference between new atom sizes
Finally, to control the difference probability forthe distribution over new atom sizes i.e. the P ( D (cid:54) = D (cid:48) ) term in Eq. (F.3), we will utilizeEqs. (8) and (10). For each n , define the short-hand d n to refer to the collection d i,x for i ∈ [ n ], x ∈ N , and the typical sets: D ∗ n = (cid:40) d n : n (cid:88) i =1 ∞ (cid:88) x =1 d i,x ≤ ( β + 1) max( C ( K, C ) , C ( N, C )) (cid:41) . The type of expansion performed in Eq. (F.3) can be done once here to see that: P ( D (cid:54) = D (cid:48) )= P (( D N − , D N ) (cid:54) = ( D (cid:48) N − , D (cid:48) N )) ≤ P ( D N − (cid:54) = D (cid:48) N − ) + P ( D N (cid:54) = D (cid:48) N | D N − = D (cid:48) N − , D N − ∈ D ∗ n − ) + P ( D N − / ∈ D ∗ n − ) . Apply the expansion once more to P ( D N − (cid:54) = D (cid:48) N − ), then to P ( D N − (cid:54) = D (cid:48) N − ).If we define: B j = P ( D j (cid:54) = D (cid:48) j | D j − = D (cid:48) j − , D j − ∈ D ∗ j − ) , with the special case B simply being P ( D (cid:54) = D (cid:48) ), then: P ( D (cid:54) = D (cid:48) ) ≤ N (cid:88) j =1 B j + N (cid:88) j =2 P ( D j − / ∈ D ∗ j − ) . (F.6)The second summation in Eq. (F.6), comprising of only atypical probabilities, is easierto control. For any j , since (cid:80) j − i =1 (cid:80) ∞ x =1 D i,x ≤ (cid:80) Ni =1 (cid:80) ∞ x =1 D i,x , P ( D j − / ∈ D ∗ j − ) ≤ P ( D / ∈ D ∗ ), so a generous upper bound for the contribution of all the atypical probabilities . Nguyen et al./Independent finite approximations including the first one from Eq. (F.4) is: P ( D / ∈ D ∗ ) + N (cid:88) j =2 P ( D j − / ∈ D ∗ j − ) ≤ exp (cid:18) − (cid:18) β β + 1) max( C ( K, C ) , C ( N, C )) − ln N (cid:19)(cid:19) . By Lemma D.9, max( C ( K, C ) , C ( N, C )) ≥ C (max(ln N, ln K ) − C ( ψ ( C ) + 1)). Sincewe have set β so that β β +1 C = 2, we have: β β + 1) max( C ( K, C ) , C ( N, C )) − ln N ≥ ln K − constant . meaning the overall atypical probabilities is at most: P ( D / ∈ D ∗ ) + N (cid:88) j =2 P ( D j − / ∈ D ∗ j − ) ≤ constant K . (F.7)As for the first summation in Eq. (F.6), we look at the individual B j ’s. For any fixed d j − ∈ D ∗ j − , we claim that there exists a coupling between the conditionals D j | D j − = d j − and D (cid:48) j | D (cid:48) j − = d j − such that P ( D j (cid:54) = D (cid:48) j | D j − = D (cid:48) j − = d j − ) isat most: constant K j − C ) + constant (ln N + ln K ) K j − C . (F.8) Because the upper bound hold for arbitrary values d j − , the coupling actually ensuresthat, as long as D j − = D (cid:48) j − for some value in D ∗ j − , the probability of differencebetween D j and D (cid:48) j is small i.e. B j is at most the right hand side.Such a coupling exists because the total variation between the two distributions P D j | D j − = d j − and P D (cid:48) j | D (cid:48) j − = d j − is small. In particular, there exists a distribution U = { U x } ∞ x =1 of in-dependent Poisson random variables, such that both the total variation between P D j | D j − = d j − and P U and the total variation between P D (cid:48) j | D (cid:48) j − = d j − and P U is small – we then usetriangle inequality to bound the original total variation. Here, each U x has mean: E ( U x ) = (cid:32) K − j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:33) (cid:101) h c ( x | x j − = 0) . On the one hand, conditioned on D (cid:48) j − = d j − , D (cid:48) j = { D (cid:48) j,x } ∞ x =1 is the joint dis-tribution of types of successes of type x , where there are K − (cid:80) j − i =1 (cid:80) ∞ x =1 d i,x independenttrials and types x success has probability (cid:101) h c ( x | x j − = 0) by Proposition C.2. Because ofLemma D.4 and Eq. (8): d T V (cid:16) P D (cid:48) j | D (cid:48) j − = d j − , P U (cid:17) ≤ (cid:32) K − j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:33) (cid:32) ∞ (cid:88) x =1 (cid:101) h c ( x | x j − = 0) (cid:33) ≤ K (cid:18) K C j − C (cid:19) ≤ C K j − C ) . (F.9) . Nguyen et al./Independent finite approximations On the other hand, conditioned on D j − , D j = { D j,x } ∞ x =1 consists of independent Pois-sons, where the mean of D j,x is M j,x by Proposition C.1. We recursively apply Lemma D.8and Lemma D.5: d T V ( P U , P D j ) ≤ ∞ (cid:88) x =1 d T V ( P U x , P D j,x ) ≤ ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M j,x − (cid:32) K − j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:33) (cid:101) h c ( x | x j − = 0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) x =1 (cid:32) | M j,x − K (cid:101) h c ( x | x j − = 0) | + j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:101) h c ( x | x j − = 0) (cid:33) ≤ ∞ (cid:88) x =1 | M j,x − K (cid:101) h c ( x | x j − = 0) | + (cid:32) j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:33) (cid:32) ∞ (cid:88) x =1 (cid:101) h c ( x | x j − = 0) (cid:33) . (F.10)The first term is upper bounded by Eq. (10). Regarding the second term, since we are in thetypical set, (cid:80) j − i =1 (cid:80) ∞ y =1 d i,y is upper bounded. Therefore the overall bound on the secondterm is: ( β + 1) max( C ( K, C ) , C ( N, C )) 1 K C j − C . Combining the two bounds give the bound on d T V ( P U , P D j ):1 K C ln j + C j − C + ( β + 1) max( C ( K, C ) , C ( N, C )) 1 K C j − C ≤ constant (ln N + ln K ) K j − C . (F.11)Combining Eqs. (F.9) and (F.11) gives the upper bound in Eq. (F.8). The summation ofthe right hand side of Eq. (F.8) across j leads to: N (cid:88) j =1 B j ≤ constant K + constant (ln N + ln K ) ln NK . (F.12)In all, because of Eqs. (F.7) and (F.12), we can couple D and D (cid:48) such that P ( D (cid:54) = D (cid:48) ) is atmost: constant K + constant (ln N + ln K ) ln NK . (F.13)Aggregating the results from Eqs. (F.4), (F.5) and (F.13), we are done.
F.2. Lower bound
Proof of Theorem 4.3.
First we mention which probability kernel f results in the large totalvariation distance: the pathological f is the Dirac measure i.e., f ( · | X ) := δ X ( . ). With thisconditional likelihood X n = Y n and Z n = W n , meaning: d T V ( P N, ∞ , P N,K ) = d T V ( P X N , P Z N ) . . Nguyen et al./Independent finite approximations Now we discuss why the total variation is lower bounded by the function of N . Let A be the event that there are at least C ( N, α ) unique atom locations in among the latentstates: A := (cid:26) x N : ≥ C ( N, α ) (cid:27) . The probabilities assigned to this event by the approximate and the target models arevery different from each other. On the one hand, since
K < γC ( N,α )2 , under IFA K , A hasmeasure zero: P Z N ( A ) = 0 . (F.14)On the other hand, under beta-Bernoulli, the number of unique atom locations drawn is aPoisson random variable with mean exactly γC ( N, α ) – see Proposition C.1 and Example 4.2.The complement of A is a lower tail event. By Lemma D.3 with λ = γC ( N, α ) and x = γC ( N, α ): P X N ( A ) ≥ − exp (cid:18) − γC ( N, α )8 (cid:19) . (F.15)Because of Lemma D.9, we can lower bound C ( N, α ) by a multiple of ln N :exp (cid:18) − γC ( N, α )8 (cid:19) ≤ exp (cid:18) − γα ln N αγ ( ψ ( α ) + 1)8 (cid:19) = constant N γα/ . We now combine Eqs. (F.14) and (F.15) and recall that total variation is the maximumover probability discrepancies.The proof of Theorem 4.4 relies on the ability to compute a lower bound on the totalvariation distance between a Binomial distribution and a Poisson distribution.
Proposition F.1 (Lower bound on total variation between Binomial and Poisson) . For all K , it is true that: d T V (cid:18)
Poisson ( γ ) , Binom (cid:18)
K, γ/Kγ/K + 1 (cid:19)(cid:19) ≥ C ( γ ) K (cid:18) γ/Kγ/K + 1 (cid:19) , where: C ( γ ) = 18 1 γ + exp( − γ + 1) max(12 γ , γ, . Proof of Proposition F.1.
We adapt the proof of (Barbour and Hall, 1984, Theorem 2) toour setting. The Poisson( γ ) distribution satisfies the functional equality: E [ γy ( Z + 1) − Zy ( Z )] = 0 , (F.16)where y is any real-valued function and Z ∼ Poisson( γ ).Denote γ K = γγ/K +1 . For m ∈ N , let x ( m ) = m exp (cid:18) − m γ K θ (cid:19) , where θ is a constant which will be specified later. x ( m ) serves as a test function to lowerbound the total variation distance between Poisson( γ ) and Binom ( K, γ K /K ). Let X i ∼ . Nguyen et al./Independent finite approximations Ber( γ K K ), independently across i from 1 to K , and W = (cid:80) Ki =1 . Then W ∼ Binomial (
K, γ K /K ).The following identity is adapted from (Barbour and Hall, 1984, Equation 2.1): E [ γ K x ( W + 1) − W x ( W )] = (cid:16) γ K K (cid:17) K (cid:88) i =1 E [ x ( W i + 2) − x ( W i + 1)] . (F.17)where W i = W − X i .We first argue that the right hand side is not too small i.e. for any i : E [ x ( W i + 2) − x ( W i + 1)] ≥ − γ K + 12 γ K + 7 θγ K . (F.18)Consider the derivative of x ( m ): ddm x ( m ) = exp (cid:18) − m γ K θ (cid:19) (cid:18) − m γ K θ (cid:19) ≥ − m θγ K . because of the easy-to-verify inequality e − x (1 − x ) ≥ − x for x ≥
0. This means: x ( W i + 2) − x ( W i + 1) ≥ (cid:90) W i +2 W i +1 (cid:18) − m θγ K (cid:19) dm = 1 − θγ K (3 W i + 9 W i + 7) . Taking expectations, noting that E ( W i ) ≤ γ K and E ( W i ) = Var( W i )+[ E ( W i )] ≤ (cid:80) Kj =1 γ K K +( γ K ) = γ K + γ K we have proven Eq. (F.18).Now, because of positivity of x , and that γ ≥ γ K , we trivially have: E [ γx ( W + 1) − W x ( W )] ≥ E [ γ K x ( W + 1) − W x ( W )] . (F.19)Combining Eq. (F.17), Eq. (F.18) and Eq. (F.19) we have: E [ γx ( W + 1) − W x ( W )] ≥ K (cid:16) γ K K (cid:17) (cid:18) − γ K + 12 γ K + 7 θγ K (cid:19) . Recalling Eq. (F.16), for any coupling (
W, Z ) such that W ∼ Binom (cid:16) K, γ/Kγ/K +1 (cid:17) and Z ∼ Poisson( γ ): E [ γ ( x ( W + 1) − x ( Z + 1)) + Zx ( Z ) − W x ( W )] ≥ γ K K (cid:18) − γ K + 12 γ K + 7 θγ K (cid:19) . Suppose (
W, Z ) is the maximal coupling attaining the total variation distance between P W and P Z i.e. P ( W (cid:54) = Z ) = d T V ( P Y , P Z ). Clearly: γ ( x ( W + 1) − x ( Z + 1)) + Zx ( Z ) − W x ( W ) ≤ { W (cid:54) = Z } sup m ,m | ( γx ( m + 1) − m x ( m )) − ( γx ( m + 1) − m x ( m )) |≤ { W (cid:54) = Z } sup m | ( γx ( m + 1) − mx ( m ) | . Taking expectations on both sides, we conclude that2 d T V ( P W , P Z ) × sup m | γx ( m + 1) − mx ( m ) | ≥ γ K K (cid:18) − γ K + 12 γ K + 7 θγ K (cid:19) (F.20) . Nguyen et al./Independent finite approximations It remains to upper bound sup m | γx ( m + 1) − mx ( m ) | . Recall that the derivative of x is exp (cid:16) − m γ K θ (cid:17) (cid:16) − m γ K θ (cid:17) , taking values in [ − e − / , m , − e − / ≤ x ( m + 1) − x ( m ) ≤
1. Hence: | γx ( m + 1) − mx ( m ) | = | γ ( x ( m + 1) − x ( m )) + ( γ − m ) x ( m ) |≤ γ + ( m + γ ) m exp (cid:18) − m γ K θ (cid:19) ≤ γ + ( γ + 1) m exp (cid:18) − m γ K θ (cid:19) ≤ γ + θγ K ( γ + 1) exp( − . (F.21)where the last inequality owes to the easy-to-verify x exp( − x ) ≤ exp( − d T V (cid:18)
Binomial (cid:18)
K, γ/Kγ/K + 1 (cid:19) , Poisson( γ ) (cid:19) ≥
12 1 − γ K +12 γ K +7 θγ K γ + ( γ + 1) θγ K exp( − K (cid:16) γ K K (cid:17) . Finally, we calibrate θ . By selecting θ = max (cid:16) γ K , γ K , (cid:17) we have that the numeratorof the unwieldy fraction is at least and its denominator is at most γ + exp( − γ +1) max(12 γ , γ, γ K < γ . This completes the proof. Proof of Theorem 4.4.
First we mention which probability kernel f results in the large totalvariation distance. For any discrete measure (cid:80) Mi =1 δ ψ i , f is the Dirac measure sitting on M ,the number of atoms. f ( . | M (cid:88) i =1 δ ψ i ) := δ M ( . ) . (F.22)Now we show that under such f , the total variation distance is lower bounded. First,observe that: d T V ( P Y N , P W N ) ≥ d T V ( P Y , P W ) . (F.23)Truly, suppose ( Y N , W N ) is any coupling of P Y N , P W N . Elementarily we have P ( Y N (cid:54) = W N ) ≥ P ( Y (cid:54) = W ). Taking the infimum over couplings to attain the total variationdistance, we have shown Eq. (F.23). Hence it suffices to show: d T V ( P Y , P W ) ≥ C ( γ ) γ K γ/K ) . Recall the generative process defining P Y and P W . Y is an observation from the targetBeta-Bernoulli model, so by Proposition C.1 N T ∼ Poisson( γ ) , ψ k iid ∼ H, X = N T (cid:88) i =1 δ ψ k , Y ∼ f ( . | X ) .W is an observation from the approximate model, so by Proposition C.2 N A ∼ Binom (cid:18)
K, γ/K γ/K (cid:19) , φ k iid ∼ H, Z = N A (cid:88) i =1 δ φ k , W ∼ f ( . | Z ) . . Nguyen et al./Independent finite approximations Because of the choice of f , Y = N T and W = N A . Hence, by Proposition F.1: d T V ( P Y , P W ) = d T V ( P N T , P N A ) ≥ C ( γ ) γ K γ/K ) . Appendix G: Proofs of DP bounds
Our technique to analyze the error made by FSD K follows a similar vein to the techniquein Appendix F. We compare the joint distribution of the latents X N and Z N (with theunderlying Θ or Θ K marginalized out) using the conditional distributions X n | X n − and Z n | Z n − . Before going into the proofs, we give the form of the conditionals.The conditional X N | X n − is the well-known Blackwell-MacQueen prediction rule. Proposition G.1.
Blackwell and MacQueen (1973) For n = 1 , X ∼ H . For n ≥ : X n | X n − , X n − , . . . , X ∼ αn − α H + (cid:88) j n j n − α δ ψ j . where { ψ j } is the set of unique values among X n − , X n − , . . . , X and n j is the cardinalityof the set { i : 1 ≤ i ≤ n − , X i = ψ j } . The conditionals Z n | Z n − are related to the Blackwell-MacQueen prediction rule. Proposition G.2.
Pitman (1996) For n = 1 , Z ∼ H . For n ≥ , let { ψ j } J n j =1 be the setof unique values among Z n − , Z n − , . . . , Z and n j is the cardinality of the set { i : 1 ≤ i ≤ n − , Z i = ψ j } . If J n < K : Z n | Z n − , Z n − , . . . , Z ∼ ( K − J n ) α/Kn − α H + J n (cid:88) j =1 n j + α/Kn − α δ ψ j , Otherwise, if J n = K , there is zero probability of drawing a fresh component from H i.e. Z n comes only from { ψ j } j =1 J n : Z n | Z n − , Z n − , . . . , Z ∼ J n (cid:88) j =1 n j + α/Kn − α δ ψ j ,J n ≤ K is an invariant of these of prediction rules: once J n = K , all subsequent J m for m ≥ n is also equal to K . G.1. Upper bounds
Proof of Theorem 5.1.
First, because of Lemma D.7, it suffices to show that d T V ( P X N , P Z N )is small, since the conditional distributions of the observations given the latent variables arethe same across target and approximate models. . Nguyen et al./Independent finite approximations To show that d T V ( P X N , P Z N ) is small, we will construct a coupling of X N and Z N such that for any n ≥ P ( X n (cid:54) = Z n | X n − = Z n − ) ≤ αK J n n − α , (G.1)where J n is the number of unique atom locations among X n − . Such a coupling exists be-cause the total variation distance between the prediction rules X n | X n − and Z n | Z n − is small: as total variation is the minimum difference probability, there exists a coupling thatachieves the total variation distance. Consider any measurable set A . If J n < K , the prob-ability of A under the two rules are respectively: α (1 − J n /K ) n − α H ( A ) + J n (cid:88) j =1 n j + α/Kn − α δ ψ j ( A ) αn − α H ( A ) + J n (cid:88) j =1 n j n − α δ ψ j ( A )meaning the absolute difference in probability mass is: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) αK J n H ( A ) n − α − αK J n (cid:88) j =1 δ j ( A ) n − α (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) αK J n H ( A ) n − α (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) αK J n (cid:88) j =1 δ j ( A ) n − α (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ αK J n n − α + αK J n n − α = 2 αK J n n − α . The same upper bound holds for the case J n = K . The couplings for different n are naturallyglued together because of the recursive nature of the conditional distributions.We now show that for the coupling satisfying Eq. (G.1), the overall probability of differ-ence P ( X N (cid:54) = Z N ) is small. Define the short hand: C ( N, α ) := N (cid:88) n =1 αn − α . The definition of the typical set depends on the relative deviation δ , which we calibrate atthe end of the proof. Define the typical set: D n := (cid:8) x n − : J n ≤ (1 + δ ) max( C ( N − , α ) , C ( K, α )) (cid:9) . In other words, the number of unique values among the x n − is small. The followingdecomposition is used to investigate the difference probability on the typical set: P ( X N (cid:54) = Z N ) = P (( X N − , X N ) (cid:54) = ( Z N − , Z N ))= P ( X N − (cid:54) = Z N − ) + P ( X N (cid:54) = Z N , X N − = Z N − ) (G.2)The second term can be further expanded: P ( X N (cid:54) = Z N ,X N − = Z N − , X N − ∈ D N )+ P ( X N (cid:54) = Z N , X N − = Z N − , X N − / ∈ D N ) . Nguyen et al./Independent finite approximations The former term is at most: P ( X N (cid:54) = Z N | X N − = Z N − , X N − ∈ D N ) , while the latter term is at most: P ( X N − / ∈ D N ) . To recap, we can bound P ( X N (cid:54) = Z N ) by bounding three quantities:1. The difference probability of a shorter process P ( X N − (cid:54) = Z N − ).2. The difference probability of the prediction rule on typical sets P ( X N (cid:54) = Z N | X N − = Z N − , X N − ∈ D N ).3. The probability of the atypical set P ( X N − / ∈ D N ).By recursively applying the expansion initiated in Eq. (G.2) to P ( X N − (cid:54) = Z N − ), weactually only need to bound difference probability of the different prediction rules on typicalsets and the atypical set probabilities.Regarding difference probability of the different prediction rules, being in the typical setallows us to control J n in Eq. (G.1). Summation across n = 1 through N gives the overallbound of:2 αK (1 + δ ) max( C ( N − , α ) , C ( K, α )) C ( N, α ) ≤ constant ln N (ln N + ln K ) K . (G.3)Regarding the atypical set probabilities, because J n − is stochastically dominated by J n i.e., the number of unique values at time n is at least the number at time n −
1, all theatypical set probabilities are upper bounded by the last one i.e. P ( X N − / ∈ D N ). J N − isthe sum of independent Poisson trials, with an overall mean equaling exactly C ( N − , α ).Therefore, the atypical event has small probability because of Lemma D.1: P ( J N − > (1 + δ ) max( C ( N − , α ) , C ( K, α )) ≤ exp (cid:18) − δ δ max( C ( N − , α ) , C ( K, α ) (cid:19) . Even accounting for all N atypical events, the total probability is small:exp (cid:18) − (cid:18) δ δ max( C ( N − , α ) , C ( K, α ) − ln( N − (cid:19)(cid:19) By Lemma D.9, max( C ( N − , α ) , C ( K − , α ) ≥ α max(ln( N − , ln K − α ( ψ ( α ) + 1).Therefore, if we set δ such that δ δ α = 2, we have: δ δ max( C ( N − , α ) , C ( K − , α ) − ln( N − ≥ ln K − constantmeaning the overall atypical probabilities is at most:constant K . (G.4)The overall total variation bound combines Eqs. (G.3) and (G.4). . Nguyen et al./Independent finite approximations Proof of Corollary 5.2.
The main idea is reducing to the Dirichlet process mixture modelsituation. This can be done in two steps.First, the conditional distribution of the observations W | H D of the target model isthe same as the conditional distribution Z | F D of the approximate model if H D = F D .Hence to control the total variation between P W and P Z it suffices to control the totalvariation between P H D and P F D because of Lemma D.7. Second, the distance between P H D and P F D can be upper bounded by the distance between the atom locations thatdefine H D and F D . Recall the construction of the F d in terms of atom locations φ d,j andstick-breaking weights γ d,j : G K ∼ FSD K ( ω, H ) φ dj | G K iid ∼ G K ( . ) across d, jγ dj iid ∼ Beta(1 , α ) across d, j (except γ dT = 1) F d | φ d,. , γ d,. = T (cid:88) i =1 γ di (cid:89) j
Proof of Theorem 5.3.
First we mention which probability kernel f results in the large totalvariation distance: the pathological f is the Dirac measure i.e., f ( · | x ) = δ x ( . ). With this . Nguyen et al./Independent finite approximations conditional likelihood X n = Y n and Z n = W n , meaning: d T V ( P N, ∞ , P N,K ) = d T V ( P X N , P Z N ) . Now we discuss why the total variation is lower bounded by the function of N . Let A bethe event that there are at least C ( N, α ) unique components in among the latent states: A := (cid:26) x N : ≥ C ( N, α ) (cid:27) . The probabilities assigned to this event by the approximate and the target models arevery different from each other. On the one hand, since
K < C ( N,α )2 , under FSD K , A hasmeasure zero: P Z N ( A ) = 0 . (G.5)On the other hand, under DP, the number of unique atoms drawn is the sum of Poissontrials with expectation exactly C ( N, α ). The complement of A is a lower tail event. Henceby Lemma D.2 with δ = 1 / , µ = C ( N, α ), we have: P X N ( A ) ≥ − exp (cid:18) − C ( N, α )8 (cid:19) (G.6)Because of Lemma D.9, we can lower bound C ( N, α ) by a multiple of ln N :exp (cid:18) − C ( N, α )8 (cid:19) ≤ exp (cid:18) − α ln N α ( ψ ( α ) + 1)8 (cid:19) = constant N α/ . We now combine Eqs. (G.5) and (G.6) and recall that total variation is the maximumover probability discrepancies.
Proof of Theorem 5.4.
First we mention which probability kernel f results in the large totalvariation distance: the pathological f is the Dirac measure i.e., f ( · | x ) = δ x ( . ).Now we show that under such f, the total variation distance is lower bounded. Observethat it suffices to understand the total variation between P Y ,Y and P W ,W . Truly, suppose( Y N , W N ) is any coupling of P Y N and P W N . Elementarily we have P ( Y N (cid:54) = W N ) ≥ P (( Y , Y ) (cid:54) = ( W , W )). Taking the infimum, we have: d T V ( P N, ∞ , P N,K ) ≥ d T V ( P Y ,Y , P W ,W ) . Since f is Dirac, X n = Y n and Z n = W n and we have: d T V ( P Y ,Y , P W ,W ) = d T V ( P X ,X , P Z ,Z ) . Now, let ( X , X ) , ( Z , Z ) be any coupling of P X ,X and P Z ,Z . We have: P (( X , X ) (cid:54) = ( Z , Z )) = P ( X (cid:54) = Z | X = Z ) + P ( X (cid:54) = Z ) P ( X = Z | X = Z ) ≥ P ( X (cid:54) = Z | X = Z ) . We now investigate how small P ( X (cid:54) = Z | X = Z ) can be. In the conditioning X = Z ,let the common atom be ψ . The prediction rule X | X = ψ puts mass α on ψ while the . Nguyen et al./Independent finite approximations prediction rule Z | Z = ψ puts mass α/K α . This means that the total variation distancebetween the two prediction rules is at least:1 + α/K α −
11 + α = α α K .
Since the minimum difference probability is at least the total variation distance, we concludethat for any coupling ( X , X ) , ( Z , Z ) P ( X (cid:54) = Z | X = Z ) ≥ α α K .
Hence we have a lower bound on P (( X , X ) (cid:54) = ( Z , Z )) itself. As the coupling was arbitrary,we take the infimum to attain the lower bound on total variation. Appendix H: Experimental setup
H.1. Image denoising
The experiments in this section aim to isolate the effect of TFA versus IFA, by fitting differentapproximations of the beta-Bernoulli model to denoise an image. We give a description ofour models and their hyper-parameter settings. Each patch x i is flattened into a vector in R n . Let I n be the n × n identity matrix, and similarly for I K . The base measure generatingthe basis elements is the same: ψ k iid ∼ N (0 , n − I n ) k = 1 , , . . . , K The observational likelihood conditioned on feature-allocation matrix F ∈ { , } N × K andbasis elements { ψ k } Kk =1 is the same for both models. γ w ∼ Gamma(10 − , − ) γ e ∼ Gamma(10 − , − ) w i iid ∼ N (0 , γ − w I K ) i = 1 , , . . . , N(cid:15) i iid ∼ N (0 , γ − e I n ) i = 1 , , . . . , Nx i = K (cid:88) k =1 F i,k w i,k ψ k + (cid:15) i i = 1 , , . . . , N (H.1)where we are using the shape-rate parametrization of the gamma. Finally, how the feature-allocation matrix F is generated is the sole difference between TFA and IFA. The underlyingbeta process being approximated has rate measure ν ( θ ) = θ − { θ ≤ } . • TFA: v k iid ∼ Beta (1 , π k = k (cid:89) i =1 v i , k = 1 , , . . . , KF i,k | π k indep ∼ Ber( π k ) i = 1 , , . . . , N The posterior over (trait, frequency) and per-observation allocation is traversed for a certain numberof steps using a Gibbs sampler. Each visited dictionary and assignment is used to compute each patch’smean value: the candidate output pixel value is the mean over patches covering that pixel. We aggregatethe output images across Gibbs steps by a weighted averaging mechanism. . Nguyen et al./Independent finite approximations • IFA: π k iid ∼ Beta (cid:18) K , (cid:19) k = 1 , , . . . , KF i,k | π k indep ∼ Ber( π k ) i = 1 , , . . . , N In Eq. (H.1), we are enriching the basic feature-allocation structure by introducing weights w i,k which allow an observation to manifest a non-integer (and potentially negative) scaledversion of the basis element. Following (Zhou et al., 2009), we are uninformative about thenoise precisions by choosing Gamma(10 − , − ). Regarding the choice of hyper-parametersfor the underlying beta process, (Zhou et al., 2009) suggests that the performance of thedenoising routine is insensitive to the choice of γ and α : we picked γ, α = 1 for computationalconvenience, especially since for the beta process for α = 1 admits the simple stick-breakingconstruction. H.2. Topic modelling
Nearly 1 m random wikipedia documents were downloaded and processed following (Hoff-man, Bach and Blei, 2010).IFA: G ∼ FSD K ( ω, Dir ( η V )) G d ∼ T-DP T ( α, G ) independently across d = 1 , , . . . , Dβ dn | G d ∼ G d ( . ) independently across n = 1 , , . . . , N d w dn | β dn ∼ Categorical ( β dn ) independently across n = 1 , , . . . , N d TFA: G ∼ T-DP K ( ω, Dir ( η V )) G d ∼ T-DP T ( α, G ) independently across d = 1 , , . . . , Dβ dn | G d ∼ G d ( . ) independently across n = 1 , , . . . , N d w dn | β dn ∼ Categorical ( β dn ) independently across n = 1 , , . . . , N d Hyper-parameter settings follow (Wang, Paisley and Blei, 2011) in that η = 0 . , α =1 . , ω = 1 . , T = 20 . We approximate the posterior in each model using stochastic variational inference (Hoff-man et al., 2013). Both models have nice conditional conjugacies that allow the use ofexponential family variational distributions and closed-form expectation equations. Batchsize is 500, learning rate parametrized by ρ t = ( t + τ ) − κ where by default τ = 1 . κ = 0 . . We discuss how held-out log-likelihood is computed. Each held-out document d (cid:48) is sep-arated into two parts w ho and w obs , with no common words between the two. In our How each document is separated into these two parts can have an impact on the range of test log-likelihood values encountered. For instance, if the first (in order of appearance in the document) x % ofwords were the observed words and the last (100 − x )% words were unseen, then the test log-likelihood islow, presumably since predicting future words using only past words and without any filtering is challenging.Randomly assigning words to be observed and unseen gives better test log-likelihood. . Nguyen et al./Independent finite approximations experiments, we set 75% of words to be observed, the remaining 25% unseen. The predictivedistribution of each word w new in the w ho is exactly equal to: p ( w new |D , w obs ) = (cid:90) θ d (cid:48) ,β p ( w new | θ d (cid:48) , β ) p ( θ d (cid:48) , β |D , w obs ) dθ d (cid:48) dβ. This is an intractable computation as the posterior p ( θ d (cid:48) , β |D , w obs ) is not analytical. Weapproximate it with a factorized distribution: p ( θ d (cid:48) , β |D , w obs ) ≈ q ( β |D ) q ( θ d (cid:48) ) , where q ( β |D ) is fixed to be the variational approximation found during training and q ( θ d (cid:48) )minimizes the KL between the variational distribution and the posterior. Operationally, wedo an E-step for the document d (cid:48) based on the variational distribution of β and the observedwords w obs , and discard the distribution over z d (cid:48) ,. , the per-word topic assignments becauseof the mean-field assumption. Using those approximations, the predictive approximation isapproximately: p ( w new |D , w obs ) ≈ (cid:101) p ( w new |D , w obs ) = K (cid:88) k =1 E q ( θ d (cid:48) ( k )) E q ( β k ( w new )) , and the final number we report for document d (cid:48) is:1 | w ho | (cid:88) w ∈ w ho log (cid:101) p ( w |D , w obs ) . Appendix I: Additional experiments
I.1. Plane
The results for the plane image are Figs. I.1, I.2 and I.3.
I.2. Truck
The results for the truck image are Figs. I.4, I.5 and I.6. . Nguyen et al./Independent finite approximations Fig I.1: Original versus corrupted image for plane.Fig I.2: PSNR versus approximation level for plane . Nguyen et al./Independent finite approximations (a) TFA training (b) IFA training Fig I.3: The output of one model is a good initialization for the training of the other one.Here K = 60 . Fig I.4: Original versus corrupted images for truck. . Nguyen et al./Independent finite approximations Fig I.5: PSNR versus approximation level for truck. (a) TFA training (b) IFA training
Fig I.6: The output of one model is a good initialization for the training of the other one.Here K = 60= 60