[PDF] Independent finite approximations for Bayesian nonparametric inference

Abstract

Bayesian nonparametric priors based on completely random measures (CRMs) offer a flexible modeling approach when the number of latent components in a dataset is unknown. However, managing the infinite dimensionality of CRMs typically requires practitioners to derive ad-hoc algorithms, preventing the use of general-purpose inference methods and often leading to long compute times. We propose a general but explicit recipe to construct a simple finite-dimensional approximation that can replace the infinite-dimensional CRMs. Our independent finite approximation (IFA) is a generalization of important cases that are used in practice. The independence of atom weights in our approximation (i) makes the construction well-suited for parallel and distributed computation and (ii) facilitates more convenient inference schemes. We quantify the approximation error between IFAs and the target nonparametric prior. We compare IFAs with an alternative approximation scheme -- truncated finite approximations (TFAs), where the atom weights are constructed sequentially. We prove that, for worst-case choices of observation likelihoods, TFAs are a more efficient approximation than IFAs. However, in real-data experiments with image denoising and topic modeling, we find that IFAs perform very similarly to TFAs in terms of task-specific accuracy metrics.

Full PDF

IIndependent ﬁnite approximations forBayesian nonparametric inference:construction, error bounds, and practicalimplications

Tin D. Nguyen , Jonathan Huggins , Lorenzo Masoero , Lester Mackey ,Tamara Broderick CSAIL, MIT, e-mail: [email protected] ; [email protected] ; [email protected] Department of Statistics & Mathematics, Boston University, e-mail: [email protected] Microsoft Research, e-mail: [email protected]

Abstract:

Bayesian nonparametrics based on completely random measures (CRMs)oﬀers a ﬂexible modeling approach when the number of clusters or latent componentsin a dataset is unknown. However, managing the inﬁnite dimensionality of CRMs oftenleads to slow computation. Practical inference typically relies on either integrating outthe inﬁnite-dimensional parameter or using a ﬁnite approximation : a truncated ﬁniteapproximation (TFA) or an independent ﬁnite approximation (IFA). The atom weightsof TFAs are constructed sequentially, while the atoms of IFAs are independent, which(1) make them well-suited for parallel and distributed computation and (2) facilitatesmore convenient inference schemes. While IFAs have been developed in certain spe-cial cases in the past, there has not yet been a general template for construction or asystematic comparison to TFAs. We show how to construct IFAs for approximating dis-tributions in a large family of CRMs, encompassing all those typically used in practice.We quantify the approximation error between IFAs and the target nonparametric prior,and prove that, in the worst-case, TFAs provide more component-eﬃcient approxima-tions than IFAs. However, in experiments on image denoising and topic modeling taskswith real data, we ﬁnd that the error of Bayesian approximation methods overwhelmsany ﬁnite approximation error, and IFAs perform very similarly to TFAs.

1. Introduction

Many data analysis problems can be seen as discovering a latent set of traits in a population.For instance, we might recover topics or themes from scientiﬁc papers, ancestral populationsfrom genetic data, interest groups from social network data, or unique speakers across audiorecordings of many meetings (Palla, Knowles and Ghahramani, 2012; Blei, Griﬃths andJordan, 2010; Fox et al., 2010). In all of these cases, we might reasonably expect the numberof latent traits present in a data set to grow with the size of the data. One modeling optionis to choose a diﬀerent prior for diﬀerent data set sizes, but is unwieldy and inconvenient. Asimpler option is to choose a single prior that naturally yields diﬀerent expected numbersof traits for diﬀerent numbers of data points. In theory,

Bayesian nonparametrics providesa rich set of priors with exactly this desirable property thanks to a countable inﬁnity oftraits, so that there are always more traits to reveal through the accumulation of more data.This latent, inﬁnite-dimensional parameter presents a major practical challenge, though.In what follows, we propose a simple approximation across a wide range of BNP models,which can be seen as a generalization of certain existing special cases. Furthermore, it a r X i v : . [ s t a t . M E ] S e p . Nguyen et al./Independent ﬁnite approximations is amenable to modern, eﬃcient inference schemes and black-box code; ﬁts easily withincomplex, potentially deep generative models; and admits straightforward parallelization. Background

A particular challenge of the inﬁnite-dimensional parameter is that it isimpossible to store an inﬁnity of random variables in memory or learn the distribution overan inﬁnite number of variables in ﬁnite time. Some authors have developed conjugate priorsand likelihoods (Orbanz, 2010) to circumvent the inﬁnite representation via marginalizationand thereby perform exact Bayesian posterior inference (Broderick, Wilson and Jordan,2018; James, 2017). However, these priors and likelihoods are often just a single piece withina more complex generative model, which is no longer fully conjugate and therefore requiresan approximate posterior inference scheme such as Markov Chain Monte Carlo (MCMC) orvariational Bayes (VB). Some local steps in, e.g., an MCMC sampler can still take advantageof conditional conjugacy via special marginal forms such the Chinese restaurant process (Tehet al., 2006) or the Indian buﬀet process (Griﬃths and Ghahramani, 2005); see Broderick,Wilson and Jordan (2018) and James (2017) for general treatments. But using these marginaldistributions rather than a full and explicit representation of the latent variables typicallynecessitates a Gibbs sampler, which can be slow to mix and may require special-purpose,model-speciﬁc sampling moves. To take advantage of black-box variational inference methods(Ranganath, Gerrish and Blei, 2014; Kucukelbir et al., 2015), modern MCMC methods suchas Metropolis-adjusted Langevin algorithm (Roberts and Tweedie, 1996) or HamiltonianMonte Carlo (HMC) (Neal, 2011; Betancourt, 2017), or modern probabilistic programmingsystems such as Stan (Carpenter et al., 2017), a full trait representation is generally required.An alternative approach that still allows use of these convenient inference methods is toapproximate the inﬁnite-dimensional prior with a ﬁnite-dimensional prior that essentiallyreplaces the inﬁnite collection of random traits by a ﬁnite subset of “likely” traits. Unlikea ﬁxed ﬁnite-dimensional prior across all data set sizes, this ﬁnite dimensional prior is seenas an approximation to the BNP prior and thereby its cardinality is informed directly bythe BNP prior. Note that since any moderately complex model will necessitate approximateinference, so long as the approximation error from using the ﬁnite-dimensional prior ap-proximation is on the order of the approximation error from MCMC or VB, no inferentialquality has been lost.Much of the previous work on ﬁnite approximations developed and analyzed trunca-tions of the random measures underlying the nonparametric prior (Doshi-Velez et al., 2009;Paisley, Blei and Jordan, 2012; Roychowdhury and Kulis, 2015; Campbell et al., 2019); wecall these truncated ﬁnite approximations (TFAs) and refer to Campbell et al. (2019) fora thorough study of constructions for TFAs. In the present work, we instead consider aﬁnite approximation consisting of independent and identical (i.i.d.) representations of thetraits together with their rates within the population; we call these independent ﬁnite ap-proximations (IFAs). The IFA approach has the potential to be simpler to incorporate in acomplex hierarchical model, to exhibit improved mixing, and to be amenable to parallelizingcomputation during inference. There are not many known ﬁnite approximations using i.i.d.random variables and we are unaware of any general-purpose results on constructing them.

Our Contributions

We propose a construction for IFAs that subsumes a number ofspecial cases which have already been successfully used in applications, with practitionersreporting similar performance to the truncation approach but with faster mixing (Kurihara,Welling and Teh, 2007; Saria, Koller and Penn, 2010; Fox et al., 2010; Johnson and Willsky,2013). On the other hand, our construction is distinct from that presented in Lee, James . Nguyen et al./Independent ﬁnite approximations and Choi (2016), which has an arguably smaller scope of application. We propose a broadmechanism for our i.i.d. ﬁnite approximation and relate these to existing work. We thenquantify the eﬀect of replacing the inﬁnite-dimensional priors with an IFA in probabilisticmodels, providing interpretable error bounds with explicit dependence on the size of theapproximation and the data cardinality. The error bounds reveal that in the worst case, toapproximate the target to an accuracy, it is necessary to use a large IFA model while a smallTFA model would suﬃce. However, diﬀerences have not been observed in practice, and weconﬁrm through experiments with image denoising and topic modeling that IFAs and TFAsperform similarly on applied problems – IFAs beneﬁt from conceptual ease-of-use.

2. Background

We start by summarizing relevant background on nonparametric priors constructed fromcompletely random measures, and how truncated and independent ﬁnite approximations forthese priors are constructed. Let ψ i represent the i th trait of interest and Let θ i representthe rate, or frequency, of this trait in the population. We can collect the pairs of traits withtheir frequencies ( ψ i , θ i ) in a measure that places non-negative mass θ i at location ψ i : Θ := (cid:80) Ii =1 θ i δ ψ i . I , the total number of traits, may be ﬁnite or, as in the nonparametric setting,countably inﬁnite. To perform Bayesian inference, we need to choose a prior distribution onΘ and a likelihood for the observed data Y N := { Y n } Nn =1 given Θ, and then we must applyBayes theorem to obtain the posterior on Θ given the observed data. Completely random measures

Most common BNP priors can be conveniently formu-lated as (normalizations of) completely random measures (CRMs). CRMs are constructedfrom Poisson point processes, which are straightforward to manipulate both analyticallyand algorithmically. Consider a Poisson point process on R + := [0 , ∞ ) with rate measure ν (d θ ) such that ν ( R + ) = ∞ and (cid:82) min(1 , θ ) ν (d θ ) < ∞ . Such a process generates an inﬁnitenumber of rates ( θ i ) ∞ i =1 , θ i ∈ R + , having an almost surely ﬁnite sum (cid:80) ∞ i =1 θ i < ∞ . Weassume throughout that ψ i ∈ Ψ for some space Ψ and ψ i i.i.d. ∼ H for some diﬀuse distribution H . H serves as a prior on the trait values: in topic modeling, each topic is a probabilityvector in the simplex of vocabulary words, and it is typical to use H = Dir . The result-ing measure Θ in this case is a completely random measure (CRM) (Kingman, 1967). Asshorthand, we will write CRM(

H, ν ) for the completely random measure generated as justdescribed: Θ := (cid:80) i θ i δ ψ i ∼ CRM(

H, ν ) . The corresponding normalized CRM (NCRM) isΞ := Θ / Θ(Ψ), which is a discrete probability measure. The set of atom locations of Ξ is thesame as that of Θ, while the atom sizes are normalized Ξ = (cid:80) i ξ i δ ψ i where ξ i = θ i / ( (cid:80) j θ j ) . Finite approximations

Since the sequence ( θ i ) ∞ i =1 is countably inﬁnite, it may be diﬃcultto simulate or perform posterior inference in the full model. One approximation scheme isto deﬁne the ﬁnite approximation Θ K := (cid:80) Ki =1 θ i δ ψ i . Since it involves a ﬁnite numberof parameters, Θ K can be used for eﬃcient posterior inference, including with black-boxMCMC and VB algorithms—but some approximation error is introduced by not using thefull CRM Θ.A truncated ﬁnite approximation (TFA) (Doshi-Velez et al., 2009; Paisley, Blei and Jor-dan, 2012; Roychowdhury and Kulis, 2015) requires constructing an ordering on the sequence The possible ﬁxed-location and deterministic components of an (N)CRM (Kingman, 1967) are notconsidered here for brevity; these components can be added (assuming they are purely atomic) and ouranalysis modiﬁed without undue eﬀort. . Nguyen et al./Independent ﬁnite approximations ( θ i ) ∞ i =1 such that θ i is a function of some auxiliary random variables ξ , . . . , ξ i ; hence, θ i +1 reuses the same auxiliary randomness as θ i , plus uses an additional random variable ξ i +1 .Thus, the value of θ i +1 implicitly depends on the values of θ , . . . , θ i . Truncated ﬁnite ap-proximations are attractive because of the nestedness of the approximations K : in general,the approximation quality increases with K , and to reﬁne existing truncations, it suﬃcesto generate the next terms in the sequence. On the other hand the complex dependencesbetween the atoms θ , θ , . . . potentially make inference more challenging.We here instead pursue what we call an independent ﬁnite approximation (IFA), whichinvolves choosing a sequence of probability measures ν , ν , . . . such that for any approxima-tion level K , we choose θ , . . . , θ K i.i.d. ∼ ν K . The ν K are chosen in such a way that Θ K D = ⇒ Θas K → ∞ — that is, the IFAs converge in distribution to the CRM. The pros and consof the IFA invert those of the TFA: the atoms are now i.i.d., potentially making inferenceeasier, but a completely new approximation must be constructed if K changes. Existingwork (Paisley and Carin, 2009; Broderick et al., 2015; Acharya, Ghosh and Zhou, 2015; Lee,James and Choi, 2016; Lee, Miscouridou and Caron, 2019) has only developed i.i.d. ﬁniteapproximations on a case-by-case basis, where as our focus is a general-purpose mechanism.For the normalized atom sizes ξ i = θ i / (cid:80) j θ j , ﬁnite approximations also involve randommeasures with ﬁnite support Ξ K = (cid:80) Ki =1 ξ i δ ψ i . TFAs can be deﬁned in one of two ways.In the ﬁrst approach, the TFA corresponding to the CRM can be normalized to form theapproximation of the NCRM (Campbell et al., 2019). The second approach instead directlyconstructs an ordering over the sequence ( ξ i ) ∞ i =1 and truncate this representation (Ishwaranand James, 2001; Blei and Jordan, 2006). Regarding the independent approach, we will onlynormalize the IFAs that target a given CRM to form the approximation of the correspondingNCRM. The beta process.

For concreteness, we consider the beta process (Teh and G¨or¨ur, 2009;Broderick, Jordan and Pitman, 2012) as a running example of a CRM. We denote its dis-tribution as BP( γ, α, d ), with discount parameter d ∈ [0 , α > − d , massparameter γ >

0, and rate measure ν (d θ ) = γ Γ( α +1)Γ(1 − d )Γ( α +1) [ θ ≤ θ − d − (1 − θ ) α + d − d θ. The case in which d = 0 is the standard beta process (Hjort, 1990; Thibaux and Jordan,2007). The beta process is typically paired with the Bernoulli likelihood process l ( x | θ ) = θ x (1 − θ ) − x ; the combination has been used for factor analysis (Doshi-Velez et al., 2009;Paisley, Blei and Jordan, 2012) or dictionary learning (Zhou et al., 2009).

3. Constructing independent ﬁnite approximations

We ﬁrst show how to easily construct independent ﬁnite approximations to a completely ran-dom measure. Speciﬁcally, our ﬁrst main result shows how to construct IFAs that convergein distribution to CRMs with rate measures of a particular form. As an important specialcase, if the CRM is an exponential family CRM (Broderick, Wilson and Jordan, 2018) andthe “discount” parameter d = 0, then the IFA is constructed from random variables in thesame exponential family, a connection which is not only useful for approximate inferencealgorithms, but also for the theoretical analysis of the approximation itself. Finally, we showhow normalized IFAs converge to the corresponding NCRM, in the sense that the partitioninduced by IFA converges to that induced by NCRM.Formally, IFAs take the following form. For probability measures H and ν K , write Θ K ∼ . Nguyen et al./Independent ﬁnite approximations IFA K ( H, ν K ) ifΘ K = (cid:80) Ki =1 θ K,i δ ψ K,i θ K,i indep ∼ ν K ψ K,i i.i.d. ∼ H. We consider CRMs with rate measures ν with densities that, near zero, are (essentially) pro-portional to θ − − d , where d ∈ [0 ,

1) is the “discount” parameter. The explicit assumptionson ν are given in Assumption 1. Assumption 1.

For d ∈ [0 ,

1) and η ∈ E ⊆ R d , let Θ ∼ CRM(

H, ν ( · ; d, η )), where ν (d θ ; d, η ) := γθ − − d g ( θ ) − d h ( θ ; η ) Z (1 − d, η ) d θ. Assume that:1. for ξ > η ∈ E , Z ( ξ, η ) = (cid:82) θ ξ − g ( θ ) ξ h ( θ ; η )d θ < ∞ ;2. g is continuous, g (0) = 1, and ∃ < c ∗ ≤ c ∗ < ∞ such that c ∗ ≤ g ( θ ) − ≤ c ∗ (1 + θ ); and3. there exists (cid:15) > η ∈ E , θ (cid:55)→ h ( θ ; η ) is continuous and bounded on [0 , (cid:15) ].Other than the discount d and mass γ , the rate measure ν potentially has additional hy-perparameters, which are encapsulated by η . The ﬁniteness of the normalizer Z is necessaryin deﬁning ﬁnite-dimensional distributions whose densities are very similar in form to ν . Theconditions on the behaviors of g ( θ ) and h ( θ ; η ) imply that the overall rate measure’s behav-ior near θ = 0 is dominated by the θ − − d term. These are mild regularity conditions: mostpopular BNP priors can be cast in such form, and the functions g ( θ ) and h ( θ ; η ) are suchthat all three assumptions can be easily veriﬁed. Appendix A shows how common processsuch as beta, gamma (Ferguson and Klass, 1972; Kingman, 1975; Brix, 1999; Titsias, 2008;James, 2013), beta prime (Broderick, Wilson and Jordan, 2018) and generalized gammaprocess satisfy Assumption 1.We will now deﬁne a sequence of IFAs that converge in distribution to such a CRM. OurIFA construction requires the following deﬁnition. Deﬁnition 3.1.

The parameterized function family { S b } b ∈ R + are approximate indicators if, for any b ∈ R + , S b ( θ ) is a real increasing function such that S b ( θ ) = 0 for θ ≤ S b ( θ ) = 1 for θ ≥ b .Valid examples of approximate indicators are the indicator function S b ( θ ) = [ θ >

0] andthe smoothed indicator function S b ( θ ) = (cid:40) exp (cid:16) − − ( θ − b ) /b + 1 (cid:17) if θ ∈ (0 , b ) [ θ >

0] otherwise.Our ﬁrst result now shows how to construct IFAs that provably converge to our family ofCRMs.

Theorem 3.2.

Suppose Assumption 1 hold. Let { S b } b ∈ R + be a family of approximate indi-cators. Fix a > , and ( b K ) K ∈ N , a decreasing sequence such that b K → . For c := γ h (0; η ) Z (1 − d,η ) and κ = min(1 , (cid:15) ) , let ν K (d θ ) := θ − cK − − dS bK ( θ − aK − ) g ( θ ) cK − − d h ( θ ; η ) Z − K d θ, be a family of probability densities, where Z K is chosen such that (cid:82) ν K (d θ ) = 1 . If Θ K ∼ IFA K ( H, ν K ) , then Θ K D = ⇒ Θ as K → ∞ . . Nguyen et al./Independent ﬁnite approximations The proof can be found in Appendix B.1. The scope of Theorem 3.2 is broader thanknown i.i.d. ﬁnite approximations. Namely, Lee, James and Choi (2016, Theorem 2) designsi.i.d. ﬁnite approximations that converge in distribution to either a beta process with d > d > d = 0, whereasour construction naturally incorporates this situation.An important corollary of Theorem 3.2 applies to exponential family CRM with d = 0.In common BNP models, the relationship between the likelihood l ( · | θ ) and the CRM prioris closely related to the well-known conjugacy in exponential families (Broderick, Wilsonand Jordan, 2018, Section 4). In particular, the likelihood has an exponential family form l ( x | θ ) := κ ( x ) θ φ ( x ) exp ( (cid:104) µ ( θ ) , t ( x ) (cid:105) − A ( θ )) . (1)Here x ∈ N ∪{ } , κ ( x ) is the base density, (cid:2) t ( x ) , φ ( x ) (cid:3) T is the vector of suﬃcient statistics, A ( θ ) is the log partition function, (cid:2) µ ( θ ) , log θ (cid:3) T is the vector of natural parameters, and (cid:104) µ ( θ ) , t ( x ) (cid:105) is an inner product. As for the rate measure, we will analyze those that behavelike θ − near 0 ν ( θ ) := γ (cid:48) θ − exp (cid:26) (cid:104) (cid:18) ψλ (cid:19) , (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:27) { θ ∈ U } , (2)where γ (cid:48) > λ > U ⊂ R + is the support of ν. Eq. (2) leads to the suggestive terminologyof exponential

CRMs. The θ − dependence near 0 means that these models lack power-lawbehavior e.g., in beta process, see Teh and G¨or¨ur (2009). Models that can be cast in thisform include beta process with Bernoulli likelihood, beta process with negative binomiallikelihood (Broderick et al., 2015; Zhou et al., 2012) and gamma process with Poisson like-lihood (Acharya, Ghosh and Zhou, 2015; Roychowdhury and Kulis, 2015). For short-hand,we refer to these models as beta–Bernoulli, beta–negative binomial and gamma–Poisson,respectively. The normalizer S ( ξ, η ) := (cid:90) U θ ξ exp (cid:26) (cid:104) η, (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:27) dθ. (3)of the exponential family distribution plays an important role in the sequel. Note that S isequal to the normalization quantity Z appearing in Assumption 1, but specialized for theexponential family rate measure.We now state the simple form taken by IFA K for exponential family CRMs. The assump-tions are the natural analogues of Assumption 1, specialized for exponential family ratemeasures. Corollary 3.3.

Let ν be of the form Eq. (2) , and assume that:1. S ( ξ, η ) < ∞ for ξ > − ;2. There exists (cid:15) > such that for any ψ, λ , θ (cid:55)→ exp (cid:26) (cid:104) η, (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:27) { θ ∈ U } is acontinuous and bounded function of θ on [0 , (cid:15) ] . . Nguyen et al./Independent ﬁnite approximations For c := γ (cid:48) exp (cid:26) (cid:104) η, (cid:18) µ (0) − A (0) (cid:19) (cid:105) (cid:27) , let ν K ( θ ) := { θ ∈ U } S ( c/K − , η ) θ c/K − exp (cid:26) (cid:104) η, (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:27) . (4) If Θ K ∼ IFA K ( H, ν K ) , then Θ K D = ⇒ Θ . Corollary 3.3 is suﬃcient to recover known IFA results for BP( γ, α,

0) (Doshi-Velez et al.,2009; Paisley and Carin, 2009; Griﬃths and Ghahramani, 2011). Appendix A uses Corol-lary 3.3 to construct IFAs for more example CRMs.

Example 3.1 (Beta process) . When d = 0, the rate measure of the beta process is ν ( θ ) = γαθ − exp(( α −

1) log(1 − θ )) { ≤ θ ≤ } . The normalizer depends only on ξ and α − S BP = (cid:82) θ ξ (1 − θ ) α − exp(0) dθ = B ( ξ + 1 , α ) . The assumptions in Corollary 3.3 can bequickly veriﬁed. S BP < ∞ for ξ > − B ( ξ + 1 , α ) < ∞ for ξ + 1 > , α > θ (cid:55)→ (1 − θ ) α − is clearly bounded and continuous on the interval [0 , .

5] forany α >

0. Therefore ν K = Beta ( γα/K, α ).In comparison, Doshi-Velez et al. (2009) approximates BP( γ, ,

0) with each ν K is a Beta ( γ/K,

1) distribution. Griﬃths and Ghahramani (2011) also approximates BP( γ, α, ν K being Beta ( γα/K, α ). Lastly, Paisley and Carin (2009) approximates BP( γ, α, ν K being Beta ( γα/K, α (1 − /K )) distribution, with the diﬀerence between Beta ( γα/K, α )and Beta ( γα/K, α (1 − /K )) being not substantive.Given that IFA K is a converging approximation to the corresponding target CRM, it isnatural to ask if the normalization of IFA K converges to the corresponding normalizationof CRM i.e., NCRM. Our next result shows that normalized IFA indeed converges, in thesense of exchangeable partition probability functions , or EPPF (Pitman, 1995). The EPPF ofa NCRM Ξ gives the probability of partitions of { , , . . . , N } induced by sampling from Ξ.In particular, under the model Ξ ∼ NCRM , V n | Ξ i.i.d. ∼ Ξ for 1 ≤ n ≤ N with the eﬀect of Ξmarginalized out, the ties among the V n ’s induce a partition over the set { , , . . . , N } . Letthere be t ≤ N distinct values among the V n ’s, and let n i be the number of elements in the i -th block of the partition induced by sampling from Ξ, so that n i ≥ , (cid:80) ti =1 n i = N . Theprobability of the induced partition is a symmetric function p ( n , n , . . . , n t ) that dependsonly on the frequencies n i of each block. The EPPF of IFA K is deﬁned analogously. Theorem 3.4.

Suppose Assumption 1 holds, and let Θ K be as in Theorem 3.2. Let p ( n , n , . . . , n t ) be the EPPF of a NCRM Ξ where Ξ := Θ / Θ(Ψ) and let p K ( n , n , . . . , n t ) be the EPPF ofnormalized IFA Ξ K where Ξ K := Θ K / Θ K (Ψ) . Then, for any N , for any n i ≥ , (cid:80) ti =1 n i = N , lim K →∞ p K ( n , n , . . . , n t ) = p ( n , n , . . . , n t ) . The proof can be found in Appendix B.2. Since the EPPF gives the probability of eachpartition, the point-wise convergence in Theorem 3.4 certiﬁes that the distribution overpartitions induced by the normalized IFA K converges to that induced by the target NCRM,for any ﬁnite data cardinality N .

4. Non-asymptotic error bounds for CRM-based models

Theorem 3.2 justiﬁes the use of IFA K in the asymptotic limit K → ∞ but does not provideguidance on choosing an appropriate approximation level for modeling a data process with a . Nguyen et al./Independent ﬁnite approximations given cardinality N . In this section, we quantify the eﬀect of replacing CRM with IFA K (forﬁnite K ) in probabilistic models using error bounds that are simple to manipulate, easilyyielding recommendation of the appropriate K for a given N and accuracy level.The CRM prior on Θ is typically combined with a likelihood that generates trait countsfor each data point. Let l ( · | θ ) be a proper probability mass function on N ∪{ } for all θ in thesupport of ν . Then a collection of conditionally independent observations X N given Θ aredistributed according to the likelihood process LP( l, Θ) – i.e., X n := (cid:80) i x ni δ ψ i i.i.d. ∼ LP( l, Θ) –if x ni ∼ l ( · | θ i ) independently across i and i.i.d. across n . Since the trait counts are typicallylatent in a full generative model speciﬁcation, deﬁne the observed data Y n | X n indep ∼ f ( · | X n )for a probability kernel f . For instance, if the sequence ( θ i ) ∞ i =1 represents the topic rates ina document corpus, X n might capture how many words in document n are generated fromeach topic and Y n might be the observed collection of words for that document. The targetnonparametric model can thus be summarized asΘ ∼ CRM(

H, ν ) , X n | Θ i.i.d. ∼ LP( l ; Θ) , Y n | X n indep ∼ f ( · | X n ) n = 1 , , . . . , N. (5)The approximating ﬁnite-dimensional model, with ν K being given in Theorem 3.2 (or Corol-lary 3.3), isΘ K ∼ IFA K ( H, ν K ) , Z n | Θ K i.i.d. ∼ LP( l ; Θ K ) , W n | Z n indep ∼ f ( . | Z n ) n = 1 , , . . . , N. (6)Let P N, ∞ be the distribution of the observations Y N , and P N,K be the distribution ofthe observations W N . We deﬁne approximation error to be the total variation distance d T V ( P N,K , P N, ∞ ) between two observational processes, one using the CRM and the otherone using the approximate IFA K as the prior (Ishwaran and Zarepour, 2002; Doshi-Velezet al., 2009; Paisley, Blei and Jordan, 2012; Campbell et al., 2019). Recall that totalvariation distance is the supremum diﬀerence in probability mass over measurable sets d T V ( P N,K , P N, ∞ ) := sup A | P N,K ( A ) − P N, ∞ ( A ) | . We restrict attention to exponential family CRM-likelihood pairs. We require Deﬁnition 4.1to express our the assumptions on the target model.

Deﬁnition 4.1.

Suppose l ( · | θ ) has the form Eq. (1) and ν ( θ ) has the form Eq. (2). For n ∈ N , x n − ∈ ( N ∪{ } ) n − , deﬁne shorthands T n := (cid:80) n − m =1 t ( x m ) and Φ n := (cid:80) n − m =1 φ ( x m ).For x ∈ N ∪ { } , let h c ( x | x n − ) := κ ( x ) S (cid:18) − n + φ ( x ) , η + (cid:18) T n + t ( x ) n (cid:19)(cid:19) S (cid:18) − n , η + (cid:18) T n n − (cid:19)(cid:19) and (cid:101) h c ( x | x n − ) := κ ( x ) S (cid:18) c/K − n + φ ( x ) , η + (cid:18) T n + t ( x ) n (cid:19)(cid:19) S (cid:18) c/K − n , η + (cid:18) T n n − (cid:19)(cid:19) . Nguyen et al./Independent ﬁnite approximations and M n,x := γ (cid:48) κ (0) n − κ ( x ) S (cid:18) c/K − n − φ (0) + φ ( x ) , η + (cid:18) ( n − t (0) + t ( x )) n (cid:19)(cid:19) . We show in Appendix C that the functions h c , (cid:101) h c , M n,x govern the marginal process representation of the probabilistic models (Broderick, Wilson and Jordan, 2018, Section6). Namely, the joint distribution of X N can be expressed in terms of the conditionals X n | X n − , with M n,x and h c governing this process. Similarly, the joint distribution Z N can be expressed in terms of the conditionals Z n | Z n − , with (cid:101) h c governing thisprocess. For the beta-Bernoulli process with d = 0, the functions have particularly simpleforms. Example 4.1 (Beta-Bernoulli with d = 0) . For the beta-Bernoulli model with d = 0, wehave h c ( x | x n − ) = (cid:80) n − i =1 x i α − n { x = 1 } + α + (cid:80) n − i =1 (1 − x i ) α − n { x = 0 } . (cid:101) h c ( x | x n − ) = (cid:80) n − i =1 x i + γα/Kα − n + γα/K { x = 1 } + α + (cid:80) n − i =1 (1 − x i ) α − n + γα/K { x = 0 } ,M n, = γαα − n , M n,x = 0 for x > . We now formulate the conditions which can be used to show that d T V ( P N,K , P N, ∞ ) issmall. Assumption 2.

There exist constants { C i } i =1 such that the following hold.1. For all n ∈ N , ∞ (cid:88) x =1 M n,x ≤ C n − C . (7)2. For all n ∈ N , ∞ (cid:88) x =1 h ( x | x n − = 0) ≤ K C n − C . (8)3. For any n ∈ N , for any { x i } n − i =1 , ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12) h c ( x | x n − ) − (cid:101) h c ( x | x n − ) (cid:12)(cid:12)(cid:12) ≤ K C n − C . (9)4. For all n ∈ N , for any K ≥ C (ln n + C ), ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ K C ln n + C n − C . (10)Note that the conditions depend only on the functions in Deﬁnition 4.1 and not on theobservational likelihood f ( . ) which maps the latent states to the observations. The ﬁrstcondition constrains the growth rate of the target model. (cid:80) Nn =1 (cid:80) ∞ x =1 M n,x is the expectednumber of components for data cardinality N – since each (cid:80) ∞ x =1 M n,x is at most O (1 /n ), thetotal number of components is O (ln N ). The second condition means that (cid:101) h c is a very good . Nguyen et al./Independent ﬁnite approximations approximation of h c in total variation distance; furthermore, the longer the vector { x i } n − i =1 ,the smaller the error. Similarly, the third condition means that K (cid:101) h c ( · |

0) is a very accurateapproximation of M n,. , and there is also a reduction in the error as n increases. The setof constants C i which satisfy Assumption 2 is not unique: we are in general not interestedin the best constants C i , rather that they exist. We speculate that such assumptions canbe made more explicit in the normalizer S . For instance, the 1 /K dependence is due tosmoothness of S in its ﬁrst argument, while the dependence on n is due to some inherentnotion of scale dictated by the second and third arguments.Assumption 2 can be veriﬁed for the most important CRM models. In Example 4.2 weverify it for the beta-Bernoulli model, and in Appendix E, we verify it for beta-negativebinomial and gamma-Poisson models. Example 4.2 (Beta-Bernoulli with d = 0, continued) . The growth rate of the target modelis ∞ (cid:88) x =1 M n,x = M n, = γαn − α . Since (cid:101) h c is supported on { , } , the growth rate of the approximate model is (cid:101) h c (1 | x n − = 0) = γα/Kα − n + γα/K ≤ K γαn − α . Since both h c and (cid:101) h c are supported on { , } , Eq. (9) becomes (cid:12)(cid:12)(cid:12) h c (1 | x n − ) − (cid:101) h c (1 | x n − ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) n − i =1 x i + γα/Kα − n + γα/K − (cid:80) n − i =1 x i α − n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ γαK n − α . Again, because M n,x = (cid:101) h c ( x | . ) = 0 for x >

1, Eq. (10) becomes (cid:12)(cid:12)(cid:12) M n, − K (cid:101) h c (1 | x n − = 0) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) γαα − n − γαα − n + γαK (cid:12)(cid:12)(cid:12)(cid:12) ≤ γ αK n − α . Calibrating { C i } based on these inequalities is straightforward. Under the aforementioned assumptions, Theorem 4.2 upper bounds the approximation error.

Theorem 4.2 (Upper bound for exponential family CRMs) . If Assumption 2 holds, thenthere exist positive constants C (cid:48) , C (cid:48)(cid:48) , C (cid:48)(cid:48)(cid:48) depending only on { C i } i =1 such that d T V ( P N, ∞ , P N,K ) ≤ C (cid:48) + C (cid:48)(cid:48) ln N + C (cid:48)(cid:48)(cid:48) ln N ln KK .

The proof can be found in Appendix F.1. Theorem 4.2 states that the IFA approximationerror grows as O (ln N ) with ﬁxed K , and as decreases as O (cid:0) ln KK (cid:1) for ﬁxed N . On the onehand, for ﬁxed K , it is expected that the error increases as N increases: with more data,the number of latent components in the data increases, demanding ﬁnite approximationsof increasingly larger sizes. In particular, O (ln N ) is the standard Bayesian nonparametricgrowth rate for non-power law models (Griﬃths and Ghahramani, 2011). It is likely that . Nguyen et al./Independent ﬁnite approximations the O (ln N ) factor can be improved to O (ln N ) – more generally, we conjecture that theerror directly depends on the expected number of latent components in a model for N observations. On the other hand, for ﬁxed N , the error goes to zero at least as fast as O (cid:0) ln KK (cid:1) . We also suspect the ln K factor in the numerator can be removed. As Theorem 4.2 is only an upper bound, a natural question to investigate is the tightnessof the bound in terms of

N, K . In this section, we focus on the beta-Bernoulli process with d = 0, i.e., P N, ∞ refers to the observational process coming from BP( γ, α,

0) and P N,K refersto the observational process IFA K with ν K as in Example 3.1.We ﬁrst look at the dependence of the error bound in terms of ln N . For any N ∈ N , α >

0, we deﬁne the growth function C ( N, α ) := N (cid:88) n =1 αn − α . (11)It is known that C ( N, α ) = Ω(ln N ) (see Lemma D.9). Theorem 4.3 shows that ﬁniteapproximations cannot be accurate if the approximation level is too small compared to thegrowth function C ( N, α ). Theorem 4.3 (ln N is necessary) . For the beta-Bernoulli model with d = 0 , there exists anobservation likelihood f , independent of K and N , such that for any N , if K ≤ γC ( N, α ) ,then d T V ( P N, ∞ , P N,K ) ≥ − CN γα/ , where C only depends on hyper-parameters of the beta process i.e., γ, α . The proof is given in Appendix F.2. Theorem 4.3 implies that as N grows, if the approxi-mation level K fails to surpass the γC ( N, α ) = Ω(ln N ) threshold, then the total variationbetween the approximate and the target model remains bounded from zero – in fact, theerror tends to one.Now turning to the dependence on K of the upper bound Theorem 4.2, we discuss a lowerbound on the approximation error, which reveals that the K factor in the upper bound is tight (modulo logarithmic factors). Theorem 4.4 (Lower bound of 1 /K ) . For the beta-Bernoulli model with d = 0 , there existsan observation likelihood f , independent of K and N , such that for any N , d T V ( P N, ∞ , P N,K ) ≥ C ( γ ) γ K γ/K ) , where C ( γ ) :=

18 1 γ +exp( − γ +1) max(12 γ , γ, . The proof can be found in Appendix F.2. While Theorem 4.2 implies that an IFA with K = O (poly(ln N ) /(cid:15) ) atoms suﬃces in approximating the target model to less than (cid:15) error,Theorem 4.4 implies that an IFA with K = Ω (1 /(cid:15) ) atoms is necessary in the worst case.This dependence on the accuracy level means that IFAs are worse than TFAs in theory. Forexample, consider Bondesson approximations (Bondesson, 1982) of BP( γ, α, . Nguyen et al./Independent ﬁnite approximations Example 4.3 (Bondesson approximation (Bondesson, 1982)) . Let α ≥

1. Let E l iid ∼ Exp (1)and Γ k = (cid:80) kl =1 E l . The level K Bondesson approximation of BP( γ, α,

0) is a TFA (cid:80) Kk =1 θ k δ ψ k where θ k = V k exp( − Γ k /γα ) , V k iid ∼ Beta (1 , α −

1) and ψ k iid ∼ H .The following result gives a bound on the error of the Bondesson approximation: Proposition 4.5. (Campbell et al., 2019) For γ > , α ≥ , let Θ K be distributed accordingto a level K Bondesson approximation of

BP( γ, α, , R n | Θ K iid ∼ LP( l ; Θ K ) , T n | R n indep ∼ f ( . | R n ) with N observations. Let Q N,K be the distribution of the observations T N . Then: d T V (cid:0) P N, ∞ , P Q N,K (cid:1) ≤ N γ (cid:18) γα γα (cid:19) K . Proposition 4.5 implies that a TFA with K = O (ln ( N/(cid:15) )) atoms suﬃces in approximatingthe target model to less than (cid:15) error. Modulo log factors, comparing the necessary (cid:15) levelfor IFA and the suﬃcient ln (cid:0) (cid:15) (cid:1) level for TFA, we conclude that the necessary size for IFAis exponentially larger than the suﬃcient size for TFA, in the worst case.

5. Non-asymptotic error bounds for Dirichlet process-based models

Having analyzed the error incurred by IFA K in CRM-based models like beta-Bernoulli,gamma-Poisson and beta-negative binomial, we now turn the approximation error in NRCM-based models. Our notion of approximation error remains the total variation distance be-tween the target and the approximate observational processes. The forms of the upper andlower bounds are very similar to Theorems 4.2, 4.3 and 4.4. We leave to future work toderive bounds for more general NCRMs.We focus on the Dirichlet process [DP] (Ferguson, 1973; Sethuraman, 1994) – which isthe normalization of a non-power law gamma process – and the ﬁnite symmetric Dirichlet[FSD] distribution – which is the normalization of the IFA for gamma process. The Dirichletprocess is one of the most widely used nonparametric priors. The gamma process CRMhas rate measure ν (d θ ) = γ λ − d Γ(1 − d ) θ − d − e − λθ d θ. We denote its distribution as ΓP( γ, λ, d ).The normalization of ΓP( γ, ,

0) is a Dirichlet process with mass parameter γ (Kingman,1975; Ferguson, 1973). By Corollary 3.3, IFA K ( H, ν K ) (where ν K ( θ ) = Gam ( θ ; γ/K, γ, , K ( H, ν K ) is equal in distribution to (cid:80) Ki =1 p i δ ψ i where ψ i i.i.d. ∼ H and { p i } Ki =1 ∼ Dir ( γK K ) . We denote this as FSD K ( γ, H ).We consider Dirichlet process mixture models (Antoniak, 1974)Θ ∼ DP( α, H ) , X n | Θ i.i.d. ∼ Θ , Y n | X n i.i.d. ∼ f ( · | X n ) (12)with corresponding approximationΘ K ∼ FSD K ( α, H ) , Z n | Θ K i.i.d. ∼ Θ K , W n | Z n i.i.d. ∼ f ( · | Z n ) . (13)Let P N, ∞ be the distribution of the observations Y N . Let P N,K be the distribution of theobservations W N . . Nguyen et al./Independent ﬁnite approximations Upper bounds on the error made by FSD K can be used to determine the suﬃcient K to approximate the target process for a given N and accuracy level. We upper bound d T V ( P N, ∞ , P N,K ) in Theorem 5.1.

Theorem 5.1 (Upper bound for DP mixture model) . For some constants C , C , C thatonly depend on α , d T V ( P N, ∞ , P N,K ) ≤ C + C ln N + C ln N ln KK .

The proof is given in Appendix G.1. Theorem 5.1 is similar to Theorem 4.2. The O (ln N )growth of the bound for ﬁxed N can likely be reduced to O (ln N ), the inherent growth rate ofDP mixture models (Miller and Harrison, 2013). The O (cid:0) ln KK (cid:1) rate of decrease to zero is tightbecause of a K lower bound on the approximation error. Theorem 5.1 is an improvementover the existing theory for FSD K , in the sense that Ishwaran and Zarepour (2002, Theorem4) provides an upper bound on d T V ( P N, ∞ , P N,K ) that lacks an explicit dependence on K or N – that bound cannot be inverted to determine the suﬃcient K to approximate thetarget to a given accuracy, while it is simple to determine using Theorem 5.1.Theorem 5.1 can also be used to analyze models with additional hierarchical structure.For instance, the hierarchical Dirichlet process [HDP] and variants are important use cases ofDP and have demonstrated great practical use. We will analyze the error made by FSD K fora variant of HDP we call modiﬁed HDP. In HDP, there is a population measure generatedby DP, G ∼ DP( ω, H ), and for each sub-population indexed by d , the sub-populationmeasure is generated as G d | G ∼ DP( α, G ). In modiﬁed HDP, the sub-population measureis instead distributed as G d | G ∼ TSB T ( α, G ) where the TSB distribution is explained inExample 5.1. Example 5.1 (Stick-breaking approximation (Sethuraman, 1994)) . For i = 1 , , . . . , K − v i i.i.d. ∼ Beta(1 , α ). Set v K = 1. Let ξ i = v i (cid:81) i − j =1 (1 − v j ). Let ψ k i.i.d. ∼ H , and Ξ K = (cid:80) Kk =1 ξ k δ ψ k . We denote the distribution of Ξ K as TSB K ( α, H ) . In all, the generative process of modiﬁed HDP is G ∼ DP( ω, H ) H d | G indep ∼ TSB T ( α, G ) across dβ dn | H d indep ∼ H d ( . ) , W dn | β dn indep ∼ f ( . | β dn ) across d, n (14)Observation groups are indexed by d and individual observations are indexed by n, d . Eachgroup manifests at most T distinct atoms of the population-level measure in the style ofExample 5.1. The number of groups is D , and the number of observations in each group is N. The ﬁnite approximation we consider replaces the population level DP with FSD K , keep-ing the other conditionals intact G K ∼ FSD K ( ω, H ) F d | G K indep ∼ TSB T ( α, G K ) across dψ dn | F d indep ∼ F d ( . ) , Z dn | ψ dn indep ∼ f ( . | ψ dn ) across d, n (15) . Nguyen et al./Independent ﬁnite approximations Let P ( N,D ) , ∞ be the distribution of the observations { W dn } . Let P ( N,D ) ,K be the distri-bution of the observations { Z dn } . We have the following corollary to Theorem 5.1. Corollary 5.2 (Upper bound for modiﬁed HDP) . For some constants C , C , C whichdepend only on ω , d T V (cid:0) P ( N,D ) , ∞ , P ( N,D ) ,K (cid:1) ≤ C + C ln ( DT ) + C ln( DT ) ln KK .

The proof can be found in Appendix G.1. For ﬁxed K , Corollary 5.2 is independent of N , the number of observations in each group, but grows like O (poly(ln D )) with the numberof groups D . For ﬁxed D , the approximation error decrease to zero at rate no slower that O (cid:0) ln KK (cid:1) . As Theorem 5.1 is only an upper bound, we now investigate the tightness of the inequalityin terms of N and K . We return to DP mixture models. We ﬁrst look at the dependence ofthe error bound in terms of ln N . Theorem 5.3 shows that ﬁnite approximations cannot beaccurate if the approximation level is too small compared to the growth rate ln N . Theorem 5.3 (ln N is necessary) . There exists a probability kernel f ( . ) , independent of K, N , such that for any N ≥ , if K ≤ C ( N, α ) , then d T V ( P N, ∞ , P N,K ) ≥ − C (cid:48) N α/ where C (cid:48) is a constant only dependent on α . The proof is given in Appendix G.2. Theorem 5.3 implies that as N grows, if the approx-imation level K fails to surpass the C ( N, α ) threshold, then the total variation betweenthe approximate and the target model remains bounded from zero – in fact, the error tendsto one. Recall that C ( N, α ) = Ω(ln N ), so the necessary approximation level is Ω(ln N ).Theorem 5.3 is the analog of Theorem 4.3.We also investigate the tightness of Theorem 5.1 in terms of K . In Theorem 5.4, ourlower bound indicates that the K factor in Theorem 5.1 is tight (up to log factors). Theorem 5.4 (1 /K lower bound) . There exists a probability kernel f ( . ) , independent of K, N , such that for any N ≥ , d T V ( P N, ∞ , P N,K ) ≥ α α K .

The proof is given in Appendix G.2. While Theorem 5.1 implies that the normalizedIFA K with K = O (poly(ln N ) /(cid:15) ) atoms suﬃces in approximating the DP mixture modelto less than (cid:15) error, Theorem 5.4 implies that a normalized IFA with K = Ω (1 /(cid:15) ) atomsis necessary in the worst case. This worst-case behavior is analogous to Theorem 4.4 forDP-based models.The (cid:15) dependence means that IFAs are worse than TFAs in theory. It is known thatsmall TFA models are already excellent approximations of the DP. Example 5.1 is a verywell-known ﬁnite approximation whose error is upper bounded in Proposition 5.5. . Nguyen et al./Independent ﬁnite approximations Proposition 5.5. (Ishwaran and James, 2001, Theorem 2) Let Ξ K ∼ TSB K ( α, H ) , R n | Ξ K i.i.d. ∼ Ξ K , T n | R n indep ∼ f ( . | R n ) with N observations. Let Q N,K be the distribution of the observa-tions T N . Then d T V ( P N, ∞ , Q N,K ) ≤ N exp (cid:18) − K − α (cid:19) . Proposition 5.5 implies that a TFA with K = O (ln ( N/(cid:15) )) atoms suﬃces in approximatingthe DP mixture model to less than (cid:15) error. Modulo log factors, comparing the necessary (cid:15) level for IFA and the suﬃcient ln (cid:0) (cid:15) (cid:1) level for TFA, we conclude that the necessary size fornormalized IFA is exponentially larger than the suﬃcient size for TFA, in the worst case.

6. Conceptual beneﬁts of ﬁnite approximations

As part of Bayesian inference, we need to compute the posterior over the latent variablesin our ﬁnite-dimensional probabilistic models (Eq. (6)). To set up notation, we denote by θ = ( θ i ) Ki =1 the collection atom sizes, ψ = ( ψ i ) Ki =1 the collection of atom locations and x = ( x n,i ) the trait count of each observation.Standard tools to explore or approximate the posterior distribution P ( θ, ψ, x | data) requireeasy-to-simulate Gibbs conditional distributions or tractable expectations. On the one hand,because of the discreteness of the trait counts x , even with the recent advances in Hamil-tonian Monte Carlo (Hoﬀman and Gelman, 2014), successful Markov chain Monte Carlo(MCMC) algorithms have been based largely on Gibbs sampling (Geman and Geman, 1984).In particular, blocked Gibbs sampling utilizing the natural Markov blanket structure isstraightforward to implement when the complete conditionals P ( θ | x, ψ, data) , P ( x | ψ, θ, data)or P ( ψ | x, θ, data) are easy to simulate from. On the other hand, variational inference us-ing mean-ﬁeld approximation and KL divergence (Wainwright and Jordan, 2008) requiresanalytical expectations. The variational distributions are typically chosen to match the para-metric form of the complete conditionals – information about the latent variables is easilysummarized, and the divergence between approximation and target is (locally) optimized us-ing coordinate ascent updates. Such updates require expectations of the form E θ ∼ q [ln l ( x | θ )]where q ( θ ) is the variational distribution over atom sizes.Since ﬁnite approximations (IFAs/TFAs) with the same number of atoms K only diﬀer inthe prior P ( θ ), to compare the ease-of-use between IFAs and TFAs, it suﬃces to compare thetractability of P ( θ | x, ψ, data) under diﬀerent approximations. For exponential family CRMswith d = 0, IFAs are highly compatible with standard inference schemes, because the Gibbsconditional P ( θ | x, ψ, data) comes from the same exponential family as the prior ν K . Lemma 6.1 (Conditional conjugacy of IFA) . Suppose the likelihood is Eq. (1) and the IFAprior ν K is as in Corollary 3.3. Then the complete conditional of atom sizes factorizes acrossatoms P ( θ | x, ψ, data ) = K (cid:89) k =1 P ( θ k | x .,k ) . Furthermore, each P ( θ k | x .,k ) is in the same exponential family as the IFA prior, with densityproportional to { θ ∈ U } θ c/K + (cid:80) Nn =1 φ ( x n,k ) − exp (cid:32) (cid:104) ψ + N (cid:88) n =1 t ( x n,k ) , µ ( θ ) (cid:105) + ( λ + N )[ − A ( θ )] (cid:33) dθ. (16) . Nguyen et al./Independent ﬁnite approximations The proof follows from the results in Appendix C. Lemma 6.1 implies that the derivationof simulation steps/expectation equations for IFAs of common models such as beta-Bernoulli,gamma-Poisson and beta-negative binomial is straightforward. The complete conditionalsover atom sizes are easy-to-simulate because they are well-known exponential families (betaand gamma). Also, the expectations of ln l ( x | θ ) when θ has the exponential family distribu-tion (Eq. (16)) are tractable because of the exponential family algebra between log-likelihoodand prior. Finally, a parallelizing strategy to utilize the factorization structure across atomscan yield user-time speed up, with the gains being greatest when there are many instantiatedatoms.There are many diﬀerent types of TFAs, but in general the derivation of simulationsteps/expectation equations are much more involved than for IFAs. While the prior P ( θ )can be reasonably easy to sample from, the incorporation of trait counts leads to intractableconditionals P ( θ | x ). We consider two illustrative examples, both for exponential CRMs with d = 0. In Example 6.1, the complete conditional of atom size is both hard to sample fromand leads to analytically intractable expectations. In Example 6.2, the complete conditionalof atom sizes can be sampled from without introducing auxiliary variables, but importantexpectations are not analytically tractable. Example 6.1 (Stick-breaking approximation (Broderick, Jordan and Pitman, 2012; Paisley,Carin and Blei, 2011)) . The following ﬁnite approximation is a TFA for BP( γ, α, K = K (cid:88) i =1 C i (cid:88) j =1 V ( i ) i,j i − (cid:89) l =1 (1 − V ( l ) i,j ) δ ψ ij where C i iid ∼ Poisson( γ ), V ( l ) i,j iid ∼ Beta(1 , α ) and ψ i,j iid ∼ H . A priori, the atom sizes V ( i ) i,j (cid:81) i − l =1 (1 − V ( l ) i,j ) can be sampled (using stick-breaking proportions V i,j ), but there is no tractable wayto sample from/compute expectations with respect to the conditional distribution P ( θ | x )because of the dependence on C i as well as the entangled form of each θ. Strategies tomake the model more tractable include introducing auxiliary round indicator variables r k (Broderick, Jordan and Pitman, 2012; Paisley, Carin and Blei, 2011), marginalizing out thestick-breaking proportions (Broderick, Jordan and Pitman, 2012) or replacing the product (cid:81) i − l =1 (1 − V ( l ) i,j ) with more succinct representation (Paisley, Carin and Blei, 2011). However,the ﬁnal model from these attempts all contain at least one Gibbs conditional that is ei-ther diﬃcult to sample from (Broderick, Jordan and Pitman, 2012, Equation 37) or lackstractable expectations (Paisley, Carin and Blei, 2011, Section 3.3).Other superposition-based approximations, like decoupled Bondesson or power-law (Camp-bell et al., 2019), will similarly struggle with the number of atoms per round variables C i and the entanglement among the atom sizes. Example 6.2 (Bondesson approximation (Doshi-Velez et al., 2009; Teh, G¨or¨ur and Ghahra-mani, 2007)) . When α = 1, the Bondesson approximation in Example 4.3 becomesΘ K = K (cid:88) i =1  i (cid:89) j =1 p j  δ ψ i where p j i.i.d. ∼ Beta( γ,

1) and ψ i iid ∼ H. The atom sizes are tangled by the p j ’s, θ i = (cid:81) ij =1 p j ,but the complete conditional of atom sizes P ( θ | x ) admits a density with respect to Lebesgue, . Nguyen et al./Independent ﬁnite approximations and it is proportional to { ≤ θ K ≤ θ K − ≤ . . . ≤ θ ≤ } K (cid:89) j =1 θ γ { j = K } + (cid:80) Nn =1 x n,j − j (1 − θ j ) N − (cid:80) Nn =1 x n,j . The conditional distributions P ( θ i | θ − i , x ) are truncated betas, so adaptive rejection sampling(Gilks and Wild, 1992) can be used as a sub-routine to sample each P ( θ i | θ − i , x ), and thensweep over all atom sizes. However, for this exponential family, expectations of the suﬃcientstatistics ln θ i and ln(1 − θ i ) are not tractable: variational inference as conducted in Doshi-Velez et al. (2009) required additional approximations.Other series-based approximations, like thinning or rejection sampling (Campbell et al.,2019), have more intractable dependencies between atom sizes in both the prior and theconditional P ( θ | x ).

7. Empirical evaluation

We compare the practical performance of IFAs and TFAs on two real-data examples: animage denoising application using the beta-Bernoulli model and topic modeling using themodiﬁed HDP. Existing empirical work (e.g., Doshi-Velez et al. (2009, Table 1,2) and Kuri-hara, Welling and Teh (2007, Figure 4)) suggests two patterns: that the approximationsimprove in performance as the number of instantiated atoms K increase, and for the same K , normalized IFA and TFA have similar performance. Our experiments conﬁrm and expandupon these previous ﬁndings. Image denoising through dictionary learning is an application where ﬁnite approximationsof BNP model - in particular beta-Bernoulli with d = 0 – have proven useful (Zhou et al.,2009). The goal is recovering the original noiseless image (left of Fig. 1) from a corruptedone (right of Fig. 1). To do so, the input image is deconstructed into small contiguouspatches and we postulate that each patch is a combination of underlying basis elements . Byestimating the coeﬃcients expressing the combination, possibly in addition to estimating thebasis elements themselves, one can denoise the individual patches and ultimately the overallimage. The beta-Bernoulli process allow simultaneous estimation of basis elements and basisassignments. The nonparametric nature sidesteps the cumbersome problem of calibratingthe number of basis elements. The number of extracted patches depends on both the patchsize and the dimensions of the input image: even on the same input image, the analysismight process a varying number of “observations.” Better denoised images have high peaksignal-to-noise-ratio, or PSNR (Hore and Ziou, 2010), with respect to the noiseless image:the PSNR between two identical images is ∞ .To compare IFA and TFA, we considered beta process BP( γ, ,

0) due to past work whichsuggests that the hyper-parameters γ, α do not play a large role (Zhou et al., 2009). Eachconﬁguration of the latent variables x, ψ, θ leads to a candidate denoised image. By default,a sequential Gibbs sampler traverses the posterior over latent variables – the ﬁnal denoised Patches i.e., observations are gradually introduced in epochs, and the sampler only modiﬁes the latentvariables of the current epoch’s observations. . Nguyen et al./Independent ﬁnite approximations Fig 1: Original versus corrupted images. The number plotted on top of the noisy image ispeak signal-to-noise-ratio, or PSNR, with respect to the noiseless image.image is a weighted average of the candidate images encountered during the sampler run.There is randomness in how the latent variables are initialized, as well as in the simulationof the Gibbs conditionals. The gradual data introduction employed in the Gibbs samplercan be thought of as a way to initialize the latent variables for the entire set of observations.For a 256 ×

256 image like the right panel of Fig. 1, the number of extracted patches, N ,is about 60 k. More details about the ﬁnite approximations, hyper-parameter settings andinference can be found in Appendix H.1.In Fig. 2, the quality of denoised images improves with increasing K – furthermore, thequality is very similar across the two types of approximation. Both kinds perform muchbetter than the baseline i.e., noisy input image. The improvement with K is largest forsmall K , and plateaus for larger values of K . For a given approximation level, the quality ofTFA denoising and that of IFA are almost the same. Furthermore, the denoised image fromTFA is more similar to the denoised image from IFA than it is similar to the original image,indicated by the large gap in PSNR. The error bars reﬂect randomness in both initializationand simulation of the conditionals across 5 trials.Fig. 3 shows that the modes of TFA posterior are centers of regions of attraction in IFAposterior, and vice-versa. For both kinds of approximation, K = 60. Rather than randomlyinitializing the latent variables at the beginning of the Gibbs sampler of one model i.e.,cold start, we can use the last conﬁguration of latent variables visited in the other modelas the initial state of the Gibbs sampler – i.e., warm start. To isolate the eﬀect of theinitial conditions, all the patches are available from the start as opposed to being graduallyintroduced. For both kinds of approximation, the Gibbs sampler initialized at the warm startvisits candidate images that basically have the same PSNR as the starting conﬁguration.The early iterates of cold-start Gibbs sampler are noticeably lower in quality compared tothe warm-start iterates, and the quality at the plateau is still lower than that of the warmstart. Each trace of PSNR of cold-start Gibbs corresponds to a random seed in intializationand simulation of the conditionals, while each trace of warm-start PSNR corresponds to adiﬀerent ﬁnal state of the alternative model’s training. The variation across warm starts istiny – the variation across cold starts is larger but still very small.Experiments on other noisy images can be found in Appendix I; the trends are the same. . Nguyen et al./Independent ﬁnite approximations Fig 2: Finite approximations have similar performance across approximation levels. For each K , the ﬁnal denoised image is a weighted average of candidate images encountered duringGibbs sampling. (a) TFA training (b) IFA training Fig 3: The output of one model is a good initialization for the training of the other one.

Finally, we compare the performance of normalized IFA (i.e., FSD K ) and TFA (i.e., TSB K )when used in DP-based model. In this section, we provide evidence of the same trends inthe modiﬁed HDP – a more complicated model than a Dirichlet process mixture – when . Nguyen et al./Independent ﬁnite approximations Fig 4: Finite approximations have similar performance across approximation levels.analyzing Wikipedia documents.For both IFA and TFA, we use stochastic variational inference with mean-ﬁeld factor-ization (Hoﬀman et al., 2013) to approximate the posterior over the latent topics based ontraining documents. The training corpus is nearly one million documents from Wikipedia.There is randomness in the initial values of the variational parameters, as well as in theorder that data minibatches are processed. The quality of inferred topics is measured bythe predictive log-likelihood on a set of 10 k held-out documents. More details about theﬁnite approximations, hyper-parameter settings, variational inference and deﬁnition of testlog-likelihood can be found in Appendix H.2.In Fig. 4, the quality of the inferred topics improves as the approximation level grows –furthermore, the quality is very similar across the two types of approximation. The improve-ment with K is largest for small K : the slope plateaus for large K . For a given approximationlevel, the quality of TFA topics and that of normalized IFA are almost the same. The er-ror bars reﬂect variation across both the random initialization and the ordering of dataminibatches processed by stochastic variational inference.In Fig. 5, the modes of TFA posterior are centers of regions of attraction in IFA posterior,and vice-versa. The number of topics is ﬁxed to be K = 300 . Rather than randomly initial-izing the variational parameters at the start of variational inference of one model i.e., coldstart, we can use the variational parameters at the end of the other model’s training as theinitialization i.e., warm start. The learning rate for warm-start training is slightly diﬀerentfrom that for cold start, to reﬂect the fact that many batches of data had been processedleading up to the warm-start variational parameters. For both kinds of approximation, thetest log-likelihood basically stays the same for warm-start training iterates, hinting thatsuch initialization is part of an attractive region. The early iterates of cold start are notice-ably lower in quality compared to the warm iterates – however at the end of training, thetest log-likelihoods are nearly the same. Each trace of cold start corresponds to a diﬀerentinitialization and ordering of data batches processed. Each trace of warm start correspondsto a diﬀerent output of the other model’s training and a diﬀerent ordering of data batchesprocessed. The variation across either cold starts or warm starts is small. . Nguyen et al./Independent ﬁnite approximations (a) TFA training (b) IFA training Fig 5: The output of one model is a good initialization for the training of the other one.

8. Discussion

We have provided a general construction of independent ﬁnite approximations for completelyrandom measures, analyzed error bounds on IFAs for conjugate exponential family CRMwith no power law and the Dirichlet process, and investigated how they compare to truncatedﬁnite approximations in realistic data applications. Our error bounds reveal that in the worstcase, for the same number of atoms instantiated, IFA has larger error than TFA. However,we have not observed the worst case in our experiments, suggesting that either the errorbounds can be tightened for relevant conditional densities f or that additional sources oferror, such as those from approximate inference, dominate approximation error made by theﬁnite approximations. From a practical point of view, IFA is easier than TFA to work with.Our analyses and experiments suggest a number of directions for future work. For exam-ple, the error bound analysis could be extended for conjugate family CRM with power-lawbehavior. We speculate that in such situations, the O (ln N ) factor appearing in the numera-tor of the upper bounds will be replaced by O ( N a ) where O ( N a ) is the growth rate of BNPmodels with power law behavior. References

Acharya, A. , Ghosh, J. and

Zhou, M. (2015). Nonparametric Bayesian factor analysisfor dynamic count matrices. In

AISTATS . Adell, J. A. and

Lekuona, A. (2005). Sharp estimates in signed Poisson approximationof Poisson mixtures.

Bernoulli Aldous, D. (1985). Exchangeability and related topics. ´Ecole d’ ´Et´e de Probabilit´es de Saint-Flour XIII—1983

Alzer, H. (1997). On some inequalities for the gamma and psi functions.

Mathematics ofcomputation Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesiannonparametric problems.

The Annals of Statistics Barbour, A. D. and

Hall, P. (1984). On the rate of Poisson convergence. In

MathematicalProceedings of the Cambridge Philosophical Society . Nguyen et al./Independent ﬁnite approximations Betancourt, M. (2017). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv.org . Blackwell, D. and

MacQueen, J. B. (1973). Ferguson Distributions Via Polya UrnSchemes.

Ann. Statist. Blei, D. M. , Griffiths, T. L. and

Jordan, M. I. (2010). The nested Chinese restaurantprocess and Bayesian nonparametric inference of topic hierarchies.

Journal of the ACM Blei, D. M. and

Jordan, M. I. (2006). Variational Inference for Dirichlet Process Mixtures.

Bayesian Analysis Bondesson, L. (1982). On simulation from inﬁnitely divisible distributions.

Advances inApplied Probability . Brix, A. (1999). Generalized gamma measures and shot-noise Cox processes.

Advances inApplied Probability Broderick, T. , Jordan, M. I. and

Pitman, J. (2012). Beta processes, stick-breakingand power laws.

Bayesian analysis Broderick, T. , Wilson, A. C. and

Jordan, M. I. (2018). Posteriors, conjugacy, andexponential families for completely random measures.

Bernoulli Broderick, T. , Mackey, L. , Paisley, J. and

Jordan, M. I. (2015). Combinatorial Clus-tering and the Beta Negative Binomial Process.

IEEE Transactions on Pattern Analysisand Machine Intelligence Campbell, T. , Huggins, J. H. , How, J. P. and

Broderick, T. (2019). Truncatedrandom measures.

Bernoulli Carpenter, B. , Gelman, A. , Hoffman, M. D. , Lee, D. , Goodrich, B. , Betan-court, M. , Brubaker, M. , Guo, J. , Li, P. and

Riddell, A. (2017). Stan: A Proba-bilistic Programming Language.

Journal of Statistical Software . Doshi-Velez, F. , Miller, K. T. , Van Gael, J. and

Teh, Y. W. (2009). Variationalinference for the Indian buﬀet process. In

Artiﬁcial Intelligence and Statistics

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.

The Annalsof Statistics

Ferguson, T. S. and

Klass, M. J. (1972). A representation of independent incrementprocesses without Gaussian components.

The Annals of Mathematical Statistics . Fox, E. B. , Sudderth, E. , Jordan, M. I. and

Willsky, A. S. (2010). A Sticky HDP-HMM with Application to Speaker Diarization.

The Annals of Applied Statistics Geman, S. and

Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and theBayesian Restoration of Images.

Pattern Analysis and Machine Intelligence, IEEE Trans-actions on Gilks, W. R. and

Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling.

Journal of the Royal Statistical Society: Series C (Applied Statistics) Gnedin, A. V. (1998). On convergence and extensions of size-biased permutations.

Journalof Applied Probability Gordon, L. (1994). A Stochastic Approach to the Gamma Function.

The American Math-ematical Monthly

Griffiths, T. L. and

Ghahramani, Z. (2005). Inﬁnite Latent Feature models and theIndian Buﬀet Process. In

Advances in Neural Information Processing Systems . Griffiths, T. L. and

Ghahramani, Z. (2011). The Indian Buﬀet Process: An Introductionand Review.

Journal of Machine Learning Research Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models . Nguyen et al./Independent ﬁnite approximations for life history data. the Annals of Statistics Hoffman, M. , Bach, F. R. and

Blei, D. M. (2010). Online learning for latent Dirichletallocation. In

Advances in Neural Information Processing Systems

Hoffman, M. D. and

Gelman, A. (2014). The No-U-Turn sampler: adaptively settingpath lengths in Hamiltonian Monte Carlo.

Journal of Machine Learning Research Hoffman, M. D. , Blei, D. M. , Wang, C. and

Paisley, J. (2013). Stochastic variationalinference.

Journal of Machine Learning Research Hore, A. and

Ziou, D. (2010). Image quality metrics: PSNR vs. SSIM. In

Ishwaran, H. and

James, L. F. (2001). Gibbs sampling methods for stick-breaking priors.

Journal of the American Statistical Association . Ishwaran, H. and

Zarepour, M. (2002). Exact and approximate sum representations forthe Dirichlet process.

Canadian Journal of Statistics James, L. F. (2013). Stick-breaking PG( α , ζ )-Generalized Gamma Processes. arXiv.org . James, L. F. (2017). Bayesian Poisson calculus for latent feature modeling via generalizedIndian Buﬀet Process priors.

The Annals of Statistics Johnson, N. L. , Kemp, A. W. and

Kotz, S. (2005).

Univariate Discrete Distributions . Wiley Series in Probability and Statistics . Wiley.

Johnson, M. J. and

Willsky, A. S. (2013). Bayesian Nonparametric Hidden Semi-MarkovModels.

Journal of Machine Learning Research Kallenberg, O. (2002).

Foundations of modern probability , 2nd ed. Springer, New York.

Kingman, J. F. C. (1967). Completely random measures.

Paciﬁc Journal of Mathematics Kingman, J. F. C. (1975). Random discrete distributions.

Journal of the Royal StatisticalSociety B Kucukelbir, A. , Ranganath, R. , Gelman, A. and

Blei, D. M. (2015). AutomaticVariational Inference in Stan. In

Advances in Neural Information Processing Systems . Kurihara, K. , Welling, M. and

Teh, Y. W. (2007). Collapsed Variational DirichletProcess Mixture Models. In

International Joint Conference on Artiﬁcial Intelligence

Last, G. and

Penrose, M. (2017).

Lectures on the Poisson Process . Institute of Mathe-matical Statistics Textbooks . Le Cam, L. (1960). An approximation theorem for the Poisson binomial distribution.

PaciﬁcJ. Math. Lee, J. , James, L. F. and

Choi, S. (2016). Finite-dimensional BFRY priors and variationalBayesian inference for power law models. In

Advances in Neural Information ProcessingSystems

Lee, J. , Miscouridou, X. and

Caron, F. (2019). A uniﬁed construction for series rep-resentations and ﬁnite approximations of completely random measures. arXiv preprintarXiv:1905.10733 . Loeve, M. (1956). Ranking Limit Problem. In

Proceedings of the Third Berkeley Symposiumon Mathematical Statistics and Probability, Volume 2: Contributions to Probability Theory

Madras, N. and

Sezer, D. (2010). Quantitative bounds for Markov chain convergence:Wasserstein and total variation distances.

Bernoulli Miller, J. W. and

Harrison, M. T. (2013). A simple example of Dirichlet process mix-ture inconsistency for the number of components. In

Advances in Neural Information . Nguyen et al./Independent ﬁnite approximations Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani andK. Q. Weinberger, eds.) 199–206. Curran Associates, Inc.

Neal, R. M. (2011). MCMC using Hamiltonian dynamics. In

Handbook of Markov ChainMonte Carlo

Orbanz, P. (2010). Conjugate Projective Limits. arXiv.org . Paisley, J. , Blei, D. M. and

Jordan, M. I. (2012). Stick-breaking beta processes andthe Poisson process. In

Artiﬁcial Intelligence and Statistics

Paisley, J. and

Carin, L. (2009). Nonparametric factor analysis with beta process priors.In

Proceedings of the 26th Annual International Conference on Machine Learning . ICML’09

Paisley, J. , Carin, L. and

Blei, D. (2011). Variational inference for stick-breaking betaprocess priors. In

Proceedings of the 28th International Conference on International Con-ference on Machine Learning

Palla, K. , Knowles, D. A. and

Ghahramani, Z. (2012). An Inﬁnite Latent AttributeModel for Network Data. In

International Conference on Machine Learning . Universityof Cambridge.

Perman, M. , Pitman, J. and

Yor, M. (1992). Size-biased sampling of Poisson pointprocesses and excursions.

Probability Theory and Related Fields Pitman, J. (1995). Exchangeable and partially exchangeable random partitions.

Probabilitytheory and related ﬁelds

Pitman, J. (1996). Some developments of the Blackwell-MacQueen urn scheme.

LectureNotes-Monograph Series

Pollard, D. (2001).

A User’s Guide to Measure Theoretic Probability . Cambridge Uni-versity Press. Ranganath, R. , Gerrish, S. and

Blei, D. M. (2014). Black Box Variational Inference.In

International Conference on Artiﬁcial Intelligence and Statistics

Roberts, G. O. and

Tweedie, R. L. (1996). Exponential convergence of Langevin distri-butions and their discrete approximations.

Bernoulli Roychowdhury, A. and

Kulis, B. (2015). Gamma processes, stick-breaking, and varia-tional inference. In

Artiﬁcial Intelligence and Statistics

Saria, S. , Koller, D. and

Penn, A. (2010). Learning individual and population leveltraits from clinical temporal data Technical Report.

Sethuraman, J. (1994). A Constructive Deﬁnition of Dirichlet Priors.

Statistica Sinica Teh, Y. W. , G¨or¨ur, D. and

Ghahramani, Z. (2007). Stick-breaking Construction for theIndian Buﬀet Process. In

International Conference on Artiﬁcial Intelligence and Statis-tics . Teh, Y. W. and

G¨or¨ur, D. (2009). Indian buﬀet processes with power-law behavior. In

Advances in Neural Information Processing Systems . Teh, Y. W. , Jordan, M. I. , Beal, M. J. and

Blei, D. M. (2006). Hierarchical DirichletProcesses.

Journal of the American Statistical Association

Thibaux, R. and

Jordan, M. I. (2007). Hierarchical Beta Processes and the Indian BuﬀetProcess. In

International Conference on Artiﬁcial Intelligence and Statistics . Titsias, M. (2008). The inﬁnite gamma-Poisson feature model. In

Advances in NeuralInformation Processing Systems . Wainwright, M. J. and

Jordan, M. I. (2008). Graphical Models, Exponential Families,and Variational Inference.

Foundations and Trends R (cid:13) in Machine Learning Wang, C. , Paisley, J. and

Blei, D. (2011). Online variational inference for the hier- . Nguyen et al./Independent ﬁnite approximations archical Dirichlet process. In Proceedings of the Fourteenth International Conference onArtiﬁcial Intelligence and Statistics

Zhou, M. , Chen, H. , Ren, L. , Sapiro, G. , Carin, L. and

Paisley, J. W. (2009). Non-parametric Bayesian dictionary learning for sparse image representations. In

Advances inNeural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. D. Laﬀerty,C. K. I. Williams and A. Culotta, eds.) 2295–2303. Curran Associates, Inc.

Zhou, M. , Hannah, L. , Dunson, D. and

Carin, L. (2012). Beta-negative binomial pro-cess and Poisson factor analysis. In

Artiﬁcial Intelligence and Statistics

Appendix A: Additional examples of IFA construction

Let B ( α, β ) = Γ( α )Γ( β )Γ( α + β ) denote the beta function. Example A.1 (Beta process) . Taking E = R + , g ( θ ) = 1, h ( θ ; η ) = (1 − θ ) η − [ θ ≤ Z ( ξ, η ) = B ( ξ, η ) in Theorem 3.2 yields the beta process BP( γ, η − d, d ), which has ratemeasure ν (d θ ) = γ [ θ ≤ B ( η, − d ) θ − − d (1 − θ ) η − d θ. Since h is continuous and bounded on [0 , / Example A.2 (Beta prime process) . Taking E = R + , g ( θ ) = (1 + θ ) − , h ( θ ; η ) = (1 + θ ) − η ,and Z ( ξ, η ) = B ( ξ, η ) in Theorem 3.2 yields the beta prime process, which has rate measure ν (d θ ) = γB ( η, − d ) θ − − d (1 + θ ) − d − η d θ. Since g is continuous, g (0) = 1, 1 ≤ g ( θ ) ≤ θ , and h ( θ ; η ) is continuous and bounded on[0 , d = 0, c = γη and ν n ( θ ) = Beta (cid:48) ( θ ; γη/n, η ) . Example A.3 (Gamma process) . Taking E = R + , g ( θ ) = 1, h ( θ ; η ) = e − ηθ , and Z ( ξ, η ) =Γ( ξ ) η − ξ in Theorem 3.2 yields the gamma process, with rate measure ν (d θ ) = γ λ − d Γ(1 − d ) θ − d − e − λθ d θ. Since h ( θ ; η ) is continuous and bounded on [0 , Example A.4 (Generalized gamma process) . Taking E = R , g ( θ ) = 1, h ( θ ; η ) = e − ( η θ ) η ,and Z ( ξ, η ) = Γ( ξ/η )( η η ) − ξ in Theorem 3.2 yields the generalized gamma distribution Gam ( ξ, η , η ). The corresponding rate measure is ν (d θ ) = γ ( η η ) − d Γ((1 − d ) /η ) θ − d − e − ( η θ ) η d θ, which is the rate measure for the gamma process ΓP( γ, η, d ). Since h ( θ ; η ) is continuous andbounded on [0 , d = 0, c = γη η Γ( η − ) and ν n ( θ ) = Gam (cid:18) θ ; γη η n Γ( η − ) , η , η (cid:19) . https://en.wikipedia.org/wiki/Generalized_gamma_distribution . Nguyen et al./Independent ﬁnite approximations Appendix B: Proof of IFA convergence

B.1. IFA converges to CRM in distribution

In order to prove our main result, we require a few auxiliary results.

Lemma B.1 ((Kallenberg, 2002, Lemmas 12.1 and 12.2)) . Let Θ be a random measure and Θ , Θ , . . . a sequence of random measures. If for all measurable sets A and t > , lim K →∞ E [ e − t Θ K ( A ) ] = E [ e − t Θ( A ) ] , then Θ K D = ⇒ Θ . For a density f , let µ ( t, f ) : θ (cid:55)→ (1 − e − tθ ) f ( θ ). In results that follow we assume allmeasures on R + have densities with respect to Lebesgue measure. We abuse notation anduse the same symbol to denote the measure and the density. Proposition B.2.

Let Θ ∼ CRM(

H, ν ) and for K = 1 , , . . . , let Θ K ∼ IFA K ( H, ν K ) where ν is a measure and ν , ν , . . . are probability measures on R + , all absolutely continuous withrespect to Lebesgue measure. If (cid:107) µ (1 , nν K ) − µ (1 , ν ) (cid:107) → , then Θ K D = ⇒ Θ .Proof. Let t > A a measurable set. First, recall that the Laplace functional of theCRM Θ is E [ e − t Θ( A ) ] = exp (cid:26) − H ( A ) (cid:90) ∞ µ ( t, ν )( θ ) d θ (cid:27) . We have E [ e − tθ K, ( ψ K, ∈ A ) ] = P ( ψ K, ∈ A ) E [ e − tθ K, ] + P ( ψ K, / ∈ A )= H ( A ) E [ e − tθ K, ] + 1 − H ( A )= 1 − H ( A )(1 − E [ e − tθ K, ])= 1 − H ( A ) K (cid:90) ∞ µ ( t, Kν K )( θ ) d θ. Since | − e − tθ || − e − θ | ≤ max(1 , t ), it follows by hypothesis that (cid:107) µ ( t, Kν K ) − µ ( t, ν ) (cid:107) →

0. Thus,by dominated convergence and the standard exponential limit,lim K →∞ E [ e − tθ K, ( ψ K, ∈ A ) ] K = lim K →∞ (cid:18) − H ( A ) K (cid:90) ∞ µ ( t, Kν K )( θ ) d θ (cid:19) K = exp (cid:26) − lim K →∞ H ( A ) (cid:90) ∞ µ ( t, Kν K )( θ ) d θ (cid:27) = exp (cid:26) − H ( A ) (cid:90) ∞ µ ( t, ν )( θ ) d θ (cid:27) . Finally, by the independence of the random variables { θ K,i } Ki =1 ,lim K →∞ E [ e − t Θ K ( A ) ] = lim K →∞ E [ e − tθ K, ( ψ K, ∈ A ) ] K , so result follows from Lemma B.1. . Nguyen et al./Independent ﬁnite approximations Lemma B.3.

If there exist measures π ( θ ) d θ and π (cid:48) ( θ ) d θ on R + such that for some κ > ,1. the measures µ, µ , µ , . . . have densities f, f , f , . . . wrt π and densities f (cid:48) , f (cid:48) , f (cid:48) , . . . wrt π (cid:48) ,2. (cid:82) κ | f (cid:48) ( θ ) − f (cid:48) K ( θ ) | d θ → ,3. sup θ ∈ [ κ, ∞ ) | f ( θ ) − f K ( θ ) | → ,4. sup θ ∈ [0 ,κ ] π (cid:48) ( θ ) ≤ c (cid:48) < ∞ , and5. (cid:82) ∞ κ π ( θ ) d θ ≤ c < ∞ ,then (cid:107) µ − µ K (cid:107) → . Proof.

Proof of Theorem 3.2.

Note that since h is continuous and bounded on [0 , (cid:15) ], c < ∞ . We willapply Lemma B.3 with κ as given in the theorem statement, µ = µ (1 , ν ), µ K = µ (1 , nν K ), π ( θ ) = p ( θ ; 1 − d, η ) = θ − d g ( θ ) − d h ( θ ; η ) Z (1 − d, η ) , and π (cid:48) ( θ ) := ( θg ( θ )) d π ( θ ). Thus, f ( θ ) = γ (1 − e − θ )( θg ( θ )) − , f K ( θ ) = nZ − K (1 − e − θ ) θ − cK − + d − dS bK ( θ − aK − ) g ( θ ) − cK − , and f (cid:48) ( θ ) = ( θg ( θ )) − d f ( θ ), and f (cid:48) K ( θ ) = ( θg ( θ )) − d f K ( θ ).We now note a few useful properties that we will use repeatedly in the proof. Observethat ( a/K ) cK − = 1 + o (1). The assumption that h is bounded and continuous implies thaton [0 , a/K ], h ( θ ; η ) = h (0; η ) + o (1). Similarly, for any δ > g ( θ ) is bounded and continuousfor θ ∈ [0 , δ ] and therefore, together with the fact that g (0) = 1, we can conclude that on[0 , a/K ], g ( θ ) = 1 + o (1).For the remainder of the proof we will consider K large enough that aK − + 2 b K and cK − are less than κ . The normalizing constant Z K can be written as Z K = (cid:90) a/K ( θg ( θ )) − cK − π (cid:48) (d θ )+ (cid:90) κa/K θ − cK − − dS bK ( θ − aK − ) g ( θ ) − cK − π (cid:48) (d θ )+ (cid:90) ∞ κ ( θg ( θ )) − cK − − d π (cid:48) (d θ ) . . Nguyen et al./Independent ﬁnite approximations We rewrite each term in turn. For the ﬁrst term, (cid:90) a/K θ − cK − g ( θ ) − cK − π (cid:48) (d θ ) = ( c/γ + o (1)) (cid:90) a/K θ − cK − d θ = ( c/γ + o (1)) Kc (cid:16) aK (cid:17) cK − = Kγ + o ( K ) . Since κ ≤ S b K ∈ [0 , θ ∈ [ a/K, κ ], θ − dS bK ( θ − aK − ) ≤ θ − d . Since g (0) = 1, c ∗ ≤ g ( θ ) − cK − ≤ c − c ∗ . Hence the second term is upper bounded by c − c ∗ (cid:90) κa/K θ − cK − − d π (cid:48) (d θ ) ≤ c − ∗ ( c/γ + O (1)) K d a d Kc ( κ cK − − ( a/K ) cK − )= O ( K d ) × O (log K )= o ( K ) . For the third term, (cid:90) ∞ κ ( θg ( θ )) − cK − − d π (cid:48) (d θ ) = (cid:90) ∞ κ ( θg ( θ )) − cK − π (d θ ) ≤ ( κc ∗ ) − cK − (cid:90) ∞ κ π (d θ ) ≤ ( κc ∗ ) − . Hence, Z K = Kγ + o ( K ) and KZ − K = γ (1 + e K ), where e K = o (1).Next, we have sup θ ∈ [ κ, ∞ ) | f ( θ ) − f K ( θ ) | = sup θ ∈ [ κ, ∞ ) (1 − e − θ )( θg ( θ )) − | γ − KZ − K ( θg ( θ )) cK − |≤ sup θ ∈ [ κ, ∞ ) γ ( θg ( θ )) − | − (1 + e K )( θg ( θ )) cK − |≤ γ sup θ ∈ [ κ, ∞ ) ( θg ( θ )) − | − ( θg ( θ )) cK − | + γe K sup θ ∈ [ κ, ∞ ) ( θg ( θ )) − cK − . (B.1)To bound the two terms we will use the fact that if θ ≥ κ , then θg ( θ ) ≥ θc ∗ (1 + θ ) ≥ κc ∗ (1 + κ ) =: ˜ κ and if θ ≤ θg ( θ ) ≤ c ∗ ≤

1. Hence, letting ψ := θg ( θ ), for the ﬁrst term in Eq. (B.1) . Nguyen et al./Independent ﬁnite approximations we have γ sup θ ∈ [ κ, ∞ ) ( θg ( θ )) − | − ( θg ( θ )) cK − |≤ γ sup ψ ∈ [˜ κ, ∞ ) ψ − | − ψ cK − |≤ γ sup ψ ∈ [˜ κ, ψ − | − ψ cK − | + γ sup ψ ∈ [1 , ∞ ) ψ − | − ψ cK − |≤ γ ˜ κ − sup ψ ∈ [˜ κ, | − ψ cK − | + γ (cid:18) K − cK (cid:19) Kc − (cid:12)(cid:12)(cid:12)(cid:12) − KK − c (cid:12)(cid:12)(cid:12)(cid:12) ≤ γ ˜ κ − (1 − ˜ κ cK − ) + O (1) × cK − c = γ ˜ κ − × o (1) + O ( K − ) → . Similarly, for the second term in Eq. (B.1) we have γe K sup θ ∈ [ κ, ∞ ) ( θg ( θ )) − cK − ≤ γe K sup ψ ∈ [˜ κ, ∞ ) ψ − cK − ≤ γ ˜ κ − e K → . Since g ( θ ) is bounded on [0 , κ ], g ( θ ) cK − = 1+ o (1) and therefore (1+ e K ) g ( θ ) cK − = 1+ e (cid:48) K ,where e (cid:48) K = o (1). Using this observation together with the bound (1 − e − θ ) θ − ≤

1, we have (cid:90) κ | f (cid:48) ( θ ) − f (cid:48) K ( θ ) | d θ = (cid:90) κ ( θg ( θ )) − d | f ( θ ) − f K ( θ ) | d θ = (cid:90) κ (1 − e − θ )( θg ( θ )) − − d | γ − KZ − K θ cK − + d − dS bK ( θ − aK − ) g ( θ ) cK − | d θ ≤ γ [ c ∗ (1 + κ )] d (cid:90) κ θ − d | − (1 + e (cid:48) K ) θ cK − + d − dS bK ( θ − aK − ) | d θ ≤ γ (cid:90) κ θ − d | − θ cK − + d − dS bK ( θ − aK − ) | d θ + γe (cid:48) K (cid:90) κ θ cK − + d − dS bK ( θ − aK − ) d θ. (B.2)We bound the ﬁrst integral in Eq. (B.2) in four parts: from 0 to aK − , from aK − to aK − + b K , from aK − + b K to κ − b K , and from κ − b K to κ . The ﬁrst part is equal to (cid:90) aK − θ − d | − θ d + cK − | d θ ≤ (cid:90) aK − θ − d + θ cK − d θ = θ − d − d + Kc + K θ cK − (cid:12)(cid:12)(cid:12)(cid:12) aK − = 11 − d ( aK − ) − d + Kc + K ( aK − ) cK − → . . Nguyen et al./Independent ﬁnite approximations The second part is equal to (cid:90) aK − + b K aK − θ − d | − θ cK − + d − dS bK ( θ − aK − ) | d θ ≤ (cid:90) aK − + b K aK − θ − d + θ cK − − d d θ ≤ (cid:90) aK − + b K aK − θ − d d θ = 21 − d θ − d (cid:12)(cid:12)(cid:12)(cid:12) aK − + b K aK − = 21 − d (cid:0) ( aK − + b K ) − d − ( aK − ) − d (cid:1) → . The third part is equal to (cid:90) κ − b K aK − + b K θ − d | − θ cK − | d θ = (cid:90) κ − b K aK − + b K θ − d − θ cK − − d d θ = 11 − d θ − d − Kc + K (1 − d ) θ − d + cK − (cid:12)(cid:12)(cid:12)(cid:12) κ − b K aK − + b K = ( κ − b K ) − d − d − Kc + K (1 − d ) ( κ − b K ) − d + cK − − ( aK − + b K ) − d − d + Kc + K ( aK − + b K ) − d + cK − → . The fourth part is equal to (cid:90) κκ − b K θ − d | − θ cK − | d θ ≤ (cid:90) κκ − b K θ − d + θ cK − − d d θ → γe (cid:48) K (cid:90) κ θ cK − − dS bK ( θ − aK − ) d θ ≤ γe (cid:48) K (cid:90) κ θ − d d θ = γe (cid:48) K κ − d − d = o ( K ) . Since sup θ ∈ [0 ,κ ] π (cid:48) ( θ ) < ∞ by the boundedness of g and h and π is a probability densityby construction, conclude using Lemma B.3 that (cid:107) µ − µ K (cid:107) →

0. It then follows fromLemma B.1 that Θ K D = ⇒ Θ. B.2. Normalized IFA EPPF converges to NCRM EPPF

Proof of Theorem 3.4.

First, we show that the total mass of IFA converges in distributionto the total mass of CRM. Through Appendix B.1, we have shown that for all measurablesets A and t >

0, the Laplace functionals converge:lim K →∞ E [ e − t Θ K ( A ) ] = E [ e − t Θ( A ) ] , . Nguyen et al./Independent ﬁnite approximations By choosing A = Ψ i.e. the ground space, we have that Θ K (Ψ) is the total mass of IFA andΘ(Ψ) is the total mass of CRMΘ K (Ψ) = K (cid:88) i =1 θ K,i , Θ(Ψ) = ∞ (cid:88) i =1 θ i . Since for any t >

0, the Laplace transform of Θ K (Ψ) converges to that of Θ(Ψ), we concludethat Θ K (Ψ) converges to Θ(Ψ) in distribution (Kallenberg, 2002, Theorem 5.3): K (cid:88) i =1 θ K,i D = ⇒ Θ(Ψ) . (B.3)Second, we show that the decreasing order statistics of IFA atom sizes converges (inﬁnite-dimensional distributions i.e., in f.d.d) to the decreasing order statistics of CRM atomsizes. For each K , the decreasing order statistics of IFA atoms is denoted by { θ K, ( i ) } Ki =1 : θ K, (1) ≥ θ K, (2) ≥ · · · ≥ θ K, ( K ) . We will leverage (Loeve, 1956, Theorem 4 and page 191) to ﬁnd the limiting distribution { θ K, ( i ) } Ki =1 as K → ∞ . It is easy to verify the conditions to use the theorem: because thesums (cid:80) Ki =1 θ K,i converge in distribution to a limit, we know that all the θ K,i ’s are uniformlyasymptotically negligible (Kallenberg, 2002, Lemma 15.13). Now, we discuss what the limitsare. It is well-known that Θ(Ψ) is an inﬁnitely divisible positive random variable with nodrift component and Levy measure exactly ν ( dθ ) Perman, Pitman and Yor (1992). In theterminology of (Loeve, 1956, Equation 2), the characteristics of Θ(Ψ) are a = b = 0 (nodrift or Gaussian parts), L ( x M ( x ) := − ν ([ x, ∞ )) . Let I be a counting process in reverse over (0 , ∞ ) deﬁned based on the Poisson pointprocess { θ i } ∞ i =1 in the following way. For any x , I ( x ) is the number of points θ i exceedingthe threshold x : I ( x ) := |{ i : θ i ≥ x }| . We augment I (0) = ∞ and I ( ∞ ) = 0. As a stochastic process, I has independent increments,in that for all 0 = t < t < · · · < t k , the increments I ( t i ) − I ( t i − ) are independent,furthermore the law of the increments is I ( t i − ) − I ( t i ) ∼ Poisson( M ( t i ) − M ( t i − )). Theseproperties are simple consequences of the counting measure induced by the Poisson pointprocess. According to (Loeve, 1956, Page 191), the limiting distribution of { θ K, ( i ) } Ki =1 isgoverned by I , in the sense that for any ﬁxed t ∈ N , for any x , x , . . . , x t ∈ [0 , ∞ ):lim K →∞ P ( θ K, (1) < x , θ K, (2) < x , . . . , θ K, ( t ) < x t )= P ( I ( x ) < , I ( x ) < , . . . , I ( x t ) < t ) . (B.4)Because the θ i ’s induce I , we can relate the left hand side to the order statistics of thePoisson point process. We denote the decreasing order statistic of the { θ i } ∞ i =1 as: θ (1) ≥ θ (2) ≥ · · · ≥ θ ( n ) ≥ · · · . Nguyen et al./Independent ﬁnite approximations Clearly, for any t ∈ N , the event that I ( x ) exceeds t is the same as the top t jumps amongthe { θ i } ∞ i =1 exceed x: I ( x ) ≥ t ⇐⇒ θ ( t ) ≥ x . Therefore Eq. (B.4) can be rewritten as, forany ﬁxed t ∈ N , for any x , x , . . . , x t ∈ [0 , ∞ ):lim K →∞ P ( θ K, (1) < x , θ K, (2) < x , . . . , θ K, ( t ) < x t ) = P ( θ (1) < x , θ (2) < x , . . . , θ ( t ) < x t )(B.5)It is well-known that convergence of the distribution function imply weak convergence– for instance, see Problem 1 of https://link.springer.com/content/pdf/10.1007/978-1-4612-5254-2_3.pdf . Actually, from (Loeve, 1956, Theorem 5 and page 194), forany ﬁxed t ∈ N , the convergence in distribution of { θ K, ( i ) } ti =1 to { θ i } ti =1 holds jointly withthe convergence of (cid:80) Ki =1 θ K, ( i ) to (cid:80) ∞ i =1 θ i : the two conditions of the theorem, which arecontinuity of the distribution function of each θ K,i and M (0) = −∞ (there is a typo inLoeve (1956)), are easily veriﬁed. Therefore, by continuous mapping theorem, if we deﬁnethe normalized atom sizes: p K, ( s ) := θ K, ( s ) (cid:80) Ki =1 θ K,i p ( s ) := θ ( s ) (cid:80) ∞ i =1 θ i we also have that the normalized decreasing order statistics converge:( p K,i ) Ki =1 f.d.d. → ( p K, ( i ) ) ∞ i =1 Finally we show that the EPPFs converge. In addition, if we deﬁne the size-biased per-mutation (in the sense of (Gnedin, 1998, Section 2) ) of the normalized atom sizes: { (cid:101) p K,i } ∼

SBP( p K, ( s ) ) { (cid:101) p i } ∼ SBP( p ( s ) )then by (Gnedin, 1998, Theorem 1), the ﬁnite-dimensional distributions of the size-biasedpermutation also converges: ( (cid:101) p K,i ) Ki =1 f.d.d. → ( (cid:101) p i ) ∞ i =1 (B.6)From here, we ﬁx the number of samples N , the number of components t and the size ofthe clusters n i . (Pitman, 1996, Equation 45) gives the EPPF of Ξ = Θ / Θ(Ψ): p ( n , n , . . . , n t ) = E  t (cid:89) i =1 (cid:101) p n i − i t − (cid:89) i =1  − i (cid:88) j =1 (cid:101) p j  , Likewise, the EPPF of Ξ K = Θ K / Θ K (Ψ) is: p K ( n , n , . . . , n t ) = E  t (cid:89) i =1 (cid:101) p n i − K,i t − (cid:89) i =1  − i (cid:88) j =1 (cid:101) p K,j 

Since t is ﬁxed, and each p j is [0 ,

1] valued, the mapping from the t -dimensional vector p tothe product (cid:81) ti =1 p n i − i (cid:81) t − i =1 (cid:16) − (cid:80) ij =1 p j (cid:17) is continuous and bounded. The choice of N , t , n i have been ﬁxed but arbitrary. Hence, the convergence in ﬁnite-dimensional distributionsof in Eq. (B.6) imply that the EPPFs converge. . Nguyen et al./Independent ﬁnite approximations Appendix C: Marginal processes of exponential CRMs

The marginal process characterization describes the probabilistic model not through thetwo-stage sampling Θ ∼ CRM(

H, ν ) and X n | Θ iid ∼ LP( l ; Θ), but through the conditionaldistributions X n | X n − , X n − , . . . , X i.e. the underlying Θ has been marginalized out . Thisperspective removes the need to infer a countably inﬁnite set of target variables. In addition,the exchangeability between X , X , . . . , X N i.e. the joint distribution’s invariance with re-spect to ordering of observations Aldous (1985), often enables the development of inferencealgorithms, namely Gibbs samplers.(Broderick, Wilson and Jordan, 2018, Corollary 6.2) derives the conditional distributions X n | X n − , X n − , . . . , X for general exponential family CRMs Eqs. (1) and (2). Proposition C.1 (Target’s marginal process (Broderick, Wilson and Jordan, 2018, Corol-lary 6.2)) . For any n , X n | X n − , . . . , X is a random measure with ﬁnite support.1. Let { ζ i } K n − i =1 be the union of atom locations in X , X , . . . , X n − . For ≤ m ≤ n − , let x m,j be the atom size of X m at atom location ζ j . Denote x n,i to be the atom size of X n at atom location ζ i . The x n,i ’s are independent across i and the p.m.f. of x n,i at x is: κ ( x ) S (cid:18) − (cid:80) n − m =1 φ ( x m,i ) + φ ( x ) , η + (cid:18)(cid:80) n − m =1 t ( x m,i ) + t ( x ) n (cid:19)(cid:19) S (cid:18) − (cid:80) n − m =1 φ ( x m,i ) , η + (cid:18)(cid:80) n − m =1 t ( x m,i ) n − (cid:19)(cid:19) .

2. For each x ∈ N , X n has p n,x atoms whose atom size is exactly x . The locations ofeach atom are iid H : as H is diﬀuse, they are disjoint from the existing union of atoms { ζ i } K n − i =1 . p n,x is Poisson-distributed, independently across x , with mean: γ (cid:48) κ (0) n − κ ( x ) S (cid:18) c/K − n − φ (0) + φ ( x ) , η + (cid:18) ( n − t (0) + t ( x )) n (cid:19)(cid:19) . In Proposition C.2, we state a similar characterization of Z n | Z n − , Z n − , . . . , Z for ﬁnite-dimensional model Eq. (6) and give the proof. Proposition C.2 (Approximation’s marginal process) . For any n , Z n | Z n − , . . . , Z is arandom measure with ﬁnite support.1. Let { ζ i } K n − i =1 be the union of atom locations in Z , Z , . . . , Z n − . For ≤ m ≤ n − , let z m,j be the atom size of Z m at atom location ζ j . Denote z n,i to be the atom size of Z n at atom location ζ i . z n,i ’s are independently across i and the p.m.f. of z n,i at x is: κ ( x ) S (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) + φ ( x ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) + t ( x ) n (cid:19)(cid:19) S (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19)(cid:19) . K − K n − atom locations are generated iid from H . Z n has p n,x atoms whose size isexactly x (for x ∈ N ∪ { } ) over these K − K n − atom locations (the p n, atoms whoseatom size is can be interpreted as not present in Z n ). The joint distribution of p n,x is . Nguyen et al./Independent ﬁnite approximations a Multinomial with K − K n − trials, with success of type x having probability: κ ( x ) S (cid:18) c/K − n − φ (0) + φ ( x ) , η + (cid:18) ( n − t (0) + t ( x ) n (cid:19)(cid:19) S (cid:18) c/K − n − φ (0) , η + (cid:18) ( n − t (0) n − (cid:19)(cid:19) . Proof of Proposition C.2.

We only need to prove the conditional distributions for the atomsizes: that the K distinct atom locations are generated iid from the base measure is clear.First we consider n = 1. By construction Corollary 3.3, a priori, the trait frequencies { θ i } Ki =1 are independent, each following the distribution: P ( θ i ∈ dθ ) = { θ ∈ U } S ( c/K − , η ) θ c/K − exp (cid:18) (cid:104) η, (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:19) . Conditioned on { θ i } Ki =1 , the atom sizes z ,i that Z puts on the i th atom location areindependent across i and each is distributed as: P ( z ,i = x | θ i ) = κ ( x ) θ φ ( x ) exp ( (cid:104) µ ( θ i ) , t ( x ) (cid:105) − A ( θ i )) . Integrating out θ i , the marginal distribution for z ,i is: P ( z ,i = x ) = (cid:90) P ( z ,i = x | θ i = θ ) P ( θ i ∈ dθ )= κ ( x ) S ( c/K − , η ) (cid:90) U θ c/K − φ ( x ) exp (cid:18) (cid:104) η + (cid:18) t ( x )1 (cid:19) , (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:19) dθ = κ ( x ) S (cid:18) c/K − φ ( x ) , η + (cid:18) t ( x )1 (cid:19)(cid:19) S ( c/K − , η ) , by deﬁnition of S as the normalizer Eq. (3).Now we consider n ≥

2. The distribution of z n,i only depends on the distribution of z n − ,i , z n − ,i , . . . , z ,i since the atom sizes across diﬀerent atoms are independent of eachother both a priori and a posteriori. The predictive distribution is an integral: P ( z n,i = x | z n − ,i ) = (cid:90) P ( z n,i = x | θ i ) P ( θ i ∈ dθ | z n − ,i ) . Because the prior over θ i is conjugate for the likelihood z i,j | θ i , and the observations z i,j are conditionally indepndent given θ i , the posterior P ( θ i ∈ dθ | z n − ,i ) is in the sameexponential family but with diﬀerent natural parameters: { θ ∈ U } θ c/K − (cid:80) n − m =1 φ ( z m,i ) exp (cid:18) (cid:104) η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19) , (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:19) dθS (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19)(cid:19) . . Nguyen et al./Independent ﬁnite approximations This means that the predictive distribution P ( z n,i = x | z n − ,i ) equals: κ ( x ) (cid:82) U θ c/K − (cid:80) n − m =1 φ ( z m,i )+ φ ( x ) exp (cid:18) (cid:104) η + (cid:18)(cid:80) n − m =1 t ( z m,i ) + t ( x ) n (cid:19) , (cid:18) µ ( θ ) − A ( θ ) (cid:19) (cid:105) (cid:19) dθS (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19)(cid:19) = κ ( x ) S (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) + φ ( x ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) + t ( x ) n (cid:19)(cid:19) S (cid:18) c/K − (cid:80) n − m =1 φ ( z m,i ) , η + (cid:18)(cid:80) n − m =1 t ( z m,i ) n − (cid:19)(cid:19) . The predictive distribution P ( z n,i = x | z n − ,i ) govern both the distribution of atom sizesfor known atom locations and new atom locations. Appendix D: Technical lemmas

D.1. Concentration

Lemma D.1 (Modiﬁed upper tail Chernoﬀ bound) . Let X = (cid:80) ni =1 X i , where X i = 1 withprobability p i and X i = 0 with probability − p i , and all X i are independent. Let µ be anupper bound on E ( X ) = (cid:80) ni =1 p i . Then for all δ > : P ( X ≥ (1 + δ ) µ ) ≤ exp (cid:18) − δ δ µ (cid:19) . Proof of Lemma D.1.

The proof relies on the regular upper tail Chernoﬀ bound http://math.mit.edu/~goemans/18310S15/chernoff-notes.pdf and an argument using stochas-tic domination. Truly, we pad the ﬁrst n Poisson trials that deﬁne X with additional trials X n +1 , X n +2 , . . . , X n + m where m is the smallest natural number such that µ − E [ X ] m ≤ X n + i is a Bernoulli with probability µ − E [ X ] m , and the trials are independent. Then Y = X + (cid:80) mj =1 X n + j is itself the sum of Poisson trials with mean exactly µ , so the regularChernoﬀ bound applies: P ( Y ≥ (1 + δ ) µ ) ≤ exp (cid:18) − δ δ µ (cid:19) . However by construction, X is stochastically dominated by Y , so the tail probabilities of X is bounded by the tail probabilities of Y . Lemma D.2 (Lower tail Chernoﬀ bound) . Let X = (cid:80) ni =1 X i , where X i = 1 with probability p i and X i = 0 with probability − p i , and all X i are independent. Let µ := E ( X ) = (cid:80) ni =1 p i .Then for all δ ∈ (0 , : P ( X ≤ (1 − δ ) µ ) ≤ exp( − µδ / . Lemma D.3 (Tail bounds for Poisson distribution) . If X ∼ Poisson( λ ) then for any x > : P ( X ≥ λ + x ) ≤ exp (cid:18) − x λ + x ) (cid:19) , and for any < x < λ : P ( X ≤ λ − x ) ≤ exp (cid:18) − x λ (cid:19) . . Nguyen et al./Independent ﬁnite approximations Proof of Lemma D.3.

For x ≥ −

1, let ψ ( x ) := 2((1 + x ) ln(1 + x ) − x ) /x .We ﬁrst inspect the upper tail bound. If X ∼ Poisson( λ ), for any x >

0, (Pollard, 2001,Exercise 3 p.272) implies that: P ( Z ≥ λ + x ) ≤ exp (cid:18) − x λ ψ (cid:16) xλ (cid:17)(cid:19) . To show the upper tail bound, it suﬃces to prove that x λ ψ (cid:0) xλ (cid:1) is greater than x λ + x ) . Ingeneral, we show that for u ≥

0: ( u + 1) ψ ( u ) − ≥ . (D.1)The denominator of ( u + 1) ψ ( u ) − u +1) ψ ( u ) −

1, which is g ( u ) := 2(( u + 1) ln( u + 1) − u ( u + 1) − u . Its 1st and 2nd derivativesare: g (cid:48) ( u ) = 4( u + 1) ln( u + 1) − u + 1 g (cid:48)(cid:48) ( u ) = 4 ln( u + 1) + 2 . Since g (cid:48)(cid:48) ( u ) ≥ g (cid:48) ( u ) is monotone increasing. Since g (cid:48) (0) = 1, g (cid:48) ( u ) > u ≥

0, hence g ( u ) is monotone increasing. Because g (0) = 0, we conclude that g ( u ) ≥ u > u = x/λ : ψ (cid:16) xλ (cid:17) ≥

11 + xλ = λx + λ , which shows x λ ψ (cid:0) xλ (cid:1) ≥ x λ + x ) .Now we inspect the lower tail bound. We follow the proof of . We ﬁrst argue that: P ( X ≤ λ − x ) ≤ exp (cid:18) − x λ ψ (cid:16) − xλ (cid:17)(cid:19) . (D.2)For any θ , the moment generating function E [exp( θX )] is well-deﬁned and well-known: E [exp( θX )] := exp( λ (exp( θ ) − . Therefore: P ( X ≤ λ − x ) ≤ P (exp( θX ) ≤ exp( θ ( λ − x )) ≤ P (exp( θ ( λ − x − X )) ≥ ≤ exp( θ ( λ − x )) E [exp( − θX )] , where we have used Markov’s inequality. We now aim to minimize exp( θ ( λ − x )) E [exp( − θX )]as a function of θ . Its logarithm is: λ (exp( − θ ) −

1) + θ ( λ − x ) . This is a convex function, whose derivative vanishes at θ = − ln (cid:0) − xλ (cid:1) . Overall this meansthe best upper bound on P ( X ≤ λ − x ) is:exp (cid:16) − λ (cid:16) xλ + (1 − xλ ) ln(1 − xλ ) (cid:17)(cid:17) , . Nguyen et al./Independent ﬁnite approximations which is exactly the right hand side of Eq. (D.2). Hence to demonstrate the lower tail bound,it suﬃces to show that: ψ (cid:16) − xλ (cid:17) ≥ . More generally, we show that for − ≤ u ≤ ψ ( u ) − ≥

0. Consider the numerator of ψ ( u ) −

1, which is h ( u ) := 2((1 + u ) ln(1 + u ) − u ) − u . The ﬁrst two derivatives are: h (cid:48) ( u ) = 2(1 + ln(1 + u )) − uh (cid:48)(cid:48) ( u ) = 21 + u − h (cid:48)(cid:48) ( u ) ≥ h ( u ) is convex on [ − , h (0) = 0. Also, by simple continuityargument, h ( −

1) = 2. Therefore, h is non-negative on [0 , ψ ( u ) ≥ Lemma D.4 (Multinomial-Poisson approximation) . Let { p i } ∞ i =1 , p i ≥ , (cid:80) ∞ i =1 p i < .Suppose there are n independent trials: in each trial, success of type i has probability p i . Let X = { X i } ∞ i =1 be the number of type i successes after n trial. Let Y = { Y i } ∞ i =1 be independentPoisson random variables, where Y i has mean np i . Then: d T V ( X, Y ) ≤ n (cid:32) ∞ (cid:88) i =1 p i (cid:33) . Proof of Lemma D.4.

First we remark that both X and Y can be sampled in two-steps. • Regarding X , ﬁrst sample N ∼ Binom ( n, (cid:80) ∞ i =1 p i ). Then, for each 1 ≤ k (cid:54) = N ,sample Z k where P ( Z k = i ) = p i (cid:80) ∞ j =1 p j . Then, X i = (cid:80) N k =1 { Z k = i } for each i . • Regarding Y , ﬁrst sample N ∼ Poisson ( n (cid:80) ∞ i =1 p i ). Then, for each 1 ≤ k ≤ N ,sample T k where P ( T k = i ) = p i (cid:80) ∞ j =1 p j . Then, Y i = (cid:80) N k =1 { T k = i } for each i .The two-step sampling perspective for X comes from rejection sampling: to generate asuccess of type k , we ﬁrst generate some type of success, and then re-calibrate to get theright proportion for type k . The two-step perspective for Y comes from the thinning propertyof Poisson distribution (Last and Penrose, 2017, Exercise 1.5). The thinning property impliesthat for any ﬁnite index set K , all { Y i } for i ∈ K are mutually independent and marginally, Y i ∼ Poisson( np i ). Hence the whole collection { Y i } i =1 are independent Poissons and themean of Y i is np i .Observing that the conditional X | N = n is the same as Y | N = n , we use propagationrule Lemma D.7: d T V ( X, Y ) ≤ d T V ( N , N ) . Total variation between N and N is just the classic Binomial-Poisson approximationLe Cam (1960). d T V ( N , N ) ≤ n (cid:32) ∞ (cid:88) i =1 p i (cid:33) . Lemma D.5 (Total variation between Poissons (Adell and Lekuona, 2005, Corrollary 3.1)) . Let P be the Poisson distribution with mean s , P the Poisson distribution with mean t .Then: d T V ( P , P ) ≤ − exp( −| s − t | ) ≤ | s − t | . . Nguyen et al./Independent ﬁnite approximations D.2. Total variation

First is the chain rule, which will be applied to compare joint distributions that admitdensities.

Lemma D.6 (Chain rule) . Suppose ( X , Y ) and ( X , Y ) are two distributions, over A×B ,that have densities w.r.t a common measure. Then: d T V ( P X ,Y , P X ,Y ) ≤ d T V ( P X , P X ) + sup a ∈A d T V ( P Y | X = a , P Y | X = a ) . Proof of Lemma D.6.

Because both P X ,Y and P X ,Y have densities, total variation dis-tance is half of L distance between the densities: d TV ( P X ,Y , P X ,Y ) = 12 (cid:90) ( a,b ) ∈A×B | P X ,Y ( a, b ) − P X ,Y ( a, b ) | dadb = 12 (cid:90) ( a,b ) ∈A×B | P X ,Y ( a, b ) − P X ( a ) P Y | X ( b | a ) + P X ( a ) P Y | X ( b | a ) − P X ,Y ( a, b ) | dadb ≤ (cid:90) ( a,b ) ∈A×B (cid:0) P Y | X ( b | a ) | P X ( a ) − P X ( a ) | + P X ( a ) | P Y | X ( b | a ) − P Y | X ( b | a ) | (cid:1) dadb = 12 (cid:90) ( a,b ) ∈A×B P Y | X ( b | a ) | P X ( a ) − P X ( a ) | dadb + 12 (cid:90) ( a,b ) ∈A×B P X ( a ) | P Y | X ( b | a ) − P Y | X ( b | a ) | dadb. where we have used triangle inequality. Regarding the ﬁrst term, using Fubini:12 (cid:90) ( a,b ) ∈A×B P Y | X ( b | a ) | P X ( a ) − P X ( a ) | dadb = 12 (cid:90) a ∈A (cid:18)(cid:90) b ∈B P Y | X ( b | a ) db (cid:19) | P X ( a ) − P X ( a ) | da = 12 (cid:90) a ∈A | P X ( a ) − P X ( a ) | da = d TV ( P X , P X ) . Regarding the second term:12 (cid:90) ( a,b ) ∈A×B P X ( a ) | P Y | X ( b | a ) − P Y | X ( b | a ) | dadb = (cid:90) a ∈A (cid:18) (cid:90) b ∈B | P Y | X ( b | a ) − P Y | X ( b | a ) | db (cid:19) P X ( a ) da ≤ (cid:18) sup a ∈A d TV ( P Y | X = a , P Y | X = a ) (cid:19) (cid:90) a ∈A P X ( a ) da = sup a ∈A d TV ( P Y | X = a , P Y | X = a ) Sum of the ﬁrst and second upper bound give the total variation chain rule.Second is the propagation rule, which applies even if distributions don’t have densities.

Lemma D.7 (Propagation rule) . Suppose ( X , Y ) and ( X , Y ) are two distributions over A × B . Suppose the conditional Y | X = a is the same as the conditional Y | X = a , whichwe just denote as Y | X = a . Then: d T V ( P Y , P Y ) ≤ d T V ( P X , P X ) . . Nguyen et al./Independent ﬁnite approximations Proof of Lemma D.7.

It is well-known that total variation between P U and P V is the in-ﬁmum of P ( U (cid:54) = V ) over all couplings ( U, V ) where U ∼ P U and V ∼ P V ((Madras andSezer, 2010, Equation 13)). For any joint distribution of ( X , Y , X , Y ) where marginally( X , Y ) ∼ P X ,Y and ( X , Y ) ∼ P X ,Y , ( Y , Y ) is a coupling where Y ∼ P Y and Y ∼ P Y . Therefore: d T V ( P Y , P Y ) ≤ P ( Y (cid:54) = Y ) = P ( Y (cid:54) = Y , X (cid:54) = X ) + P ( Y (cid:54) = Y , X = X ) . Now suppose the joint distribution over ( X , Y , X , Y ) is such that, conditioned on X = X = a for any a , P ( Y = Y | X = X = a ) = 1 (when X (cid:54) = X , it doesn’t matterthe relationship between Y | X = a and Y | X = b ). This is possible since the conditional Y | X = a is the same as the conditional Y | X = a . For such a distribution, P ( Y (cid:54) = Y , X = X ) = 0. Hence: d T V ( P Y , P Y ) ≤ P ( Y (cid:54) = Y , X (cid:54) = X ) ≤ P ( X (cid:54) = X ) . Now, we recognize that ( X , X ) is an arbitrary coupling between P X and P X . Takinginﬁmum over all couplings, we arrive at the propagation rule.Third is the product rule. Lemma D.8 (Product rule) . Z = ( X , Y ) and Z = ( X , Y ) are two distributions over A × B . Suppose P X ,Y factorizes into P X P Y and similarly P X ,Y = P X P Y . Then: inf coupling P Z ,P Z P ( Z (cid:54) = Z ) ≤ inf coupling P X ,P X P ( X (cid:54) = X ) + inf coupling P Y ,P Y P ( Y (cid:54) = Y ) Proof of Lemma D.8.

Consider any ( X , X ) that is a coupling of P X and P X , and any( Y , Y ) that is a coupling of P Y and P Y . Because of the factorization structure betweenthe X (cid:48) s and the Y (cid:48) , we can construct ( X (cid:48) , X (cid:48) , Y (cid:48) , Y (cid:48) ) such that ( X (cid:48) , X (cid:48) ) D = ( X , X ),( Y (cid:48) , Y (cid:48) ) D = ( Y , Y ), ( X (cid:48) , Y (cid:48) ) ∼ P X ,Y , ( X (cid:48) , Y (cid:48) ) ∼ P X ,Y . By union bound: P (( X (cid:48) , Y (cid:48) ) (cid:54) = ( X (cid:48) , Y (cid:48) )) ≤ P ( X (cid:48) (cid:54) = X (cid:48) ) + P ( Y (cid:48) (cid:54) = Y (cid:48) )Because inf coupling P Z ,P Z P ( Z (cid:54) = Z ) ≤ P (( X (cid:48) , Y (cid:48) ) (cid:54) = ( X (cid:48) , Y (cid:48) )), we have:inf coupling P Z ,P Z P ( Z (cid:54) = Z ) ≤ P ( X (cid:48) (cid:54) = X (cid:48) ) + P ( Y (cid:48) (cid:54) = Y (cid:48) ) . We ﬁnish the proof by taking the inﬁmum over couplings ( X , X ) and ( Y , Y ) of theRHS. D.3. Miscellaneous

Lemma D.9 (Order of growth of harmonic-like sums) . N (cid:88) n =1 αn − α ≥ α (ln N − ψ ( α ) − . where ψ is the digamma function. . Nguyen et al./Independent ﬁnite approximations Proof of Lemma D.9.

It is well-known (for instance https://en.wikipedia.org/wiki/Chinese_restaurant_process ) that: N (cid:88) n =1 αn − α = α [ ψ ( α + N ) − ψ ( α )](Gordon, 1994, Theorem 5) says that ψ ( α + N ) ≥ ln( α + N ) − α + N ) − α + N ) ≥ ln N − . We list a collection of technical lemmas that are used when verifying Assumption 2 forthe recurring examples.The ﬁrst set assists in the beta-Bernoulli model. • For α > i = 1 , , , . . . :1 i + α − ≤ (cid:18) α { i = 1 } + 1 i { i > } (cid:19) . (D.3) • For m, x, y > m ≤ y : (cid:12)(cid:12)(cid:12)(cid:12) m + xy + x − my (cid:12)(cid:12)(cid:12)(cid:12) ≤ xy . (D.4) Proof of Eq. (D.3) . If i = 1, i + α − = α . If i ≥ i + α − ≤ i − ≤ i . Proof of Eq. (D.4) . (cid:12)(cid:12)(cid:12)(cid:12) m + xy + x − my (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ( m + x ) y − m ( y + x ) y ( y + x ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) x ( y − m ) y ( y + x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ xy . The second set aid in the gamma-Poisson model. • For x ∈ [0 , − x ) ln(1 − x ) + x ≥ . (D.5) • For x ∈ (0 , p ≥

0: (1 − x ) p + p x − x ≥ . (D.6) • For λ >

0, for m > , t > , x > d T V (cid:0)

NB( m, t − ) , NB( m + x, t − ) (cid:1) ≤ x /t − /t . (D.7) • For y ≥ , m > , K > (cid:12)(cid:12)(cid:12)(cid:12) my − K Γ( m/K + y )Γ( m/K ) y ! (cid:12)(cid:12)(cid:12)(cid:12) ≤ e m K . (D.8)where e is the Euler constant. . Nguyen et al./Independent ﬁnite approximations Proof of Eq. (D.5) . Set g ( x ) to be the function on the right hand side. Then its derivativeis g (cid:48) ( x ) = − ln(1 − x ) ≥

0, meaning the function is monotone increasing. Since g (0) = 0, it’strue that g ( x ) ≥ , Proof of Eq. (D.6) . Let f ( p ) = (1 − x ) p + p x − x −

1. Then f (cid:48) ( p ) = ln(1 − x )(1 − x ) p + x − x .Also f (cid:48)(cid:48) ( p ) = (ln(1 − x )) (1 − x ) p >

0. So f (cid:48) ( p ) is monotone increasing. At p = 0, f (cid:48) (0) =ln(1 − x ) + x − x >

0. Therefore f (cid:48) ( p ) ≥ p . So f ( p ) is increasing. Since f (0) = 0, it’strue that f ( p ) ≥ p . Proof of Eq. (D.7) . It is known that NB( r, θ ) is a Poisson stopped sum distribution (John-son, Kemp and Kotz, 2005, Equation 5.15): • N ∼ Poisson( − r ln(1 − θ )). • Y i iid ∼ Log( θ ) where the Log( θ ) distribution’s pmf at k equals − θ k k ln(1 − θ ) . • (cid:80) Ni =1 Y i ∼ NB( r, θ ).Therefore, by total variation’s chain rule Lemma D.6, to compare NB( m, t − ) withNB( m + γ/K, t − ) it suﬃces to compare the two generating Poissons. d T V (cid:0)

NB( m, t − ) , NB( m + γ/K, t − ) (cid:1) ≤ d T V (Poisson( − m ln(1 − t − ) , Poisson( − ( m + γλ/K ) ln(1 − t − ))) ≤ − ln(1 − t − ) γλK ≤ t − − t − γλK . We have used the fact that total variation distance between Poissons is dominated by theirdiﬀerent in means Lemma D.5 and Eq. (D.5) where x = ( λ + i ) − . Proof of Eq. (D.8) . Since Γ (cid:0) mK + y (cid:1) = (cid:16)(cid:81) y − j =0 ( mK + j ) (cid:17) Γ (cid:0) mK (cid:1) = Γ (cid:0) mK (cid:1) mK (cid:81) y − j =1 ( mK + j ), wehave: (cid:12)(cid:12)(cid:12)(cid:12) my − K Γ( m/K + y )Γ( m/K ) y ! (cid:12)(cid:12)(cid:12)(cid:12) = my  y − (cid:89) j =1 m/K + jj −  . We inspect the product in more detail. y − (cid:89) j =1 m/K + jj = y − (cid:89) j =1 (cid:18) m/Kj (cid:19) ≤ y − (cid:89) j =1 exp (cid:18) m/Kj (cid:19) = exp  mK y − (cid:88) j =1 j  ≤ exp (cid:16) mK (ln y + 1) (cid:17) = ( ey ) m/K . where the ( y − y + 1. In all: (cid:12)(cid:12)(cid:12)(cid:12) my − K Γ( m/K + y )Γ( m/K ) y ! (cid:12)(cid:12)(cid:12)(cid:12) ≤ my (cid:16) ( ey ) m/K − (cid:17) ≤ my mK ( ey − ≤ e m K .

The third set aid in the beta-negative binomial model. . Nguyen et al./Independent ﬁnite approximations • For x > z ≥ y ≥ B ( x, y ) − B ( x, z ) ≤ ( z − y ) B ( x + 1 , y − . ≤ ( z − y ) B ( x + 1 , y − . (D.9) • For any θ ∈ [0 , r > b ≥ ∞ (cid:88) y =1 Γ( y + r ) y !Γ( r ) B ( y, b + r ) ≤ rb − . . (D.10) • For b ≥

1, for any c >

0, for any K ≥ c : (cid:12)(cid:12)(cid:12)(cid:12) − Γ( b )Γ( b + c/K ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ cK (2 + ln b ) . (D.11) • There exists a constant D (cid:48)(cid:48) such that for all b > , c > , K ≥ c (ln b + 2): (cid:12)(cid:12)(cid:12)(cid:12) c − KB ( c/K, b ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ cK (3 ln b + 8) . (D.12) Proof of Eq. (D.9) . First we prove that for any x ∈ [0 , √ − x ln(1 − x ) + x ≥ . Truly, let g ( x ) to be the function on the right hand side. Then its derivative is: g (cid:48) ( x ) = 2 √ − x − ln(1 − x ) − √ − x . Denote the numerator function by h ( x ). Its derivative is: h (cid:48) ( x ) = 11 − x − √ − x ≥ , since x ∈ [0 ,

1] meaning h is monotone increasing. Since h (0) = 0, it means h ( x ) ≥

0. Thismeans g (cid:48) ( x ) ≥ g itself is monotone increasing. Since g (0) = 0 it’s true that g ( x ) ≥ x ∈ [0 , x ∈ [0 , p ≥ − x ) p + p x √ − x − ≥ . (D.13)Truly, let f ( p ) = (1 − x ) p + p x √ − x −

1. Then f (cid:48) ( p ) = ln(1 − x )(1 − x ) p + x √ − x . Also f (cid:48)(cid:48) ( p ) = (ln(1 − x )) (1 − x ) p >

0. So f (cid:48) ( p ) is monotone increasing. At p = 0, f (cid:48) (0) =ln(1 − x ) + x √ − x >

0. Therefore f (cid:48) ( p ) ≥ p . So f ( p ) is increasing. Since f (0) = 0,it’s true that f ( p ) ≥ p .We ﬁnally prove the inequality about beta functions. B ( x, y ) − B ( x, z ) = (cid:90) θ x − (1 − θ ) y − (1 − (1 − θ ) z − y ) dθ ≤ (cid:90) θ x − (1 − θ ) y − ( z − y ) θ (1 − θ ) − . dθ = ( z − y ) (cid:90) θ x (1 − θ ) y − . dθ = ( z − y ) B ( x + 1 , y − . . where we have usd 1 − (1 − θ ) z − y ≤ ( z − y ) θ (1 − θ ) − / . As for B ( x +1 , y − . ≤ B ( x +1 , y − . Nguyen et al./Independent ﬁnite approximations Proof of Eq. (D.10) . ∞ (cid:88) y =1 Γ( y + r ) y !Γ( r ) B ( y, b + r ) = (cid:90) ∞ (cid:88) y =1 Γ( y + r ) y !Γ( r ) θ y − (1 − θ ) b + r − dθ = (cid:90) θ − (cid:32) ∞ (cid:88) y =1 Γ( y + r ) y ! Γ( r ) θ y (cid:33) (1 − θ ) b + r − dθ = (cid:90) (cid:18) θ − (cid:18) − θ ) r − (cid:19)(cid:19) (1 − θ ) b + r − dθ = (cid:90) (cid:0) θ − (1 − (1 − θ ) r ) (cid:1) (1 − θ ) b − dθ ≤ (cid:90) θ − r θ √ − θ (1 − θ ) b − dθ = r (cid:90) (1 − θ ) b − . dθ = rb − . , where the identity (cid:80) ∞ y =1 Γ( y + r ) y ! Γ( r ) θ y = − θ ) r − − (1 − θ ) r . Proof of Eq. (D.11) . First we prove that:1 − Γ( b )Γ( b + c/K ) ≤ cK (2 + ln b ) . The recursion deﬁning Γ( b ) allows us to write:1 − Γ( b )Γ( b + c/K ) = 1 −  (cid:98) b (cid:99)− (cid:89) i =1 b − ib + c/K − i  Γ( b − (cid:98) b (cid:99) + 1)Γ( b + c/K − (cid:98) b (cid:99) + 1) . The argument proceeds in one of two ways. If Γ( b −(cid:98) b (cid:99) +1)Γ( b + c/K −(cid:98) b (cid:99) +1) ≥

1, then we have:1 − Γ( b )Γ( b + c/K ) ≤ − (cid:98) b (cid:99)− (cid:89) i =1 b − ib + c/K − i = (cid:18) − b − b + c/K − (cid:19) + b − b + c/K − −  (cid:98) b (cid:99)− (cid:89) i =1 b − ib + c/K − i  = cK b + c/K − b − b + c/K −  − (cid:98) b (cid:99)− (cid:89) i =2 b − ib + c/K − i  ≤ cK b −  − (cid:98) b (cid:99)− (cid:89) i =2 b − ib + c/K − i  ≤ ... ≤ cK (cid:98) b (cid:99)− (cid:88) i =1 b − i ≤ cK (ln b + 1) . . Nguyen et al./Independent ﬁnite approximations Else, Γ( b −(cid:98) b (cid:99) +1)Γ( b + c/K −(cid:98) b (cid:99) +1) < − Γ( b )Γ( b + c/K )= 1 − Γ( b − (cid:98) b (cid:99) + 1)Γ( b + c/K − (cid:98) b (cid:99) + 1) + Γ( b − (cid:98) b (cid:99) + 1)Γ( b + c/K − (cid:98) b (cid:99) + 1)  − (cid:98) b (cid:99)− (cid:89) i =1 b − ib + c/K − i  ≤ (cid:18) − Γ( b − (cid:98) b (cid:99) + 1)Γ( b + c/K − (cid:98) b (cid:99) + 1) (cid:19) + cK (ln b + 1) . We now argue that for all x ∈ [1 , K ≥ c , 1 − Γ( x )Γ( x + c/K ) ≤ cK . By convexity of Γ( x ),we know that Γ( x ) ≥ Γ( x + c/K ) − cK Γ (cid:48) ( x + c/K ). Hence Γ( x )Γ( x +1 /K ) ≥ − cK Γ (cid:48) ( x + c/K )Γ( x + c/K ) . Since x + c/K ∈ [1 ,

3) and ψ ( y ) = Γ (cid:48) ( y )Γ( y ) , the digamma function, is a monotone increasing function(it is the derivative of a ln Γ( x ), which is also convex), (cid:12)(cid:12)(cid:12) Γ (cid:48) ( x + c/K )Γ( x + c/K ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) Γ (cid:48) (3)Γ(3) (cid:12)(cid:12)(cid:12) ≤

1. Applyingthis to x = b − (cid:98) b (cid:99) + 1, we conclude that:1 − Γ( b )Γ( b + c/K ) ≤ cK (2 + ln b ) . We now show that: Γ( b )Γ( b + c/K ) − ≥ − cK (ln b + ln 2) . Convexity of Γ( y ) means that:Γ( b ) ≥ Γ( b + c/K ) − cK Γ (cid:48) ( b + c/K ) −→ Γ( b )Γ( b + c/K ) − ≥ − cK Γ (cid:48) ( b + c/K )Γ( b + c/K ) . From (Alzer, 1997, Equation 2.2), we know that ψ ( x ) ≤ ln( x ) for positive x . Therefore: − cK Γ (cid:48) ( b + c/K )Γ( b + c/K ) ≥ − cK ln( b + c/K ) ≥ − cK (ln b + ln 2)since b + cK ≤ b .We combine two sides of the inequality to conclude that the absolute value is at most cK (2 + ln b ). Proof of Eq. (D.12) . (cid:12)(cid:12)(cid:12)(cid:12) c − KB ( c/K, b ) (cid:12)(cid:12)(cid:12)(cid:12) = c (cid:12)(cid:12)(cid:12)(cid:12) K/c Γ( c/K ) Γ( c/K + b )Γ( b ) − (cid:12)(cid:12)(cid:12)(cid:12) = c (cid:12)(cid:12)(cid:12)(cid:12) K/c Γ( c/K ) (cid:18) Γ( c/K + b )Γ( b ) − (cid:19) + (cid:18) K/c Γ( c/K ) − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c (cid:18) K/c Γ( c/K ) (cid:12)(cid:12)(cid:12)(cid:12) Γ( c/K + b )Γ( b ) − (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) K/c Γ( c/K ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) . On the one hand:

K/c Γ( c/K ) = Γ(1)Γ(1 + c/K ) . . Nguyen et al./Independent ﬁnite approximations From Eq. (D.11), we know: (cid:12)(cid:12)(cid:12)(cid:12)

Γ(1)Γ(1 + c/K ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ cK . On the other hand, let y = Γ( c/K + b )Γ( b ) . Then: (cid:12)(cid:12)(cid:12)(cid:12) Γ( c/K + b )Γ( b ) − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) y − (cid:12)(cid:12)(cid:12)(cid:12) = | − y | y ≤ cK (2 + ln b ) . Again using Eq. (D.11), | − y | ≤ cK (2 + ln b ). Since K ≥ c (ln b + 2), this is at most 0 . y ≥ .

5. In all: (cid:12)(cid:12)(cid:12)(cid:12) c − KB ( c/K, b ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ c (cid:18)(cid:18) cK (cid:19) cK (2 + ln b ) + 2 cK (cid:19) ≤ cK (3 ln b + 8) . Appendix E: Veriﬁcation of upper bound’s assumptions for additionalexamples

E.1. Gamma-Poisson

First we write down the functions in Deﬁnition 4.1 for gamma-Poisson. This requires ex-pressing the rate measure and likelihood in exponential-family form: h ( x | θ ) = 1 x ! θ x exp( − θ ) , ν ( dθ ) = γλθ − exp( − λθ ) , which means that κ ( x ) = 1 /x ! , φ ( x ) = x, µ ( θ ) = 0 , A ( θ ) = θ . This leads to the normalizer: S = (cid:90) ∞ θ ξ exp( − λθ ) dθ = Γ( ξ + 1) λ − ( ξ +1) . Therefore, h c is: h c ( x n = x | x n − ) = 1 x ! Γ( − (cid:80) n − i =1 x i + x + 1)( λ + n ) − (cid:80) n − i =1 x i + x +1 Γ( − (cid:80) n − i =1 x i + 1)( λ + n − − (cid:80) n − i =1 x i +1 = 1 x ! Γ( (cid:80) n − i =1 x i + x )Γ( (cid:80) n − i =1 x i ) (cid:18) λ + n (cid:19) x (cid:18) − λ + n (cid:19) (cid:80) n − i =1 x i , and similarly (cid:101) h c is: (cid:101) h c ( x n = x | x n − ) = 1 x ! Γ( − (cid:80) n − i =1 x i + x + 1 + γλ/K )( λ + n ) − (cid:80) n − i =1 x i + x +1+ γλ/K Γ( − (cid:80) n − i =1 x i + 1 + γλ/K )( λ + n − − (cid:80) n − i =1 x i +1+ γλ/K = 1 x ! Γ( (cid:80) n − i =1 x i + x + γλ/K )Γ( (cid:80) n − i =1 x i + γλ/K ) (cid:18) λ + n (cid:19) x (cid:18) − λ + n (cid:19) (cid:80) n − i =1 x i + γλ/K , . Nguyen et al./Independent ﬁnite approximations and M n,x is: M n,x = γλ x ! Γ( x )( λ + n ) − x = γλx ( λ + n ) x . Now, we state the constants so that gamma-Poisson satisﬁes Assumption 2, and give theproof.

Proposition E.1 (Gamma-Poisson satisﬁes Assumption 2) . The following hold for arbitrary γ, λ > . For any n : ∞ (cid:88) x =1 M n,x ≤ γλn − λ . ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) ≤ γλn − λ . For any K : ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12) h c ( x | x n − ) − (cid:101) h c ( x | x n − ) (cid:12)(cid:12)(cid:12) ≤ γλK n − λ . For any K : ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ γ λ + eγ λ K n − λ . Proof of Proposition E.1.

The growth rate condition of target model is simple: ∞ (cid:88) x =1 M n,x = γλ ∞ (cid:88) x =1 x ( λ + n ) x ≤ γλ ∞ (cid:88) x =1 λ + n ) x = γλn − λ . The growth rate condition of approximate model is also simple: ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) = 1 − (cid:101) h c (0 | x n − = 0) = (cid:18) − λ + n (cid:19) γλ/K ≤ γλK ( λ + n ) − − ( λ + n ) − = 1 K γλn − λ , where we have used Eq. (D.6) with p = γλK , x = ( λ + n ) − .For the total variation between h c and (cid:101) h c condition, observe that h c and (cid:101) h c are probabilitymass functions of negative binomial distributions, namely: h c ( x | x n − ) = NB (cid:32) x | n − (cid:88) i =1 x i , ( λ + n ) − (cid:33)(cid:101) h c ( x | x n − ) = NB (cid:32) x | n − (cid:88) i =1 x i + γλ/K, ( λ + n ) − (cid:33) . The two negative binomial distributions have the same success probability and only diﬀerin the number of trials. Hence using Eq. (D.7), we have: ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12) h c ( x | x n − ) − (cid:101) h c ( x | x n − ) (cid:12)(cid:12)(cid:12) ≤ γλK ( λ + n ) − − ( λ + n ) − = 2 γλK n − λ . . Nguyen et al./Independent ﬁnite approximations For the total variation between M n,. and K (cid:101) h c ( · |

0) condition: ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) = ∞ (cid:88) x =1 λ + n ) x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) γλx − K Γ( γλ/K + x )Γ( γλ/K ) x ! (cid:18) − λ + n (cid:19) γλ/K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) x =1 λ + n ) x (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) γλx (cid:32) − (cid:18) − λ + n (cid:19) γλ/K (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) γλx − K Γ( γλ/K + x )Γ( γλ/K ) x ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:33) . Using Eqs. (D.7) and (D.8) we can upper bound:1 − (cid:18) − λ + n (cid:19) γλ/K ≤ γλK λ + n − (cid:12)(cid:12)(cid:12)(cid:12) γλx − K Γ( γλ/K + x )Γ( γλ/K ) x ! (cid:12)(cid:12)(cid:12)(cid:12) ≤ eγ λ K .

This means: ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) x =1 λ + n ) x γλx γλK λ + n − ∞ (cid:88) x =1 λ + n ) x eγ λ K ≤ γ λ K λ + n − + eγ λ K λ + n − ≤ γ λ + eγ λ K n − λ . E.2. Beta-negative binomial

First we write down the functions in Deﬁnition 4.1 for beta-negative binomial. This requiresexpressing the rate measure and likelihood in exponential-family form: h ( x | θ ) = Γ( x + r ) x !Γ( r ) θ x exp( r log(1 − θ )) ,ν ( dθ ) = γαθ − exp(log(1 − θ )( α − { θ ≤ } , which means that κ ( x ) = Γ( x + r ) / Γ( r ) x ! , φ ( x ) = x, µ ( θ ) = 0 , A ( θ ) = − r log(1 − θ ). Thisleads to the normalizer: S = (cid:90) θ ξ (1 − θ ) rλ dθ = B ( ξ + 1 , rλ + 1) . To match the parametrizations, we need to set λ = α − r i.e. rλ = α −

1. Therefore, h c is: h c ( x n = x | x n − ) = Γ( x + r ) x !Γ( r ) B ( (cid:80) n − i =1 x i + x, rn + α ) B ( (cid:80) n − i =1 x i , r ( n −

1) + α ) , . Nguyen et al./Independent ﬁnite approximations and (cid:101) h c is: (cid:101) h c ( x n = x | x n − ) = Γ( x + r ) x !Γ( r ) B ( c/K + (cid:80) n − i =1 x i + x, rn + α ) B ( c/K + (cid:80) n − i =1 x i , r ( n −

1) + α ) . and M n,x is: M n,x = γα Γ( x + r ) x !Γ( r ) B ( x, rn + α ) . Now, we state the constants so that beta-negative binomial satisﬁes Assumption 2, andgive the proof.

Proposition E.2 (Beta-negative binomial satisﬁes Assumption 2) . The following hold forany γ > , α ≥ . For any n : ∞ (cid:88) x =1 M n,x ≤ γαn − α − . /r . For any n , any K : ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) ≤ K γαn − α − . /r . For any K : ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12) h c ( x | x n − ) − (cid:101) h c ( x | x n − ) (cid:12)(cid:12)(cid:12) ≤ γαK n − α/r . For any n , for K ≥ γα (3 ln( r ( n −

1) + α ) + 8) : ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ γαK (4 γα + 3) ln( rn + α + 1) + (10 + 2 r ) γα + 24 n − α − . /r . Proof of Proposition E.2.

The ﬁrst growth rate condition is easy to verify: ∞ (cid:88) x =1 M n,x = γα ∞ (cid:88) x =1 Γ( x + r )Γ( r ) x ! B ( x, rn + α ) ≤ γα rr ( n −

1) + α − . . where we have used Eq. (D.10) with b = r ( n −

1) + α .As for the other growth rate condition, ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) = 1 − (cid:101) h c (0 | x n − = 0) = 1 − B ( γα/K, rn + α ) B ( γα/K, r ( n −

1) + α )= B ( γα/K, r ( n −

1) + α ) − B ( γα/K, rn + α ) B ( γα/K, r ( n −

1) + α ) . The numerator is small because of Eq. (D.9) where x = γα/K, y = r ( n −

1) + α, z = rn + α : B ( γα/K, r ( n −

1) + α ) − B ( γα/K, rn + α ) ≤ rB ( γα/K + 1 , r ( n −

1) + α − . . Nguyen et al./Independent ﬁnite approximations The denominator is large because Equation (D.12) with Equation (D.12) with c = γα, b = r ( n −

1) + α : 1 B ( γα/K, r ( n −

1) + α ) ≤ γαK . Combining the two give and using a simple bound on the beta function yields: ∞ (cid:88) x =1 (cid:101) h c ( x | x n − = 0) ≤ K γαn − α − . /r . For the total variation between h c and (cid:101) h c condition, we ﬁrst discuss how each functioncan be expressed a probability mass function of so-called beta negative binomial i.e., BNB((Johnson, Kemp and Kotz, 2005, Section 6.2.3)) distribution. Let A = (cid:80) n − i =1 x i . Observethat: Γ( x + r )Γ( r ) x ! B ( A + x, rn + α ) B ( A, r ( n −

1) + α ) = Γ( A + r )Γ( A ) x ! B ( r + x, A + r ( n −

1) + α ) B ( r, r ( n −

1) + α ) . (E.1)The random variable V whose p.m.f at x appears on the right hand side of Eq. (E.1) is theresult of a two-step sampling procedure. P ∼ Beta ( r, r ( n −

1) + α ) , V | P ∼ NB( A ; P ) . We denote such a distribution as V ∼ BNB( A ; r, r ( n −

1) + α ). An analogous argumentapplies to (cid:101) h c : P ∼ Beta ( r, r ( n −

1) + α ) , V | P ∼ NB (cid:16) A + γαK ; P (cid:17) . Therefore: h c ( x | x n − ) = BNB ( x | A ; r, r ( n −

1) + α ) (cid:101) h c ( x | x n − ) = BNB (cid:16) x | A + γαK ; r, r ( n −

1) + α (cid:17) . We now bound the total variation between the BNB distributions. Because they have acommon mixing distribution, we can upper bound the distance with an integral using simpletriangle inequalities: d T V (cid:16) h c , (cid:101) h c (cid:17) = 12 ∞ (cid:88) x =0 | P ( V = x ) − P ( V = x ) | = 12 ∞ (cid:88) x =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ( P ( V = x | P = p ) − P ( V = x | P = p )) P ( P ∈ dp ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (cid:32) ∞ (cid:88) x =0 | P ( V = x | P = p ) − P ( V = x | P = p ) | (cid:33) P ( P ∈ dp )= (cid:90) d T V (NB(

A, p ) , NB( A + γα/K, p )) P ( P ∈ dp ) . For any p , Eq. (D.7) is used to upper bound the total variation distance between negative . Nguyen et al./Independent ﬁnite approximations binomial distributions. Therefore: d T V (cid:16) h c , (cid:101) h c (cid:17) ≤ (cid:90) γαK p − p P ( P ∈ dp )= γαK B ( r, r ( n −

1) + α ) (cid:90) p r (1 − p ) r ( n − α − dp = γαK B ( r + 1 , r ( n −

1) + α − B ( r, r ( n −

1) + α ) = γαK n − α/r . Finally, we verify the condition between K (cid:101) h c and M n,. , which is showing that the followingsum is small: ∞ (cid:88) x =1 Γ( x + r ) x !Γ( r ) (cid:12)(cid:12)(cid:12)(cid:12) cγαB ( x, rn + α ) − K B ( γα/K + x, rn + α ) B ( γα/K, r ( n −

1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) . We look at the summand for x = 1 and the summation from x = 2 through ∞ separately.For x = 1, we prove that: (cid:12)(cid:12)(cid:12)(cid:12) γαB (1 , rn + α ) − K B ( γα/K + 1 , rn + α ) B ( c/K, r ( n −

1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ rγ α K rn + α + 1) rn + α . (E.2)Expanding gives: (cid:12)(cid:12)(cid:12)(cid:12) γαB (1 , rn + α ) − K B (1 + γα/K, rn + α ) B ( γα/K, r ( n −

1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) = | γαB (1 , rn + α ) B ( γα/K, r ( n −

1) + α ) − KB (1 + γα/K, rn + α ) | B ( γα/K, r ( n −

1) + α ) . (E.3)We look at the numerator of the right hand side in Eq. (E.3): (cid:12)(cid:12)(cid:12)(cid:12) γαB (1 , rn + α ) Γ( γα/K )Γ( r ( n −

1) + α )Γ( γα/K + r ( n −

1) + α ) − K Γ(1 + γα/K )Γ( rn + α )Γ(1 + γα/K + rn + α ) (cid:12)(cid:12)(cid:12)(cid:12) = γα Γ( γα/K ) (cid:12)(cid:12)(cid:12)(cid:12) rn + α Γ( r ( n −

1) + α )Γ( γα/K + r ( n −

1) + α ) − Γ( rn + α )Γ( γα/K + 1 + rn + α ) (cid:12)(cid:12)(cid:12)(cid:12) = γα Γ( γα/K ) rn + α (cid:12)(cid:12)(cid:12)(cid:12) Γ( r ( n −

1) + α )Γ( γα/K + r ( n −

1) + α ) − Γ( rn + α + 1)Γ( γα/K + 1 + rn + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ γα Γ( γα/K ) rn + α (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) Γ( r ( n −

1) + α )Γ( γα/K + r ( n −

1) + α ) − (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) Γ( rn + α + 1)Γ( γα/K + 1 + rn + α ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ≤ γα Γ( γα/K ) rn + α γαK (2 + ln( rn + α + 1)) . where we have used Eq. (D.11) with c = γα and b = r ( n −

1) + α or b = rn + α + 1. In all,Eq. (E.3) is upper bounded by:2 γ α rn + α rn + α + 1) K Γ( γα/K ) B ( γα/K, r ( n −

1) + α )= 2 γ α rn + α rn + α + 1) K Γ( γα/K + r ( n −

1) + α )Γ( r ( n −

1) + α ) ≤ γ α K rn + α + 1) rn + α , . Nguyen et al./Independent ﬁnite approximations since Γ( r ( n − α )Γ( r ( n − α + γα/K ) ≥ − γαK (2+ln( r ( n − α )) ≥ . K ≥ γα (2+ln( r ( n − α ),and this is the proof of Eq. (E.2).We now move onto the summands from x = 2 to ∞ . By triangle inequality: (cid:12)(cid:12)(cid:12)(cid:12) γαB ( x, rn + α ) − K B ( γα/K + x, rn + α ) B ( γα/K, r ( n −

1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ T ( x ) + T ( x ) , where: T ( x ) := B ( x, rn + α ) (cid:12)(cid:12)(cid:12)(cid:12) γα − KB ( γα/K, r ( n −

1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ,T ( x ) := K (cid:12)(cid:12) B ( x, rn + α ) − B ( γαK + x, rn + α ) (cid:12)(cid:12) B ( γα/K, r ( n −

1) + α ) . The helper inequalities we have proven once again are useful: (cid:12)(cid:12)(cid:12)(cid:12) γα − KB ( γα/K, r ( n −

1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ γαK (3 ln( r ( n −

1) + α ) + 8) KB ( γα/K, r ( n −

1) + α ) ≤ γα + γαK (3 ln( r ( n −

1) + α ) + 8) ≤ γα, | B ( x, rn + α ) − B ( γα/K + x, rn + α ) | ≤ γαK B ( x − , rn + α + 1)since K ≥ γα (3 ln( r ( n −

1) + α ) + 8), we have applied Eq. (D.12) in the ﬁrst and secondinequality and Eq. (D.9) in the third one. So for each x ≥

2, each summand is at most:Γ( x + r ) x !Γ( r ) (cid:12)(cid:12)(cid:12)(cid:12) cB ( x, rn + α ) − K B ( γα/K + x, rn + α ) B ( γα/K, r ( n −

1) + α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ γα (3 ln( r ( n −

1) + α ) + 8) K Γ( x + r ) x !Γ( r ) B ( x, rn + α ) + 2 γ α K Γ( x + r ) x !Γ( r ) B ( x − , rn + α + 1) . To upper bound the summation from x = 2 to ∞ , it suﬃces to bound: ∞ (cid:88) x =2 Γ( x + r )Γ( r ) x ! B ( x, rn + α ) ≤ ∞ (cid:88) x =1 Γ( x + r )Γ( r ) x ! B ( x, rn + α ) ≤ rr ( n −

1) + α − . , and: ∞ (cid:88) x =2 Γ( x + r )Γ( r ) x ! B ( x − , rn + α + 1) ≤ r ∞ (cid:88) x =2 Γ( x − r + 1)Γ( r + 1)( x − B ( x − , rn + α + 1) ≤ r ∞ (cid:88) z =1 Γ( z + r + 1)Γ( r + 1) z ! B ( z, rn + α + 1) ≤ r ( r + 1) r ( n −

1) + α − . x = 2 to ∞ is upper bounded by: γα (3 ln( r ( n −

1) + α ) + 8) K rr ( n −

1) + α − . γ α K r ( r + 1) r ( n −

1) + α − . . Nguyen et al./Independent ﬁnite approximations Eqs. (E.2) and (E.4) combine to give: ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12) M n,x − K (cid:101) h c ( x | x n − = 0) (cid:12)(cid:12)(cid:12) ≤ γαK (4 γα + 3) ln( rn + α + 1) + (10 + 2 r ) γα + 24 n − α − . /r . Appendix F: Proofs of CRM bounds

F.1. Upper bound

Proof of Theorem 4.2.

Let β be the smallest positive constant where β C / (1 + β ) ≥

2. Wewill focus on the case where the approximation level K is essentially Ω(ln N ): K ≥ max (( β + 1) max( C ( K, C ) , C ( N, C )) , C (ln N + C )) . (F.1)To see why it is suﬃcient, observe that the upper bound in Theorem 4.2 naturally holds for K smaller than ln N . Total variation distance is always upper bounded by 1; if K = o (ln N ),then by selecting reasonable constants C (cid:48) , C (cid:48)(cid:48) , C (cid:48)(cid:48)(cid:48) , we can make the right hand side at least1, and satisfy the inequality. In the sequel, we will only consider the situation in Eq. (F.1).First, we argue that it suﬃces to bound the total variation distance between the feature-allocation matrices coming from the target model and the approximate model. Given thelatent measures X , X , . . . , X N from the target model, we can read oﬀ the feature-allocationmatrix F , which has N rows and as many columns as there are unique atom locations amongthe X i ’s:1. The i th row of F records the atom sizes of X i .2. Each column corresponds to an atom location: the locations are sorted ﬁrst accordingto the index of the ﬁrst measure X i to manifest it (counting from 1 , , . . . ), and then itsatom size in X i .The marginal process that described the atom sizes of X n | X n − , X n − , . . . , X in Proposi-tion C.1 is also the description of how the rows of F are generated. The joint distribution X , X , . . . , X n can be two-step sampled. First, the feature-allocation matrix F is sampled.Then, the atom locations are drawn iid from the base measure H : each column of F isassigned an atom location, and the latent measure X i has atom size F i,j on the j th atomlocation. A similar two-step sampling generates Z , Z , . . . , Z n , the latent measures underthe approximate model: the distribution over the feature-allocation matrix F (cid:48) follows Propo-sition C.2 instead of Proposition C.1, but conditioned on the feature-allocation matrix, theprocess generating atom locations and constructing latent measures is exactly the same. Inother words, this implies that the conditional distributions Y N | F = f and W N | F (cid:48) = f arethe same, since both models have the same the observational likelihood f given the latentmeasures 1 through N . Denote P F to be the distribution of the feature-allocation matrixunder the target model, and P F (cid:48) the distribution of the feature-allocation matrix under theapproximate model. Lemma D.7 implies that: d T V ( P W N , P Y N ) ≤ d T V ( P F , P F (cid:48) ) . (F.2) . Nguyen et al./Independent ﬁnite approximations Next, we parametrize the feature-allocation matrices in a way that is convenient for theanalysis of total variation distance. Let J be the number of columns of F . Our parametriza-tion involves d n,x , for n ∈ [ N ] and x ∈ N , and s j , for j ∈ [ J ]:1. For n = 1 , , . . . , N :(a) If n = 1, for each x ∈ N , d ,x counts the number of columns j where F ,j = x .(b) For n ≥

2, for each x ∈ N , let J n = { j : ∀ i < n, F i,j = 0 } i.e. no observation before n manifests the atom locations indexed by columns in J n . For each x ∈ N , d n,x countsthe number of columns j ∈ J n where F n,j = x .2. For j = 1 , , . . . , J , let I j = min { i : F i,j > } i.e. the ﬁrst row to manifest the j th atomlocation. Let s j = F I j : N,j i.e. the history of the j th atom location.In words, d n,x is the number of atom locations that is ﬁrst instantiated by the individual n and each atom has size x , while s j is the history of the j th atom location. (cid:80) Nn =1 (cid:80) ∞ x =1 d n,x is exactly J , the number of columns. We use the short-hand d to refer to the collection of d n,x and s the collection of s j . There is a one-to-one mapping between ( d, s ) and the featureallocation matrix f . Let ( D, S ) be the distribution of d and s under the target model, while( D (cid:48) , S (cid:48) ) is the distribution under the approximate model. We now aim to compare the jointdistribution: d T V ( P F , P F (cid:48) ) = d T V ( P D,S , P D (cid:48) ,S (cid:48) ) . Because total variation distance is the inﬁmum of diﬀerence probability over all couplings,to ﬁnd an upper bound on d T V ( P D,S , P D (cid:48) ,S (cid:48) ), it suﬃces to demonstrate a joint distributionsuch that P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) )) is small. The rest of the proof is dedicated to that end. Tostart, we only assume that ( D, S, D (cid:48) , S (cid:48) ) is a proper coupling, in that marginally (

D, S ) ∼ P D,S and ( D (cid:48) , S (cid:48) ) ∼ P D (cid:48) ,S (cid:48) . As we progress, gradually more structure is added to the jointdistribution ( D, S, D (cid:48) , S (cid:48) ) to control P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) )).We ﬁrst decompose P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) )) into other probabilistic quantities which can beanalyzed using Assumption 2. Deﬁne the typical set: D ∗ = (cid:40) d : N (cid:88) n =1 ∞ (cid:88) x =1 d n,x ≤ ( β + 1) max( C ( K, C ) , C ( N, C )) (cid:41) .d ∈ D ∗ means that the feature-allocation matrix f has a bounded number of columns. Theclaim is that: P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) )) ≤ P ( D (cid:54) = D (cid:48) ) + P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ) + P ( D / ∈ D ∗ ) . (F.3)This is true from basic properties of probabilities and conditional probabilities: P (( D, S ) (cid:54) = ( D (cid:48) , S (cid:48) ))= P ( D (cid:54) = D (cid:48) ) + P ( S (cid:54) = S (cid:48) , D = D (cid:48) )= P ( D (cid:54) = D (cid:48) ) + P ( S (cid:54) = S (cid:48) , D = D (cid:48) , D ∈ D ∗ ) + P ( S (cid:54) = S (cid:48) , D = D (cid:48) , D / ∈ D ∗ ) ≤ P ( D (cid:54) = D (cid:48) ) + P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ) + P ( D / ∈ D ∗ ) , The three ideas behind this upper bound are the following. First, because of the growthcondition, we can analyze the atypical set probability P ( D / ∈ D ∗ ). Second, because of thetotal variation between h c and (cid:101) h c , we can analyze P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ). Finally, . Nguyen et al./Independent ﬁnite approximations we can analyze P ( D (cid:54) = D (cid:48) ) because of the total variation between K (cid:101) h c and M n,. . In whatfollows we carry out the program. Atypical set probability

The P ( D / ∈ D ∗ ) term in Eq. (F.3) is easiest to control. Underthe target model Proposition C.1, the D i,x ’s are independent Poissons with mean M i,x , sothe sum (cid:80) Ni =1 (cid:80) ∞ x =1 D i,x is itself a Poisson with mean M = (cid:80) Ni =1 (cid:80) ∞ x =1 M i,x . Because ofLemma D.3, for any x > P (cid:32) N (cid:88) i =1 ∞ (cid:88) x =1 D i,x > M + x (cid:33) ≤ exp (cid:18) − x M + x ) (cid:19) . For the event P ( D / ∈ D ∗ ), M + x = ( β + 1) max( C ( K, C ) , C ( N, C )), M ≤ C ( N, C ) dueto Eq. (7), so that x ≥ β max( C ( K, C ) , C ( N, C )). Therefore: P ( D / ∈ D ∗ ) ≤ exp (cid:18) − β β + 1) max( C ( K, C ) , C ( N, C )) (cid:19) . (F.4) Diﬀerence between histories

To minimize the diﬀerence probability between the his-tories of atom sizes i.e. the P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ) term in Eq. (F.3), we will use Eq. (9).The claim is, there exists a coupling of S (cid:48) | D (cid:48) and S | D such that: P ( S (cid:54) = S (cid:48) | D = D (cid:48) , D ∈ D ∗ ) ≤ ( β + 1) max( C ( K, C ) , C ( N, C )) K C ( N, C ) . (F.5)Fix some d ∈ D ∗ – since we are in the typical set, the number of columns in the feature-allocation matrix is at most ( β + 1) max( C ( K, C ) , C ( N, C )). Conditioned on D = d , thereis a ﬁnite number of history variables S , one for each atom location; similar for conditioningof S (cid:48) on D (cid:48) = d . For both the target and the approximate model, the density of the jointdistribution factorizes: P ( S = s | D = d ) = J (cid:89) j =1 P ( S j = s j | D = d ) P ( S (cid:48) = s | D (cid:48) = d ) = J (cid:89) j =1 P ( S (cid:48) j = s j | D (cid:48) = d ) , since in both marginal processes, the atom sizes for diﬀerent atom locations are independentof each other. This means we can use Lemma D.8: d T V ( P S | D = d , P S (cid:48) | D (cid:48) = d ) ≤ J (cid:88) j =1 d T V ( P S j | D = d , P S (cid:48) j | D (cid:48) = d ) . We inspect each d T V ( P S j | D = d , P S (cid:48) j | D (cid:48) = d ). Fixing d also ﬁxes I j , the ﬁrst row to manifest the j th atom location. The history s j is then a N − I j + 1 dimensional integer vector, whose t thentry is the atom size over the j the atom location of the t + I j − S j (1 : ( t − S (cid:48) j (1 : ( t − s ,the distributions S j ( t ) and S (cid:48) j ( t ) are very similar. The conditional distribution S j ( t ) | D = d, S j (1 : ( t − s is governed by h c Proposition C.1 while S (cid:48) j ( t ) | D (cid:48) = d, S (cid:48) j (1 : ( t − s is governed by (cid:101) h c Proposition C.2. Hence: d T V (cid:16) P S j ( t ) | D = d,S j (1:( t − s , P S (cid:48) j ( t ) | D (cid:48) = d,S (cid:48) j (1:( t − s (cid:17) ≤ K C t + I j − C , . Nguyen et al./Independent ﬁnite approximations for any partial history s . To use this conditional bound, we again leverage Lemma D.6 to com-pare the joint S j = ( S j (1) , S j (2) , . . . , S j ( N − I j +1)) with the joint S (cid:48) j = ( S (cid:48) j (1) , S (cid:48) j (2) , . . . , S (cid:48) j ( N − I j + 1)), peeling oﬀ one layer at a time. d T V ( P S j | D = d , P S (cid:48) j | D (cid:48) = d ) ≤ N − I j +1 (cid:88) t =1 max s d T V (cid:16) P S j ( t ) | D = d,S j (1:( t − s , P S (cid:48) j ( t ) | D (cid:48) = d,S (cid:48) j (1:( t − s (cid:17) ≤ N − I j +1 (cid:88) t =1 K C t + I j − C ≤ C ( N, C ) K .

Multiplying the right hand side by ( β + 1) max( C ( K, C ) , C ( N, C )), the upper bound on J ,we arrive at the same upper bound for the total variation between P S | D = d and P S (cid:48) | D (cid:48) = d inEq. (F.5). Furthermore, our analysis of the total variation can be back-tracked to constructthe coupling between the conditional distributions S | D = s and S (cid:48) | D (cid:48) = d which attainsthat small probability of diﬀerence. Since the choice of conditioning d ∈ D ∗ was arbitrary,we have actually shown Eq. (F.5). Diﬀerence between new atom sizes

Finally, to control the diﬀerence probability forthe distribution over new atom sizes i.e. the P ( D (cid:54) = D (cid:48) ) term in Eq. (F.3), we will utilizeEqs. (8) and (10). For each n , deﬁne the short-hand d n to refer to the collection d i,x for i ∈ [ n ], x ∈ N , and the typical sets: D ∗ n = (cid:40) d n : n (cid:88) i =1 ∞ (cid:88) x =1 d i,x ≤ ( β + 1) max( C ( K, C ) , C ( N, C )) (cid:41) . The type of expansion performed in Eq. (F.3) can be done once here to see that: P ( D (cid:54) = D (cid:48) )= P (( D N − , D N ) (cid:54) = ( D (cid:48) N − , D (cid:48) N )) ≤ P ( D N − (cid:54) = D (cid:48) N − ) + P ( D N (cid:54) = D (cid:48) N | D N − = D (cid:48) N − , D N − ∈ D ∗ n − ) + P ( D N − / ∈ D ∗ n − ) . Apply the expansion once more to P ( D N − (cid:54) = D (cid:48) N − ), then to P ( D N − (cid:54) = D (cid:48) N − ).If we deﬁne: B j = P ( D j (cid:54) = D (cid:48) j | D j − = D (cid:48) j − , D j − ∈ D ∗ j − ) , with the special case B simply being P ( D (cid:54) = D (cid:48) ), then: P ( D (cid:54) = D (cid:48) ) ≤ N (cid:88) j =1 B j + N (cid:88) j =2 P ( D j − / ∈ D ∗ j − ) . (F.6)The second summation in Eq. (F.6), comprising of only atypical probabilities, is easierto control. For any j , since (cid:80) j − i =1 (cid:80) ∞ x =1 D i,x ≤ (cid:80) Ni =1 (cid:80) ∞ x =1 D i,x , P ( D j − / ∈ D ∗ j − ) ≤ P ( D / ∈ D ∗ ), so a generous upper bound for the contribution of all the atypical probabilities . Nguyen et al./Independent ﬁnite approximations including the ﬁrst one from Eq. (F.4) is: P ( D / ∈ D ∗ ) + N (cid:88) j =2 P ( D j − / ∈ D ∗ j − ) ≤ exp (cid:18) − (cid:18) β β + 1) max( C ( K, C ) , C ( N, C )) − ln N (cid:19)(cid:19) . By Lemma D.9, max( C ( K, C ) , C ( N, C )) ≥ C (max(ln N, ln K ) − C ( ψ ( C ) + 1)). Sincewe have set β so that β β +1 C = 2, we have: β β + 1) max( C ( K, C ) , C ( N, C )) − ln N ≥ ln K − constant . meaning the overall atypical probabilities is at most: P ( D / ∈ D ∗ ) + N (cid:88) j =2 P ( D j − / ∈ D ∗ j − ) ≤ constant K . (F.7)As for the ﬁrst summation in Eq. (F.6), we look at the individual B j ’s. For any ﬁxed d j − ∈ D ∗ j − , we claim that there exists a coupling between the conditionals D j | D j − = d j − and D (cid:48) j | D (cid:48) j − = d j − such that P ( D j (cid:54) = D (cid:48) j | D j − = D (cid:48) j − = d j − ) isat most: constant K j − C ) + constant (ln N + ln K ) K j − C . (F.8) Because the upper bound hold for arbitrary values d j − , the coupling actually ensuresthat, as long as D j − = D (cid:48) j − for some value in D ∗ j − , the probability of diﬀerencebetween D j and D (cid:48) j is small i.e. B j is at most the right hand side.Such a coupling exists because the total variation between the two distributions P D j | D j − = d j − and P D (cid:48) j | D (cid:48) j − = d j − is small. In particular, there exists a distribution U = { U x } ∞ x =1 of in-dependent Poisson random variables, such that both the total variation between P D j | D j − = d j − and P U and the total variation between P D (cid:48) j | D (cid:48) j − = d j − and P U is small – we then usetriangle inequality to bound the original total variation. Here, each U x has mean: E ( U x ) = (cid:32) K − j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:33) (cid:101) h c ( x | x j − = 0) . On the one hand, conditioned on D (cid:48) j − = d j − , D (cid:48) j = { D (cid:48) j,x } ∞ x =1 is the joint dis-tribution of types of successes of type x , where there are K − (cid:80) j − i =1 (cid:80) ∞ x =1 d i,x independenttrials and types x success has probability (cid:101) h c ( x | x j − = 0) by Proposition C.2. Because ofLemma D.4 and Eq. (8): d T V (cid:16) P D (cid:48) j | D (cid:48) j − = d j − , P U (cid:17) ≤ (cid:32) K − j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:33) (cid:32) ∞ (cid:88) x =1 (cid:101) h c ( x | x j − = 0) (cid:33) ≤ K (cid:18) K C j − C (cid:19) ≤ C K j − C ) . (F.9) . Nguyen et al./Independent ﬁnite approximations On the other hand, conditioned on D j − , D j = { D j,x } ∞ x =1 consists of independent Pois-sons, where the mean of D j,x is M j,x by Proposition C.1. We recursively apply Lemma D.8and Lemma D.5: d T V ( P U , P D j ) ≤ ∞ (cid:88) x =1 d T V ( P U x , P D j,x ) ≤ ∞ (cid:88) x =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M j,x − (cid:32) K − j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:33) (cid:101) h c ( x | x j − = 0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ (cid:88) x =1 (cid:32) | M j,x − K (cid:101) h c ( x | x j − = 0) | + j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:101) h c ( x | x j − = 0) (cid:33) ≤ ∞ (cid:88) x =1 | M j,x − K (cid:101) h c ( x | x j − = 0) | + (cid:32) j − (cid:88) i =1 ∞ (cid:88) y =1 d i,y (cid:33) (cid:32) ∞ (cid:88) x =1 (cid:101) h c ( x | x j − = 0) (cid:33) . (F.10)The ﬁrst term is upper bounded by Eq. (10). Regarding the second term, since we are in thetypical set, (cid:80) j − i =1 (cid:80) ∞ y =1 d i,y is upper bounded. Therefore the overall bound on the secondterm is: ( β + 1) max( C ( K, C ) , C ( N, C )) 1 K C j − C . Combining the two bounds give the bound on d T V ( P U , P D j ):1 K C ln j + C j − C + ( β + 1) max( C ( K, C ) , C ( N, C )) 1 K C j − C ≤ constant (ln N + ln K ) K j − C . (F.11)Combining Eqs. (F.9) and (F.11) gives the upper bound in Eq. (F.8). The summation ofthe right hand side of Eq. (F.8) across j leads to: N (cid:88) j =1 B j ≤ constant K + constant (ln N + ln K ) ln NK . (F.12)In all, because of Eqs. (F.7) and (F.12), we can couple D and D (cid:48) such that P ( D (cid:54) = D (cid:48) ) is atmost: constant K + constant (ln N + ln K ) ln NK . (F.13)Aggregating the results from Eqs. (F.4), (F.5) and (F.13), we are done.

F.2. Lower bound

Proof of Theorem 4.3.

First we mention which probability kernel f results in the large totalvariation distance: the pathological f is the Dirac measure i.e., f ( · | X ) := δ X ( . ). With thisconditional likelihood X n = Y n and Z n = W n , meaning: d T V ( P N, ∞ , P N,K ) = d T V ( P X N , P Z N ) . . Nguyen et al./Independent ﬁnite approximations Now we discuss why the total variation is lower bounded by the function of N . Let A be the event that there are at least C ( N, α ) unique atom locations in among the latentstates: A := (cid:26) x N : ≥ C ( N, α ) (cid:27) . The probabilities assigned to this event by the approximate and the target models arevery diﬀerent from each other. On the one hand, since

K < γC ( N,α )2 , under IFA K , A hasmeasure zero: P Z N ( A ) = 0 . (F.14)On the other hand, under beta-Bernoulli, the number of unique atom locations drawn is aPoisson random variable with mean exactly γC ( N, α ) – see Proposition C.1 and Example 4.2.The complement of A is a lower tail event. By Lemma D.3 with λ = γC ( N, α ) and x = γC ( N, α ): P X N ( A ) ≥ − exp (cid:18) − γC ( N, α )8 (cid:19) . (F.15)Because of Lemma D.9, we can lower bound C ( N, α ) by a multiple of ln N :exp (cid:18) − γC ( N, α )8 (cid:19) ≤ exp (cid:18) − γα ln N αγ ( ψ ( α ) + 1)8 (cid:19) = constant N γα/ . We now combine Eqs. (F.14) and (F.15) and recall that total variation is the maximumover probability discrepancies.The proof of Theorem 4.4 relies on the ability to compute a lower bound on the totalvariation distance between a Binomial distribution and a Poisson distribution.

Proposition F.1 (Lower bound on total variation between Binomial and Poisson) . For all K , it is true that: d T V (cid:18)

Poisson ( γ ) , Binom (cid:18)

K, γ/Kγ/K + 1 (cid:19)(cid:19) ≥ C ( γ ) K (cid:18) γ/Kγ/K + 1 (cid:19) , where: C ( γ ) = 18 1 γ + exp( − γ + 1) max(12 γ , γ, . Proof of Proposition F.1.

We adapt the proof of (Barbour and Hall, 1984, Theorem 2) toour setting. The Poisson( γ ) distribution satisﬁes the functional equality: E [ γy ( Z + 1) − Zy ( Z )] = 0 , (F.16)where y is any real-valued function and Z ∼ Poisson( γ ).Denote γ K = γγ/K +1 . For m ∈ N , let x ( m ) = m exp (cid:18) − m γ K θ (cid:19) , where θ is a constant which will be speciﬁed later. x ( m ) serves as a test function to lowerbound the total variation distance between Poisson( γ ) and Binom ( K, γ K /K ). Let X i ∼ . Nguyen et al./Independent ﬁnite approximations Ber( γ K K ), independently across i from 1 to K , and W = (cid:80) Ki =1 . Then W ∼ Binomial (

K, γ K /K ).The following identity is adapted from (Barbour and Hall, 1984, Equation 2.1): E [ γ K x ( W + 1) − W x ( W )] = (cid:16) γ K K (cid:17) K (cid:88) i =1 E [ x ( W i + 2) − x ( W i + 1)] . (F.17)where W i = W − X i .We ﬁrst argue that the right hand side is not too small i.e. for any i : E [ x ( W i + 2) − x ( W i + 1)] ≥ − γ K + 12 γ K + 7 θγ K . (F.18)Consider the derivative of x ( m ): ddm x ( m ) = exp (cid:18) − m γ K θ (cid:19) (cid:18) − m γ K θ (cid:19) ≥ − m θγ K . because of the easy-to-verify inequality e − x (1 − x ) ≥ − x for x ≥

0. This means: x ( W i + 2) − x ( W i + 1) ≥ (cid:90) W i +2 W i +1 (cid:18) − m θγ K (cid:19) dm = 1 − θγ K (3 W i + 9 W i + 7) . Taking expectations, noting that E ( W i ) ≤ γ K and E ( W i ) = Var( W i )+[ E ( W i )] ≤ (cid:80) Kj =1 γ K K +( γ K ) = γ K + γ K we have proven Eq. (F.18).Now, because of positivity of x , and that γ ≥ γ K , we trivially have: E [ γx ( W + 1) − W x ( W )] ≥ E [ γ K x ( W + 1) − W x ( W )] . (F.19)Combining Eq. (F.17), Eq. (F.18) and Eq. (F.19) we have: E [ γx ( W + 1) − W x ( W )] ≥ K (cid:16) γ K K (cid:17) (cid:18) − γ K + 12 γ K + 7 θγ K (cid:19) . Recalling Eq. (F.16), for any coupling (

W, Z ) such that W ∼ Binom (cid:16) K, γ/Kγ/K +1 (cid:17) and Z ∼ Poisson( γ ): E [ γ ( x ( W + 1) − x ( Z + 1)) + Zx ( Z ) − W x ( W )] ≥ γ K K (cid:18) − γ K + 12 γ K + 7 θγ K (cid:19) . Suppose (

W, Z ) is the maximal coupling attaining the total variation distance between P W and P Z i.e. P ( W (cid:54) = Z ) = d T V ( P Y , P Z ). Clearly: γ ( x ( W + 1) − x ( Z + 1)) + Zx ( Z ) − W x ( W ) ≤ { W (cid:54) = Z } sup m ,m | ( γx ( m + 1) − m x ( m )) − ( γx ( m + 1) − m x ( m )) |≤ { W (cid:54) = Z } sup m | ( γx ( m + 1) − mx ( m ) | . Taking expectations on both sides, we conclude that2 d T V ( P W , P Z ) × sup m | γx ( m + 1) − mx ( m ) | ≥ γ K K (cid:18) − γ K + 12 γ K + 7 θγ K (cid:19) (F.20) . Nguyen et al./Independent ﬁnite approximations It remains to upper bound sup m | γx ( m + 1) − mx ( m ) | . Recall that the derivative of x is exp (cid:16) − m γ K θ (cid:17) (cid:16) − m γ K θ (cid:17) , taking values in [ − e − / , m , − e − / ≤ x ( m + 1) − x ( m ) ≤

1. Hence: | γx ( m + 1) − mx ( m ) | = | γ ( x ( m + 1) − x ( m )) + ( γ − m ) x ( m ) |≤ γ + ( m + γ ) m exp (cid:18) − m γ K θ (cid:19) ≤ γ + ( γ + 1) m exp (cid:18) − m γ K θ (cid:19) ≤ γ + θγ K ( γ + 1) exp( − . (F.21)where the last inequality owes to the easy-to-verify x exp( − x ) ≤ exp( − d T V (cid:18)

Binomial (cid:18)

K, γ/Kγ/K + 1 (cid:19) , Poisson( γ ) (cid:19) ≥

12 1 − γ K +12 γ K +7 θγ K γ + ( γ + 1) θγ K exp( − K (cid:16) γ K K (cid:17) . Finally, we calibrate θ . By selecting θ = max (cid:16) γ K , γ K , (cid:17) we have that the numeratorof the unwieldy fraction is at least and its denominator is at most γ + exp( − γ +1) max(12 γ , γ, γ K < γ . This completes the proof. Proof of Theorem 4.4.

First we mention which probability kernel f results in the large totalvariation distance. For any discrete measure (cid:80) Mi =1 δ ψ i , f is the Dirac measure sitting on M ,the number of atoms. f ( . | M (cid:88) i =1 δ ψ i ) := δ M ( . ) . (F.22)Now we show that under such f , the total variation distance is lower bounded. First,observe that: d T V ( P Y N , P W N ) ≥ d T V ( P Y , P W ) . (F.23)Truly, suppose ( Y N , W N ) is any coupling of P Y N , P W N . Elementarily we have P ( Y N (cid:54) = W N ) ≥ P ( Y (cid:54) = W ). Taking the inﬁmum over couplings to attain the total variationdistance, we have shown Eq. (F.23). Hence it suﬃces to show: d T V ( P Y , P W ) ≥ C ( γ ) γ K γ/K ) . Recall the generative process deﬁning P Y and P W . Y is an observation from the targetBeta-Bernoulli model, so by Proposition C.1 N T ∼ Poisson( γ ) , ψ k iid ∼ H, X = N T (cid:88) i =1 δ ψ k , Y ∼ f ( . | X ) .W is an observation from the approximate model, so by Proposition C.2 N A ∼ Binom (cid:18)

K, γ/K γ/K (cid:19) , φ k iid ∼ H, Z = N A (cid:88) i =1 δ φ k , W ∼ f ( . | Z ) . . Nguyen et al./Independent ﬁnite approximations Because of the choice of f , Y = N T and W = N A . Hence, by Proposition F.1: d T V ( P Y , P W ) = d T V ( P N T , P N A ) ≥ C ( γ ) γ K γ/K ) . Appendix G: Proofs of DP bounds

Our technique to analyze the error made by FSD K follows a similar vein to the techniquein Appendix F. We compare the joint distribution of the latents X N and Z N (with theunderlying Θ or Θ K marginalized out) using the conditional distributions X n | X n − and Z n | Z n − . Before going into the proofs, we give the form of the conditionals.The conditional X N | X n − is the well-known Blackwell-MacQueen prediction rule. Proposition G.1.

Blackwell and MacQueen (1973) For n = 1 , X ∼ H . For n ≥ : X n | X n − , X n − , . . . , X ∼ αn − α H + (cid:88) j n j n − α δ ψ j . where { ψ j } is the set of unique values among X n − , X n − , . . . , X and n j is the cardinalityof the set { i : 1 ≤ i ≤ n − , X i = ψ j } . The conditionals Z n | Z n − are related to the Blackwell-MacQueen prediction rule. Proposition G.2.

Pitman (1996) For n = 1 , Z ∼ H . For n ≥ , let { ψ j } J n j =1 be the setof unique values among Z n − , Z n − , . . . , Z and n j is the cardinality of the set { i : 1 ≤ i ≤ n − , Z i = ψ j } . If J n < K : Z n | Z n − , Z n − , . . . , Z ∼ ( K − J n ) α/Kn − α H + J n (cid:88) j =1 n j + α/Kn − α δ ψ j , Otherwise, if J n = K , there is zero probability of drawing a fresh component from H i.e. Z n comes only from { ψ j } j =1 J n : Z n | Z n − , Z n − , . . . , Z ∼ J n (cid:88) j =1 n j + α/Kn − α δ ψ j ,J n ≤ K is an invariant of these of prediction rules: once J n = K , all subsequent J m for m ≥ n is also equal to K . G.1. Upper bounds

Proof of Theorem 5.1.

First, because of Lemma D.7, it suﬃces to show that d T V ( P X N , P Z N )is small, since the conditional distributions of the observations given the latent variables arethe same across target and approximate models. . Nguyen et al./Independent ﬁnite approximations To show that d T V ( P X N , P Z N ) is small, we will construct a coupling of X N and Z N such that for any n ≥ P ( X n (cid:54) = Z n | X n − = Z n − ) ≤ αK J n n − α , (G.1)where J n is the number of unique atom locations among X n − . Such a coupling exists be-cause the total variation distance between the prediction rules X n | X n − and Z n | Z n − is small: as total variation is the minimum diﬀerence probability, there exists a coupling thatachieves the total variation distance. Consider any measurable set A . If J n < K , the prob-ability of A under the two rules are respectively: α (1 − J n /K ) n − α H ( A ) + J n (cid:88) j =1 n j + α/Kn − α δ ψ j ( A ) αn − α H ( A ) + J n (cid:88) j =1 n j n − α δ ψ j ( A )meaning the absolute diﬀerence in probability mass is: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) αK J n H ( A ) n − α − αK J n (cid:88) j =1 δ j ( A ) n − α (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) αK J n H ( A ) n − α (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) αK J n (cid:88) j =1 δ j ( A ) n − α (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ αK J n n − α + αK J n n − α = 2 αK J n n − α . The same upper bound holds for the case J n = K . The couplings for diﬀerent n are naturallyglued together because of the recursive nature of the conditional distributions.We now show that for the coupling satisfying Eq. (G.1), the overall probability of diﬀer-ence P ( X N (cid:54) = Z N ) is small. Deﬁne the short hand: C ( N, α ) := N (cid:88) n =1 αn − α . The deﬁnition of the typical set depends on the relative deviation δ , which we calibrate atthe end of the proof. Deﬁne the typical set: D n := (cid:8) x n − : J n ≤ (1 + δ ) max( C ( N − , α ) , C ( K, α )) (cid:9) . In other words, the number of unique values among the x n − is small. The followingdecomposition is used to investigate the diﬀerence probability on the typical set: P ( X N (cid:54) = Z N ) = P (( X N − , X N ) (cid:54) = ( Z N − , Z N ))= P ( X N − (cid:54) = Z N − ) + P ( X N (cid:54) = Z N , X N − = Z N − ) (G.2)The second term can be further expanded: P ( X N (cid:54) = Z N ,X N − = Z N − , X N − ∈ D N )+ P ( X N (cid:54) = Z N , X N − = Z N − , X N − / ∈ D N ) . Nguyen et al./Independent ﬁnite approximations The former term is at most: P ( X N (cid:54) = Z N | X N − = Z N − , X N − ∈ D N ) , while the latter term is at most: P ( X N − / ∈ D N ) . To recap, we can bound P ( X N (cid:54) = Z N ) by bounding three quantities:1. The diﬀerence probability of a shorter process P ( X N − (cid:54) = Z N − ).2. The diﬀerence probability of the prediction rule on typical sets P ( X N (cid:54) = Z N | X N − = Z N − , X N − ∈ D N ).3. The probability of the atypical set P ( X N − / ∈ D N ).By recursively applying the expansion initiated in Eq. (G.2) to P ( X N − (cid:54) = Z N − ), weactually only need to bound diﬀerence probability of the diﬀerent prediction rules on typicalsets and the atypical set probabilities.Regarding diﬀerence probability of the diﬀerent prediction rules, being in the typical setallows us to control J n in Eq. (G.1). Summation across n = 1 through N gives the overallbound of:2 αK (1 + δ ) max( C ( N − , α ) , C ( K, α )) C ( N, α ) ≤ constant ln N (ln N + ln K ) K . (G.3)Regarding the atypical set probabilities, because J n − is stochastically dominated by J n i.e., the number of unique values at time n is at least the number at time n −

1, all theatypical set probabilities are upper bounded by the last one i.e. P ( X N − / ∈ D N ). J N − isthe sum of independent Poisson trials, with an overall mean equaling exactly C ( N − , α ).Therefore, the atypical event has small probability because of Lemma D.1: P ( J N − > (1 + δ ) max( C ( N − , α ) , C ( K, α )) ≤ exp (cid:18) − δ δ max( C ( N − , α ) , C ( K, α ) (cid:19) . Even accounting for all N atypical events, the total probability is small:exp (cid:18) − (cid:18) δ δ max( C ( N − , α ) , C ( K, α ) − ln( N − (cid:19)(cid:19) By Lemma D.9, max( C ( N − , α ) , C ( K − , α ) ≥ α max(ln( N − , ln K − α ( ψ ( α ) + 1).Therefore, if we set δ such that δ δ α = 2, we have: δ δ max( C ( N − , α ) , C ( K − , α ) − ln( N − ≥ ln K − constantmeaning the overall atypical probabilities is at most:constant K . (G.4)The overall total variation bound combines Eqs. (G.3) and (G.4). . Nguyen et al./Independent ﬁnite approximations Proof of Corollary 5.2.

The main idea is reducing to the Dirichlet process mixture modelsituation. This can be done in two steps.First, the conditional distribution of the observations W | H D of the target model isthe same as the conditional distribution Z | F D of the approximate model if H D = F D .Hence to control the total variation between P W and P Z it suﬃces to control the totalvariation between P H D and P F D because of Lemma D.7. Second, the distance between P H D and P F D can be upper bounded by the distance between the atom locations thatdeﬁne H D and F D . Recall the construction of the F d in terms of atom locations φ d,j andstick-breaking weights γ d,j : G K ∼ FSD K ( ω, H ) φ dj | G K iid ∼ G K ( . ) across d, jγ dj iid ∼ Beta(1 , α ) across d, j (except γ dT = 1) F d | φ d,. , γ d,. = T (cid:88) i =1  γ di (cid:89) j

Proof of Theorem 5.3.

First we mention which probability kernel f results in the large totalvariation distance: the pathological f is the Dirac measure i.e., f ( · | x ) = δ x ( . ). With this . Nguyen et al./Independent ﬁnite approximations conditional likelihood X n = Y n and Z n = W n , meaning: d T V ( P N, ∞ , P N,K ) = d T V ( P X N , P Z N ) . Now we discuss why the total variation is lower bounded by the function of N . Let A bethe event that there are at least C ( N, α ) unique components in among the latent states: A := (cid:26) x N : ≥ C ( N, α ) (cid:27) . The probabilities assigned to this event by the approximate and the target models arevery diﬀerent from each other. On the one hand, since

K < C ( N,α )2 , under FSD K , A hasmeasure zero: P Z N ( A ) = 0 . (G.5)On the other hand, under DP, the number of unique atoms drawn is the sum of Poissontrials with expectation exactly C ( N, α ). The complement of A is a lower tail event. Henceby Lemma D.2 with δ = 1 / , µ = C ( N, α ), we have: P X N ( A ) ≥ − exp (cid:18) − C ( N, α )8 (cid:19) (G.6)Because of Lemma D.9, we can lower bound C ( N, α ) by a multiple of ln N :exp (cid:18) − C ( N, α )8 (cid:19) ≤ exp (cid:18) − α ln N α ( ψ ( α ) + 1)8 (cid:19) = constant N α/ . We now combine Eqs. (G.5) and (G.6) and recall that total variation is the maximumover probability discrepancies.

Proof of Theorem 5.4.

First we mention which probability kernel f results in the large totalvariation distance: the pathological f is the Dirac measure i.e., f ( · | x ) = δ x ( . ).Now we show that under such f, the total variation distance is lower bounded. Observethat it suﬃces to understand the total variation between P Y ,Y and P W ,W . Truly, suppose( Y N , W N ) is any coupling of P Y N and P W N . Elementarily we have P ( Y N (cid:54) = W N ) ≥ P (( Y , Y ) (cid:54) = ( W , W )). Taking the inﬁmum, we have: d T V ( P N, ∞ , P N,K ) ≥ d T V ( P Y ,Y , P W ,W ) . Since f is Dirac, X n = Y n and Z n = W n and we have: d T V ( P Y ,Y , P W ,W ) = d T V ( P X ,X , P Z ,Z ) . Now, let ( X , X ) , ( Z , Z ) be any coupling of P X ,X and P Z ,Z . We have: P (( X , X ) (cid:54) = ( Z , Z )) = P ( X (cid:54) = Z | X = Z ) + P ( X (cid:54) = Z ) P ( X = Z | X = Z ) ≥ P ( X (cid:54) = Z | X = Z ) . We now investigate how small P ( X (cid:54) = Z | X = Z ) can be. In the conditioning X = Z ,let the common atom be ψ . The prediction rule X | X = ψ puts mass α on ψ while the . Nguyen et al./Independent ﬁnite approximations prediction rule Z | Z = ψ puts mass α/K α . This means that the total variation distancebetween the two prediction rules is at least:1 + α/K α −

11 + α = α α K .

Since the minimum diﬀerence probability is at least the total variation distance, we concludethat for any coupling ( X , X ) , ( Z , Z ) P ( X (cid:54) = Z | X = Z ) ≥ α α K .

Hence we have a lower bound on P (( X , X ) (cid:54) = ( Z , Z )) itself. As the coupling was arbitrary,we take the inﬁmum to attain the lower bound on total variation. Appendix H: Experimental setup

H.1. Image denoising

The experiments in this section aim to isolate the eﬀect of TFA versus IFA, by ﬁtting diﬀerentapproximations of the beta-Bernoulli model to denoise an image. We give a description ofour models and their hyper-parameter settings. Each patch x i is ﬂattened into a vector in R n . Let I n be the n × n identity matrix, and similarly for I K . The base measure generatingthe basis elements is the same: ψ k iid ∼ N (0 , n − I n ) k = 1 , , . . . , K The observational likelihood conditioned on feature-allocation matrix F ∈ { , } N × K andbasis elements { ψ k } Kk =1 is the same for both models. γ w ∼ Gamma(10 − , − ) γ e ∼ Gamma(10 − , − ) w i iid ∼ N (0 , γ − w I K ) i = 1 , , . . . , N(cid:15) i iid ∼ N (0 , γ − e I n ) i = 1 , , . . . , Nx i = K (cid:88) k =1 F i,k w i,k ψ k + (cid:15) i i = 1 , , . . . , N (H.1)where we are using the shape-rate parametrization of the gamma. Finally, how the feature-allocation matrix F is generated is the sole diﬀerence between TFA and IFA. The underlyingbeta process being approximated has rate measure ν ( θ ) = θ − { θ ≤ } . • TFA: v k iid ∼ Beta (1 , π k = k (cid:89) i =1 v i , k = 1 , , . . . , KF i,k | π k indep ∼ Ber( π k ) i = 1 , , . . . , N The posterior over (trait, frequency) and per-observation allocation is traversed for a certain numberof steps using a Gibbs sampler. Each visited dictionary and assignment is used to compute each patch’smean value: the candidate output pixel value is the mean over patches covering that pixel. We aggregatethe output images across Gibbs steps by a weighted averaging mechanism. . Nguyen et al./Independent ﬁnite approximations • IFA: π k iid ∼ Beta (cid:18) K , (cid:19) k = 1 , , . . . , KF i,k | π k indep ∼ Ber( π k ) i = 1 , , . . . , N In Eq. (H.1), we are enriching the basic feature-allocation structure by introducing weights w i,k which allow an observation to manifest a non-integer (and potentially negative) scaledversion of the basis element. Following (Zhou et al., 2009), we are uninformative about thenoise precisions by choosing Gamma(10 − , − ). Regarding the choice of hyper-parametersfor the underlying beta process, (Zhou et al., 2009) suggests that the performance of thedenoising routine is insensitive to the choice of γ and α : we picked γ, α = 1 for computationalconvenience, especially since for the beta process for α = 1 admits the simple stick-breakingconstruction. H.2. Topic modelling

Nearly 1 m random wikipedia documents were downloaded and processed following (Hoﬀ-man, Bach and Blei, 2010).IFA: G ∼ FSD K ( ω, Dir ( η V )) G d ∼ T-DP T ( α, G ) independently across d = 1 , , . . . , Dβ dn | G d ∼ G d ( . ) independently across n = 1 , , . . . , N d w dn | β dn ∼ Categorical ( β dn ) independently across n = 1 , , . . . , N d TFA: G ∼ T-DP K ( ω, Dir ( η V )) G d ∼ T-DP T ( α, G ) independently across d = 1 , , . . . , Dβ dn | G d ∼ G d ( . ) independently across n = 1 , , . . . , N d w dn | β dn ∼ Categorical ( β dn ) independently across n = 1 , , . . . , N d Hyper-parameter settings follow (Wang, Paisley and Blei, 2011) in that η = 0 . , α =1 . , ω = 1 . , T = 20 . We approximate the posterior in each model using stochastic variational inference (Hoﬀ-man et al., 2013). Both models have nice conditional conjugacies that allow the use ofexponential family variational distributions and closed-form expectation equations. Batchsize is 500, learning rate parametrized by ρ t = ( t + τ ) − κ where by default τ = 1 . κ = 0 . . We discuss how held-out log-likelihood is computed. Each held-out document d (cid:48) is sep-arated into two parts w ho and w obs , with no common words between the two. In our How each document is separated into these two parts can have an impact on the range of test log-likelihood values encountered. For instance, if the ﬁrst (in order of appearance in the document) x % ofwords were the observed words and the last (100 − x )% words were unseen, then the test log-likelihood islow, presumably since predicting future words using only past words and without any ﬁltering is challenging.Randomly assigning words to be observed and unseen gives better test log-likelihood. . Nguyen et al./Independent ﬁnite approximations experiments, we set 75% of words to be observed, the remaining 25% unseen. The predictivedistribution of each word w new in the w ho is exactly equal to: p ( w new |D , w obs ) = (cid:90) θ d (cid:48) ,β p ( w new | θ d (cid:48) , β ) p ( θ d (cid:48) , β |D , w obs ) dθ d (cid:48) dβ. This is an intractable computation as the posterior p ( θ d (cid:48) , β |D , w obs ) is not analytical. Weapproximate it with a factorized distribution: p ( θ d (cid:48) , β |D , w obs ) ≈ q ( β |D ) q ( θ d (cid:48) ) , where q ( β |D ) is ﬁxed to be the variational approximation found during training and q ( θ d (cid:48) )minimizes the KL between the variational distribution and the posterior. Operationally, wedo an E-step for the document d (cid:48) based on the variational distribution of β and the observedwords w obs , and discard the distribution over z d (cid:48) ,. , the per-word topic assignments becauseof the mean-ﬁeld assumption. Using those approximations, the predictive approximation isapproximately: p ( w new |D , w obs ) ≈ (cid:101) p ( w new |D , w obs ) = K (cid:88) k =1 E q ( θ d (cid:48) ( k )) E q ( β k ( w new )) , and the ﬁnal number we report for document d (cid:48) is:1 | w ho | (cid:88) w ∈ w ho log (cid:101) p ( w |D , w obs ) . Appendix I: Additional experiments

I.1. Plane

The results for the plane image are Figs. I.1, I.2 and I.3.

I.2. Truck

The results for the truck image are Figs. I.4, I.5 and I.6. . Nguyen et al./Independent ﬁnite approximations Fig I.1: Original versus corrupted image for plane.Fig I.2: PSNR versus approximation level for plane . Nguyen et al./Independent ﬁnite approximations (a) TFA training (b) IFA training Fig I.3: The output of one model is a good initialization for the training of the other one.Here K = 60 . Fig I.4: Original versus corrupted images for truck. . Nguyen et al./Independent ﬁnite approximations Fig I.5: PSNR versus approximation level for truck. (a) TFA training (b) IFA training

Fig I.6: The output of one model is a good initialization for the training of the other one.Here K = 60= 60

Related Researches

A robust multivariate linear non-parametric maximum likelihood model for ties

by Landon Hurley

An Aligned Rank Transform Procedure for Multifactor Contrast Tests

by Lisa A. Elkin

Fast and frugal time series forecasting

by Fotios Petropoulos

Parameter estimation in nonlinear mixed effect models based on ordinary differential equations: an optimal control approach

by Quentin Clairon

A Review of Generalizability and Transportability

by Irina Degtiar

Sharp Inference on Selected Subgroups in Observational Studies

by Xinzhou Guo

Interactive identification of individuals with positive treatment effect while controlling false discoveries

by Boyan Duan

Regression discontinuity design: estimating the treatment effect with standard parametric rate

by Debarghya Mukherjee

Sequential Bayesian experimental design for estimation of extreme-event probability in stochastic dynamical systems

by Xianliang Gong

Bayesian Non-parametric Quantile Process Regression and Estimation of Marginal Quantile Effects

by Steven G. Xu

Modelling Extremes of Spatial Aggregates of Precipitation using Conditional Methods

by Jordan Richards

Probabilistic Learning on Manifolds (PLoM) with Partition

by Christian Soize

Left-censored recurrent event analysis in epidemiological studies: a proposal when the number of previous episodes is unknown

by Gilma Hernández-Herrera

Estimating Sibling Spillover Effects with Unobserved Confounding Using Gain-Scores

by David C. Mallinson

Distributional data analysis via quantile functions and its application to modelling digital biomarkers of gait in Alzheimer's Disease

by Rahul Ghosal

The likelihood-ratio test for multi-edge network models

by Giona Casiraghi

Group Inverse-Gamma Gamma Shrinkage for Sparse Regression with Block-Correlated Predictors

by Jonathan Boss

Adaptive Importance Sampling for Efficient Stochastic Root Finding and Quantile Estimation

by Shengyi He

Bi-factor and second-order copula models for item response data

by Sayed H. Kadhem

Accounting for recall bias in case-control studies: a causal inference approach

by Kwonsang Lee

Designing Experiments Informed by Observational Studies

by Evan Rosenman

Adaptive dose-response studies to establish proof-of-concept in learning-phase clinical trials

by Shiyang Ma

Diagnostics for Conditional Density Models and Bayesian Inference Algorithms

by David Zhao

Adaptive Frequency Band Analysis for Functional Time Series

by Pramita Bagchi

A Basis Approach to Surface Clustering

by Adriano Zanin Zambom

«
1

2

3

4

»

Submitted on 22 Sep 2020 (v1), last revised 28 Feb 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar