Bayesian Update with Importance Sampling: Required Sample Size
BBayesian Update with Importance Sampling:Required Sample Size
Daniel Sanz-Alonso and Zijian WangUniversity of Chicago
Abstract
Importance sampling is used to approximate Bayes’ rule in many computational approachesto Bayesian inverse problems, data assimilation and machine learning. This paper reviewsand further investigates the required sample size for importance sampling in terms of the 𝜒 -divergence between target and proposal. We develop general abstract theory and illustratethrough numerous examples the roles that dimension, noise-level and other model parametersplay in approximating the Bayesian update with importance sampling. Our examples alsofacilitate a new direct comparison of standard and optimal proposals for particle filtering. Importance sampling is a mechanism to approximate expectations with respect to a target distributionusing independent weighted samples from a proposal distribution. The variance of the weights—quantified by the 𝜒 -divergence between target and proposal— gives both necessary and sufficientconditions on the sample size to achieve a desired worst-case error over large classes of test functions.This paper contributes to the understanding of importance sampling to approximate the Bayesianupdate, where the target is a posterior distribution obtained by conditioning the proposal toobserved data. We consider illustrative examples where the 𝜒 -divergence between target andproposal admits a closed formula and it is hence possible to characterize explicitly the requiredsample size. These examples showcase the fundamental challenges that importance samplingencounters in high dimension and small noise regimes where target and proposal are far apart. Theyalso facilitate a direct comparison of standard and optimal proposals for particle filtering.We denote the target distribution by 𝜇 and the proposal by 𝜋 and assume that both areprobability distributions in Euclidean space R 𝑑 . We further suppose that the target is absolutelycontinuous with respect to the proposal, and denote by 𝑔 the unnormalized density between targetand proposal so that, for any suitable test function 𝜙, ∫︁ R 𝑑 𝜙 ( 𝑢 ) 𝜇 ( 𝑑𝑢 ) = ∫︀ R 𝑑 𝜙 ( 𝑢 ) 𝑔 ( 𝑢 ) 𝜋 ( 𝑑𝑢 ) ∫︀ R 𝑑 𝑔 ( 𝑢 ) 𝜋 ( 𝑑𝑢 ) . (1.1)We write this succinctly as 𝜇 ( 𝜙 ) = 𝜋 ( 𝜙𝑔 ) /𝜋 ( 𝑔 ) . Importance sampling approximates 𝜇 ( 𝜙 ) usingindependent samples { 𝑢 ( 𝑛 ) } 𝑁𝑛 =1 from the proposal 𝜋, computing the numerator and denominator in(1.1) by Monte Carlo integration, 𝜇 ( 𝜙 ) ≈ 𝑁 ∑︀ 𝑁𝑛 =1 𝜙 ( 𝑢 ( 𝑛 ) ) 𝑔 ( 𝑢 ( 𝑛 ) ) 𝑁 ∑︀ 𝑁𝑛 =1 𝑔 ( 𝑢 ( 𝑛 ) )= 𝑁 ∑︁ 𝑛 =1 𝑤 ( 𝑛 ) 𝜙 ( 𝑢 ( 𝑛 ) ) , 𝑤 ( 𝑛 ) := 𝑔 ( 𝑢 ( 𝑛 ) ) ∑︀ 𝑁ℓ =1 𝑔 ( 𝑢 ( ℓ ) ) . (1.2)1 a r X i v : . [ s t a t . C O ] S e p he weights 𝑤 ( 𝑛 ) —called autonormalized or self-normalized since they add up to one— can becomputed as long as the unnormalized density 𝑔 can be evaluated point-wise; knowledge of thenormalizing constant 𝜋 ( 𝑔 ) is not needed. We write (1.2) briefly as 𝜇 ( 𝜙 ) ≈ 𝜇 𝑁 ( 𝜙 ), where 𝜇 𝑁 is the random autonormalized particle approximation measure 𝜇 𝑁 := 𝑁 ∑︁ 𝑛 =1 𝑤 ( 𝑛 ) 𝛿 𝑢 ( 𝑛 ) , 𝑢 ( 𝑛 ) i.i.d. ∼ 𝜋. (1.3)This paper is concerned with the study of importance sampling in Bayesian formulations to inverseproblems, data assimilation and machine learning tasks [1, 26, 3, 13, 14], where the relationship 𝜇 ( 𝑑𝑢 ) ∝ 𝑔 ( 𝑢 ) 𝜋 ( 𝑑𝑢 ) arises from application of Bayes’ rule P ( 𝑢 | 𝑦 ) ∝ P ( 𝑦 | 𝑢 ) P ( 𝑢 ); we interpret 𝑢 ∈ R 𝑑 as a parameter of interest, 𝜋 ≡ P ( 𝑢 ) as a prior distribution on 𝑢 , 𝑔 ( 𝑢 ) ≡ 𝑔 ( 𝑢 ; 𝑦 ) ≡ P ( 𝑦 | 𝑢 ) as alikelihood function which tacitly depends on observed data 𝑦 ∈ R 𝑘 , and 𝜇 ≡ P ( 𝑢 | 𝑦 ) as the posteriordistribution of 𝑢 given 𝑦. With this interpretation and terminology, the goal of importance samplingis to approximate posterior expectations using prior samples. Since the prior has fatter tails thanthe posterior, the Bayesian setting poses further structure into the analysis of importance sampling.In addition, there are several specific features of the application of importance sampling in Bayesianinverse problems, data assimilation and machine learning that shape our presentation and results.First, Bayesian formulations have the potential to provide uncertainty quantification by com-puting several posterior quantiles. This motivates considering a worst-case error analysis [11] ofimportance sampling over large classes of test functions 𝜙 or, equivalently, bounding a certaindistance between the random particle approximation measure 𝜇 𝑁 and the target 𝜇, see [1]. Aswe will review in Section 2, a key quantity in controlling the error of importance sampling withbounded test functions is the 𝜒 -divergence between target and proposal, given by 𝑑 𝜒 ( 𝜇 ‖ 𝜋 ) = 𝜋 ( 𝑔 ) 𝜋 ( 𝑔 ) − . Second, importance sampling in inverse problems, data assimilation and machine learningapplications is often used as a building block of more sophisticated computational methods, and insuch a case there may be little or no freedom in the choice of proposal. For this reason, throughoutthis paper we view both target and proposal as given and we focus on investigating the requiredsample size for accurate importance sampling with bounded test functions, following a similarperspective as [8, 1, 25]. The complementary question of how to choose the proposal to achieve asmall variance for a given test function is not considered here. This latter question is of centralinterest in the simulation of rare events [23] and has been widely studied since the introduction ofimportance sampling in [16, 15], leading to a plethora of adaptive importance sampling schemes [7].Third, high dimensional and small noise settings are standard in inverse problems, data assimi-lation and machine learning, and it is essential to understand the scalability of sampling algorithmsin these challenging regimes. The curse of dimension of importance sampling has been extensivelyinvestigated [4, 5, 27, 22, 9, 1]. The early works [4, 5] demonstrated a weight collapse phenomenon,by which unless the number of samples is scaled exponentially with the dimension of the parameter,the maximum weight converges to one. The paper [1] also considered small noise limits and furtheremphasized the need to define precisely the dimension of learning problems. Indeed, while manyinverse problems, data assimilation models and machine learning tasks are defined in terms of2illions of parameters, their intrinsic dimension is often substantially lower since ( 𝑖 ) all parametersare typically not equally important; ( 𝑖𝑖 ) substantial a priori information about some parametersmay be available; and ( 𝑖𝑖𝑖 ) the data may be lower dimensional than the parameter space. Here wewill provide a unified and accessible understanding of the roles that dimension, noise-level and othermodel parameters play in approximating the Bayesian update. We will do so through exampleswhere it is possible to compute explicitly the 𝜒 -divergence between target and proposal, and hencethe required sample size.Finally, in the Bayesian context the normalizing constant 𝜋 ( 𝑔 ) represents the marginal likelihoodand is often computationally intractable. This motivates our focus on the auto-normalized importancesampling estimator in (1.2), which estimates both 𝜋 ( 𝑔𝜙 ) and 𝜋 ( 𝑔 ) using Monte Carlo integration, asopposed to unnormalized variants of importance sampling [25]. Main Goals, Specific Contributions and Outline
The main goal of this paper is to provide a rich and unified understanding of the use of importancesampling to approximate the Bayesian update, while keeping the presentation accessible to a largeaudience. In Section 2 we investigate the required sample size for importance sampling in termsof the 𝜒 -divergence between target and proposal. Section 3 builds on the results in Section 2to illustrate through numerous examples the fundamental challenges that importance samplingencounters when approximating the Bayesian update in small noise and high dimensional settings.In Section 4 we show how our concrete examples facilitate a new direct comparison of standard andoptimal proposals for particle filtering. These examples also allow us to identify model problemswhere the advantage of the optimal proposal over the standard one can be dramatic.Next, we provide further details on the specific contributions of each section and link them tothe literature. We refer to [1] for a more exhaustive literature review. ∙ Section 2 provides a unified perspective on the sufficiency and necessity of having a samplesize of the order of the 𝜒 -divergence between target and proposal to guarantee accurateimportance sampling with bounded test functions. Our analysis and presentation are informedby the specific features that shape the use of importance sampling to approximate Bayes’ rule.The key role of the second moment of the 𝜒 -divergence has long been acknowledged [19, 21],and it is intimately related to an effective sample size used by practitioners to monitor theperformance of importance sampling [17, 18]. A topic of recent interest is the development ofadaptive importance sampling schemes where the proposal is chosen by minimizing —oversome admissible family of distributions— the 𝜒 -divergence with respect to the target [24, 2].The main original contributions of Section 2 are Proposition 2.2 and Theorem 2.3, whichdemonstrate the necessity of suitably increasing the sample size with the 𝜒 -divergence alongsingular limit regimes. The idea of Proposition 2.2 is inspired by [8], but adapted here fromrelative entropy to 𝜒 -divergence. Our results complement sufficient conditions on the samplesize derived in [1] and necessary conditions for unnormalized (as opposed to autonormalized)importance sampling in [25]. ∙ In Section 3, Proposition 3.1 gives a closed formula for the 𝜒 -divergence between posteriorand prior in a linear-Gaussian Bayesian inverse problem setting. This formula allows us toinvestigate the scaling of the 𝜒 -divergence (and thereby the rate at which the sample sizeneeds to grow) in several singular limit regimes, including small observation noise, large prior3ovariance and large dimension. Numerical examples motivate and complement the theoreticalresults. In an infinite dimensional setting, Corollary 3.8 establishes an equivalence betweenabsolute continuity, finite 𝜒 -divergence and finite intrinsic dimension. A similar result wasproved in more generality in [1] using the advanced theory of Gaussian measures in Hilbertspace [6]; our presentation and proof here are elementary, while still giving the same degree ofunderstanding. ∙ In Section 4 we follow [4, 5, 27, 28, 1] and investigate the use of importance sampling toapproximate Bayes’ rule within one filtering step in a linear-Gaussian setting. We build onthe examples and results in Section 3 to identify model regimes where the performance ofstandard and optimal proposals can be dramatically different. We refer to [12, 26] for anintroduction to standard and optimal proposals for particle filtering, and to [10] for a moreadvanced presentation. The main original contribution of this section is Theorem 4.1, whichgives a direct comparison of the 𝜒 -divergence between target and standard/optimal proposals.This result improves on [1], where only a comparison between the intrinsic dimension wasestablished. 𝜒 -divergence The aim of this section is to demonstrate the central role of the 𝜒 -divergence between target andproposal in determining the accuracy of importance sampling. In Subsection 2.1 we show how the 𝜒 -divergence arises in both sufficient and necessary conditions on the sample size for accurateimportance sampling with bounded test functions. Subsection 2.2 describes a well-known connectionbetween the effective sample size and the 𝜒 -divergence. Our investigation of importance samplingto approximate the Bayesian update —developed in Sections 3 and 4— will make use of a closedformula for the 𝜒 -divergence between Gaussians, which we include in Subsection 2.3 for laterreference. Here we provide general sufficient and necessary conditions on the sample size in terms of 𝜌 := 𝑑 𝜒 ( 𝜇 ‖ 𝜋 ) + 1 . We first review upper-bounds on the worst-case bias and mean-squared error of importance samplingwith bounded test functions, which imply that accurate importance sampling is guaranteed if 𝑁 ≫ 𝜌 .The proofs can be found in [1, 26] and are therefore omitted. Proposition 2.1 (Sufficient Sample Size) . It holds that sup | 𝜙 | ∞ ≤ ⃒⃒⃒ E [︁ 𝜇 𝑁 ( 𝜙 ) − 𝜇 ( 𝜙 ) ]︁⃒⃒⃒ ≤ 𝑁 𝜌, sup | 𝜙 | ∞ ≤ E [︁(︀ 𝜇 𝑁 ( 𝜙 ) − 𝜇 ( 𝜙 ) )︀ ]︁ ≤ 𝑁 𝜌.
The next result shows the existence of bounded test functions for which the error may be largewith a high probability if 𝑁 ≪ 𝜌. The idea is taken from [8], but we adapt it here to obtain a resultin terms of the 𝜒 -divergence rather than relative entropy. We denote by g := 𝑔/𝜋 ( 𝑔 ) the normalized density between 𝜇 and 𝜋, and note that 𝜌 = 𝜋 ( g ) = 𝜇 ( g ) . roposition 2.2 (Necessary Sample Size) . Let 𝑈 ∼ 𝜇. For any 𝑁 ≥ and 𝛼 ∈ (0 , there exists a testfunction 𝜙 with | 𝜙 | ∞ ≤ such that P (︁ | 𝜇 𝑁 ( 𝜙 ) − 𝜇 ( 𝜙 ) | = P ( g ( 𝑈 ) > 𝛼𝜌 )︀)︁ ≥ − 𝑁𝛼𝜌 . (2.1)
Proof.
Observe that for the test function 𝜙 ( 𝑢 ) := { g ( 𝑢 ) ≤ 𝛼𝜌 } , we have 𝜇 ( 𝜙 ) = P (︀ g ( 𝑈 ) ≤ 𝛼𝜌 )︀ . On the other hand, 𝜇 𝑁 ( 𝜙 ) = 1 if and only if g ( 𝑢 ( 𝑛 ) ) ≤ 𝛼𝜌 for all 1 ≤ 𝑛 ≤ 𝑁 . This implies that P (︁ | 𝜇 𝑁 ( 𝜙 ) − 𝜇 ( 𝜙 ) | = P ( g ( 𝑈 ) > 𝛼𝜌 ) )︁ ≥ − 𝑁 P ( g ( 𝑢 (1) ) > 𝛼𝜌 ) ≥ − 𝑁𝛼𝜌 . (2.2)The power of Proposition 2.2 is due to the fact that in some singular limit regimes the distributionof g ( 𝑈 ) concentrates around its expected value 𝜌. In such a case, for any fixed 𝛼 ∈ (0 ,
1) the probabilityof the event g ( 𝑈 ) > 𝛼𝜌 will not vanish as the singular limit is approached. This idea will becomeclear in the proof of Theorem 2.3 below.In Sections 3 and 4 we will investigate the required sample size for importance samplingapproximation of the Bayesian update in various singular limits, where target and proposal becomefurther apart as a result of reducing the observation noise, increasing the prior uncertainty, orincreasing the dimension of the problem. To formalize the discussion in a general abstract setting, let { ( 𝜇 𝜃 , 𝜋 𝜃 ) } 𝜃> be a family of targets and proposals such that 𝜌 𝜃 := 𝑑 𝜒 ( 𝜇 𝜃 ‖ 𝜋 𝜃 ) → ∞ as 𝜃 → ∞ . Theparameter 𝜃 may represent for instance the size of the precision of the observation noise, the size ofthe prior covariance, or a suitable notion of dimension. Our next result shows a clear dichotomy inthe performance of importance sampling along the singular limit depending on whether the samplesize grows sublinearly or superlinearly with 𝜌 𝜃 . Theorem 2.3.
Suppose that 𝜌 𝜃 → ∞ and that 𝒱 := sup 𝜃 V [ g 𝜃 ( 𝑈 𝜃 )] 𝜌 𝜃 < . Let 𝛿 > . (i) If 𝑁 𝜃 = 𝜌 𝛿𝜃 , then lim 𝜃 →∞ sup | 𝜙 | ∞ ≤ E [︀(︀ 𝜇 𝑁 𝜃 𝜃 ( 𝜙 ) − 𝜇 𝜃 ( 𝜙 ) )︀ ]︀ = 0 . (2.3) (ii) If 𝑁 𝜃 = 𝜌 − 𝛿𝜃 , then there exists a fixed 𝑐 ∈ (0 , such that lim 𝜃 →∞ sup | 𝜙 | ∞ ≤ P (︁ | 𝜇 𝑁 𝜃 𝜃 ( 𝜙 ) − 𝜇 𝜃 ( 𝜙 ) | > 𝑐 )︁ = 1 . (2.4) Proof.
The proof of ( 𝑖 ) follows directly from Proposition 2.1. For ( 𝑖𝑖 ) we fix 𝛼 ∈ (0 , − 𝒱 ) and 𝑐 ∈ (︁ , − 𝒱 (1 − 𝛼 ) )︁ . Let 𝜙 𝜃 ( 𝑢 ) := ( g 𝜃 ( 𝑢 ) ≤ 𝛼𝜌 𝜃 ) as in the proof of Proposition 2.2. Then, P (︀ g 𝜃 ( 𝑈 𝜃 ) > 𝛼𝜌 𝜃 )︀ ≥ − P (︀ | 𝜌 𝜃 − g 𝜃 ( 𝑈 𝜃 ) | ≥ (1 − 𝛼 ) 𝜌 𝜃 )︀ ≥ − V [ g 𝜃 ( 𝑈 𝜃 )](1 − 𝛼 ) 𝜌 𝜃 ≥ − 𝒱 (1 − 𝛼 ) > 𝑐. The bound in (2.2) implies that P (︁ | 𝜇 𝑁 𝜃 𝜃 ( 𝜙 𝜃 ) − 𝜇 𝜃 ( 𝜙 𝜃 ) | > 𝑐 )︁ ≥ P (︁ | 𝜇 𝑁𝜃 ( 𝜙 𝜃 ) − 𝜇 𝜃 ( 𝜙 𝜃 ) | = P ( g 𝜃 ( 𝑈 𝜃 ) > 𝛼𝜌 𝜃 )︀)︁ ≥ − 𝑁 𝜃 𝛼𝜌 𝜃 . This completes the proof, since if 𝑁 𝜃 = 𝜌 − 𝛿𝜃 the right-hand side goes to 1 as 𝜃 → ∞ .5he assumption that 𝒱 < 𝒱 < 𝑁 with 𝜌 along those singular limits in order to avoid a weigh-collapsephenomenon. Further theoretical evidence was given for unnormalized importance sampling in [25]. 𝜒 -divergence and Effective Sample Size The previous subsection provides theoretical non-asymptotic and asymptotic evidence that a samplesize larger than 𝜌 is necessary and sufficient for accurate importance sampling. Here we recall a wellknown connection between the 𝜒 -divergence and the effective sample sizeESS := 1 ∑︀ 𝑁𝑛 =1 ( 𝑤 ( 𝑛 ) ) , (2.5)widely used by practitioners to monitor the performance of importance sampling. Note that always1 ≤ ESS ≤ 𝑁 ; it is intuitive that ESS = 1 if the maximum weight is one and ESS = 𝑁 if themaximum weight is 1 /𝑁. To see the connection between ESS and 𝜌 , note thatESS 𝑁 = 1 𝑁 ∑︀ 𝑁𝑛 =1 ( 𝑤 ( 𝑛 ) ) = (︁∑︀ 𝑁𝑛 =1 𝑔 ( 𝑢 ( 𝑛 ) ) )︁ 𝑁 ∑︀ 𝑁𝑛 =1 𝑔 ( 𝑢 ( 𝑛 ) ) = (︂ 𝑁 ∑︀ 𝑁𝑛 =1 𝑔 ( 𝑢 ( 𝑛 ) ) )︂ 𝑁 ∑︀ 𝑁𝑛 =1 𝑔 ( 𝑢 ( 𝑛 ) ) ≈ 𝜋 ( 𝑔 ) 𝜋 ( 𝑔 ) . Therefore, ESS ≈ 𝑁/𝜌 : if the sample-based estimate of 𝜌 is significantly larger than 𝑁 , ESS will besmall which gives a warning sign that a larger sample size 𝑁 may be needed. 𝜒 -divergence Between Gaussians We conclude this section by recalling an analytical expression for the 𝜒 -divergence betweenGaussians. In order to make our presentation self-contained, we include a proof in Appendix A. Proposition 2.4.
Let 𝜇 = 𝒩 ( 𝑚, 𝐶 ) and 𝜋 = 𝒩 (0 , Σ) . If ≻ 𝐶 , then 𝜌 = | Σ | √︀ | − 𝐶 || 𝐶 | exp (︁ 𝑚 ′ (2Σ − 𝐶 ) − 𝑚 )︁ . Otherwise, 𝜌 = ∞ . It is important to note that non-degenerate Gaussians 𝜇 = 𝒩 ( 𝑚, 𝐶 ) and 𝜋 = 𝒩 (0 , Σ) in R 𝑑 arealways equivalent. However, 𝜌 = ∞ unless 2Σ ≻ 𝐶. In Sections 3 and 4 we will interpret 𝜇 as aposterior and 𝜋 as a prior, in which case automatically Σ ≻ 𝐶 and 𝜌 < ∞ . In this section we study the use of importance sampling in a linear Bayesian inverse problem settingwhere the target and the proposal represent, respectively, the posterior and the prior distribution. InSubsection 3.1 we describe our setting and we also derive an explicit formula for the 𝜒 -divergencebetween the posterior and the prior. This explicit formula allows us to determine the scaling ofthe 𝜒 -divergence in small noise regimes (Subsection 3.2), in the limit of large prior covariance(Subsection 3.3), and in a high dimensional limit (Subsection 3.4). Our overarching goal is to showhow the sample size for importance sampling needs to grow along these limiting regimes in order tomaintain the same level of accuracy. 6 .1 Inverse Problem Setting and 𝜒 -divergence Between Posterior and Prior Let 𝐴 ∈ R 𝑘 × 𝑑 be a given design matrix and consider the linear inverse problem of recovering 𝑢 ∈ R 𝑑 from data 𝑦 ∈ R 𝑘 related by 𝑦 = 𝐴𝑢 + 𝜂, 𝜂 ∼ 𝒩 (0 , Γ) , (3.1)where 𝜂 represents measurement noise. We assume henceforth that we are in the underdeterminedcase 𝑘 ≤ 𝑑 , and that 𝐴 is full rank. We follow a Bayesian perspective and set a Gaussian prior on 𝑢 , 𝑢 ∼ 𝜋 = 𝒩 (0 , Σ) . We assume throughout that Σ and Γ are given symmetric positive definite matrices.The solution to the Bayesian formulation of the inverse problem is the posterior distribution 𝜇 of 𝑢 given 𝑦. We are interested in studying the performance of importance sampling with proposal 𝜋 (the prior) and target 𝜇 (the posterior). We recall that under this linear-Gaussian model theposterior distribution is Gaussian [26], and we denote it by 𝜇 = 𝒩 ( 𝑚, 𝐶 ). In order to characterizethe posterior mean 𝑚 and covariance 𝐶 , we introduce standard data assimilation notation 𝑆 := 𝐴 Σ 𝐴 ′ + Γ ,𝐾 := Σ 𝐴 ′ 𝑆 − , where 𝐾 is the Kalman gain. Then we have 𝑚 = 𝐾𝑦,𝐶 = ( 𝐼 − 𝐾𝐴 )Σ . (3.2)Proposition 2.4 allows us to obtain a closed formula for the quantity 𝜌 = 𝑑 𝜒 ( 𝜇 ‖ 𝜋 ) + 1, noting that(3.2) implies that 2Σ − 𝐶 = ( 𝐼 + 𝐾𝐴 )Σ= Σ + Σ 𝐴 ′ 𝑆 − 𝐴 Σ ≻ . The proof of the following result is then immediate and therefore omitted.
Proposition 3.1.
Consider the inverse problem (3.1) with prior 𝑢 ∼ 𝜋 = 𝒩 (0 , Σ) and posterior 𝜇 = 𝒩 ( 𝑚, 𝐶 ) with 𝑚 and 𝐶 defined in (3.2) . Then 𝜌 = 𝑑 𝜒 ( 𝜇 ‖ 𝜋 ) + 1 admits the explicit characterization 𝜌 = ( | 𝐼 + 𝐾𝐴 || 𝐼 − 𝐾𝐴 | ) − exp (︁ 𝑦 ′ 𝐾 ′ [( 𝐼 + 𝐾𝐴 )Σ] − 𝐾𝑦 )︁ . In the following two subsections we employ this result to derive by direct calculation the rate atwhich the posterior and prior become further apart —in 𝜒 -divergence— in small noise and largeprior regimes. To carry out the analysis we use parameters 𝛾 , 𝜎 > 𝛾 Γ , and the prior covariance, 𝜎 Σ . To illustrate the behavior of importance sampling in small noise regimes, we first introduce amotivating numerical study. A similar numerical setup was used in [4] to demonstrate the curse ofdimension of importance sampling. We consider the inverse problem setting in Equation (3.1) with 𝑑 = 𝑘 = 5 and noise covariance 𝛾 Γ . We conduct 18 numerical experiments with a fixed data 𝑦 . Foreach experiment, we perform importance sampling 400 times, and report in Figure 1 a histogram7ith the largest autonormalized weight in each of the 400 realizations. The 18 experiments differ inthe sample size 𝑁 and the size of the observation noise 𝛾 . In both Figures 1.a and 1.b we considerthree choices of 𝑁 (rows) and three choices of 𝛾 (columns). These choices are made so that inFigure 1.a it holds that 𝑁 = 𝛾 − along the bottom-left to top-right diagonal, while in Figure 1.b 𝑁 = 𝛾 − along the same diagonal. (a) 𝑁 = 𝛾 − . (b) 𝑁 = 𝛾 − . Figure 1
Noise scaling with 𝑑 = 𝑘 = 5 . We can see from Figure 1.a that 𝑁 = 𝛾 − is not a fast enough growth of 𝑁 to avoid weightcollapse: the histograms skew to the right along the bottom-left to top-right diagonal, suggestingthat weight collapse (i.e. one weight dominating the rest, and therefore the variance of the weightsbeing large) is bound to occur in the joint limit 𝑁 → ∞ , 𝛾 → 𝑁 = 𝛾 − . In contrast, thehistograms in Figure 1.b skew to the left along the same diagonal, suggesting that the probability ofweight collapse is significantly reduced if 𝑁 = 𝛾 − . We observe a similar behavior with other choicesof dimension 𝑑 by conducting experiments with sample sizes 𝑁 = 𝛾 − 𝑑 +1 and 𝑁 = 𝛾 − 𝑑 − , and weinclude the histograms with 𝑑 = 𝑘 = 4 in Appendix C. Our next result shows that these empiricalfindings are in agreement with the scaling of the 𝜒 -divergence between target and proposal in thesmall noise limit. Proposition 3.2.
Consider the inverse problem setting 𝑦 = 𝐴𝑢 + 𝜂, 𝜂 = 𝒩 (0 , 𝛾 Γ) , 𝑢 ∼ 𝜋 = 𝒩 (0 , Σ) . Let 𝜇 𝛾 denote the posterior and let 𝜌 𝛾 = 𝑑 𝜒 ( 𝜇 𝛾 ‖ 𝜋 ) + 1 . Then, for almost every 𝑦,𝜌 𝛾 ∼ 𝒪 ( 𝛾 − 𝑘 ) in the small noise limit 𝛾 → . Proof.
Let 𝐾 𝛾 = Σ 𝐴 ′ ( 𝐴 Σ 𝐴 ′ + 𝛾 Γ) − denote the Kalman gain. We observe that 𝐾 𝛾 → Σ 𝐴 ′ ( 𝐴 Σ 𝐴 ′ ) − under our standing assumption that 𝐴 is full rank. Let 𝑈 ′ Ξ 𝑉 be the singular value decompostion of8 − 𝐴 Σ and { 𝜉 𝑖 } 𝑘𝑖 =1 be the singular values. Then we have 𝐾 𝛾 𝐴 ∼ Σ 𝐴 ′ Γ − (Γ − 𝐴 Σ 𝐴 ′ Γ − + 𝛾 𝐼 ) − Γ − 𝐴 Σ = 𝑉 ′ Ξ ′ 𝑈 ( 𝑈 ′ Ξ 𝑉 𝑉 ′ Ξ ′ 𝑈 + 𝛾 𝐼 ) − 𝑈 ′ Ξ 𝑉 ∼ Ξ ′ (ΞΞ ′ + 𝛾 𝐼 ) − Ξ , where here “ ∼ ” denotes matrix similarity. It follows that 𝐼 + 𝐾 𝛾 𝐴 converges to a finite limit, andso does the exponent 𝑦 ′ 𝐾 ′ 𝛾 Σ − ( 𝐼 + 𝐾 𝛾 𝐴 ) − 𝐾 𝛾 𝑦 in Proposition 3.1. On the other hand,( | 𝐼 + 𝐾 𝛾 𝐴 || 𝐼 − 𝐾 𝛾 𝐴 | ) − = (︁ 𝑘 ∏︁ 𝑖 =1 𝛾 𝜉 𝑖 + 𝛾 )︁ − ∼ 𝒪 ( 𝛾 − 𝑘 )as 𝛾 →
0. The conclusion follows.
Here we illustrate the behavior of importance sampling in the limit of large prior covariance. Westart again with a motivating numerical example, similar to the one reported in Figure 1. Thebehavior is analogous to the small noise regime, which is expected since the ratio of prior andnoise covariances determines the closeness between target and proposal. Figure 2 shows that when 𝑑 = 𝑘 = 5 weight collapse is observed frequently when the sample size 𝑁 grows as 𝜎 , but notso often with sample size 𝑁 = 𝜎 . Similar histograms with 𝑑 = 𝑘 = 4 are included in AppendixC. These empirical results are in agreement with the theoretical growth rate of the 𝜒 -divergencebetween target and proposal in the limit of large prior covariance, as we prove next. (a) 𝑁 = 𝜎 . (b) 𝑁 = 𝜎 . Figure 2
Prior scaling 𝑑 = 𝑘 = 5 . Proposition 3.3.
Consider the inverse problem setting 𝑦 = 𝐴𝑢 + 𝜂, 𝜂 ∼ 𝒩 (0 , Γ) , 𝑢 ∼ 𝜋 𝜎 = 𝒩 (0 , 𝜎 Σ) . Let 𝜇 𝜎 denote the posterior and 𝜌 𝜎 = 𝑑 𝜒 ( 𝜇 𝜎 ‖ 𝜋 𝜎 ) + 1 . Then, for almost every 𝑦,𝜌 𝜎 ∼ 𝒪 ( 𝜎 𝑑 )9 n the large prior limit 𝜎 → ∞ . Proof.
Let Σ 𝜎 = 𝜎 Σ, let 𝐾 𝜎 = Σ 𝜎 𝐴 ′ ( 𝐴 Σ 𝜎 𝐴 ′ + Γ) − be the Kalman gain. Observing that 𝐾 𝜎 = 𝐾 𝛾 = 𝜎 , we apply Proposition 3.2 and deduce that when 𝜎 → ∞ :1. 𝐾 𝜎 → Σ 𝐴 ′ ( 𝐴 Σ 𝐴 ′ + 𝛾 Γ) − ;2. 𝐼 + 𝐾 𝜎 𝐴 has a well-defined and invertible limit;3. | 𝐼 − 𝐾 𝜎 𝐴 | − ∼ 𝒪 ( 𝜎 𝑘 ).On the other hand, we notice that the quadratic term 𝐾 ′ 𝜎 Σ − 𝜎 ( 𝐼 + 𝐾 𝜎 𝐴 ) − 𝐾 𝜎 = 𝜎 − 𝐾 ′ 𝜎 Σ( 𝐼 + 𝐾 𝜎 𝐴 ) − 𝐾 𝜎 vanishes in limit. The conclusion follows by Proposition 3.1. In this subsection we study importance sampling in high dimensional limits. To that end, we let { 𝑎 𝑖 } ∞ 𝑖 =1 , { 𝛾 𝑖 } ∞ 𝑖 =1 and { 𝜎 𝑖 } ∞ 𝑖 =1 be infinite sequences and we define, for any 𝑑 ≥ ,𝐴 𝑑 := diag {︁ 𝑎 , . . . , 𝑎 𝑑 }︁ ∈ R 𝑑 × 𝑑 , Γ 𝑑 := diag {︁ 𝛾 , . . . , 𝛾 𝑑 }︁ ∈ R 𝑑 × 𝑑 , Σ 𝑑 := diag {︁ 𝜎 , . . . , 𝜎 𝑑 }︁ ∈ R 𝑑 × 𝑑 . We then consider the inverse problem of reconstructing 𝑢 ∈ R 𝑑 from data 𝑦 ∈ R 𝑑 under the setting 𝑦 = 𝐴 𝑑 𝑢 + 𝜂, 𝜂 ∼ 𝒩 (0 , Γ 𝑑 ) , 𝑢 ∼ 𝜋 𝑑 = 𝒩 (0 , Σ 𝑑 ) . (3.3)We denote the corresponding posterior distribution by 𝜇 𝑑 , which is Gaussian with a diagonalcovariance. Given observation 𝑦 , we may find the posterior distribution 𝜇 𝑖 of 𝑢 𝑖 by solving the onedimensional linear-Gaussian inverse problem 𝑦 𝑖 = 𝑎 𝑖 𝑢 𝑖 + 𝜂 𝑖 , 𝜂 𝑖 ∼ 𝒩 (0 , 𝛾 𝑖 ) , ≤ 𝑖 ≤ 𝑑, (3.4)with prior 𝜋 𝑖 = 𝒩 (0 , 𝜎 𝑖 ) . In this way we have defined, for each 𝑑 ∈ N ∪ {∞} , an inverse problemwith prior and posterior 𝜋 𝑑 = 𝑑 ∏︁ 𝑖 =1 𝜋 𝑖 , 𝜇 𝑑 = 𝑑 ∏︁ 𝑖 =1 𝜇 𝑖 . (3.5)In Subsection 3.4.1 we include an explicit calculation in the one dimensional inverse setting (3.4),which will be used in Subsection 4.4 to establish the rate of growth of 𝜌 𝑑 = 𝑑 𝜒 ( 𝜇 𝑑 ‖ 𝜋 𝑑 ) andthereby how the sample size needs to be scaled along the high dimensional limit 𝑑 → ∞ to maintainthe same accuracy. Finally, in Subsection 3.4.3 we establish from first principles and our simple onedimensional calculation the equivalence between ( 𝑖 ) certain notion of dimension being finite; ( 𝑖𝑖 ) 𝜌 ∞ < ∞ ; and ( 𝑖𝑖𝑖 ) absolute continuity of 𝜇 ∞ with respect to 𝜋 ∞ . .4.1 One Dimensional Setting Let 𝑎 ∈ R be given and consider the one dimensional inverse problem of reconstructing 𝑢 ∈ R fromdata 𝑦 ∈ R , under the setting 𝑦 = 𝑎𝑢 + 𝜂, 𝜂 ∼ 𝒩 (0 , 𝛾 ) , 𝑢 ∼ 𝜋 = 𝒩 (0 , 𝜎 ) . (3.6)By defining 𝑔 ( 𝑢 ) := exp (︁ − 𝑎 𝛾 𝑢 + 𝑎𝑦𝛾 𝑢 )︁ , we can write the posterior density 𝜇 ( 𝑑𝑢 ) as 𝜇 ( 𝑑𝑢 ) ∝ 𝑔 ( 𝑢 ) 𝜋 ( 𝑑𝑢 ) . The next result gives a simplifiedclosed formula for 𝜌 = 𝑑 𝜒 ( 𝜇 ‖ 𝜋 ) + 1 . In addition, it gives a closed formula for the Hellinger integral ℋ ( 𝜇, 𝜋 ) := 𝜋 (︀ 𝑔 )︀ 𝜋 ( 𝑔 ) , which will facilitate the study of the case 𝑑 = ∞ in Subsection 3.4.3. Lemma 3.4.
Consider the inverse problem in (3.6) . Let 𝜆 := 𝑎 𝜎 /𝛾 and 𝑧 := 𝑦 𝑎 𝜎 + 𝛾 . Then, forany ℓ > , 𝜋 ( 𝑔 ℓ ) 𝜋 ( 𝑔 ) ℓ = ( 𝜆 + 1) ℓ √ ℓ𝜆 + 1 exp (︁ ( ℓ − ℓ ) 𝜆 ℓ𝜆 + 1) 𝑧 )︁ . (3.7) In particular, 𝜌 = 𝜆 + 1 √ 𝜆 + 1 exp (︁ 𝜆 𝜆 + 1 𝑧 )︁ , (3.8) ℋ ( 𝜇, 𝜋 ) = √︃ √ 𝜆 + 1 𝜆 + 2 exp (︁ − 𝜆𝑧 𝜆 + 2) )︁ . (3.9) Proof.
A direct calculation shows that 𝜋 ( 𝑔 ) = 1 √ 𝜆 + 1 exp (︁ 𝜆𝑦 𝑎 𝜎 + 𝛾 )︁ . The same calculation, but replacing 𝛾 by 𝛾 /ℓ and 𝜆 by ℓ𝜆 , gives similar expressions for 𝜋 ( 𝑔 ℓ ),which leads to (3.7). The other two equations follow by setting ℓ to be 2 and .Lemma 3.4 will be used in the two following subsections to study high dimensional limits. Herewe show how this lemma also allows us to verify directly that the assumption 𝒱 < Example 3.5.
Consider a sequence of inverse problems of the form (3.6) with 𝜆 = 𝑎 𝜎 /𝛾 approachinginfinity. Let { ( 𝜇 𝜆 , 𝜋 𝜆 ) } 𝜆> be the corresponding family of posteriors and priors and let g 𝜆 be thenormalized density. Lemma 3.4 implies that 𝜋 𝜆 ( g 𝜆 ) 𝜋 𝜆 ( g 𝜆 ) = 2 𝜆 + 1 √︀ (3 𝜆 + 1)( 𝜆 + 1) exp (︁ 𝜆 (2 𝜆 + 1)(3 𝜆 + 1) 𝑧 )︁ → √ < , 𝜆 → ∞ . This implies that, for 𝜆 sufficiently large, V [ g 𝜆 ( 𝑈 𝜆 )] 𝜌 𝜆 = 𝜋 𝜆 ( g 𝜆 ) 𝜋 𝜆 ( g 𝜆 ) − < . Now we investigate the behavior of importance sampling in the limit of large dimension, in theinverse problem setting (3.3). We start with an example similar to the ones in Figure 1 and Figure2. Figure 3 shows that for 𝜆 = 1 . 𝑁 grows polynomially as 𝑑 , but not so often if 𝑁 grows at rate 𝒪 (︃∏︀ 𝑑𝑖 =1 (︃ 𝜆 +1 √ 𝜆 +1 𝑒 𝜆𝑧 𝑖 𝜆 +1 )︃)︃ . Similarhistograms for 𝜆 = 2 . 𝜌 𝑑 in the large 𝑑 limit. (a) 𝑁 = 𝒪 (︂∏︀ 𝑑𝑖 =1 (︂ 𝜆 +1 √ 𝜆 +1 𝑒 𝜆𝑧 𝑖 𝜆 +1 )︂)︂ . (b) 𝑁 = 𝑑 . Figure 3
Dimensional scaling 𝜆 = 1 . . Proposition 3.6.
For any 𝑑 ∈ N ∪ {∞} ,𝜌 𝑑 = 𝑑 ∏︁ 𝑖 =1 (︃ 𝜆 𝑖 + 1 √ 𝜆 𝑖 + 1 𝑒 𝜆𝑖𝑧 𝑖 𝜆𝑖 +1 )︃ , E 𝑧 𝑑 [ 𝜌 𝑑 ] = 𝑑 ∏︁ 𝑖 =1 ( 𝜆 𝑖 + 1) . Proof.
The formula for 𝜌 𝑑 is a direct consequence of Equation (3.8) and the product structure.12imilarly, we have E 𝑧 𝑖 [︃ 𝜆 𝑖 + 1 √ 𝜆 𝑖 + 1 𝑒 𝜆𝑖𝑧 𝑖 𝜆𝑖 +1 ]︃ = 𝜆 𝑖 + 1 √ 𝜆 𝑖 + 1 ∫︁ R √ 𝜋 𝑒 − 𝑧 𝑖 + 𝜆𝑖𝑧 𝑖 𝜆𝑖 +1 𝑑𝑧 𝑖 = 𝜆 𝑖 + 1 √ 𝜆 𝑖 + 1 ∫︁ R √ 𝜋 𝑒 − 𝑧 𝑖 𝜆𝑖 +1) 𝑑𝑧 𝑖 = 𝜆 𝑖 + 1 . Proposition 3.6 implies that, for 𝑑 ∈ N ∪ {∞} , sup | 𝜙 | ∞ ≤ E [︁(︀ 𝜇 𝑁 𝑑 ( 𝜙 ) − 𝜇 𝑑 ( 𝜙 ) )︀ ]︁ ≤ 𝑑 ∏︁ 𝑖 =1 (︃ 𝜆 𝑖 + 1 √ 𝜆 𝑖 + 1 𝑒 𝜆𝑖𝑧 𝑖 𝜆𝑖 +1 )︃ , E [︃ sup | 𝜙 | ∞ ≤ E [︁(︀ 𝜇 𝑁 𝑑 ( 𝜙 ) − 𝜇 𝑑 ( 𝜙 ) )︀ ]︁]︃ ≤ 𝑑 ∏︁ 𝑖 =1 ( 𝜆 𝑖 + 1) . Note that the outer expected value in the latter equation averages over the data, while the innerone averages over sampling from the prior 𝜋 𝑑 . This suggests thatlog E [︃ sup | 𝜙 | ∞ ≤ E [︁(︀ 𝜇 𝑁 𝑑 ( 𝜙 ) − 𝜇 𝑑 ( 𝜙 ) )︀ ]︁]︃ (cid:46) 𝑑 ∑︁ 𝑖 =1 𝜆 𝑖 . The quantity 𝜏 := ∑︀ 𝑑𝑖 =1 𝜆 𝑖 had been used as an intrinsic dimension of the inverse problem (3.3).This simple heuristic together with Theorem 2.3 suggest that increasing 𝑁 exponentially with 𝜏 isboth necessary and sufficient to maintain accurate importance sampling along the high dimensionallimit 𝑑 → ∞ . In particular, if all coordinates of the problem play the same role, this implies that 𝑁 needs to grow exponentially with 𝑑 , a manifestation of the curse of dimension of importancesampling [1, 4, 5]. Finally, we investigate the case 𝑑 = ∞ . Our goal in this subsection is to establish a connectionbetween the effective dimension, the quantity 𝜌 ∞ , and absolute continuity. The main result, Corollary3.8, had been proved in more generality in [1]. However, our proof and presentation here requiresminimal technical background and is based on the explicit calculations obtained in the previoussubsections and in the following lemma. Lemma 3.7.
It holds that 𝜇 ∞ is absolutely continuous with respect to 𝜋 ∞ if and only if ℋ ( 𝜇 ∞ , 𝜋 ∞ ) = ∞ ∏︁ 𝑖 =1 𝜋 𝑖 (︀ 𝑔 𝑖 )︀ 𝜋 𝑖 ( 𝑔 𝑖 ) > , (3.10)13 here 𝑔 𝑖 is an unnormalized density between 𝜇 𝑖 and 𝜋 𝑖 . Moreover, we have the following explicitcharacterizations of the Hellinger integral ℋ ( 𝜇 ∞ , 𝜋 ∞ ) and its average with respect to data realiza-tions, ℋ ( 𝜇 ∞ , 𝜋 ∞ ) = ∞ ∏︁ 𝑖 =1 ⎛⎝√︃ √ 𝜆 𝑖 + 1 𝜆 𝑖 + 2 𝑒 − 𝜆𝑖𝑧 𝑖 𝜆𝑖 +2) ⎞⎠ , E 𝑧 ∞ [ ℋ ( 𝜇 ∞ , 𝜋 ∞ )] = ∞ ∏︁ 𝑖 =1 𝜆 𝑖 + 1) √ 𝜆 𝑖 + 4 . Proof.
The formula for the Hellinger integral is a direct consequence of Equation (3.9) and theproduct structure. On the other hand, E 𝑧 𝑖 ⎡⎣√︃ √ 𝜆 𝑖 + 1 𝜆 𝑖 + 2 𝑒 − 𝜆𝑖𝑧 𝑖 𝜆𝑖 +2) ⎤⎦ = √ 𝜆 𝑖 + 1) √ 𝜆 𝑖 + 2 ∫︁ R √ 𝜋 𝑒 − 𝜆𝑖𝑧 𝑖 𝜆𝑖 +2) − 𝑧 𝑖 𝑑𝑧 𝑖 = 2( 𝜆 𝑖 + 1) √ 𝜆 𝑖 + 4 . The proof of the equivalence between finite Hellinger integral and absolute continuity is given inAppendix B.
Corollary 3.8.
The following statements are equivalent:(i) 𝜏 = ∑︀ ∞ 𝑖 =1 𝜆 𝑖 < ∞ ;(ii) 𝜌 ∞ < ∞ for almost every 𝑦 ;(iii) 𝜇 ∞ ≪ 𝜋 ∞ for almost every 𝑦 .Proof. Observe that 𝜆 𝑖 → 𝜆 𝑖 → 𝑖 ) ⇔ ( 𝑖𝑖 ) : By Proposition 3.6,log (︁ E 𝑧 ∞ [ 𝜌 ∞ ] )︁ = ∞ ∑︁ 𝑖 =1 log(1 + 𝜆 𝑖 ) = 𝒪 ( ∞ ∑︁ 𝑖 =1 𝜆 𝑖 ) , since log(1 + 𝜆 𝑖 ) ≈ 𝜆 𝑖 for large 𝑖 .( 𝑖 ) ⇔ ( 𝑖𝑖𝑖 ) : Similarly, we havelog (︁ E 𝑧 ∞ [ ℋ ( 𝜇 ∞ , 𝜋 ∞ )] )︁ = − ∞ ∑︁ 𝑖 =1 log (3 𝜆 𝑖 + 4) 𝜆 𝑖 + 1)= − ∞ ∑︁ 𝑖 =1 log (︃ 𝜆 𝑖 + 8 𝜆 𝑖 𝜆 𝑖 + 16 )︃ = − 𝒪 ( ∞ ∑︁ 𝑖 =1 𝜆 𝑖 ) . The conclusion follows from Lemma 3.7. 14
Importance Sampling for Data Assimilation
In this section, we study the use of importance sampling in a particle filtering setting. Following[4, 5, 27] we focus on one filtering step. Our goal is to provide a new and concrete comparisonof two proposals, referred to as standard and optimal in the literature [1]. In Subsection 4.1 weintroduce the setting and both proposals, and show that the 𝜒 -divergence between target andstandard proposal is larger than the 𝜒 -divergence between target and optimal proposal. Subsections4.3 and 4.4 identify small noise and large dimensional limiting regimes where the sample size forthe standard proposal needs to grow unboundedly to maintain the same level of accuracy, but therequired sample size for the optimal proposal remains bounded. Let 𝑀 and 𝐻 be given matrices. We consider the one-step filtering problem of recovering 𝑣 , 𝑣 from 𝑦 , under the following setting 𝑣 = 𝑀 𝑣 + 𝜉, 𝑣 ∼ 𝒩 (0 , 𝑃 ) , 𝜉 ∼ 𝒩 (0 , 𝑄 ) , (4.1) 𝑦 = 𝐻𝑣 + 𝜁, 𝜁 ∼ 𝒩 (0 , 𝑅 ) . (4.2)Similar to the setting in Subsection 3.1, we assume that 𝑃, 𝑄, 𝑅 are symmetric positive definite andthat 𝑀 and 𝐻 are full rank. From a Bayesian point of view, we would like to sample from thetarget distribution P 𝑣 ,𝑣 | 𝑦 . To achieve this, we can either use 𝜋 std = P 𝑣 | 𝑣 P 𝑣 or 𝜋 opt = P 𝑣 | 𝑣 ,𝑦 P 𝑣 as the proposal distribution.The standard proposal 𝜋 std is the prior distribution of ( 𝑣 , 𝑣 ) determined by the prior 𝑣 ∼𝒩 (0 , 𝑃 ) and the signal dynamics encoded in Equation (4.1). Then assimilating the observation 𝑦 leads to an inverse problem [1, 26] with design matrix, noise covariance, and prior covariance givenby 𝐴 std := 𝐻, Γ std := 𝑅, Σ std := 𝑀 𝑃 𝑀 ′ + 𝑄. (4.3)We denote 𝜋 std = 𝒩 (0 , Σ std ) the prior distribution and by 𝜇 std the corresponding posterior distribu-tion.The optimal proposal 𝜋 opt samples from 𝑣 and the conditional kernel 𝑣 | 𝑣 , 𝑦. Then assimilating 𝑦 leads to the inverse problem [1, 26] 𝑦 = 𝐻𝑀 𝑣 + 𝐻𝜉 + 𝜁, where the design matrix, noise covariance and prior covariance are given by 𝐴 opt := 𝐻𝑀, Γ opt := 𝐻𝑄𝐻 ′ + 𝑅, Σ opt := 𝑃. (4.4)We denote 𝜋 opt = 𝒩 (0 , Σ opt ) the prior distribution and 𝜇 std the corresponding posterior distribution.15 .2 𝜒 -divergence Comparison between Standard and Optimal Proposal Here we show that 𝜌 std := 𝑑 𝜒 ( 𝜇 std ‖ 𝜋 std ) + 1 > 𝑑 𝜒 ( 𝜇 opt ‖ 𝜋 opt ) + 1 =: 𝜌 opt . The proof is a direct calculation using the explicit formula in Proposition 3.1. We introduce, as inSection 3, standard Kalman notation 𝐾 std := Σ std 𝐴 ′ std 𝑆 − std , 𝑆 std := 𝐴 std Σ std 𝐴 ′ std + Γ std ,𝐾 opt := Σ opt 𝐴 ′ opt 𝑆 − opt , 𝑆 opt := 𝐴 opt Σ opt 𝐴 ′ opt + Γ opt . It follows from the definitions in (4.3) and (4.4) that 𝑆 std = 𝐻 ( 𝑀 𝑃 𝑀 ′ + 𝑄 ) 𝐻 + 𝑅 = 𝐻𝑀 𝑃 𝑀 ′ 𝐻 + 𝐻𝑄𝐻 ′ + 𝑅 = 𝑆 opt . Since 𝑆 std = 𝑆 opt we drop the subscripts in what follows, and denote both simply by 𝑆. Theorem 4.1.
Consider the one-step filtering setting in Equations (4.1) and (4.2) . If 𝑀 and 𝐻 arefull rank and 𝑃, 𝑄, 𝑅 are symmetric positive definite, then, for almost every 𝑦,𝜌 std > 𝜌 opt . Proof.
By Proposition 3.1 we have 𝜌 std = ( | 𝐼 − 𝐾 std 𝐴 std || 𝐼 + 𝐾 std 𝐴 std | ) − exp (︁ 𝑦 ′ 𝐾 ′ std [( 𝐼 + 𝐾 std 𝐴 std )Σ std ] − 𝐾 std 𝑦 )︁ ,𝜌 opt = ( | 𝐼 − 𝐾 opt 𝐴 opt || 𝐼 + 𝐾 opt 𝐴 opt | ) − exp (︁ 𝑦 ′ 𝐾 ′ opt [( 𝐼 + 𝐾 opt 𝐴 opt )Σ std ] − 𝐾 opt 𝑦 )︁ . Therefore, it suffices to prove the following two inequalities: | 𝐼 − 𝐾 std 𝐴 std || 𝐼 + 𝐾 std 𝐴 std | < | 𝐼 − 𝐾 opt 𝐴 opt || 𝐼 + 𝐾 opt 𝐴 opt | , (4.5) 𝐾 ′ std [( 𝐼 + 𝐾 std 𝐴 std )Σ std ] − 𝐾 std ≺ 𝐾 ′ opt [( 𝐼 + 𝐾 opt 𝐴 opt )Σ std ] − 𝐾 opt . (4.6)We start with inequality (4.6). Note that( 𝐼 + 𝐾 std 𝐴 std )Σ std = Σ std + Σ std 𝐴 ′ std 𝑆 − 𝐴 std Σ std , ( 𝐼 + 𝐾 opt 𝐴 opt )Σ opt = Σ opt + Σ opt 𝐴 ′ opt 𝑆 − 𝐴 opt Σ opt . Using the definitions in (4.3) and (4.4) it follows that 𝐾 ′ std Σ − std ( 𝐼 + 𝐾 std 𝐴 std ) − 𝐾 std = 𝐻 {︁ ( 𝑀 𝑃 𝑀 ′ + 𝑄 ) − + 𝐻 ′ 𝑆𝐻 }︁ − 𝐻 ′ ≺ 𝐻 {︁ ( 𝑀 𝑃 𝑀 ′ ) − + 𝐻 ′ 𝑆𝐻 }︁ − 𝐻 ′ = 𝐾 ′ opt Σ − opt ( 𝐼 + 𝐾 opt 𝐴 opt ) − 𝐾 opt . 𝐾 std 𝐴 std = ( 𝑀 𝑃 𝑀 ′ + 𝑄 ) 𝐻 ′ 𝑆 − 𝐻 = 𝑀 ˜ 𝑃 𝑀 ′ 𝐻 ′ 𝑆 − 𝐻 ∼ ( 𝐻 ′ 𝑆 − 𝐻 ) 𝑀 ˜ 𝑃 𝑀 ′ ( 𝐻 ′ 𝑆 − 𝐻 ) ,𝐾 opt 𝐴 opt = 𝑃 𝑀 ′ 𝐻 ′ 𝑆 − 𝐻𝑀 ∼ ( 𝐻 ′ 𝑆 − 𝐻 ) 𝑀 𝑃 𝑀 ′ ( 𝐻 ′ 𝑆 − 𝐻 ) , where ˜ 𝑃 := 𝑃 + 𝑀 † 𝑄𝑀 ′ † . Therefore 𝐾 opt 𝐴 opt ≺ 𝐾 std 𝐴 std which, together with 𝐾 std 𝐴 std ≺ 𝐼, implies that | 𝐼 − 𝐾 std 𝐴 std || 𝐼 + 𝐾 std 𝐴 std | − | 𝐼 − 𝐾 opt 𝐴 opt || 𝐼 + 𝐾 opt 𝐴 opt | = | 𝐼 − ( 𝐾 std 𝐴 std ) | − | 𝐼 − ( 𝐾 opt 𝐴 opt ) | > , as desired. It is possible that along a certain limiting regime, 𝜌 diverges for the standard proposal, but not forthe optimal proposal. The proposition below gives an explicit example of this scenario. Precisely,consider the following one-step filtering setting 𝑣 = 𝑀 𝑣 + 𝜉, 𝑣 ∼ 𝒩 (0 , 𝑃 ) , 𝜉 ∼ 𝒩 (0 , 𝑄 ) ,𝑦 = 𝐻𝑣 + 𝜁, 𝜁 ∼ 𝒩 (0 , 𝑟 𝑅 ) , where 𝑟 →
0. Let 𝜇 ( 𝑟 ) opt , 𝜇 ( 𝑟 ) std be the optimal/standard targets and 𝜋 ( 𝑟 ) opt , 𝜋 ( 𝑟 ) std be the optimal/standardproposals. We assume that 𝑀 ∈ R 𝑑 × 𝑑 and 𝐻 ∈ R 𝑘 × 𝑑 are full rank. Proposition 4.2. If 𝑟 → , then we have 𝜌 ( 𝑟 ) opt < ∞ ,𝜌 ( 𝑟 ) std ∼ 𝒪 ( 𝑟 − 𝑘 ) . Proof.
Consider the two inverse problems that correspond to 𝜇 ( 𝑟 ) opt , 𝜋 ( 𝑟 ) opt and 𝜇 ( 𝑟 ) std , 𝜋 ( 𝑟 ) std . Note that thetwo problems have identical prior and design matrix. Let Γ ( 𝑟 ) opt and Γ ( 𝑟 ) std denote the noise in thosetwo inverse problems. When 𝑟 goes to 0, we observe thatΓ ( 𝑟 ) opt = 𝑟 𝑅 + 𝐻𝑄𝐻 ′ → 𝐻𝑄𝐻 ′ , Γ ( 𝑟 ) std = 𝑟 𝑅 → . Therefore, the limit of 𝜌 ( 𝑟 ) opt converges to a finite value, but Lemma 3.2 implies that 𝜌 ( 𝑟 ) std diverges atrate 𝒪 ( 𝑟 − 𝑘 ). 17 .4 Standard and Optimal Proposal in High Dimension The previous subsection shows that the standard and optimal proposals can have dramaticallydifferent behavior in the small noise regime 𝑟 → . Here we show that both proposals can alsolead to dramatically different behavior in high dimensional limits. Precisely, as a consequence ofCorollary 3.8 we can easily identify the exact regimes where both proposals converge or diverge inlimit. The notation is analogous to that in Subsection 4.4, and so we omit the details.
Proposition 4.3.
Consider the sequence of particle filters defined as above. We have the followingconvergence criteria:1. 𝜇 (1: ∞ ) opt ≪ 𝜋 (1: ∞ ) opt and 𝜌 opt < ∞ if and only if ∑︀ ∞ 𝑖 =1 ℎ 𝑖 𝑚 𝑖 𝑝 𝑖 ℎ 𝑖 𝑞 𝑖 + 𝑟 𝑖 < ∞ ,2. 𝜇 (1: ∞ ) std ≪ 𝜋 (1: ∞ ) std and 𝜌 std < ∞ if and only if ∑︀ ∞ 𝑖 =1 ℎ 𝑖 𝑚 𝑖 𝑝 𝑖 𝑟 𝑖 < ∞ and ∑︀ ∞ 𝑖 =1 ℎ 𝑖 𝑞 𝑖 𝑟 𝑖 < ∞ .Proof. By direct computation, we have 𝜆 ( 𝑖 ) std = ℎ 𝑖 𝑚 𝑖 𝑝 𝑖 + ℎ 𝑖 𝑞 𝑖 𝑟 𝑖 = ℎ 𝑖 𝑚 𝑖 𝑝 𝑖 𝑟 𝑖 + ℎ 𝑖 𝑞 𝑖 𝑟 𝑖 ,𝜆 ( 𝑖 ) opt = ℎ 𝑖 𝑚 𝑖 𝑝 𝑖 ℎ 𝑖 𝑞 𝑖 + 𝑟 𝑖 . Theorem 3.8 gives the desired result.
Example 4.4.
As a simple example where absolute continuity holds for the optimal proposal butnot for the standard one, let ℎ 𝑖 = 𝑚 𝑖 = 𝑝 𝑖 = 𝑟 𝑖 = 1 . Then 𝜌 std = ∞ , but 𝜌 opt < ∞ provided that ∑︀ ∞ 𝑖 =1 1 𝑞 𝑖 +1 < ∞ . Acknowledgement
The work of DSA was supported by NSF and NGA through the grant DMS-2027056. DSA alsoacknowledges partial support from the NSF Grant DMS-1912818/1912802.
References [1] S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, and A. M. Stuart. Importance sampling:Intrinsic dimension and computational cost.
Statistical Science , 32(3):405–431, 2017.[2] Ö. Deniz Akyildiz and J. Míguez. Convergence rates for optimised adaptive importance samplers. arXiv preprint arXiv:1903.12044 , 2019.[3] D. Barber.
Bayesian Reasoning and Machine Learning . Cambridge University Press, 2012.[4] T. Bengtsson, P. Bickel, B. Li, et al. Curse-of-dimensionality revisited: Collapse of the particlefilter in very large scale systems. In
Probability and statistics: Essays in honor of David A.Freedman , pages 316–334. Institute of Mathematical Statistics, 2008.[5] P. Bickel, B. Li, T. Bengtsson, et al. Sharp failure rates for the bootstrap particle filter inhigh dimensions. In
Pushing the limits of contemporary statistics: Contributions in honor ofJayanta K. Ghosh , pages 318–329. Institute of Mathematical Statistics, 2008.186] V. I. Bogachev.
Gaussian Measures . Number 62. American Mathematical Soc., 1998.[7] M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. Miguez, and P. M. Djuric. Adaptiveimportance sampling: The past, the present, and the future.
IEEE Signal Processing Magazine ,34(4):60–79, 2017.[8] S. Chatterjee and P. Diaconis. The sample size required in importance sampling. arXiv preprintarXiv:1511.01437 , 2015.[9] A. J. Chorin and M. Morzfeld. Conditions for successful data assimilation.
Journal ofGeophysical Research: Atmospheres , 118(20):11–522, 2013.[10] P. Del Moral.
Feynman-Kac Formulae . Springer, 2004.[11] Josef Dick, Frances Y Kuo, and Ian H Sloan. High-dimensional integration: the quasi-montecarlo way.
Acta Numerica , 22:133, 2013.[12] A. Doucet, N. De Freitas, and N. Gordon. An Introduction to Sequential Monte Carlo Methods.In
Sequential Monte Carlo Methods in Practice , pages 3–14. Springer, 2001.[13] N. Garcia Trillos, Z. Kaplan, T. Samakhoana, and D. Sanz-Alonso. On the consistency ofgraph-based bayesian semi-supervised learning and the scalability of sampling algorithms.
Journal of Machine Learning Research , 21(28):1–47, 2020.[14] N. Garcia Trillos and D. Sanz-Alonso. The Bayesian update: variational formulations andgradient flows.
Bayesian Analysis , 2018.[15] H. Kahn.
Use of different Monte Carlo sampling techniques . Rand Corporation, 1955.[16] H. Kahn and A. W. Marshall. Methods of reducing sample size in Monte Carlo computations.
Journal of the Operations Research Society of America , 1(5):263–278, 1953.[17] A. Kong. A note on importance sampling using standardized weights.
University of Chicago,Dept. of Statistics, Tech. Rep , 348, 1992.[18] A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and Bayesian missing dataproblems.
Journal of the American Statistical Association , 89(425):278–288, 1994.[19] J. S. Liu. Metropolized independent sampling with comparisons to rejection sampling andimportance sampling.
Statistics and Computing , 6(2):113–119, 1996.[20] F. Nielsen and R. Nock. On the chi square and higher-order chi distances for approximatingf-divergences.
IEEE Signal Processing Letters , 21(1):10–13, 2013.[21] M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters.
Journal of theAmerican Statistical Association , 94(446):590–599, 1999.[22] P. Rebeschini and R. Van Handel. Can local particle filters beat the curse of dimensionality?
The Annals of Applied Probability , 25(5):2809–2866, 2015.1923] G. Rubino and B. Tuffin.
Rare Event Simulation Using Monte Carlo Methods . John Wiley &Sons, 2009.[24] E. K. Ryu and S. P. Boyd. Adaptive importance sampling via stochastic convex programming. arXiv preprint arXiv:1412.4845 , 2014.[25] D. Sanz-Alonso. Importance sampling and necessary sample size: an information theoryapproach.
SIAM/ASA Journal on Uncertainty Quantification , 6(2):867–879, 2018.[26] D. Sanz-Alonso, A. M. Stuart, and A. Taeb. Inverse problems and data assimilation. arXivpreprint arXiv:1810.06191 , 2018.[27] C. Snyder, T. Bengtsson, P. Bickel, and J. Anderson. Obstacles to high-dimensional particlefiltering.
Monthly Weather Review , 136(12):4629–4640, 2008.[28] C. Snyder, T. Bengtsson, and M. Morzfeld. Performance bounds for particle filters using theoptimal proposal.
Monthly Weather Review , 143(11):4750–4761, 2015. A 𝜒 -divergence between Gaussians We recall that the distribution 𝑃 𝜃 parameterized by 𝜃 belongs to the exponential family ℰ 𝐹 (Θ) overa natural parameter space Θ, if 𝜃 ∈ Θ and 𝑃 𝜃 has density of the form 𝑓 ( 𝑢 ; 𝜃 ) = 𝑒 ⟨ 𝑡 ( 𝑢 ) ,𝜃 ⟩− 𝐹 ( 𝜃 )+ 𝑘 ( 𝑢 ) , where the natural parameter space is given byΘ = {︂ 𝜃 : ∫︁ 𝑒 ⟨ 𝑡 ( 𝑢 ) ,𝜃 ⟩ + 𝑘 ( 𝑢 ) 𝑑𝑢 < ∞ }︂ . The following result can be found in [20].
Lemma A.1.
Suppose 𝜃 , ∈ Θ are parameters for probability densities 𝑓 ( 𝑢 ; 𝜃 , ) = 𝑒 ⟨ 𝑡 ( 𝑢 ) ,𝜃 , ⟩− 𝐹 ( 𝜃 , )+ 𝑘 ( 𝑢 ) with 𝜃 − 𝜃 ∈ Θ . Then, 𝑑 𝜒 (︀ 𝑓 ( · ; 𝜃 ) ‖ 𝑓 ( · ; 𝜃 ) )︀ = 𝑒 𝐹 (2 𝜃 − 𝜃 ) − 𝐹 ( 𝜃 )+ 𝐹 ( 𝜃 ) − . Proof.
By direct computation, 𝑑 𝜒 (︀ 𝑓 ( · ; 𝜃 ) ‖ 𝑓 ( · ; 𝜃 ) )︀ + 1 = ∫︁ 𝑓 ( 𝑢 ; 𝜃 ) 𝑓 ( 𝑢 ; 𝜃 ) − 𝑑𝑢 = ∫︁ 𝑒 ⟨ 𝑡 ( 𝑢 ) , 𝜃 − 𝜃 ⟩− (2 𝐹 ( 𝜃 ) − 𝐹 ( 𝜃 ))+ 𝑘 ( 𝑢 ) 𝑑𝑢 = 𝑒 𝐹 (2 𝜃 − 𝜃 ) − 𝐹 ( 𝜃 )+ 𝐹 ( 𝜃 ) ∫︁ 𝑓 ( 𝑢 ; 2 𝜃 − 𝜃 ) 𝑑𝑢 = 𝑒 𝐹 (2 𝜃 − 𝜃 ) − 𝐹 ( 𝜃 )+ 𝐹 ( 𝜃 ) . Note that ∫︀ 𝑓 ( 𝑢 ; 2 𝜃 − 𝜃 ) 𝑑𝑢 = 1 since 2 𝜃 − 𝜃 ∈ Θ by assumption.20sing Lemma A.1 we can compute the 𝜒 -divergence between Gaussians. To do so, we notethat 𝑑 − dimensional Gaussians 𝒩 ( 𝜇, Σ) belong to the exponential family over the parameter space R 𝑑 ⨁︀ R 𝑑 × 𝑑 by letting 𝜃 = [Σ − 𝜇 ; − Σ − ] and 𝐹 ( 𝜃 ) = 𝜇 ′ Σ − 𝜇 + log | Σ | . In the context ofGaussians, an exponential parameter 𝜃 = [Σ − 𝜇 ; − Σ − ] belongs to the natural parameter space Θif and only if Σ is symmetric and positive definite. Indeed, the integral ∫︀ exp( − ( 𝑢 − 𝜇 ) ′ Σ − ( 𝑢 − 𝜇 )) 𝑑𝑢 is finite if and only if Σ ≻ Proof of Proposition 2.4.
Let 𝜃 𝜇 , 𝜃 𝜋 be the exponential parameters of 𝜇, 𝜋 . Then 2 𝜃 𝜇 − 𝜃 𝜋 correspondsto a Gaussian with mean (2 𝐶 − − Σ − ) − (2 𝐶 − 𝑚 ) and covariance (2 𝐶 − − Σ − ) − . We have 𝐹 (2 𝜃 𝜇 − 𝜃 𝜋 ) − 𝐹 ( 𝜃 𝜇 ) + 𝐹 ( 𝜃 𝜋 ) = 12 log | (2 𝐶 − − Σ − ) − | − log | 𝐶 | + 12 log | Σ | +12 (2 𝐶 − 𝑚 ) ′ (2 𝐶 − − Σ − ) − (2 𝐶 − 𝑚 ) − 𝑚 ′ 𝐶 − 𝑚 = log √︃ | Σ || 𝐶 − − Σ − || 𝐶 | + 𝑚 ′ ( 𝐶 − (2 𝐶 − − Σ − ) − 𝐶 − ) 𝑚 − 𝑚 ′ ( 𝐶 − (2 𝐶 − − Σ − ) − (2 𝐶 − − Σ − )) 𝑚 = log | Σ | √︀ | − 𝐶 || 𝐶 | + 𝑚 ′ ( 𝐶 − (2 𝐶 − − Σ − ) − Σ − ) 𝑚 = log | Σ | √︀ | − 𝐶 || 𝐶 | + 𝑚 ′ (2Σ − 𝐶 ) − 𝑚. Applying Lemma A.1 gives 𝑑 𝜒 ( 𝜇 ‖ 𝜋 ) = exp (︁ 𝐹 (2 𝜃 𝜇 − 𝜃 𝜋 ) − 𝐹 ( 𝜃 𝜇 ) + 𝐹 ( 𝜃 𝜋 ) )︁ − | Σ | √︀ | − 𝐶 || 𝐶 | exp (︁ 𝑚 ′ (2Σ − 𝐶 ) − 𝑚 )︁ − , if 2 𝜃 𝜇 − 𝜃 𝜋 ∈ Θ. In other words, the corresponding covariance matrix (2 𝐶 − − Σ − ) − is positivedefinite. Remark A.2.
By translation invariance of Lebesgue measure, we can obtain the more general formulafor 𝜒 -divergence between two Gaussians with non-zero mean by replacing 𝑚 with the differencebetween the two mean vectors: 𝑑 𝜒 (︁ 𝒩 ( 𝑚 , 𝐶 ) ‖ 𝒩 ( 𝑚 , Σ) )︁ = | Σ | √︀ | − 𝐶 || 𝐶 | 𝑒 ( 𝑚 − 𝑚 ) ′ (2Σ − 𝐶 ) − ( 𝑚 − 𝑚 ) − . B Proof of Lemma 3.7
Proof.
Dividing 𝑔 by its normalizing constant, we may assume without loss of generality that 𝑔 isexactly the Radon-Nikodym derivative 𝑑𝜇𝑑𝜋 and ℋ ( 𝜇, 𝜋 ) = 𝜋 𝑖 ( √ 𝑔 ).If 𝜇 ∞ ≪ 𝜋 ∞ , then the Radon-Nikodym derivative 𝑔 ∞ cannot be 𝜋 ∞ a.e. zero since 𝜋 ∞ and 𝜇 ∞ are probability measures. As a consequence, ∏︀ ∞ 𝑖 =1 𝜋 𝑖 (︀ √ 𝑔 𝑖 )︀ = 𝜋 ∞ (︀ √ 𝑔 ∞ )︀ > 𝜇 ∞ and 𝜋 ∞ . 21ow we assume ∏︀ ∞ 𝑖 =1 𝜋 𝑖 (︀ √ 𝑔 𝑖 )︀ >
0. It suffices to show that 𝑔 ∞ is well-defined, i.e. convergence of ∏︀ 𝐿𝑖 =1 𝑔 𝑖 in 𝐿 𝜋 as 𝐿 → ∞ . It suffices to prove that the sequence is Cauchy, in other wordslim 𝐿,ℓ →∞ 𝜋 ∞ ( | 𝑔 𝐿 + ℓ − 𝑔 𝐿 | ) = 0 . We observe that ‖ 𝑔 𝐿 + ℓ − 𝑔 𝐿 ‖ ≤ ‖√ 𝑔 𝐿 + ℓ − √ 𝑔 𝐿 ‖ ‖√ 𝑔 𝐿 + ℓ + √ 𝑔 𝐿 ‖ ≤ ‖√ 𝑔 𝐿 + ℓ − √ 𝑔 𝐿 ‖ ( ‖√ 𝑔 𝐿 + ℓ ‖ + ‖√ 𝑔 𝐿 ‖ )= 2 ‖√ 𝑔 𝐿 + ℓ − √ 𝑔 𝐿 ‖ . Expanding the square of the right-hand side gives 𝜋 ∞ (︁⃒⃒ √ 𝑔 𝐿 + ℓ − √ 𝑔 𝐿 ⃒⃒ )︁ = 𝜋 ∞ (︀ 𝑔 𝐿 + ℓ + 𝑔 𝐿 − √ 𝑔 𝐿 + ℓ 𝑔 𝐿 )︀ = 2 − 𝜋 𝐿 ( 𝑔 𝐿 ) 𝜋 𝐿 +1: ∞ (︂√︂ 𝑔 𝐿 + ℓ 𝑔 𝐿 )︂ = 2 ⎛⎝ − 𝜋 𝐿 + ℓ (︁ √ 𝑔 𝐿 + ℓ )︁ 𝜋 𝐿 (︀ √ 𝑔 𝐿 )︀ ⎞⎠ . Therefore, it is enough to show lim
𝐿,ℓ →∞ 𝜋 𝐿 + ℓ (︁ √ 𝑔 𝐿 + ℓ )︁ 𝜋 𝐿 (︀ √ 𝑔 𝐿 )︀ = 1 . By Jensen’s inequality, for any two probability measures 𝜇 ≪ 𝜋 with density 𝑔 , we have 𝜋 ( √ 𝑔 ) ≤ √︁ 𝜋 ( 𝑔 ) = 1 . (B.1)Combining with our assumption, we deduce that0 < ∞ ∏︁ 𝑖 =1 𝜋 𝑖 ( √ 𝑔 𝑖 ) = 𝜋 ∞ ( √ 𝑔 ∞ ) ≤ , which is equivalent to −∞ < ∞ ∑︁ 𝑖 =1 log( 𝜋 𝑖 ( √ 𝑔 𝑖 )) ≤ . This series is monotonely decreasing by (B.1) and bounded below, so it converges and satisfies thatlim
𝐿,ℓ →∞ 𝜋 𝐿 + ℓ (︁ √ 𝑔 𝐿 + ℓ )︁ 𝜋 𝐿 (︀ √ 𝑔 𝐿 )︀ = lim 𝐿,ℓ →∞ 𝑒 ∑︀ 𝐿 + ℓ𝑖 = 𝐿 log( 𝜋 𝑖 ( √ 𝑔 𝑖 ) ) = 1 . Additional Figures (a) 𝑁 = 𝛾 − . (b) 𝑁 = 𝛾 − . Figure 4
Noise scaling with 𝑑 = 𝑘 = 4 . (a) 𝑁 = 𝜎 − . (b) 𝑁 = 𝜎 − . Figure 5
Prior scaling with 𝑑 = 𝑘 = 4 . (a) 𝑁 = 𝑑 . (b) 𝑁 = 𝒪 (︀ 𝑑 𝜒 ( 𝜇 𝑑 ‖ 𝜋 𝑑 ) )︀ . Figure 6
Dimensional scaling 𝜆 = 2 ..
Dimensional scaling 𝜆 = 2 .. ..