Note on approximating the Laplace transform of a Gaussian on a complex disk
NNote on approximating the Laplace transform of a Gaussian on acomplex disk
Yury Polyanskiy and Yihong Wu ∗ September 1, 2020
Abstract
In this short note we study how well a Gaussian distribution can be approximated by distribu-tions supported on [ − a, a ]. Perhaps, the natural conjecture is that for large a the almost optimalchoice is given by truncating the Gaussian to [ − a, a ]. Indeed, such approximation achieves theoptimal rate of e − Θ( a ) in terms of the L ∞ -distance between characteristic functions. However,if we consider the L ∞ -distance between Laplace transforms on a complex disk, the optimal rateis e − Θ( a log a ) , while truncation still only attains e − Θ( a ) . The optimal rate can be attained bythe Gauss-Hermite quadrature. As corollary, we also construct a “super-flat” Gaussian mixtureof Θ( a ) components with means in [ − a, a ] and whose density has all derivatives bounded by e − Ω( a log( a )) in the O (1)-neighborhood of the origin. We study the best approximation of a Gaussian distribution by compact support measures, inthe sense of the uniform approximation of the Laplace transform on a complex disk. Let L π ( z ) = (cid:82) R dπ ( y ) e zy be the Laplace transform, z ∈ C , of the measure π and Ψ π ( t ) (cid:44) L π ( it ) be its characteris-tic function. Denote L ( z ) = e z / and Ψ ( t ) = e − t / the Laplace transform and the characteristicfunction corresponding to the standard Gaussian π = N (0 ,
1) with density φ ( x ) (cid:44) √ π e − x / , x ∈ R . How well can a measure π with support on [ − a, a ] approximate π ? Perhaps the most naturalchoice for π is the truncated π : π ( dx ) (cid:44) φ a ( x ) dx, φ a ( x ) (cid:44) φ ( x )1 − Q ( a ) 1 {| x | ≤ a } , (1)where Q ( a ) = P [ N (0 , > a ]. Indeed, truncation is asymptotically optimal (as a → ∞ ) in approx-imating the characteristic function, as made preicse by the following result: Proposition 1.
There exists some c > such that for all a ≥ and any probability measure π supported on [ − a, a ] we have sup t ∈ R | Ψ π ( t ) − e − t / | ≥ ce − ca . (2) ∗ Y.P. is with the Department of EECS, MIT, Cambridge, MA, email: [email protected] . Y.W. is with the Departmentof Statistics and Data Science, Yale University, New Haven, CT, email: [email protected] . a r X i v : . [ m a t h . S T ] A ug urthermore, truncation (1) satisfies (for a ≥ ) sup t ∈ R | Ψ π ( t ) − e − t / | ≤ e − a / . (3) Proof.
Let us define B ( z ) (cid:44) L π ( z ) − e z / , a holomorphic (entire) function on C . Note that if (cid:60) ( z ) = r then | L π ( z ) | ≤ e a | r | , | e z / | ≤ e r / , and thus for r ≥ a b ( r ) (cid:44) sup {| B ( z ) | : (cid:60) ( z ) = r } ≤ e ar + e r / ≤ e r / (4)On the other hand, for every r ≥ a, a ≥ b ( r ) ≥ | B ( r ) | ≥ e r / − e ar ≥ e r / . (5)Applying the Hadamard three-lines theorem to B ( z ), we conclude that r (cid:55)→ log b ( r ) is convex andhence b (3 a ) ≤ ( b (0)) / ( b (6 a )) / . (6)Since the left-hand side of (2) equals b (0), (2) then follows from (4)-(6).For the converse part, in view of (1), the total variation between π and its conditional version π is given by (cid:90) R | φ a ( x ) − φ ( x ) | dx = 2 (cid:107) π − π (cid:107) TV = 2 π ([ − a, a ] c ) = 4 Q ( a )Therefore for the Fourier transform of φ a − φ we getsup t ∈ R | Ψ π ( t ) − e − t / | ≤ Q ( a ) ≤ √ πa e − a / ≤ e − a / . Despite this evidence, it turns out that for the purpose of approximating Laplace transform ina neighborhood of 0, there is a much better approximation than (1).
Theorem 2.
There exists some constant c > such that for any probability measure π supportedon [ − a, a ] , a ≥ , we have sup | z |≤ ,z ∈ C | e z / − L π ( z ) | ≥ ce − ca log( a ) , (7) Furthermore, there exists an absolute constant c > so that for all b ≥ and all a ≥ c b there exists distribution π (the Gauss-Hermite quadrature) supported on [ − a, a ] such that sup | z |≤ b,z ∈ C | e z / − L π ( z ) | ≤ c b/a ) a / . Taking b = 1 implies that the bound (7) is order-optimal. Remark 1.
When π is given by the truncation (1), then performing explicit calculation for z ∈ R we have L π ( z ) = e z / Φ( a + z ) + Φ( a − z ) − a ) − . The same expression (by analytic continuation) holds for arbitrary z ∈ C if Φ( z ) is understood assolution of Φ (cid:48) ( z ) = φ ( z ) , Φ(0) = 1 /
2. For | z | = O (1), the approximation error is e − Ω( a ) , ratherthan e − Ω( a log a ) . The suboptimality of truncation is demonstrated on Fig. 1.2 l o g | L ( z ) - e ^ { z ^ / | HermiteTruncation
Figure 1: Comparison of approximations of N (0 ,
1) by distributions supported on [ − a, a ], as mea-sured by the (log of) L ∞ distance between the Laplace transform on the unit disk in C . Proof.
As above, denote B ( z ) = L π ( z ) − e z / , and define M ( r ) = sup | z |≤ r,z ∈ C | B ( z ) | . From (5) we have for any r ≥ a, a ≥ M ( r ) ≥ e r / and from | L π ( z ) | ≤ e a | z | we also have (for any r ≥ a ): M ( r ) ≤ e ar + e r / ≤ e r / . Applying the Hadamard three-circles theorem, we have log r (cid:55)→ log M ( r ) is convex, and hence M (3 a ) ≤ ( M (1)) − λ ( M (5 a )) λ , where λ = log(5 a )log(3 a ) and 1 − λ Θ( a ). From here we obtain for some constant c > M (1) ≥ c ( − a −
1) log( a ) , which proves (7).For the upper bound, take π to be the k -point Gauss-Hermite quadrature of N (0 ,
1) (cf. [SB02,Section 3.6]). This is the unique k -atomic distribution that matches the first 2 k − N (0 , • π is supported on the roots of the degree- k Hermite polynomial, which lie in [ −√ k + 2 , √ k + 2][Sze75, Theorem 6.32]; • The i -th moment of π , denoted by m i ( π ), satisfies m i ( π ) = m i ( N (0 , i = 1 , . . . , k − π is symmetric so that all odd moments are zero.We set k = (cid:100) a / (cid:101) , so that π is supported on [ − a, a ].Let us denote X ∼ π and G ∼ N (0 , B ( z ) = E [ e zX ] − E [ e zG ] = ∞ (cid:88) m =2 k m ! z m ( E [ X m ] − E [ G m ]) = ∞ (cid:88) (cid:96) = k (cid:96) )! z (cid:96) ( E [ X (cid:96) ] − E [ G (cid:96) ]) . Now, we will bound (2 (cid:96) )! ≥ (2 (cid:96)/e ) (cid:96) , E [ X (cid:96) ] ≤ a (cid:96) , E [ G (cid:96) ] = (2 (cid:96) − ≤ (2 (cid:96) ) (cid:96) . This implies thatfor all | z | ≤ b we have | B ( z ) | ≤ (cid:88) (cid:96) ≥ k (cid:40)(cid:16) ea (cid:96) (cid:17) (cid:96) + (cid:18) e √ (cid:96) (cid:19) (cid:96) (cid:41) | z | (cid:96) ≤ (cid:88) (cid:96) ≥ k (cid:40)(cid:16) ea k (cid:17) (cid:96) + (cid:18) e √ k (cid:19) (cid:96) (cid:41) | z | (cid:96) ≤ (cid:88) (cid:96) ≥ k ( c b/a ) (cid:96) , where in the last step we used ea k , e √ k ≤ c a for some absolute constant c . In all, we have thatwhenever c b/a < / | B ( z ) | ≤ − ( c b/a ) ( c b/a ) k ≤ c b/a ) a / . Remark 2.
Note that our proof does not show that for any π supported on [ − a, a ], its characteristicfunction restricted on [ − ,
1] must satisfy:sup | t |≤ | Ψ π ( t ) − e − t / | ≥ ce − ca log a . It is natural to conjecture that this should hold, though.
Remark 3.
Note also that the Gauss-Hermite quadrature considered in the theorem, while essen-tially optimal on complex disks, is not uniformly better than the naive truncation. For example, dueto its finite support, the Gauss-Hermite quadrature is a very bad approximation in the sense of (2).Indeed, for any finite discrete distribution π we have lim sup t →∞ | Ψ π ( t ) | = 1, thus only attainingthe trivial bound of 1 in the right-hand side of (2). (To see this, note that Ψ π ( t ) = (cid:80) kj =1 p i e itω j .By simultaneous rational approximation (see, e.g., [Cas72, Theorem VI, p. 13]), we have infinitelymany values q such that for all j | q ω j π − p j | < q /k for some p j ∈ Z . In turn, this implies thatlim inf t →∞ max j =1 ,...,m ( tω j mod 2 π ) = 0, and that Ψ π ( t ) → t attaininglim inf.) As a corollary of construction in the previous section we can also derive a curious discrete distri-bution π = (cid:80) m w m δ x m supported on [ − a, a ] such that its convolution with the Gaussian kernel π ∗ φ is maximally flat near the origin. More precisely, we have the following result.4 orollary 3. There exist constants C , C > such that for every a > there exists k = Θ( a ) , w m ≥ with (cid:80) km =1 w m = 1 and x m ∈ [ − a, a ] , m ∈ { , . . . , k } , such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) ddz (cid:19) n k (cid:88) m =1 w m φ ( z − x m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n ! · C e − Ca log( a ) ∀ z ∈ C , | z | ≤ , n ∈ { , , . . . } . Proof.
Consider the distribution π = (cid:80) km =1 (cid:101) w m δ x m claimed by Theorem 2 for b = 2. Then (hereand below C designates some absolute constant, possibly different in every occurence) we havesup | z |≤ | L π ( z ) − e z / | ≤ Ce − Ca log( a ) . Note that the function e − z / is also bounded on | z | ≤ | z |≤ | L π ( z ) e − z / − | ≤ Ce − Ca log( a ) . By Cauchy formula, this also implies that derivatives of the two functions inside | · | must satisfythe same estimate on a smaller disk, i.e.sup | z |≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) ddz (cid:19) n L π ( z ) e − z / (cid:12)(cid:12)(cid:12)(cid:12) ≤ n ! · Ce − Ca log( a ) . (8)Now, define w m = B (cid:101) w m e x m / , where B = (cid:80) m (cid:101) w m e x m / . We then have an identity: L π ( z ) e − z / = B (cid:88) m w m e − ( z − x m ) / Plugging this into (8) and noticing that B ≥ Remark 4.
This corollary was in fact the main motivation of this note. More exactly, in thestudy of the properties of non-parametric maximum-likelihood estimation of Gaussian mixtures,we conjectured that certain mixtures must possess some special z in the unit disk on C such that | (cid:80) m w m φ (cid:48) ( z − x m ) | ≥ e − O ( a ) . The stated corollary shows that this is not true for all mixtures .See [PW20, Section 5.3] for more details on why lower-bounding the derivative is important. Inparticular, one open question is whether the lower bound e − O ( a ) holds (with high probability) forthe case when a (cid:16) √ log k and x m , m ∈ { , . . . , k } , are iid samples of N (0 , w m ’s can bechosen arbitrarily given x m ’s. References [Cas72] J. W. S. Cassels.
An Introduction to Diophantine Approximation . Cambridge UniversityPress, Cambridge, United Kingdom, 1972.[PW20] Yury Polyanskiy and Yihong Wu. Self-regularizing property of nonparametric maximumlikelihood estimator in mixture models.
Arxiv preprint arXiv:2008.08244 , Aug 2020.[SB02] J. Stoer and R. Bulirsch.
Introduction to Numerical Analysis . Springer-Verlag, New York,NY, 3rd edition, 2002.[Sze75] G. Szeg¨o.