[PDF] Kernel quadrature by applying a point-wise gradient descent method to discrete energies

Abstract

We propose a method for generating nodes for kernel quadrature by a point-wise gradient descent method. For kernel quadrature, most methods for generating nodes are based on the worst case error of a quadrature formula in a reproducing kernel Hilbert space corresponding to the kernel. In typical ones among those methods, a new node is chosen among a candidate set of points in each step by an optimization problem with respect to a new node. Although such sequential methods are appropriate for adaptive quadrature, it is difficult to apply standard routines for mathematical optimization to the problem. In this paper, we propose a method that updates a set of points one by one with a simple gradient descent method. To this end, we provide an upper bound of the worst case error by using the fundamental solution of the Laplacian on \mathbf{R}^{d}. We observe the good performance of the proposed method by numerical experiments.

Full PDF

aa r X i v : . [ m a t h . NA ] F e b Kernel quadrature by applying a point-wise gradient descentmethod to discrete energies

Ken’ichiro Tanaka ∗† February 24, 2021

Abstract

We propose a method for generating nodes for kernel quadrature by a point-wise gradi-ent descent method. For kernel quadrature, most methods for generating nodes are basedon the worst case error of a quadrature formula in a reproducing kernel Hilbert space corre-sponding to the kernel. In typical ones among those methods, a new node is chosen amonga candidate set of points in each step by an optimization problem with respect to a newnode. Although such sequential methods are appropriate for adaptive quadrature, it isdiﬃcult to apply standard routines for mathematical optimization to the problem. In thispaper, we propose a method that updates a set of points one by one with a simple gradientdescent method. To this end, we provide an upper bound of the worst case error by usingthe fundamental solution of the Laplacian on R d . We observe the good performance of theproposed method by numerical experiments. Keywords

Kernel quadrature, Point-wise gradient descent method, Discrete energy, Re-producing kernel Hilbert space, Gaussian kernel

Mathematics Subject Classiﬁcation

This paper is concerned with kernel quadrature, an approach to deriving numerical integrationformulas with kernels. Let d ∈ Z + be a positive integer, Ω ⊂ R d a region with an non-emptyinterior in R d , and µ a Borel measure on Ω. A formula for quadrature is formally expressed as Z Ω f ( x ) d µ ( x ) ≈ N X j =1 w j f ( x j ) , (1.1)where X N = { x , . . . , x N } ⊂ Ω and W N = { w , . . . , w N } ⊂ R are the sets of distinct nodesand weights, respectively.Besides Monte Carlo (MC) methods and quasi Monte Carlo (QMC) methods, kernels havebeen used to derive numerical integration formulas recently. There are several categories of ∗ Department of Mathematical Informatics, Graduate School of Information Science and Technology, Univer-sity of Tokyo. 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan. e-mail: [email protected] † PRESTO, Japan Science and Technological Agency (JST), 4-1-8 Honcho, Kawaguchi-shi, Saitama, 332-0012,Japan. x j .It includes importance sampling [8], random feature expansions [1], kernel quadrature withdeterminantal point processes (DPPs) [3], etc.On the other hand, there are categories of deterministic methods. One of them consistsof sequential algorithms choosing nodes x j one-by-one with greedy ways. Such algorithmsare kernel herding (KH) [2, 7], sequential Bayesian quadrature (SBQ) [10, 14], orthogonalmatching persuit (OMP) [14], and their variants [5, 6, 11, 12, 15, 16, 20]. For example, theSBQ is realized by the procedure x N +1 ∈ argmin x ∈ Ω (cid:0) e wor ( X N ∪ { x } , W ∗ N +1 ; H K ) (cid:1) ( n = 1 , , . . . ) , (1.2)where e wor , W ∗ N +1 , and H K denote an worst case error, optimal set of weights, and reproducingkernel Hilbert space deﬁned later in (2.2), (3.1), and Section 2, respectively. The methods inthis category are rational in that we can add new nodes with keeping existing nodes. Suchprocedures are useful for adaptive quadrature, which is useful in the case of a high-dimensionaland/or complicated region Ω. However, each step like (1.2) requires nonlinear and nonconvexoptimization choosing nodes among prepared ones in the region Ω.There is another category of methods obtaining whole nodes at once. For example, Oetter-shagen [14] proposes a method solving nonlinear equation. Furthermore, Fekete points, whichmaximize the determinant of a kernel matrix, are known to be useful as nodes for quadrature.Unfortunately, it is hard to ﬁnd them exactly because the maximization of the determinant isintractable in general. However, some approximate optimization methods have been proposed[13, 18], although these are aimed at approximation of functions. In fact, these optimizationmethods employ convex objective functions for ﬁnding a set of nodes. They enable us to ﬁnda good set of nodes by standard optimization routines without preparing a candidate set ofnodes in Ω. In particular, a logarithmic energy with an external ﬁeld I ( x , . . . , x N ) = ε N X k =1 x k + X ≤ i

0. This tool is restricted to a one-dimensional Gaussian kernel and has the following problems:1. It is diﬃcult to extend it to higher-dimensional cases.2. The relationship between it and the worst case error e wor is unclear.Therefore it would be useful for kernel quadrature if we can obtain an extension of the energyin (1.3) as an upper bound of the worst case error. Here we note that ﬁnding a set of nodes atonce by optimization of such an extended energy may be diﬃcult when we need to ﬁnd verymany nodes in very high-dimensional region Ω. However, in such a case, such an energy canbe used also for a sequential algorithm ﬁnding nodes one-by-one.In this paper, we provide an extension of the energy in (1.3) by using the fundamentalsolution of the Laplacian on R d and show that it provides an upper bound of the worst caseerror. A method for deriving these results is based on the idea of [17]. Based on the bound,we propose a method that updates a set of points one by one with a simple gradient descent2ethod, which we call a point-wise gradient descent method. This method does not require acandidate set of points in Ω and relatively easy to be implemented.The rest of this paper is organized as follows. In Section 2, we review some basic notions forkernel quadrature. In Section 3, we review the basics about the energy in (1.3). In Section 4,we derive an extension of the energy by using the fundamental solution of the Laplacian on R d . In Section 5, we propose the point-wise gradient descent method. In Section 6, we showsome results of numerical experiments. In Section 7, we conclude this paper. Let K : Ω × Ω → R be a continuous and symmetric function and assume that it is positivedeﬁnite. Let H K (Ω) be the reproducing kernel Hilbert space (RKHS) corresponding to thekernel K . It has the following properties:1. ∀ x ∈ Ω , K ( · , x ) ∈ H K (Ω) , ∀ f ∈ H K (Ω) , ∀ x ∈ Ω , h f, K ( · , x ) i H K = f ( x ) . The latter is called a reproducing property. We consider quadrature formula (1.1) for a function f ∈ H K (Ω). This approximation can be regarded as that of the measure µ as follows: µ ≈ N X j =1 w j δ x j . (2.1)For the formula in (1.1), we can deﬁne the worst-case error e wor ( X N , W N ; H K ) by e wor ( X N , W N ; H K ) = sup f ∈H K k f k H K ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z Ω f ( x ) d x − N X i =1 w i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (2.2)In general, it is desirable to construct a good point set X N and weight set W N that makethe worst-case error as small as possible. To approach this goal, the following well-knownexpression of the worst-case error is useful.( e wor ( X N , W N ; H K )) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z Ω K ( y, x ) d µ ( x ) − N X j =1 w j K ( y, x j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H K = Z Ω Z Ω K ( y, z ) d µ ( y )d µ ( z ) − N X j =1 w j Z Ω K ( x j , x ) d µ ( x ) + N X i =1 N X j =1 w i w j K ( x i , x j ) . (2.3)This is owing to the reproducing property h f, K ( · , x ) i H K = f ( x ). Therefore sets X N and W N minimizing the value − N X j =1 w j Z Ω K ( x j , x ) d µ ( x ) + N X i =1 N X j =1 w i w j K ( x i , x j ) (2.4)3re required as a quadrature formula. We often assume that w i = 1 N ( i = 1 , . . . , N ) (2.5)for simplicity. If we ﬁx the point set X N , the right hand side in (2.3) becomes a quadratic form with respectto the weight set W N . Therefore, we can easily ﬁnd its minimizer W ∗ N = { w ∗ i } as follows: w ∗ = Z Ω K − X N k X N ( x ) d x, (3.1)where w ∗ = ( w ∗ , . . . , w ∗ N ) T , and K X N =  K ( x , x ) K ( x , x ) · · · K ( x , x N ) K ( x , x ) K ( x , x ) · · · K ( x , x N ) ... ... . . . ...K ( x N , x ) K ( x N , x ) · · · K ( x N , x N )  , k X N ( x ) =  K ( x, x ) K ( x, x ) ...K ( x, x N )  . (3.2)By using these, we can express the worst-case error with the optimal weights W ∗ N by( e wor ( X N , W ∗ N ; H K )) = 1det K X N det  k k · · · k N k ... K X N k N  , (3.3)where k = Z Ω Z Ω K ( x, y ) d x d y, k i = Z Ω K ( x, x i ) d x ( i = 1 , . . . , n ) . (3.4)Clearly, the value e wor ( X N , W ∗ N ; H K ) is less than or equal to the worst-case error with theequal weights in (2.5): e wor ( X N , W ∗ N ; H K ) ≤ e wor ( X N , { /N } Ni =1 ; H K ) . (3.5)According to Formula (3.3) , maximization of the determinant of the matrix K X N seemsto be useful, although we do not have some reasonable estimate of the worst case error for itsmaximizer: the Fekete points. Unfortunately, the maximization of det K X N is not tractablein general. However, in the case of the one-dimensional Gaussian kernel K , we can obtain atractable approximation of the determinant by expanding the kernel [13]. Although its derivation is fundamental, we write it in Appendix A for readers’ convenience. .2 Expansion and truncation of the one-dimensional Gaussian kernel Let K ( x, y ) := exp( − ε | x − y | ) be the one-dimensional Gaussian kernel. Then, we have K ( x, y ) ≈ b K ( x, y ) = n − X ℓ =0 ϕ ℓ ( x ) ϕ ℓ ( y ) , where ϕ ℓ ( x ) := exp( − ε x ) r ℓ ε ℓ ℓ ! x ℓ . Let b K X N = ( b K ( x k , x m )) Nk,m =1 ∈ R n × n be the corre-sponding kernel matrix, where X N = { x , . . . .x N } . Then, it is shown in [13] that − log (cid:0) det b K X N (cid:1) / = C ε,N + ε N X k =1 x k + X ≤ i

Let d ≥

2. We use the d -dimensional fundamental solution G d ( x, y ) for the Laplacian ∆ on R d . It is given by the following expression: G d ( x, y ) =  π log k x − y k ( d = 2) , − d − s d k x − y k d − ( d ≥ , (4.1)where s d is the surface area of the ( d −

1) dimensional unit sphere. The fundamental solutionsatisﬁes ∆ x G d ( x, y ) = δ y ( x ) (= δ ( x − y )) . (4.2)The following lemmas are based on the ideas in [17], whereas their proofs are slightly diﬀerentfrom those in the paper. Their proofs are provided in Section C.5 emma 4.1. Let s and t be positive real numbers and let a and b be points in R d . Then thefollowing equality holds: Z R d d x Z R d d y G d ( x, y ) e s∆ x δ a ( x ) e t∆ y δ b ( y ) = Z R d d y G d ( a, y ) e ( s + t ) ∆ y δ b ( y ) . Lemma 4.2.

Let t be a positive real number and let b be a point in R d . Then, the value Z R d d y G d ( b, y ) e t∆ y δ b ( y )is bounded and depends only on d and t . Lemma 4.3.

Let t be a positive real number and let a and b be disjoint points in R d . Thenthe following equality holds: Z R d d y G d ( a, y ) e t∆ y δ b ( y ) = G d ( a, b ) + Z t πs ) d/ exp (cid:18) − k a − b k s (cid:19) d s. For a discrete measure µ N = 1 N N X j =1 δ x j (4.3)with disjoint sets { x i } Ni =1 , we introduce a renormalized energy. To this end, for µ N and apositive number t >

0, we deﬁne A d ( t, µ N ) by A d ( t, µ N ) := Z R d d x Z R d d y G d ( x, y ) e t ∆ x µ N ( x ) e t ∆ y µ N ( y ) . (4.4)Furthermore, we deﬁne a constant: C d ( t ) := Z R d d y G d (0 , y ) e t∆ y δ ( y ) . (4.5)Then, we can derive the following relation by using the above lemmas. Theorem 4.4.

Let t > N X i = j Z t πs ) d/ exp (cid:18) − k x i − x j k s (cid:19) d s = A d ( t, µ N ) − C d ( t ) N − N X i = j G d ( x i , x j ) . (4.6) Proof.

By using Lemmas 4.1, 4.2, and 4.3, we can derive the following equalities: A d ( t, µ N ) = 1 N N X i =1 N X j =1 Z R d d x Z R d d y G d ( x, y ) e t ∆ x δ x i ( x ) e t ∆ y δ x j ( y )= 1 N N X i =1 N X j =1 Z R d d y G d ( x i , y ) e t∆ y δ x j ( y ) ( ∵ Lemma 4.1)= 1 N X i = j + X i = j  Z R d d y G d ( x i , y ) e t∆ y δ x j ( y )6 C d ( t ) N + 1 N X i = j (cid:26) G d ( x i , x j ) + Z t πs ) d/ exp (cid:18) − k x i − x j k s (cid:19) d s (cid:27) ( ∵ Lemmas 4.2 and 4.3) . Hence the conclusion follows.Here we present a sketch of an idea to estimate both sides of (4.6). Suppose that • the points x j are in a bounded region Ω ⊂ B [0 , r ] ⊂ R d , and • t is suﬃciently large so that A d ( t, µ N ) is almost independent of { x j } Nj =1 .Under these conditions, we may have Z t πs ) d/ exp (cid:18) − k x i − x j k s (cid:19) d s = (cid:18)Z r + Z tr (cid:19) πs ) d/ exp (cid:18) − k x i − x j k s (cid:19) d s ≈ Z r πs ) d/ exp (cid:18) − k x i − x j k s (cid:19) + Z tr πs ) d/ d s (4.7) ≈ ˆ C d,r exp (cid:18) − k x i − x j k r (cid:19) + Z tr πs ) d/ d s. (4.8)Therefore we may state that the terms depending on { x j } Nj =1 in (4.6) are: • N X i = j ˆ C d,r exp (cid:18) − k x i − x j k r (cid:19) (in the LHS) and • − N X i = j G d ( x i , x j ) (in the RHS).We make the above rough idea rigorous in Section 4.2 below. That is, we derive a boundof the energy of the Gaussian kernel by using Theorem 4.4. We provide a lower bound of the integral of the heat kernel in Theorem 4.4 by the followinglemma, whose proof is provided in Section C.

Lemma 4.5.

Let Ω ⊂ R d be a bounded region with D := diam Ω < ∞ and let x, y ∈ Ω bedisjoint points in the region. Then, for any t with t ≥ D /d , we have Z t πs ) d/ exp (cid:18) − k x − y k s (cid:19) d s ≥ π ) d/ " d d/ − D d − exp (cid:18) − d k x − y k D (cid:19) + h d,D ( t ) , (4.9)where h d,D ( t ) :=  e − d/ − d/ t d/ − − d d/ − D d − ! ( d = 2) , e − / log (cid:18) tD (cid:19) ( d = 2) . (4.10)7 emark 4.1. The function h d,D in (4.10) satisﬁes h d,D ( t ) ≥ t with t ≥ D /d .By combining Theorem 4.4 and Lemma 4.5, we have the following estimate of the energyof the Gaussian kernel. Theorem 4.6.

Let Ω ⊂ R d be a bounded region with D := diam Ω < ∞ . Let { x i } Ni =1 ⊂ Ω bea set of disjoint points in Ω and let µ N be the discrete measure given by (4.3). Then, for any t with t ≥ D /d and a with a ≥ √ d/ (2 D ), we have1 N X i = j exp (cid:0) − a k x i − x j k (cid:1) ≤ π ) d/ D d − d d/ −  − N X i = j G d ( x i , x j ) + A d ( t, µ N ) − C d ( t ) N − π ) d/ N − N h d,D ( t )  , (4.11)where A d ( t, µ N ) and C d ( t ) are given by (4.4) and (4.5), respectively. Proof.

It follows from inequality (4.9) in Lemma 4.5 and equality (4.6) in Theorem 4.4 that1 N X i = j π ) d/ d d/ − D d − exp (cid:0) − a k x i − x j k (cid:1) ≤ N X i = j π ) d/ d d/ − D d − exp (cid:18) − d k x i − x j k D (cid:19) ≤ N X i = j Z t πs ) d/ exp (cid:18) − k x i − x j k s (cid:19) d s − π ) d/ N − N h d,D ( t )= A d ( t, µ N ) − C d ( t ) N − N X i = j G d ( x i , x j ) − π ) d/ N − N h d,D ( t ) . Thus we have the conclusion.In the parenthesis of the RHS of (4.11), the second term A d ( t, µ N ) also depends on the set { x i } Ni =1 . However, we can expect that the dependence tends to disappear as t → ∞ . Thereforeminimization of − N X i = j G d ( x i , x j )will provide approximate minimizer of the energy of the Gaussian kernel. In this section, we present a method for generating points for quadrature based on the argu-ments in Section 4. Then, based on the relation in (3.5), we compute the optimal weights byusing Formula (3.1) to obtain a quadrature formula.8 .1 Objective functions

According to Theorem 4.6, we can provide an upper bound of the value in (2.4) in the case ofthe Gaussian kernel K ( x, y ) = exp (cid:0) − a k x − y k (cid:1) with a ≥ √ d/ (2 D ). That is, the value − N N X j =1 Z Ω K ( x j , x ) d µ ( x ) − ˆ C d,D N X i = j G d ( x i , x j )+ ˆ C d,D (cid:18) A d ( t, µ N ) − C d ( t ) N − π ) d/ N − N h d,D ( t ) (cid:19) (5.1)is such an upper bound, where ˆ C d,D := 2(4 π ) d/ D d − d d/ − . Since the underlined part of (5.1) is almost independent of the set { x j } Nj =1 , we minimize theother part of (5.1) to obtain an approximate minimizer of the worst case error. In the following,we deal with the Gaussian kernel K ( x, y ) = exp (cid:0) −k x − y k (cid:1) . That is, we consider the casethat a = 1 in Theorem 4.6. Let J ( x ) be deﬁned by J ( x ) := Z exp( −| x − y | ) d x = √ π − x ) + erf( x )) . In the following, we consider the two and three dimensional cases. d = 2We consider a region Ω = [0 , and measure d µ ( x ) = d x . Then, we have D = √ C d,D = 8 π , and we can conﬁrm that a = 1 ≥ / √ d/ (2 D ). Furthermore, we have Z Ω K ( x, y ) d µ ( y ) = J ( x (1) ) J ( x (2) ) =: J ( x ) , (5.2)where x (1) and x (2) are the ﬁrst and second components of x ∈ [0 , , respectively. Fromthese and the two-dimensional fundamental solution in (4.1), the objective function in (5.1) iswritten in the form I ( x , . . . , x N ) := − N N X j =1 J ( x j ) + 4 N X i = j log 1 k x i − x j k . (5.3) d = 3We consider a region Ω = [0 , and measure d µ ( x ) = d x . Then, we have D = √ C d,D = 16 π / , and we can conﬁrm that a = 1 ≥ / √ d/ (2 D ). Furthermore, we have Z Ω K ( x, y ) d µ ( y ) = J ( x (1) ) J ( x (2) ) J ( x (3) ) =: J ( x ) , (5.4)9here x (1) , x (2) and x (3) are the ﬁrst, second and third components of x ∈ [0 , , respectively.From these and the three-dimensional fundamental solution in (4.1), the objective functionin (5.1) is written in the form I ( x , . . . , x N ) := − N N X j =1 J ( x j ) + 2 √ πN X i = j k x i − x j k . (5.5) Remark 5.1.

The functions I and I are similar to the Riesz energies (see e.g. [4]), whichhave an intrinsic repelling property. As shown in Section 6, this property seems to be well-suited to a method introduced in Section 5.3. Similar energies are considered and diﬀerentalgorithms are applied to them in [11, 12]. In the functions I and I , the sums of J ( x j ) and J ( x j ) take roles as regularization terms,respectively. However, it was observed that their eﬀects were so weak that minimization of I and I made points x j accumulate near the boundary of Ω. Since such distribution is notappropriate for quadrature on Ω, we introduce stronger regularization terms.To this end, we begin with approximating the characteristic functions of the regions [0 , and [0 , : δ [0 , d ( x ) := ( x ∈ [0 , d ) , ∞ ( x [0 , d ) , ( d = 2 , . There are various choices about such approximate functions. In this paper, by using a realnumber

M >

0, we choose˜ δ (2) M ( x ) := X ℓ =1 (cid:18) log 1 x ( ℓ ) − M + log 11 + M − x ( ℓ ) (cid:19) and˜ δ (3) M ( x ) := X ℓ =1 (cid:18) x ( ℓ ) − M + 11 + M − x ( ℓ ) (cid:19) for δ [0 , ( x ) and δ [0 , ( x ), respectively. The number M sets a margin of the boundaries of theregions. By using these, we introduce R ( x , . . . , x N ) := 1 N P N X i =1 ˜ δ (2) M ( x i ) and (5.6) R ( x , . . . , x N ) := 1 N P N X i =1 ˜ δ (3) M ( x i ) (5.7)as regularization terms in the cases of d = 2 and d = 3, respectively. The parameter P determines strength of these terms.We generate a set { x j } of points by minimizing the functions I d ( x , . . . , x N ) + R d ( x , . . . , x N ) ( d = 2 , . The hyper-parameters P and M are chosen so that good distribution of points are obtainedby the algorithm proposed below in Section 5.3.10 emark 5.2. The functions ˜ δ (2) M and ˜ δ (3) M are not based on any theory, although they imitatesthe corresponding fundamental solutions. In addition, the factor N − P in (5.6) and (5.7) mayhave room for improvement. Finding more appropriate regularization terms based on a theoryis a topic for future work. Remark 5.3.

In [9], the authors use the heat kernels, which are time-dependent Gaussiankernels in a special case, to generate points on compact manifolds via simulated annealing.They show superiority of the points given by the heat kernels over those given by the Rieszkernels and other QMC sequences. In Section 6 of this paper, we observe that the functions I d with the regularization terms R d can be superior to the functions of the worst case error forthe Gaussian kernel. Here we propose an algorithm for generating a set { x j } of points by using the function I d + R d .We intend to realize a so simple algorithm with cheap computational cost that it can be appliedto high-dimensional cases with many points. To this end, it is better to avoid preparation ofmany candidate points in Ω among which points for quadrature are selected.Taking these considerations into account, we propose Algorithm 1 for generating pointsfor quadrature. We call it a point-wise gradient descent method (PWGD). After preparing aninitial set { x j } Nj =1 ⊂ Ω, the algorithm updates its member one by one with a gradient of thefunction I d + R d with respect to the member. Algorithm 1

Point-wise gradient descent method (PWGD)

Require: a number N of points, a default step size γ > Ensure: a set { x i } Ni =1 ⊂ Ω of points generate an initial set { x i } Ni =1 ⊂ Ω randomly for k = 1 , . . . , K max do for i = 1 , . . . , N do g i := ∇ x i { I d ( x , . . . , x N ) + R d ( x , . . . , x N ) } γ ′ := max { β ≥ | x i − βg i ∈ Ω } γ ← max { γ, γ ′ } x i ← x i − γg i end for if max ≤ i ≤ N k g i k < ǫ then break end if end for return { x i } Ni =1 Finally, we obtain a set { ( x j , w ∗ j ) } Nj =1 of points and weights for quadrature by the followingprocedure.1. Obtain a set { x j } Nj =1 ⊂ Ω by Algorithm 1 (PWGD) with equal weights: w j = 1 /N .2. Compute the optimal weights { w ∗ j } Nj =1 by Formula (3.3): w ∗ = Z Ω K − X N k X N ( x ) d µ ( x ).11 Numerical experiments

We compute the sets { ( x j , w ∗ j ) } Nj =1 by the proposed procedure in the cases that K ( x, y ) =exp( −k x − y k ) and Ω = [0 , d for d = 2 ,

3. We set γ = 1 in Algorithm 1 and choose thehyper-parameters P and M experimentally. For comparison, we also use other methods shownbelow: M1 Sequential Bayesian quadrature (SBQ), M2 Application of the point-wise gradient descent method to the worst-case error with theequal weights: e wor ( X N , { /N } Ni =1 ; H K ).In particular, we deal with the latter method to conﬁrm the eﬀect of the proposed method.We set γ = 0 . . In addition, we used ε = 10 − and ε = 10 − in Algorithm 1in the cases of d = 2 and d = 3, respectively.MATLAB programs are used for all computation in this section. In addition, the compu-tation is done with the double precision ﬂoating point numbers. The programs used for thecomputation are available on the web page [19]. d = 2 First, we show the generated points in Figures 1–4. We can observe that the points given bythe proposed procedure are separated each other and do not gather around a certain point asopposed to those given by the other methods.Figure 1: 50 points given by M1 (SBQ). Figure 2: 50 points given by M2 (PWGD forthe original worst-case error). The default step size γ in Algorithm 1 is determined experimentally. In most cases, we observed that theinner iteration was terminated before it reached the maximum number K max of the iteration. P, M ) = (0 . , . P, M ) = (0 . , . P, M ) are set appropriatelyaccording to N , the number of the points.Figure 5: Squared worst case errors. The horizontal axis corresponds to N . The legend“PWGD with Gauss” indicates method M2 and “PWGD ( P, M )” indicates the proposed pro-cedure with the hyper-parameters (

P, M ). Here P = 0 . N . The legend“PWGD with Gauss” indicates method M2 and “PWGD ( P, M )” indicates the proposed pro-cedure with the hyper-parameters (

P, M ). Here P = 0 . d = 3 First, we show the generated points in Figures 7–10. We can observe similar situations to thetwo-dimensional case.Figure 7: 100 points given by M1 (SBQ). Figure 8: 100 points given by M2 (PWGD forthe original worst-case error).14igure 9: 100 points given by the proposedprocedure with (

P, M ) = (1 . , . P, M ) = (1 . , . I d + R d is well-suitedto the point-wise gradient descent algorithm. On the other hand, the performance of theproposed procedure seems to be slightly worse than that of method M1 (SBQ), although theformer outperforms the latter in some cases.Figure 11: Squared worst case errors. The horizontal axis corresponds to N . The legend“PWGD with Gauss” indicates method M2 and “PWGD ( P, M )” indicates the proposed pro-cedure with the hyper-parameters (

P, M ). Here P = 1 . N . The legend“PWGD with Gauss” indicates method M2 and “PWGD ( P, M )” indicates the proposed pro-cedure with the hyper-parameters (

P, M ). Here P = 1 . By using the fundamental solutions of the Laplacian on R , we have provided the upper boundin (5.1) for the main terms in (2.4) of the worst case error in an RKHSs with the Gaussiankernel. Based on the bound, we have proposed the objective functions with the regulariza-tion terms and Algorithm 1 (PWGD) for generating points for quadrature in Section 5. Then,quadrature formulas are obtained by calculating the optimal weights with respect to the gener-ated points. By the numerical experiments in Section 6, we have observed that this procedurecan outperform the SBQ and PWGD with the original worst case error if the hyper-parametersare appropriate. We can guess that direct application of the PWGD to the original worst caseerror tends to make a set of points trapped in a bad local minimum.We mention some topics for future work. Since the upper bound in (5.1) is loose and theregularization terms are artiﬁcial, it will be better to ﬁnd discrete energies that is more suitedto the PWGD. After that, theoretical guarantee of the performance of the PWGD should berequired. Furthermore, we will consider generalization of such results to other kernels. Acknowledgements

This work was partly supported by JST, PRESTO Grant Number JPMJPR2023, Japan.

References [1] Bach, F.: On the equivalence between kernel quadrature rules and random feature expan-sions. The Journal of Machine Learning Research, 18, 714–751 (2017)162] Bach, F., Lacoste-Julien, S., and Obozinski, G.: On the equivalence between herdingand conditional gradient algorithms. ICML’12: Proceedings of the 29th InternationalConference on International Conference on Machine Learning 1355–1362 (2012)[3] Belhadji, A., Bardenet, R., and Chainais, P.: Kernel quadrature with DPPs. In: Advancesin Neural Information Processing Systems, 12927–12937 (2019)[4] Brauchart, J. S., and Grabner, P. J.: Distributing many points on spheres: Minimalenergy and designs. J. Complexity 31, 293–326 (2015)[5] Briol, F.-X., Oates, C., Girolami, M., and Osborne, M.: Frank-Wolfe Bayesian quadrature:Probabilistic integration with theoretical guarantees. In Advances in Neural InformationProcessing Systems, 1162–1170 (2015)[6] Chen, W., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C.: Stein points. In Pro-ceedings of the 35th International Conference on Machine Learning, PMLR 80, 844–853(2018)[7] Chen, Y., Welling, M., and Smola A.: Super-samples from kernel herding. In Proceed-ings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence, UAI’10, 109–116,Arlington, Virginia, United States, AUAI Press (2010)[8] Liu, Q., and Lee, J. D.: Black-box importance sampling. In Proceedings of the 20thInternational Conference on Artiﬁcial Intelligence and Statistics (AISTATS), PMLR 54,952-961 (2017)[9] Lu, J., Sachs, M., and Steinerberger, S.: Quadrature points via heat kernel repulsion.Constr. Approx. 51, 27–48 (2020)[10] Husz´ar, F., and Duvenaud, D.: Optimally-weighted herding is Bayesian quadrature.UAI’12: Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial In-telligence 377–386 (2012)[11] Joseph, V., Dasgupta, T., Tuo, R., and Wu, C.: Sequential exploration of complex surfacesusing minimum energy designs. Technometrics, 57(1), 64–74 (2015)[12] Joseph, V., Wang, D., Gu, L., Lyu, S., and Tuo, R.: Deterministic sampling of expensiveposteriors using minimum energy designs. Technometrics, 61(3), 297–308 (2019)[13] Karvonen, T., S¨arkk¨a, S., and Tanaka, K.: Kernel-based interpolation at approximateFekete points. Numer. Algor. (2020)[14] Oettershagen, J.: Construction of optimal cubature algorithms with applications to econo-metrics and uncertainty quantiﬁcation. Ph. D. thesis, Institut f¨ur Numerische Simulation,Universit¨at Bonn (2017)[15] Pronzato, L. and Zhigljavsky, A.: Bayesian quadrature, energy minimization and space-ﬁlling design. SIAM/ASA J. Uncertainty Quantiﬁcation, 8(3), 959–1011 (2020)[16] Pronzato, L. and Zhigljavsky, A.: Minimum-energy measures for singular kernels. Journalof Computational and Applied Mathematics 382, 113089 (2021)1717] Steinerberger, S.: On the logarithmic energy of points on S . arXiv:2011.04630 (2020)[18] Tanaka, K.: Generation of point sets by convex optimization for interpolation in repro-ducing kernel Hilbert spaces. Numer. Algor. 84, 1049–1079 (2020)[19] Tanaka, K.: Matlab programs for a point-wise gradient descent method for kernel quadra-ture. https://github.com/KeTanakaN/mat_PWGD_for_KQ (last accessed on February 21,2021)[20] Teymur, O., Gorham, J., Riabiz, M., and Oates, C.: Optimal quantisation of probabilitymeasures using maximum mean discrepancy. In Proceedings of the 24th InternationalConference on Artiﬁcial Intelligence and Statistics (AISTATS), PMLR 130 (2021) A Derivation of Formula (3.3)

Note that the right hand side of (2.3) is rewritten in the form Z Ω Z Ω K ( x, y ) d x d y − Z Ω w T k X N ( x ) d x + w T K X N w (A.1)by using w = ( w , . . . , w N ) T . Then, by letting w = w ∗ in (A.1), we have Z Ω Z Ω K ( x, y ) d x d y − Z Ω w T k X N ( x ) d x + w T K X N w = Z Ω Z Ω K ( x, y ) d x d y − Z Ω Z Ω ( k X N ( x )) T K − X N k X N ( y ) d x d y = Z Ω Z Ω K ( x, y ) d x d y + 1det K X N det  k · · · k N k ... K X N k N  = 1det K X N det  k k · · · k N k ... K X N k N  . B Integrability of the fundamental solutions of the Laplacian

Regard the fundamental solution G ( x, y ) in (4.1) as a function of x for a ﬁxed y . Then, for d ≥

2, it has a singularity at x = y . Here we conﬁrm that it is integrable on a boundedneighborhood of the singularity. In the following, we assume that y = 0 without loss ofgenerality. Let B d [0 , R ] := { x ∈ R d | k x k ≤ R } be a closed ball. B.1 d = 2 We assume that R ≤

1. Then, we have Z B [0 ,R ] | G ( x, | d x = − Z R r d r Z π d θ π log r = − R R − < ∞ . .2 d ≥ We have Z B d [0 ,R ] | G d ( x, | d x = Z R r d − d r · s d · d − s d r d − = R d − < ∞ . C Proofs

Proof of Lemma 4.1.

By integration by parts and equality (4.2), we have ∂∂s (LHS) = Z R d d y e t∆ y δ b ( y ) Z R d d x G d ( x, y ) ∆ x e s∆ x δ a ( x )= Z R d d y e t∆ y δ b ( y ) Z R d d x ∆ x G d ( x, y ) e s∆ x δ a ( x )= Z R d d y e t∆ y δ b ( y ) Z R d d x δ y ( x ) e s∆ x δ a ( x )= Z R d d y e t∆ y δ b ( y ) e s∆ y δ a ( y )= Z R d d y e ( s + t ) ∆ y δ b ( y ) δ a ( y ) , and ∂∂s (RHS) = Z R d d y G d ( a, y ) ∆ y e ( s + t ) ∆ y δ b ( y )= Z R d d y ∆ y G d ( a, y ) e ( s + t ) ∆ y δ b ( y )= Z R d d y δ a ( y ) e ( s + t ) ∆ y δ b ( y ) . Therefore they coincide. Furthermore, by taking the limit of both sides as s → +0, we havelim s → +0 (LHS) = Z R d d x Z R d d y G d ( x, y ) δ a ( x ) e t∆ y δ b ( y )= Z R d d y e t∆ y δ b ( y ) Z R d d x G d ( x, y ) δ a ( x )= Z R d d y e t∆ y δ b ( y ) G d ( a, y ) , and lim s → +0 (RHS) = Z R d d y G d ( a, y ) e t∆ y δ b ( y ) . Hence the conclusion holds.

Proof of Lemma 4.2.

Note that e t∆ δ b is the heat kernel with center b :e t∆ y δ b ( y ) = 1(4 πt ) d/ exp (cid:18) − k y − b k t (cid:19) . (C.1)19herefore the function e t∆ y δ b ( y ) depends only on t and the diﬀerence y − b . In addition, thefunction G d ( b, y ) depends only on y − b as shown by formula (4.1). Then, by integration bysubstitution with z = y − b , we have Z R d d y G d ( b, y ) e t∆ y δ b ( y ) = Z R d d z G d (0 , z ) e t∆ z δ ( z )= 1(4 πt ) d/ Z R d d z G d ( z,

0) exp (cid:18) − k z k t (cid:19) . Clearly this value is independent of b . Furthermore, to show the boundedness of this value, weuse the following estimate: (cid:12)(cid:12)(cid:12)(cid:12)Z R d d y G d ( b, y ) e t∆ y δ b ( y ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ πt ) d/ Z B d [0 , + Z R d \ B d [0 , ! d z | G d ( z, | exp (cid:18) − k z k t (cid:19) ≤ πt ) d/ Z B d [0 , d z | G d ( z, | + Z R d \ B d [0 , d z | G d ( z, | exp (cid:18) − k z k t (cid:19)! . (C.2)The ﬁrst term in the parenthesis is bounded because of the argument in Section B. To estimatethe second term, we note that | G d ( z, | depends only on k z k and bounded by c d k z k for someconstant c d > k z k ≥

1. Therefore we have Z R d \ B d [0 , d z | G d ( z, | exp (cid:18) − k z k t (cid:19) ≤ s d Z ∞ r d − d r · c d r · exp (cid:18) − r t (cid:19) = s d c d Z ∞ r d exp (cid:18) − r t (cid:19) d r < ∞ . From these, the RHS of (C.2) is bounded.

Proof of Lemma 4.3.

When t → +0, the LHS tends to G d ( a, b ). By diﬀerentiating the LHS,we have ∂∂t (LHS) = Z R d d y G d ( a, y ) ∆ y e t∆ y δ b ( y )= Z R d d y ∆ y G d ( a, y ) e t∆ y δ b ( y )= Z R d d y δ a ( y ) e t∆ y δ b ( y )= (cid:0) e t∆ δ b (cid:1) ( a )= 1(4 πt ) d/ exp (cid:18) − k a − b k t (cid:19) , where we use formula (C.1) in the last equality. Then, the conclusion follows. Proof of Lemma 4.5.

Set α := k x − y k , which satisﬁes 0 < α ≤ D . For preparation, we deﬁnea function g by g ( s ) := 1 s d/ exp (cid:18) − α s (cid:19) s >

0. Since g ′ ( s ) = 14 s d/ exp (cid:18) − α s (cid:19) ( α − ds ) , the function g is unimodal and becomes maximum at s = α d . Then, if we set s ∗ = D d , thefunction g is monotone decreasing on [ s ∗ , ∞ ). Therefore we have g ( s ) ≥ g (2 s ∗ ) for any s with s ∗ ≤ s ≤ s ∗ . By using this inequality, we can derive the following estimate: Z t πs ) d/ exp (cid:18) − α s (cid:19) d s ≥ π ) d/ (cid:18)Z s ∗ s ∗ + Z t s ∗ (cid:19) g ( s ) d s ≥ π ) d/ (cid:18) s ∗ g (2 s ∗ ) + exp (cid:18) − α s ∗ (cid:19) Z t s ∗ s d/ d s (cid:19) ≥ π ) d/ (cid:18) s ∗ g (2 s ∗ ) + exp (cid:18) − D s ∗ (cid:19) Z t s ∗ s d/ d s (cid:19) =  π ) d/ " d d/ − D d − exp (cid:18) − dα D (cid:19) + e − d/ − d/ t d/ − − d d/ − D d − ! ( d = 2) , π ) d/ " d d/ − D d − exp (cid:18) − dα D (cid:19) + e − d/ log (cid:18) tdD (cid:19) ( d = 2) ..