[PDF] A Mean-Field Theory for Learning the Schönberg Measure of Radial Basis Functions

Abstract

We develop and analyze a projected particle Langevin optimization method to learn the distribution in the Schönberg integral representation of the radial basis functions from training samples. More specifically, we characterize a distributionally robust optimization method with respect to the Wasserstein distance to optimize the distribution in the Schönberg integral representation. To provide theoretical performance guarantees, we analyze the scaling limits of a projected particle online (stochastic) optimization method in the mean-field regime. In particular, we prove that in the scaling limits, the empirical measure of the Langevin particles converges to the law of a reflected Itô diffusion-drift process. Moreover, the drift is also a function of the law of the underlying process. Using Itô lemma for semi-martingales and Grisanov's change of measure for the Wiener processes, we then derive a Mckean-Vlasov type partial differential equation (PDE) with Robin boundary conditions that describes the evolution of the empirical measure of the projected Langevin particles in the mean-field regime. In addition, we establish the existence and uniqueness of the steady-state solutions of the derived PDE in the weak sense. We apply our learning approach to train radial kernels in the kernel locally sensitive hash (LSH) functions, where the training data-set is generated via a k -mean clustering method on a small subset of data-base. We subsequently apply our kernel LSH with a trained kernel for image retrieval task on MNIST data-set, and demonstrate the efficacy of our kernel learning approach. We also apply our kernel learning approach in conjunction with the kernel support vector machines (SVMs) for classification of benchmark data-sets.

Full PDF

AA MEAN-FIELD THEORY FOR LEARNING THE SCH ¨ONBERGMEASURE OF RADIAL BASIS FUNCTIONS

MASOUD BADIEI KHUZANI † , ‡ , YINYU YE † , SANDY NAPEL ‡ , ∗ , LEI XING ‡ , ∗ Abstract.

We develop and analyze a projected particle Langevin optimization methodto learn the distribution in the Sch¨onberg integral representation of the radial basis func-tions from training samples. More speciﬁcally, we characterize a distributionally robustoptimization method with respect to the Wasserstein distance to optimize the distri-bution in the Sch¨onberg integral representation. To provide theoretical performanceguarantees, we analyze the scaling limits of a projected particle online (stochastic) op-timization method in the mean-ﬁeld regime. In particular, we prove that in the scalinglimits, the empirical measure of the Langevin particles converges to the law of a re-ﬂected

Itˆo diﬀusion-drift process. The distinguishing feature of the derived process isthat its drift component is also a function of the law of the underlying process. UsingItˆo lemma for semi-martingales and Grisanov’s change of measure for the Wiener pro-cesses, we then derive a Mckean-Vlasov type partial diﬀerential equation (PDE) withRobin boundary conditions that describes the evolution of the empirical measure of theprojected Langevin particles in the mean-ﬁeld regime. In addition, we establish the exis-tence and uniqueness of the steady-state solutions of the derived PDE in the weak sense.We apply our learning approach to train radial kernels in the kernel locally sensitivehash (LSH) functions, where the training data-set is generated via a k -mean clusteringmethod on a small subset of data-base. We subsequently apply our kernel LSH with atrained kernel for image retrieval task on MNIST data-set, and demonstrate the eﬃcacyof our kernel learning approach. We also apply our kernel learning approach in conjunc-tion with the kernel support vector machines (SVMs) for classiﬁcation of benchmarkdata-sets. Introduction R adial basis functions (RBFs) are key tools in statistical machine learning to approx-imate multivariate functions by linear combinations of terms based on a single uni-variate function. Such non-parametric regression schemes are typically used to approximatemultivariate functions or high-dimensional data [39],[8] which are only known at a ﬁnitenumber of points or too diﬃcult to evaluate otherwise, so that then evaluations of the ap-proximating function can take place often and eﬃciently. The accuracy and performanceof such techniques, however, relies to a large extent on a good choice of RBF kernel thatcaptures the structure of data. Standard regression methods based on RBF kernels requiresthe input of a user-deﬁned kernel—a drawback if a good representation of underlying datais unknown a priori. While statistical model selection techniques such as cross-validation orjackknife [14] can conceptually resolve those statistical model selection issues, they typicallyslow down the training process on complex data-sets as they repeatedly reﬁt the model. It † Department of Management Science, Stanford University, CA 94043, USA. ∗ Department of Electrical Engineering, Stanford University, CA 94043, USA. ‡ Department of Radiation Oncology, Stanford University, CA 94043, USA. i a r X i v : . [ m a t h . S T ] J u l s thus imperative to devise eﬃcient learning algorithms to facilitate such model selectionproblems.In this paper we put forth a novel optimization framework to learn a good radial kernelfrom training data. Our kernel learning approach is based on a distributionally robust opti-mization problem [12, 16] to learn a good distribution for the Sch¨onberg integral represen-tation of the radial basis functions from training samples [47, Thm. 1]. Since optimizationwith respect to the distribution of RBF kernels is intractable, we consider a Monte Carlosample average approximation to obtain a solvable ﬁnite dimensional optimization problemwith respect to the samples of the distribution. We then use a projected particle Langevinoptimization method to solve the approximated ﬁnite dimensional optimization problem.We provide a theoretical guarantee for the consistency of the ﬁnite sample-average approx-imations. Based on a mean-ﬁeld analysis, we also show the consistency of the proposedprojected particle Langevin optimization method in the sense that when the number ofparticles tends to inﬁnity, the empirical distribution of the Langevin particles follows thepath of the gradient descent ﬂow of the distributionally robust optimization problem on theWasserstein manifold.1.1. Related works.

The proposed kernel learning approach in this paper is closely relatedto the previous work of the authors in [26]. Therein, we proposed a particle stochastic gra-dient descent method in conjunction with the random feature model of Rahimi and Recht[41, 42] to optimize the distribution of the random features in generative and discriminativemachine learning models using training samples. We also showed numerically that comparedto the importance sampling techniques for kernel learning ( e.g. , [48]), the particle SGD inconjunction with the kernel SVMs yields a lower training and test errors on benchmarkdata-sets. Nevertheless, we observed that the particle SGD scales rather poorly with datadimension and the number of random feature samples, rendering it inapplicable for discrim-inative analysis of high dimensional data-sets. In this work, we address the scalability issueby optimizating the kernel over the sub-class of the radial kernels which includes importantspecial cases such as the Gaussian, inverse multiquadrics, and Mat´ern kernels. While wecharacterize a distributionally robust optimization framework similar to [26], in this workwe optimize distributions deﬁned on the real line instead of high dimensional distributionsin [26].This work is also closely related to the copious literature on the kernel model selectionproblem, see, e.g. , [28, 9, 1, 19]. For classiﬁcation problems using kernel SVMs, Cortes, etal . studied a kernel learning procedure from the class of mixture of base kernels. They havealso studied the generalization bounds of the proposed methods. The same authors havealso studied a two-stage kernel learning in [9] based on a notion of the kernel alignment.The ﬁrst stage of this technique consists of learning a kernel that is a convex combination ofa set of base kernels. The second stage consists of using the learned kernel with a standardkernel-based learning algorithm such as SVMs to select a prediction hypothesis. In [28], theauthors have proposed a semi-deﬁnite programming for the kernel-target alignment problem.However, semi-deﬁnite programs based on the interior point method scale rather poorly toa large number of base kernels.Our proposed kernel learning framework is related to the work Sinha and Duchi [48] forlearning shift invariant kernels with random features. Therein, the authors have proposeda distributionally robust optimization for the importance sampling of random features us-ing f -divergences. In [26], we proposed a particle stochastic gradient descent to directlyoptimize the samples of the random features in the distributionally robust optimization ii ramework, instead of optimizing the weight (importance) of the samples. However, for thehigh dimensional data-sets the particles are high dimensional, and the particle SGD scalespoorly. In this paper, we address the scalibility issue by restricting the kernel class to theradial kernels.The mean-ﬁeld description of SGD dynamics has been studied in several prior worksfor diﬀerent information processing tasks. Wang et al. [57] consider the problem of onlinelearning for the principal component analysis (PCA), and analyze the scaling limits ofdiﬀerent online learning algorithms based on the notion of ﬁnite exchangeability . In theirseminal papers, Montanari and co-authors [33, 23, 32] consider the scaling limits of SGDfor training a two-layer neural network, and characterize the related Mckean-Vlasov PDEfor the limiting distribution of the empirical measure associated with the weights of theinput layer. They also establish the uniqueness and existence of the solution for the PDEusing the connection between Mckean-Vlasov type PDEs and the gradient ﬂows on theWasserstein manifolds established by Otto [36], and Jordan, Kinderlehrer, and Otto [24].Similar mean-ﬁeld type results for two-layer neural networks are also studied recently in[45, 50]1.2. Paper Outline.

The rest of this paper is organized as follows: • Empirical Risk Minimization in Reproducing Kernel Hilbert Spaces : In Section 2, wereview some preliminaries regarding the empirical risk minimization in reproducing kernelHilbert spaces. We also provide the notion of the kernel-target alignment for optimizingthe kernel in support vector machines (SVMs). We then characterize a distributionallyrobust optimization problem for multiple kernel learning. • Theoretical Results : In Section 3, we provide the theoretical guarantees for the perfor-mance of our kernel learning algorithm. In particular, we establish the non-asymptoticconsistency of the ﬁnite sample approximations. We also analyze the scaling limits of thekernel learning algorithm. • Empirical Evaluation on Synthetic and Benchmark Data-Sets : In Section 4, we evaluatethe performance of our proposed kernel learning model on synthetic and benchmark data-sets. In particular, we analyze the performance of our kernel learning approach for thehypothesis testing problem. We also apply our proposed kernel learning approach todevelop hash codes for image query task from large data-bases.2.

Preliminaries and the Optimization Problem for Kernel Learning

In this section, we review preliminaries of kernel methods in classiﬁcation and regressionproblems.2.1.

Reproducing Kernel Hilbert Spaces and Kernel Alignment Optimization.

Let X be a metric space. A Mercer kernel on X is a continuous and symmetric function K : X × X →

IR such that for any ﬁnite set of points { x , · · · , x N } ⊂ X , the kernel matrix( K ( x i , x j )) ≤ i,j ≤ N is positive semi-deﬁnite.The reproducing kernel Hilbert space (RKHS) H K associated with the kernel K is thecompletion of the linear span of the set of functions { K x def = K ( x , · ) , x ∈ X } with the innerproduct structure (cid:104)· , ·(cid:105) H K deﬁned by (cid:104) K x , K x (cid:105) H K = K ( x , x ). That is (cid:42)(cid:88) i α i K x i , (cid:88) j β j K x j (cid:43) H K = (cid:88) i,j α i β j K ( x i , x j ) . (2.1) iii he reproducing property takes the following form (cid:104) K x , f (cid:105) H K = f ( x ) , ∀ x ∈ X , f ∈ H K . (2.2)In the classical supervised learning models, we are given n feature vectors and their cor-responding uni-variate class labels ( x , y ) , · · · , ( x n , y n ) ∼ i.i.d. P X ,Y , ( x i , y i ) ∈ X × Y ⊂ IR d × IR. For the binary classiﬁcation and regression tasks, the target spaces is given by Y = {− , } and Y = IR, respectively. Given a loss function (cid:96) : Y × IR → IR, a classiﬁer f is learned from the function class F via the minimization of the regularized empirical riskinf f ∈F (cid:98) R [ f ] def = 1 n n (cid:88) i =1 (cid:96) ( y i , f ( x i )) + λ (cid:107) f (cid:107) F , (2.3)where (cid:107) · (cid:107) F is a function norm, and λ > H K with the kernel function K : X × X →

IR,and suppose F = H K ⊕

1. Then, using the expansion f ( x ) = ω + (cid:80) n − i =1 ω i K ( x , x i ),and optimization over the kernel class K yields the following primal and dual optimizationproblems Primal: min ω ∈ IR n max K ∈K n n (cid:88) i =1 (cid:96) (cid:32) y i , ω + n − (cid:88) i =1 ω i K ( x , x i ) (cid:33) + λ (cid:107) ω (cid:107) , (2.4a) Dual: max α ∈ IR n min K ∈K − n (cid:88) i =1 (cid:96) ∗ ( γ i , y i ) − λ α T Kα , (2.4b)respectively, where (cid:96) ∗ ( β, y ) = sup z ∈ IR { βz − (cid:96) ( β, y ) } is the Fenchel’s conjugate, and K def =( K ( x i , x j )) ≤ i,j ≤ n is the kernel Gram matrix.In the particular case of the soft margin SVMs classiﬁer (cid:96) ( y, z ) = [1 − yz ] + def = max { , − yz } , the primal and dual optimizations take the following formsPrimal: min ω ∈ IR n max K ∈K n n (cid:88) i =1 (cid:34) − ω y i − n − (cid:88) i =1 ω i y i K ( x , x i ) (cid:35) + + λ (cid:107) ω (cid:107) , (2.5a) Dual: max β ∈ IR n : (cid:104) β , y (cid:105) =0 , (cid:22) β (cid:22) C min K ∈K (cid:104) β , (cid:105) −

12 Tr( K ( β (cid:12) y )( β (cid:12) y ) T ) , (2.5b)where (cid:12) is the Hadamard (element-wise) product of vectors. The form of the dual opti-mization in Eq. (2.5) suggests that for a ﬁxed dual vector β ∈ IR n , the optimal kernel canbe computed by optimizing the following U -statistics known as the kernel-target alignment , i.e. max K ∈K n ( n − (cid:88) ≤ i

IR : K ( x i , x j ) = ψ ( (cid:107) x i − x j (cid:107) ) , ψ ∈ C (IR) } , (2.7)which includes the following important special cases: • Gaussian : K ( x , ˜ x ) = exp (cid:18) − (cid:107) x − ˜ x (cid:107) σ (cid:19) , for σ > Compare the kernel class in Eq. (2.7) with that of K def = { K : X ×X →

IR : K ( x i , x j ) = ψ ( x i − x j ) , ψ ∈ C (IR) } we considered in [26], where K is the set of all translation invariant kernels. iv Inverse multiquadrics : K ( x , ˜ x ) = (cid:0) c + (cid:107) x − ˜ x (cid:107) (cid:1) − γ for c, γ > • Mat`ern : K ( x , ˜ x ) = c τ − d Γ (cid:0) τ − d (cid:1) τ − − d (cid:16) (cid:107) x − ˜ x (cid:107) c (cid:17) τ − d K d − τ ( c (cid:107) x − ˜ x (cid:107) ), where K α is themodiﬁed Bessel function of the third kind, and Γ is the Euler Gamma function.The radial kernel K ( x i , x j ) = ψ ( (cid:107) x i − x j (cid:107) ) is a positive semi-deﬁnite kernel (a.k.a.Mercer kernel) if the uni-variate function ψ : IR + → IR is positive semi-deﬁnite. A uni-variate function which is positive semi-deﬁnite or positive deﬁnite on every IR d admits anintegral representation due to Sch¨oenberg [47, Thm. 1]: Theorem 2.1. (I. J. Sch¨oenberg [47, Thm. 1] ) A continuous function ψ : IR + → IR ispositive semi-deﬁnite if and only if it admits the following integral representation ψ ( r ) = (cid:90) ∞ e − tr d µ ( t ) , (2.8) for a ﬁnite positive Borel measure µ on IR + . Moreover, if supp( µ ) (cid:54) = { } then ψ is positivedeﬁnite, where supp( µ ) denotes the support of the measure µ ∈ M + (IR + ) . From the Sch¨onberg’s representation theorem [47], the following integral representationfor the radial kernels follows K ( x , (cid:98) x ) = (cid:90) ∞ e − ξ (cid:107) x − (cid:98) x (cid:107) µ (d ξ ) , ∀ x , (cid:98) x ∈ IR d , µ ∈ M + (IR + ) , (2.9)where M + (IR + ) is the set of all ﬁnite non-negative Borel measures on IR + . Due to theintegral representation of Equation (2.9), a kernel function K is completely characterizedin terms of the probability measure µ . Therefore, we can reformulate the kernel-targetalignment as an optimization with respect to the distribution of the kernelsup µ ∈P (cid:98) E ( µ ) def = 2 n ( n − (cid:88) ≤ i µ ∈ M + (Ξ), where Ξ ⊆ IR + is the support of the distributions. Moreover, given the metricspace ( X , d ), for the measures µ, ν ∈ M + ( X ), W p ( µ, ν ) is the p -Wasserstein metric deﬁnedbelow W p ( µ, ν ) def = (cid:18) inf π ∈C ( µ,ν ) (cid:90) X ×X d p ( ξ , ξ )d π ( ξ , ξ ) (cid:19) p , (2.11)where the inﬁmum is taken with respect to all couplings π of the measures µ, ν ∈ M + (IR d ),and Π( µ, ν ) is the set of all measures for which µ and ν are marginals, i.e. , C ( µ, ν ) def = (cid:110) π ∈ M + (IR d × IR d ) : T π = µ, T π = ν (cid:111) , (2.12)for all the maps T ( ξ , ξ ) = ξ and T ( ξ , ξ ) = ξ , and T π and T π are the push-forwards of π .Alternatively, we can recast the distributional optimization problem as a risk minimiza-tion aiming to match the output of the target kernel with the ideal kernel K ∗ = yy T onthe training data-setinf µ ∈P (cid:98) E γ ( µ ) def = 2 n ( n − γ (cid:88) ≤ i

Using h ≥ W ( (cid:98) µ N , (cid:98) ν N ) ≤ R , a partialLagrangian is inf µ ∈M + (IR + ) sup h ∈ IR + (cid:98) J h ( µ ) def = (cid:98) E γ ( (cid:98) µ N ) + h (cid:0) W ( (cid:98) µ N , (cid:98) µ N ) − R (cid:1) , (2.16)where h > ξ m , · · · , ξ Nm at each iteration m = 0 , , , · · · , T as follows (cid:98) µ Nm ( ξ ) def = 1 N N (cid:88) k =1 δ ( ξ − ξ km ) . (2.17)Then, the surrogate loss for the Wasserstein distance in Eq. (2.14) can be obtained asfollows. We denote the joint distribution of the particle-pair ( ξ im , ζ j ) by π ij . Then, weobtain that W (cid:0)(cid:98) µ Nm , (cid:98) µ N (cid:1) = arg min π ∈C + N (cid:88) i,j =1 π ij | ξ im − ξ j | , (2.18)where the set of couplings are given by C + def = (cid:40) π ∈ IR N × N + : N (cid:88) i =1 π ij = 1 N , N (cid:88) j =1 π ij = 1 N , i, j = 1 , , · · · , N (cid:41) . (2.19)is a linear program that is challenging to solve in practice for a large number of particles N . To improve the computational eﬃciency, Cuturi [10] have proposed to add an entropicregularization to the Wasserstein distance. The resulting Sinkhorn divergence solves a reg-ularized version of Equation (2.18): π ∗ def = arg min π ∈C N (cid:88) i,j =1 π ij | ξ im − ξ j | − εH ( π ) , (2.20)where H ( π ) def = − (cid:80) Ni,j =1 π ij log( π ij ) is the entropic barrier function enforcing the non-negativity constraint on the entries π ij ’s, with the regularization parameter ε >

0. Moreover, vi has a similar deﬁnition as in Eq. (2.19), except that ( π ij ) i,j ∈ IR N × N , and thus C + ⊆ C .TheSikhorn divergence is now computed as follows W ε, (cid:0)(cid:98) µ Nm , (cid:98) µ N (cid:1) = N (cid:88) i,j =1 π ∗ ij | ξ im − ξ j | , (2.21)Notice that the entropic regularization term H ( π ) is absent from Eq. (2.21). We thus consider the following surrogate loss functioninf ξ ∈ IR N + (cid:98) J Nh,ε ( ξ ) def = (cid:98) E γ ( ξ ) + h W ε, ( (cid:98) µ Nm , (cid:98) µ N ) . (2.22)The loss function (cid:98) J Nh,ε ( ξ ) acts as a proxy for the empirical loss (cid:98) J Nh ( ξ ) that we wish tooptimize.After introducing the Lagrange multipliers λ def = ( λ i ) ≤ i ≤ N and ˜ λ def = (˜ λ i ) ≤ i ≤ N for theconstraints on the marginals of π ij , we obtain that L ( π ; λ , γ ) def = N (cid:88) i,j =1 επ ij log( π ij ) + N (cid:88) i,j =1 π ij | ξ im − ξ j | + N (cid:88) j =1 λ j (cid:32) N (cid:88) i =1 π ij − N (cid:33) + N (cid:88) i =1 ˜ λ i  N (cid:88) j =1 π ij − N  , (2.23)The optimal values of π ∗ ij can be derived explicitly using the KKT condition π ∗ ij = v i exp (cid:32) − | ξ im − ξ j | ε (cid:33) u j , (2.24)where v i def = exp( − ˜ λ i ε − ), and u j def = exp( − λ j ε − ). Alternatively, the transportation matrixis given by π ∗ m def = diag( v ∗ ) e − D mε diag( u ∗ ) , (2.25)where v ∗ def = ( v ∗ i ) ≤ i ≤ N , u ∗ def = ( u ∗ i ) ≤ i ≤ N , D m = ( d mij ) ij , d mij def = | ξ im − ξ j | , and diag( · )denotes the diagonal element of the matrix. The vectors v ∗ and u ∗ can be computedeﬃciently via the Sinkhorn-Knopp matrix scaling algorithm [49]( u k +1 , v k +1 ) = F ( u k , v k ) , k ∈ IN , (2.26)where F : IR N × IR N → IR N × IR N is the following contraction mapping F ( u , v ) = 1 N (cid:16) diag − (cid:16) e − D ε v m (cid:48) − (cid:17) N , diag − (cid:16) e − D ε u m (cid:48) − (cid:17) N (cid:17) , (2.27)where N def = (1 , , · · · , ∈ IR N . Moreover, W ε, ( (cid:98) µ Nm , (cid:98) ν N ) = Tr ( Dπ ∗ m ) . (2.28)To optimize Eq. (2.14), we consider the projected noisy stochastic gradient descentoptimization method (a.k.a. Langevin dynamics). In particular, at each iteration m =0 , , · · · , T , two samples z m = ( y m , x m ) and ˜ z m = (˜ y m , ˜ x m ) uniformly and randomly are The divergence in Eq. (2.21) is sometimes referred to as the sharp Sinkhorn divergence to diﬀerentiateit from its regularized counterpart (cid:102) W ε, ( (cid:98) µ Nm , (cid:98) µ N ) def = (cid:80) Ni,j =1 π ∗ ij | ξ im − ζ j | − εH ( π ∗ ); see, e.g. , [29]. vii rawn from the training data-set. Then, we analyze the following iterative optimizationmethod ξ m +1 = P Ξ N (cid:18) ξ m − η m ∇ (cid:98) J Nε,h ( ξ m ; z m , ˜ z m ) + (cid:114) ηβ ζ m (cid:19) , (2.29)where ( ξ k ) ≤ k ≤ N ∼ i . i . d µ Above, β > η m > ζ m ) m ∈ IN ∼ i.i.d. N (0 , I N × N ) is the isotropic Gaussian noise. Note that when γ → ∞ , the iterations in Eq. (2.29) correspond to the projected stochastic gradient descentmethod. Moreover, P Ξ N ( · ) is the projection onto the sub-set Ξ N ⊆ IR N + . Furthermore, ∇ (cid:98) J Nε,h ( ξ m ; z m , ˜ z m ) def = (cid:16) ∇ k (cid:98) J Nε,h ( ξ m ; z m , ˜ z m ) (cid:17) ≤ k ≤ m is the stochastic gradient that has thefollowing elements ∇ k (cid:98) J Nε,h ( ξ m ; z m , ˜ z m ) def = ∂E γ ( ξ m ; z m , ˜ z m ) ∂ξ km + h ∂W ε, ( (cid:98) µ Nm , (cid:98) ν N ) ∂ξ km . (2.30)where ∂E γ ( ξ m ; z m , ˜ z m ) ∂ξ km = 1 N (cid:32) γy m ˜ y m − N N (cid:88) (cid:96) =1 e − ξ (cid:96)m (cid:107) x m − ˜ x m (cid:107) (cid:33) e − ξ km (cid:107) x m − ˜ x m (cid:107) (cid:107) x m − ˜ x m (cid:107) (2.31a) ∂W ε, ( (cid:98) µ Nm , (cid:98) ν N ) ∂ξ km = N (cid:88) (cid:96) =1 u ∗ k v ∗ (cid:96) (cid:18) | ξ km − ζ (cid:96) | ε − (cid:19) e − | ξkm − ζ(cid:96) | ε ( ξ km − ζ (cid:96) ) , (2.31b) for all k = 1 , , · · · , N . To update the Lagrange multiplier h , we note that Eq. (2.22)is concave in h >

0, and h is a scaler. Therefore, a bisection method can be applied tooptimize the Lagrange multiplier h . In Algorithm 1, we summarize the main steps of theproposed kernel learning approach. To analyze the complexity of Algorithm 1, we notethat the Sinkhorn’s divergence can be computed in O ( N ) time [10]. Since the Euclideanprojection in Eq. (2.29) is onto the hyper-cube Ξ N , it can be computed eﬃciently in O ( N )time by computing min { ξ k , ξ u } and min { ξ k , ξ l } for each particle k = 1 , , · · · , N . Overall,the (cid:15) -optimal solution to problem (2.22) can be reached in O ( N log(1 /(cid:15) )).3. Main Results

In this section, we state our main theoretical results regarding the performance of theproposed kernel learning procedure. The proof of theoretical results is presented in Appen-dix.3.1.

Assumptions.

To state our theoretical results, we ﬁrst state the technical assumptionsunderlying our theoretical results: (A.1)

The feature space

X ⊂ IR d has a ﬁnite diameter, i.e. , K def = sup x , y ∈X (cid:107) x − y (cid:107) < ∞ . (A.2) The slater condition holds for the empirical loss functions. In particular, we supposethere exists ξ s ∈ IR N + and ¯ ξ s ∈ IR N + such that ξ s ∈ relint( P N ), and ¯ ξ s ∈ relint( P Nε ). (A.3) The Langevin particles are conﬁned to a compact sub-set of IR + , i.e. , Ξ = [ ξ l , ξ u ],for some 0 ≤ ξ l < ξ u < + ∞ . The relative interior of a convex set C , abbreviated relint( C ), is deﬁned as relint( C ) def = { x ∈ C : ∃ (cid:15), IB (cid:15) ( x ) ∩ aﬀ( C ) ⊆ C } , where aﬀ( C ) denotes the aﬃne hull of the set C , and IB (cid:15) ( x ) is a ball of the radius (cid:15) centered on x . viii lgorithm 1 Distributionally Robust Optimization Method for Learning the Radial Kernels

Inputs:

The learning rate η >

R >

0, samples ( x i , y i ) ≤ i ≤ n , parameters β > γ >

0, samples ζ = ( ζ , · · · , ζ N ) ∼ ν , divergence parameter ε > Output:

The samples ξ ∈ IR N + that is (cid:15) -solution to Eq. (2.22) Initialize:

The particles ξ ← ζ Set h u ← ∞ , h l ← , h s ← while h u = ∞ do Draw z , ˜ z ∼ i . i . d . Uniform( y i , x i ) ni =1 and ζ ∼ N (0 , I N × N )Update the particles ξ ← P Ξ N (cid:32) ξ − η ∇ (cid:98) J Nε,hs ( ξ ; z , ˜ z ) + (cid:115) ηβ ζ (cid:33) . (2.32) if W ,ε (cid:18) N (cid:80) Nk =1 δ ξk , N (cid:80) Nk =1 δ ζk (cid:19) ≤ R then h u ← h s else h s ← h s (cid:46) Use Eqs. (2.25)-(2.28) end whilewhile h u − h l ≥ (cid:15)h s do h s ← ( h u + h l ) / z , ˜ z ∼ i . i . d . Uniform( y i , x i ) ni =1 and ζ ∼ N (0 , I N × N )Update the particles ξ ← P Ξ N (cid:32) ξ − η ∇ (cid:98) J Nε,hs ( ξ ; z , ˜ z ) + (cid:115) ηβ ζ (cid:33) . (2.33) if W ,ε (cid:18) N (cid:80) Nk =1 δ ξk , N (cid:80) Nk =1 δ ζk (cid:19) ≤ R then h u ← h s else h l ← h s (cid:46) Use Eqs. (2.25)-(2.28) end while (A.4)

The Langevin particles are initialized by sampling from a distribution µ ∈ M + (Ξ)whose Lebesgue density exists. Let q ( ξ ) def = d µ / d ξ denotes the associated Lebesguedensity.We remark that while (A.3) is essential to establish theoretical performance guaranteesfor Algorithm 1 in this paper, in practice, the same algorithm can be employed to optimizedistributions with unbounded support.3.2. Non-asymptotic consistency.

In this part, we prove that the value of the populationoptimum evaluated at the solution of the ﬁnite sample optimization problem in Eq. (2.14)approaches its population value as the number of training data ( n ), the number of Langevinparticles ( N ), the regularization parameter ( γ ), and the inverse of the parameter of theSinkhorn transport ( ε ) tend to inﬁnity. Theorem 3.1. (Non-asymptotic Consistency of Finite-Sample Estimator)

Fur-ther, consider the distribution balls P and P Nε of the radius R that are deﬁned with respectto the -Wasserstein and Sinkhorn’s divegences, respectively. Deﬁne the optimal values ofthe population optimization and its ﬁnite sample estimate µ ∗ ( ξ ) def = inf µ ∈P E ( µ ( ξ )) def = IE P ⊗ x ,y [ (cid:98) E ( µ ( ξ ))] , (cid:98) µ N ∗ ( ξ ; γ, ε ) def = inf (cid:98) µ N ∈P Nε (cid:98) E γ ( (cid:98) µ N ) . respectively. Then, with the probability of (at least) − (cid:37) over the training data samples { ( x i , y i ) } ni =1 and the Langevin particles { ξ k } Nk =1 , the following non-asymptotic bound holds (cid:12)(cid:12)(cid:12) E ( µ ∗ ( ξ )) − E ( (cid:98) µ N ∗ ( ξ ; ε, γ )) (cid:12)(cid:12)(cid:12) ≤ √ ξ u − ξ l ) R √ N (cid:32) (cid:32) √ N ( ξ u − ξ l ) (cid:37) (cid:33)(cid:33) + 2 max (cid:40) c L n ln (cid:18) (cid:37) (cid:19) , c RL n ln (cid:32) e L (cid:37) (cid:33)(cid:41) + 3 γ + c e − ε , ix here c = 3 × , and c = 9 × , and c is a constant independent of ε . The proof of Theorem 3.1 is presented in Appendix A.1.3.3.

Scaling limits for the unconstrained optimization.

In this part, we provide amean-ﬁeld analysis of the projected particle optimization in Eq. (2.29) for the special caseof the unconstrained optimization. In particular, throughout this part, we assume R = ∞ in the distributional ball (thus h = 0). We then analyze the following Langevin optimizationmethod for unconstrained distributional optimization: ξ m +1 = P Ξ N (cid:18) ξ m − η m ∇ (cid:98) J N ( ξ m ; z m , ˜ z m ) + (cid:114) ηβ ζ m (cid:19) , (3.1)where (cid:98) J N ( ξ m ; z m , ˜ z m ) def = (cid:98) J Nε,h ( ξ m ; z m , ˜ z m ) (cid:12)(cid:12)(cid:12) h =0 . The ﬁrst result of this paper is concernedwith the scaling limits of the Langevin optimization method in Eq. (3.1) in the mean-ﬁeldregime: Theorem 3.2. (Mean-Field Partial Differential Equation)

Suppose Assumptions ( A . ) - ( A . ) are satisﬁed. Let µ Nt denotes the continuous-time embedding of the empiricalmeasure ( (cid:98) µ Nk ) k ∈ IN associated with the projected Langevin particles in Eq. (3.1) , i.e., µ Nt ( ξ ) def = (cid:98) µ N (cid:98) tη (cid:99) = 1 N N (cid:88) k =1 δ ( ξ − ξ k (cid:98) tη (cid:99) ) , ≤ t ≤ T, (3.2) associated with the projected particle Langevin optimization in Eq. (3.1) . Suppose the stepsize η = η N satisﬁes η N → , N/ log( η N /N ) → ∞ , and η N / log( η N /N ) → as N → ∞ .Furthermore, suppose the Lebesgue density q ( ξ ) = d µ / d ξ exists. Then, for any ﬁxed t ∈ [0 , T ] , µ Nt weakly → µ t as N → ∞ , where the Lebesgue density of the limiting measure p ∗ t ( ξ ) = d µ ∗ t / d ξ is a solution to the following distributional dynamics with Robin boundaryconditions as well as an initial datum ∂p t ( ξ ) ∂t = ∂∂ξ (cid:18) p t ( ξ ) ∂∂ξ J ( ξ, p t ( ξ )) (cid:19) + 1 β ∂ ∂ξ ( p t ( ξ )) , ∀ ( t, ξ ) ∈ [0 , T ] × ( ξ l , ξ u ) , (3.3a) ∂p t ( ξ ) ∂ξ + βp t ( ξ ) ∂∂ξ J ( ξ, p t ( ξ )) (cid:12)(cid:12)(cid:12) ξ = ξ l = 0 , ∀ t ∈ [0 , T ] , (3.3b) ∂p t ( ξ ) ∂ξ + βp t ( ξ ) ∂∂ξ J ( ξ, p t ( ξ )) (cid:12)(cid:12)(cid:12) ξ = ξ u = 0 , ∀ t ∈ [0 , T ] , (3.3c) p ( ξ ) = q ( ξ ) , ∀ ξ ∈ [ ξ l , ξ u ](3.3d) p t ( ξ ) ≥ , ∀ ξ ∈ [ ξ l , ξ u ] , (cid:90) Ξ p t ( ξ )d ξ = 1 , ∀ t ∈ [0 , T ] , (3.3e) where the functional J ( ξ, p t ( ξ )) is deﬁned J ( ξ, p t ( ξ )) = IE (cid:104) y ˆ y exp( − ξ (cid:107) x − ˆ x (cid:107) ) (cid:105) + 1 γ (cid:90) ∞ IE (cid:104) exp( − ( ξ + ξ (cid:48) ) (cid:107) x − ˆ x (cid:107) ) (cid:105) p t ( ξ (cid:48) )d ξ (cid:48) . Above, the expectations are taken with respect to the tensor product of the joint distribution P ⊗ x ,y of the features and class labels, and its marginal P ⊗ x , respectively. For a real x ∈ IR, (cid:98) x (cid:99) stands for the largest integer not exceeding x . The notion of weak convergence of probability measures is formally deﬁned in Appendix A. x he proof of Theorem 3.2 is presented in Appendix A.2.Let us make several remarks regarding the distributional dynamics in Theorem 3.2.First, we notice that despite some resemblance between Eq. (3.3) and the Fokker-Plank(a.k.a. forward Kolmogorov) equations for the diﬀusion-drift processes, they diﬀer in thatthe drift J ( ξ, p t ( ξ )) of Eq. (3.3) is a functional of the Lebesgue density of the law ofthe underlying process. In the context of kinetic gas theory of statistical physics, suchdistributional dynamics are known as the Mckean-Vlasov partial diﬀerential equations; see, e.g. , [6].Second, the proof of Theorem 3.2 is based on the notion of propogation of chaos or asymptotic freedom [54, 26], meaning that the dynamics of particles are decoupled in theasymptotic of inﬁnitely many particles N → ∞ . To formalize this notion, we require thefollowing deﬁnitions: Deﬁnition 3.3. (Exchangablity)

Let ν be a probability measure on a Polish space S and. For N ∈ IN, we say that ν ⊗ N is an exchangeable probability measure on the productspace S n if it is invariant under the permutation π def = ( π (1) , · · · , π ( N )) of indices. Inparticular, ν ⊗ N ( π · B ) = ν ⊗ N ( B ) , (3.4)for all Borel subsets B ∈ B ( S n ).An interpretation of the exchangablity condition (3.4) can be provided via De Finetti’srepresentation theorem which states that the joint distribution of an inﬁnitely exchangeablesequence of random variables is as if a random parameter were drawn from some distributionand then the random variables in question were independent and identically distributed,conditioned on that parameter.Next, we review the mathematical deﬁnition of chaoticity, as well as the propagation ofchaos in the product measure spaces: Deﬁnition 3.4. (Chaoticity)

Suppose ν ⊗ N is exchangeable. Then, the sequence { ν ⊗ N } N ∈ IN is ν -chaotic if, for any natural number (cid:96) ∈ IN and any test function f , f , · · · , f k ∈ C b ( S ),we have lim N →∞ (cid:42) (cid:96) (cid:89) k =1 f k ( s k ) , ν ⊗ N (d s , · · · , d s N ) (cid:43) = (cid:96) (cid:89) k =1 (cid:104) f k , ν (cid:105) (3.5)According to Eq. (3.5) of Deﬁnition 3.4, a sequence of probability measures on theproduct spaces S is ν -chaotic if, for ﬁxed k the joint probability measures for the ﬁrst k coordinates tend to the product measure ν (d s ) ν (d s ) · · · ν (d s k ) = ν ⊗ k on S k . If themeasures ν ⊗ N are thought of as giving the joint distribution of N particles residing in thespace S , then { ν ⊗ N } is ν -chaotic if k particles out of N become more and more independentas N tends to inﬁnity, and each particles distribution tends to ν . A sequence of symmetricprobability measures on S N is chaotic if it is ν -chaotic for some probability measure ν on S . If a Markov process on S N begins in a random state with the distribution ν ⊗ N , thedistribution of the state after t seconds of Markovian random motion can be expressed interms of the transition function K N for the Markov process. The distribution at time t > U Nt ν ⊗ N is deﬁned by the kernel U Nt ν ⊗ N ( B ) def = (cid:90) S N K N ( s, B, t ) ν ⊗ N (d s ) . (3.6) xi eﬁnition 3.5. (Propogation of Chaos) A sequence functions (cid:110) K N ( s, B, t ) (cid:111) N ∈ IN (3.7)whose N -th term is a Markov transition function on S N that satisﬁes the permutationcondition K N ( s, B, t ) = K N ( π · s, π · B, t ) , (3.8)propagates chaos if whenever { ν ⊗ N } N ∈ IN is chaotic, so is { U Nt } for any t ≥

0, where U Nt isdeﬁned in Eq. (3.6).In the context of this paper, the propagation of chaos simply means that the time-scaled empirical measures (cid:98) µ Nt (say, on the path space) converge weakly in probability to thedeterministic measure µ t , or equivalently that the law of ( ξ (cid:98) tη (cid:99) , · · · , ξ N (cid:98) tη (cid:99) ) converges weaklyto the product measure µ ⊗ Nt for any ﬁxed N , where µ t is the law of the following reﬂected diﬀusion-drift Itˆo process θ t = θ − (cid:90) t ∂∂ξ J ( ξ s , µ s ( ξ ))d s + 1 β W t + Z − t − Z + t , µ t law = θ t . (3.9)In Eqs. (3.9), Z − t and Z + t are the non-negative reﬂection processes from the boundaries ξ l and ξ u , respectively. In particular, Z − t and Z + t are non-decreasing, c´adl´ag, with the initialvalues Z +0 = Z − = 0, and (cid:90) ∞ ( X t − ξ l )d Z − t = 0 , (cid:90) ∞ ( ξ u − X t )d Z + t = 0 , (3.10)where the integrals are in the Stieltjes sense. In fact, we show that the law of the Langevinparticles ( ξ (cid:98) tη (cid:99) , · · · , ξ N (cid:98) tη (cid:99) ) converges in the 2-Wasserstein distance to µ ⊗ Nt . Alternatively,due to the bounded equivalence of the Wasserstein distance and the total variation distanceon compact metric spaces (cf. (A.3) ), the sense in which (cid:98) µ Nt converges in probability to µ t can be strengthened; rather than working with the usual weak- ∗ topology induced byduality with bounded continuous test functions (see, e.g. , [26]), we work with the strongertopology induced by duality with bounded measurable test functions.Third, the Robin boundary conditions in Eq. (3.3b) captures the eﬀect of an elasticreﬂection of the particles in the projected particle Langevin optimization method. Alter-natively, the Robin boundary conditions asserts that the ﬂux of particles in and out of theboundaries is zero. In particular, the particles are constrained to stay in ( ξ l , ξ u ) by barriersat ξ = ξ u and ξ = ξ l , in such a way that when the particle hits the barrier with incomingvelocity v <

0, it will instantly bounce back with velocity ρv ≥

0, where ρ ≥ velocity restitution coeﬃcient . When ρ = 1 we say that the reﬂection is perfectlyelastic; when ρ = 0 it is said to be a rigid reﬂection resulting in the Neumann boundaryconditions instead of the Robin boundary conditions. By deﬁnition of the Euclidean pro-jection P Ξ N ( · ) in the projected Langevin particle of Eq. (3.1), the particles are bouncedoﬀ of the boundary of the projection space with the elastic parameter ρ = β .Fourth, we note that the solution of the distributional dynamics in Eq. (3.3) is absolutelycontinuous with respect to ξ , and lacks atoms at the boundaries ξ = ξ l and ξ = ξ u . This isdue to the fact that the boundaries are non-absorbing.In the next proposition, we establish the existence and uniqueness of the steady-statesolutions of the mean-ﬁeld PDE (3.3): xii roposition 3.1. (Existence and Uniqueness of the Steady-State Solution) Sup-pose Assumptions ( A . ) - ( A . ) are satisﬁed. Then, the following assertions hold: • There exists a unique non-negative steady-state solution for the mean-ﬁeld PDE inEq. (3.3) in the weak sense. • The ﬁxed point of the restricted Gibbs-Boltzmann measure p ∗ ( ξ ) = exp( − βJ ( ξ, p ∗ ( ξ ))) Z β ξ ∈ Ξ , Z β def = (cid:90) Ξ exp( − βJ ( ξ, p ∗ ( ξ )))d ξ. (3.11) is a steady-state solution to the mean-ﬁeld PDE in Eq. (3.3) . The proof of Proposition 3.1 is deferred to Appendix A.3.4.

Performance Evaluation on Synthetic and Benchmark Data-Sets

For simulations of this section, we consider kernel SVMs in conjunction with the randomfeature model of Rahimi and Recht [41, 42]. Speciﬁcally, we consider the following modelrandom feature model: K ( x , ˜ x ) = (cid:90) Ω ϕ ( x ; ω ) ϕ (˜ x ; ω ) ν (d ω ) , (4.1)where ϕ : X × Ω → IR is the random feature. For translation invariant kernels, we have ϕ ( x ; ω ) = √ (cid:104) x , ω (cid:105) + b ), where b ∼ Uniform[ − π, π ] is a random bias term. Given thei.i.d. samples ζ , · · · , ζ D ∼ i.i.d. ν , then we minimize the following risk functionmin α ∈ IR D n n (cid:88) i =1 max (cid:40) , − D (cid:88) k =1 y i α i ϕ ( x i ; ω k ) (cid:41) + λ (cid:107) α (cid:107) . (4.2)To compute the random features associated with the radial kernels, we note that the prob-ability measure of the random feature has the following Lebesgue’s densityd ν d ω = (cid:90) Ξ (cid:18) ξ (cid:19) d exp (cid:18) − (cid:107) ω (cid:107) ξ (cid:19) µ (d ξ ) . More speciﬁcally, due to the fact that (cid:90) ∞ (cid:32)(cid:90) IR d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e − i (cid:104) ω , x − ˜ x (cid:105) (cid:18) ξ (cid:19) d exp (cid:18) − (cid:107) ω (cid:107) ξ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ω (cid:33) µ (d ξ ) < ∞ , (4.3)the Tonelli-Fubini theorem is admissible. Therefore, we may change the order of integral toderive (cid:90) IR d e − i (cid:104) ω , x (cid:105) (2 π ) d (cid:32)(cid:90) ∞ (cid:18) ξ (cid:19) d exp (cid:18) − (cid:107) ω (cid:107) ξ (cid:19) µ (d ξ ) (cid:33) d ω = (cid:90) ∞ (cid:32)(cid:90) IR d e − i (cid:104) ω , x (cid:105) (4 ξπ ) d exp (cid:18) − (cid:107) ω (cid:107) ξ (cid:19) d ω (cid:33) µ (d ξ )= (cid:90) ∞ e − ξ (cid:107) x − ˜ x (cid:107) , where we recognize the last term as the integral representation of the radial kernel in Eq.(2.9). Let ξ = ( ξ k ) ≤ k ≤ N ∼ i.i.d. µ denotes the vector outputs of Algorithm 1. We draw therandom features, ϕ ( x ) def = ( ϕ ( x ; ω ) , · · · , ϕ ( x ; ω D )) , (4.4)where ω , · · · , ω D ∼ i . i . d . (cid:98) ν N .We compare our method with three alternative kernel learning techniques, namely, theimportance sampling of Sinha and Duchi [48], the Gaussian bandwidth optimization via xiii nearest neighbor ( k -NN) [7], and our proposed particle SGD in [26]. In the sequel, weprovide a brief description of each method: • Importance Sampling : the importance sampling of Sinha and Duchi [48] which proposesto assign a weight w m to each sample ω m ∈ IR d for m = 1 , , · · · , N . The weights arethen optimized via the following standard optimization proceduremax w , ··· ,w N ω ∈Q N n ( n − (cid:88) ≤ i

For experiments with the syntheticdata, we use the setup of [26]. The synthetic data-set we consider is as follows: • The distribution of training data is P V = N ( , (1 + λ ) I d × d ), • The distribution of generated data is P W = N ( , (1 − λ ) I d × d ).To reduce the dimensionality of data, we consider the embedding ι : IR d (cid:55)→ IR d , x (cid:55)→ ι ( x ) = Σ x , where Σ ∈ IR d × d and d < d . In this case, the distribution of the embedded featuresare P X | Y =+1 = N ( , (1 + λ ) ΣΣ T ), and P X | Y = − = N ( , (1 − λ ) ΣΣ T ).Note that λ ∈ [0 ,

1] is a parameter that determines the separation of distributions. Inparticular, the Kullback-Leibler divergece of the two multi-variate Gaussian distributions iscontrolled by λ ∈ [0 , D KL ( P X | Y = − , P X | Y =+1 ) = 12 (cid:20) log (cid:18) − λ λ (cid:19) − d + d (1 − λ ) (cid:21) . (4.8) xiv a) (b) (c) Figure 1.

Visualization of data-points using the synthetic data generation modelswith multivariate Gaussian distributions P V = N ( , (1 + λ ) I d × d ) and P W = N ( , (1 − λ ) I d × d ) for d = 2. Panel (a): λ = 0 .

1, Panel (b): λ = 0 .

5, and Panel (c): λ = 0 . In Figure 1, we show the i.i.d. samples from the distributions P V and P W for diﬀerentchoices of variance parameter of λ = 0 . λ = 0 .

5, and λ = 0 .

9. Notice that for larger λ thedivergence is reduced and thus performing the two-sample test is more diﬃcult. From Figure1, we clearly observe that for large values of λ , the data-points from the two distributions P V and P W have a large overlap and conducting a statistical test to distinguish betweenthese two distributions is more challenging.4.1.1. Kernel Learning Approach.

Figure 4 depicts our two-phase kernel learning procedure.The kernel learning approach consists of training the auto-encoder for the dimensionalityreduction and the kernel optimization sequentially, i.e. ,sup (cid:98) µ N ∈P N sup ι ∈Q (cid:92) MMD αK (cid:98) µN ◦ ι [ P V , P W ] . (4.9)where the function class is deﬁned Q def = { ι ( z ) = σ ( Σ z + b ) , Σ ∈ IR d × d , b ∈ IR d } , and( K (cid:98) µ N ◦ ι )( x , x ) = K (cid:98) µ N ( ι ( x ) , ι ( x )). Here, σ ( · ) is the sigmoid non-linearity. Now, weconsider a two-phase optimization procedure: • Phase (I) : we ﬁx the kernel function, and optimize the auto-encoder to compute aco-variance matrix Σ and the bias term b for the dimensionality reduction. • Phase (II) : we optimize the kernel based from the learned embedded features ι ( x ).This two-phase procedure signiﬁcantly improves the computational complexity of SGD asit reduces the dimensionality of random feature samples ξ ∈ IR D , D = d (cid:28) d .4.1.2. Statistical Hypothesis Testing with the Kernel MMD.

Let V , · · · , V m ∼ i.i.d. P V = N ( , (1 + λ ) I d × d ), and W , · · · , W n ∼ i.i.d. P W = N ( , (1 − λ ) I d × d ). Given these i.i.d.samples, the statistical test T ( { V i } mi =1 , { W i } nj =1 ) : V m × W n → { , } is used to distinguishbetween these hypotheses: • Null hypothesis H : P V = P W (thus λ = 0), • Alternative hypothesis H : P V (cid:54) = P W (thus λ > H X is a universalRKHS, deﬁned on a compact metric space X . Universality requires that the kernel K ( · , · )be continuous and, H X be dense in C ( X ). Under these conditions, the following theoremestablishes that the kernel MMD is indeed a metric: xv heorem 4.1. (Metrizablity of the RKHS) Let F denotes a unit ball in a universalRKHS H X deﬁned on a compact metric space X with the associated continuous kernel K ( · , · ) .Then, the kernel MMD is a metric in the sense that MMD K [ P V , P W ] = 0 if and only if P V = P W . A radial kernel is universal if the support of the measure in the integral representationin Eq. (2.9) excludes the origin, i.e., supp( µ ) (cid:54) = { } , see [53].To design a test, let (cid:98) µ Nm ( ξ ) = N (cid:80) Nk =1 δ ( ξ − ξ km ) denotes the solution of SGD in (4.7) forsolving the optimization problem. Consider the following MMD estimator consisting of two U -statistics and an empirical function (cid:92) MMD K (cid:98) µNm ◦ ι (cid:2) { V i } mi =1 , { W i } ni =1 (cid:3) = 1 m ( m − N (cid:88) k =1 (cid:88) i (cid:54) = j ϕ ( ι ( V i ) , ω km ) ϕ ( ι ( V j ) , ω km )+ 1 n ( n − N (cid:88) k =1 (cid:88) i (cid:54) = j ϕ ( ι ( W i ) , ω km ) ϕ ( ι ( W j ) , ω km ) − nm N (cid:88) k =1 m (cid:88) i =1 n (cid:88) j =1 ϕ ( ι ( W i ) , ω km ) ϕ ( ι ( V j ) , ω km ) . (4.10)Given the samples { V i } mi =1 and { W i } ni =1 , we design a test statistic as below T ( { V i } mi =1 , { W i } ni =1 ) def = (cid:40) H if (cid:92) MMD K (cid:98) µNm ◦ ι (cid:2) { V i } mi =1 , { W i } ni =1 (cid:3) ≤ τ H if (cid:92) MMD K (cid:98) µNm ◦ ι (cid:2) { V i } mi =1 , { W i } ni =1 (cid:3) > τ, . (4.11)where τ ∈ IR is a threshold. Notice that the unbiased MMD estimator of (4.10) can benegative despite the fact that the population MMD is non-negative. Consequently, negativevalues for the statistical threshold τ (4.11) are admissible. Nevertheless, int our simulations,we only consider non-negative values for the threshold τ .A Type I error is made when H is rejected based on the observed samples, despite thenull hypothesis having generated the data. Conversely, a Type II error occurs when H isaccepted despite the alternative hypothesis H being true. The signiﬁcance level α of a testis an upper bound on the probability of a Type I error: this is a design parameter of the testwhich must be set in advance, and is used to determine the threshold to which we comparethe test statistic. The power of a test is the probability of rejecting the null hypothesis H when it is indeed incorrect. In particular,Power def = IP(reject H | H is true) . (4.12)In this sense, the statistical power controls the probability of making Type II errors.4.2. Empirical results on benchmark data-sets.

In Figure 4, we evaluate the powerof the test for 100 trials of hypothesis test using the test statistics of (4.11). To obtain theresult, we used an autoencoder to reduce the dimension from d = 100 to p = 50. Clearly, forthe trained kernel in Panel (a) of Figure 4, the threshold τ for which Power = 1 increases afterlearning the kernel via the two phase procedure described earlier. In comparison, in Panel(b), we observe that training an auto-encoder only with a ﬁxed standard Gaussian kernel K ( x , y ) = exp( −(cid:107) x − y (cid:107) ) attains lower thresholds compared to our two-phase procedure.In Panel (c), we demonstrate the case of a ﬁxed Gaussian kernel without an auto-encoder.In this case, the threshold is signiﬁcantly lower due to the large dimensionality of the data.We also observer that the particle SGD for optimization we proposed in [26] to optimize xvi a) (b) (c) (d) Figure 2.

The statistical power versus the threshold τ for the binary hypothesis test-ing via the unbiased estimator of the kernel MMD. The parameters for this simulationsare λ ∈ { . , . , . } , d = 100, d = 50, n + m = 100, and N = 5000. Panel (a):Trained radial kernel using the two-phase procedure with the particle Langevin opti-mization in Algorithm 1, and an auto-encoder, Panel (b): Trained shift-invariant kernelusing particle SGD in Eq. (4.7) and an auto-encoder, Panel (c): Trained kernel withan auto-encoder and a ﬁxed Gaussian kernel whose the bandwidth is σ = 1, Panel (d):Untrained kernel without an auto-encoder.(a) (b) (c) (d)(e) (f) (g) (h) Figure 3.

The evolution of the histogram of Langevin particles ξ , · · · , ξ N ∈ Ξ = IR + and predictions from the theory (Theorem 3.2) with γ = 10000, R = 10000, η = 10 − .Panel (a): m = 10 ( t = 10 − s ), Panel (b): m = 1000( t = 0 . s ), Panel (c): m = 5000 ( t =0 . s ), Panel (d): m = 10000( t = 1 s ), Panel (e): m = 50000 ( t = 5 s ), Panel (f): m =100000 ( t = 10 s ), Panel (g): m = 200000 ( t = 20 s ), Panel (h): a m = 300000 ( t = 30 s ). the translation invariant kernels provides the highest statistical power for a given thresholdvalue. Nevertheless, as we observe in Figure 4, the run-time of the particle SGD for a givennumber of particles is signiﬁcantly higher than Algorithm 1. This is due to the fact that thedimension of the particles in the particle SGD is dependent on the number of hidden layersof auto-encoder. In contrast, the particles in Algorithm 1 are one dimensional. xvii a) (b) (c)(d) (e) (f) Figure 4.

The scatter plots of the normalized Hamming distance versus Euclideandistance for the locally sensitive hash function of Lemma 5.2. The hash functions aregenerated with the random features associated with an untrained Gaussian kernel withthe bandwidth parameter of σ = 1 (top row), and a trained kernel with Algorithm 1 inconjunction with a k -mean clustering method (bottom row). Panels (a),(d): 128 bits,Panels (b),(e): 512 bits, Panels (c),(f): 4096 bits. Application to Locally Sensitive Hashing

The research on computer systems and web search in the mid-nineties was focused ondesigning “hash” functions that were sensitive to the topology on the input domain, andpreserved distances approximately during hashing. Constructions of such hash functions ledto eﬃcient methods to detect similarity of ﬁles in distributed ﬁle systems and proximity ofdocuments on the web. We describe the problem and results below. Recall that a metricspace ( C , d ) is given by a set C (here the set of cells) and a distance measure d : X ×X → IR + which satisﬁes the axioms of being a metric. Deﬁnition 5.1. (Locally Sensitive Hash Function)

Given a metric space ( C , ρ ) andthe set L , a family of function H ⊆ { h : C → L} is said to be a basic locality sensitive hash(LSH) family if there exists an increasing invertible function α : IR + → [0 ,

1] such that forall x , ˜ x ∈ X , we have IP h ∈H [ h ( x ) = h (˜ x )] ≤ α ( d ( x , ˜ x )) . (5.1)To contrast the locally sensitive hash functions with the standard hash function familiesnote that in the latter, the goal is to map a domain C to a range L such that the probabilityof a collision among any pair of elements x (cid:54) = ˜ x ∈ C is small. In contrast, with LSH families,we wish for the probability of a collision to be small only when the pairwise distance d ( x , ˜ x ) xviii a) (b) (c) Figure 5.

The precision-recall curves for the kernel locally sensitive hashing. Panel(a): hash function F n : X → Z n of Lemma 5.2 with an untrained Gaussian kernel, Panel(b): hash function F n : X → Z n of Lemma 5.2 with the trained kernel with Algorithm1, where the labeled data for kernel training is generated via the k -mean clustering with k = 10, and one versus all rule. Panel (c): hash function G n : X → Z nq of Lemma 5.3with the quantization levels q ∈ { , , , } , and the ﬁxed code-word length of n = 512. (a) (b) (c) Figure 6.

The (normalized) histogram of the (rescaled) Lee distance (red color) andthe (rescaled) Euclidean distance (blue color) of a sample query point from each point ofthe MNIST data-base for diﬀerent quantization level q in Lemma 5.3, Panel (a): q = 2,Panel (b): q = 20, Panel (c): q = 50. Increasing the quantization level q in the Leedistance yields a more ﬁne grained approximation for the Euclidean distance. (bestviewed in color) is large, and we do want a high probability of collision when d ( x , ˜ x ) is small. Indyk andMotwani [22] showed that such a hashing scheme facilitates the construction of eﬃcient datastructures for answering approximate nearest-neighbor queries on the collection of objects.In the sequel, we describe a family of locally sensitive hash functions due to Raginsky, etal. [40], where the measure d ( x , ˜ x ) is determined by a kernel function. We draw a randomthreshold t ∼ Uniform[ − ,

1] and deﬁne the quantizer Q t ( u ) = sgn( t + u ). The followingtheorem is due to Raginsky, et al. [40]: Lemma 5.2. (Kernel Hash Function, Raginsky, et al. [40] ) Consider a translationinvariant kernel K : X × X → IR , K ( x , ˜ x ) = φ ( x − ˜ x ) . Deﬁne the following hamming Z n -code F n ( x ) def = ( F t ,b , ω ( x ) , · · · , F τ n ,b n , ω n ( x )) , (5.2) xix a) (b) (c)(d) (e) (f)(g) (h) (i)(j) (k) (l) Figure 7.

Image retrieval for three query digits { , , } on the MNIST database.The query image is in the top left of the collage (red box), and the incorrect retrievedimages are shaded with the gray color. The Panels (a)-(c): Euclidean distance. Panels(d)-(f): Z n code with an untrained kernel and n = 4096 bits. Panels (g)-(i): Z n codewith a trained kernel with Algoirthm 1 and n = 4096. Panels (j)-(l): Z nq codeword lengthwith q = 10 and n = 1024. (best viewed in color) where ( b i ) ≤ i ≤ n ∼ i.i.d Uniform[ − π, π ] , ( t i ) ≤ i ≤ n ∼ i.i.d. Uniform[ − , , and ( ω i ) ≤ i ≤ n ∼ i.i.d. ν , where ν is the distribution in the Rahimi and Recht random feature model φ ( x − ˜ x ) = (cid:90) IR d ϕ ( x ; ω ) ϕ (˜ x ; ω )d ν ( ω ) . (5.3) xx urthermore, F t,b, ω : IR d → Z is a mapping deﬁned as follows F t,b, ω ( x ) def = 12 (cid:104) Q t (cos( (cid:104) ω , x (cid:105) + b )) (cid:105) . (5.4) Fix (cid:15), δ ∈ (0 , . For any ﬁnite data set D = { x , · · · , x N } ⊂ IR d , F n is such that h K ( x i − x j ) − δ ≤ n d H ( F n ( x i ) , F n ( x j )) ≤ h K ( x i − x j ) + δ, (5.5) where d H ( · , · ) : Z n × Z n → { , } is the Hamming distance. Furthermore, h K ( x i − x j ) def = 8 π ∞ (cid:88) m =0 − K ( m x i − m ˜ x i )4 m − . (5.6)The construction of LSH in Eq. (5.4) of Lemma (5.2) is based on the random featuremodel of Rahimi and Recht [41, 42], and is restricted to the Hamming codes deﬁned on ahyper-cube. In the sequel, we present an alternative kernel LSH using the embedding offunctions in L p (Ω , ν ) space into (cid:96) p -spaces, generalizing the result of Lemma 5.2 to any ﬁniteinteger alphabet Z q def = { , , · · · , q − } . To describe the result, we deﬁne the Lee distancebetween two code-words x = ( x , · · · , x n ) , y = ( y , · · · , y n ) ∈ Z nq of length n as follows d Lee ( x , y ) def = n (cid:88) i =1 min { ( x i − y i ) mod q, ( y i − x i ) mod q } (5.7a) = n (cid:88) i =1 min {| y i − x i | , q − | y i − x i |} . (5.7b)In the special cases of q = 2 and q = 3, the Lee distance corresponds to the Hammingdistance as distances are 0 for two single equal symbols and 1 for two single non-equalsymbols. For q > Lemma 5.3. (Generalized Kernel LSH)

Consider a translation invariant kernel K :IR d × IR d → IR , K ( x , ˜ x ) = ψ ( (cid:107) x − ˜ x (cid:107) ) . Deﬁne a random map G t, ω : IR d → Z q through thefollowing function G t, w ( x ) def = (cid:6) q − ( (cid:104) w , ϕ N ( x ) (cid:105) + t ) (cid:7) mod q, (5.8) where w ∼ N (0 , I N × N ) , t ∼ Uniform[0 , q ] . Moreover, ϕ N ( x ) def = (cos( (cid:104) ω k , x (cid:105) + b k )) ≤ k ≤ N ,where b , · · · , b N ∼ i.i.d. ν def = Uniform[ − π, π ] , and ω , · · · , ω N ∼ i.i.d. ν . Then, the probabil-ity of collision is bounded as follows (cid:12)(cid:12)(cid:12) IP (cid:104) G t, w ( x ) = G t, ω (˜ x ) (cid:105) − Ψ q ( K ( x , ˜ x )) (cid:12)(cid:12)(cid:12) = O (cid:18) ln( N ) N (cid:19) . where Ψ q : IR + (cid:55)→ IR + is deﬁned Ψ q ( u ) def = (cid:90) q (cid:112) π (1 − u ) e − s − u ) (cid:18) − sq (cid:19) d s. (5.9)The proof of Lemma 5.3 is presented in Appendix A.12, and uses the property of thestable distributions. xxi .1. Description of the experiment.

We present image retrieval results for 10000 imagesfrom MNIST databases [10]. The images are 28 by 28 pixels and are compressed via an auto-encoder with 500 hidden layers, which have proven to be eﬀective at learning useful features.For this experiment, we randomly select 1,000 images to serve as queries, and the remaining9,000 images make up the “database”. To distinguish true positives from false positives forthe performance evaluation retrieval performance, we select a “nominal” neighborhood.5.2.

Results.

In Figure 4, we illustrate the scatter plots of the hash codes F n ( x ) of length n ∈ { , , } bits. In Figure 4 (a)-(c), we show the scatter plots for Hammingcodes constructed by Lemma 5.2 using an untrained Gaussian kernel K ( x , (cid:98) x ) = exp( −(cid:107) x − (cid:98) x (cid:107) ). From Figure 4 (a)-(c) we observe that as the number of bits of the Hamming codeincrease, the scatter plots concentrate around a curve that has a ﬂat region. Evidently, thiscurve deviates from the ideal curve which is a straight line passing through the origin. Inparticular, in 4 (a)-(c), as the Euclidean distance changes on the interval [2 ,

8] of x -axis, thenormalized Hamming distance of the constructed codes remains constant.To attain proportionality between the Hamming distance and the Euclidean distance,we train the kernel using Algorithm 1. To generate the labels for kernel training, we ﬁrstperform a k -NN on 1000 images from 9,000 images of the data-base, where here k = 10.Then, we generate the binary class labels by assigning y = +1 to the points in a randomlychosen cluster, and y = − i.e. , the one-versus-all rule). After training the kernel, the kernel is used to generate the Hash codesusing the construction of Lemma 5.2. The resulting scattering plots are depicted in Figure4 (d)-(f), and the Hamming distances are more proportional to the Euclidean distance.Central to our study is the precision-recall curve, which captures the trade-oﬀ betweenprecision and recall for diﬀerent threshold. A high area under the curve represents bothhigh recall and high precision, where high precision relates to a low false positive rate, andhigh recall relates to a low false negative rate. High scores for both show that the classiﬁeris returning accurate results (high precision), as well as returning a majority of all positiveresults (high recall). Speciﬁcally,Precision def = |{ relevant images } ∩ { retrieved images }||{ retrieved images }| , (5.10a) Recall def = |{ relevant images } ∩ { retrieved images }||{ relevant images }| . (5.10b)In Figure 5 we illustrate the precision-recall curves for Hamming codes of diﬀerent lengths n ∈ { , , , } . We also depict the precision-recall curve for the Euclidean dis-tance. We recall the notions of the recall and precision as follows From Figure 5(a), wedepict the precision-recall curves for Hamming codes constructed from the Gaussian ker-nel. We observe that even Hamming codes of length 4096 bits attain the precision-recallcurve that is signiﬁcantly lower than the Euclidean curve. In Figure 5(b), we illustrate theprecision-recall curves after training kernels using Algorithm 1. The curves from Hammingcodes are markedly closer to the curve associated with the Euclidean distance. In Figure5(c), we depict the precision-recall curves using the q -ary codes in Lemma 5.3 for diﬀerentquantization levels q ∈ { , , , } . Clearly, increasing q leads to a better performance ini-tially while larger values of q has a diminishing return. This can be justiﬁed using Figure 6,where we plot the normalized histogram of the distance of a sample query point from eachpoint of the MNIST data-base consisting of 9,000 of data-points. The blue color histogramdepict the histogram using a Euclidean distance, and the orange color histogram with spikes xxii epicts the histogram of the Lee distance. Figure 6(a)-(c) are plotted with the quantizationlevels q = 2, q = 20, and q = 50, respectively. As the quantization level increases, thehistogram of the Lee distance provides a better approximation for the Euclidean distancehistogram. However, even for large q , there is a disparity between the two histograms, asthe length of the q -ary codes are ﬁxed at n = 512 and is ﬁnite.In Figure 7, we depict the retrieval results for three query digits { , , } on MNIST data-set. In Fig. 7(a)-(c), we show the performance of retrieval system using a Euclidean distancefor searching nearest neighbor. In particular, for each query digit, the retrieval systemreturns 99 similar images among 9,000 images in the data-base using nearest neighborhoodsearch. The retrieved images for less challenging digits such as 1 and 0 are all correct,while for more challenging query points such as 4, some erroneous images are retrieved.In Fig. 7(d)-(f), we depict the retrieved images using a Hamming distance for the nearestneighborhood search, where the Hamming codes for each feature vector are generated via thekernel LSH in Lemma 5.2, where a Gaussian kernel K ( x , (cid:98) x ) = exp( −(cid:107) x − (cid:98) x (cid:107) ) is employed.We observe that compared to the Euclidean distance in Fig. 7(a)-(c), the performance ofHamming based nearest neighbor search is signiﬁcantly less accurate. However, we pointout that from a computation complexity perspective, computing the Hamming distance issigniﬁcantly more eﬃcient than the Euclidean distance which is the main motivation inusing LSH for nearest neighborhood search.In Fig. 7 (g)-(i), we show the retrieval results after training the kernel using Algorithm1. Clearly, the retrievals are more accurate on the digits 0 and 4. However, the length ofthe Hamming code required to achieve such performance is quite large n = 4096. In Fig.7 (j)-(l), we show the retrieval performance using the q -ary code of length n = 1024 andthe quanitization level of q = 10. Although the length of the q -ary code is signiﬁcantlysmaller than that of the Hamming code, the retrieval error is slightly less. However, thecomputational cost is slightly higher than that of the Hamming code.In the future, we will test our method on data-sets consisting of millions of data points. Atpresent, our promising initial results on MNIST data-set, combined with our comprehensivetheoretical analysis, convincingly demonstrate the potential usefulness of our kernel learningscheme for large-scale indexing and search applications.6. Classification on Benchmark Data-Sets

We now apply our kernel learning method for classiﬁcation and regression tasks on real-world data-sets.6.1.

Data-Sets Description.

We apply our kernel learning approach to classiﬁcation andregression tasks of real-world data-sets. In Table 1, we provide the characteristics of eachdata-set. All of these datasets are publicly available at UCI repository. Online news popularity.

This data-set summarizes a heterogeneous set of featuresabout articles published by Mashable in a period of two years. The goal is to predict thenumber of shares in social networks (popularity).6.1.2.

Buzz in social media dataset.

This data-set contains examples of buzz events fromtwo diﬀerent social networks: Twitter, and Tom’s Hardware, a forum network focusing onnew technology with more conservative dynamics https://archive.ics.uci.edu/ml/index.php xxiii .1.3. Adult.

Adult data-set contains the census information of individuals including edu-cation, gender, and capital gain. The assigned classiﬁcation task is to predict whether aperson earns over 50K annually. The train and test sets are two separated ﬁles consistingof roughly 32000 and 16000 samples respectively.6.1.4.

Epileptic Seizure Detection.

The epileptic seizure detection data-set consists of arecording of brain activity for 23.6 seconds. The corresponding time-series is sampled into4097 data points. Each data point is the value of the EEG recording at a diﬀerent pointin time. So we have total 500 individuals with each has 4097 data points for 23.5 seconds.The 4097 data points are then divided and shuﬄed every into 23 segments, each segmentcontains 178 data points for 1 second, and each data point is the value of the EEG recordingat a diﬀerent point in time.6.2.

Quantitative Comparison.

In Figure 8, we present the training and test resultsfor regression and classiﬁcation tasks on benchmark data-sets, using top d = 35 featuresfrom each data-set and for diﬀerent number of random feature samples N . In all theexperiments, the SGD method provides a better accuracy in both the training and testphases. Nevertheless, in the case of seizure detection, we observe that for a small numberof random feature samples, the importance sampling and Gaussian kernel with k NN forbandwidth outperforms SGD.In Figure 9, we illustrate the time consumed for training the kernel using SGD andimportance sampling methods versus the number of random features N . For the linearregression on Buzz , and

Online news popularity the diﬀerence between the run-timesare negligible. However, for classiﬁcation tasks using

Adult and

Seizure , the diﬀerence inrun-times are more pronounced.Those visual observations are also repeated in a tabular form in Table 2, where thetraining and test errors as well algorithmic eﬃciencies (run-times) are presented for N = 100, N = 1000, N = 2000, and N = 3000 number of features.7. Conclusions

We have proposed and analyzed a distributionally robust optimization method to learnthe distribution in the Sch¨onberg integral representation of radial basis functions. In par-ticular, we analyzed a projected particle Langevin dynamics to optimize the samples of thedistribution in the integral representation of the radial kernels. We established theoreticalperformance guarantees for the proposed distributional optimization procedure. Speciﬁcally,we derived a non-asymptotic bound for the consistency of the ﬁnite sample approximations.Furthermore, using a mean-ﬁeld analysis, we derived the scaling limits of a particle sto-chastic optimization method. We showed that in the scaling limits, the projected particleLangevin optimization converges to a reﬂected diﬀusion-drift process. We then derived apartial diﬀerential equation, describing the evolution of the law of the underlying reﬂectedprocess. We evaluated the performance of the proposed kernel learning approach for classi-ﬁcation on benchmark data-sets. We also used the kernel learning approach in conjunctionwith a k -mean clustering method to train the kernel in the kernel locally sensitive hashfunctions. xxiv ata-set Task d n training n test Buzz Regression 77 93800 46200Online news popularity Regression 58 26561 13083Adult Classiﬁcation 122 32561 16281Seizure Classiﬁcation 178 8625 2875

Table 1.

Description of the benchmark data-sets used in this paper.(a) (b) (c) (d)

Figure 8.

Training and test errors of Algorithm 1 (black line), the SGD optimization(blue lines), the importance sampling (green lines), and a Gaussian kernel with the op-timized bandwidth (red lines) for classiﬁcation and linear regression tasks. The trainingand test errors are depicted with dashed and solid lines, respectively. Panel (a):

Buzz ,Panel (b):

Online news popularity , Panel (c):

Adult , Panel (d):

Seizure . (Bestviewed in color)(a) (b) (c) (d)

Figure 9.

The run-times of kernel optimization algorithms using Algorithm 1, theSGD optimization, and Importance Sampling (IS). Panel (a):

Buzz , Panel (b):

Onlinenews popularity , Panel (c):

Adult , Panel (d):

Seizure (Best viewed in color).

Appendix A. Proof of Main Results

Notations . We deﬁne the following notion of distances between two measures µ, ν ∈M ( X ) on the metric space X : • f -divergence : D f ( µ || ν ) = (cid:90) X f (cid:18) d µ d ν (cid:19) d ν. (A.1) xxv uzz Algorithm 1 N =100 N =1000 N =2000 N =3000Training Error 0.448e − − − − − − − − N =100 N =1000 N =2000 N =3000Training Error 0.501e − − − − − − − − N =100 N =1000 N =2000 N =3000Training Error 0.486e − − − − − − − − N =100 N =1000 N =2000 N =3000Training Error 0.484e − − − − − − − − Online news popularity

Algorithm 1 N =100 N =1000 N =2000 N =3000Training Error 0.463e − − − − − − − − N =100 N =1000 N =2000 N =3000Training Error 0.501e − − − − − − − − N =100 N =1000 N =2000 N =3000Training Error 0.486e − − − − − − − − N =100 N =1000 N =2000 N =3000Training Error 0.484e − − − − − − − − Adult

Algorithm 1 N =100 N =1000 N =2000 N =3000Training Error 0.221 0.201 0.199 0.199Test Error 0.219 0.207 0.205 0.205Run Time (sec) 0.155 0.645 1.426 1.802Particle SGD N =100 N =1000 N =2000 N =3000Training Error 0.240 0.204 0.199 0.197Test Error 0.236 0.209 0.208 0.210Run Time (sec) 0.274 1.784 9.272 11.295Importance Sampling N =100 N =1000 N =2000 N =3000Training Error 0.240 0.207 0.214 0.220Test Error 0.236 0.219 0.222 0.223Run Time (sec) 0.058 0.511 2.661 8.068Gaussian Kernel N =100 N =1000 N =2000 N =3000Training Error 0.240 0.206 0.201 0.197Test Error 0.236 0.222 0.216 0.214 Seizure

Algorithm 1 N =100 N =1000 N =2000 N =3000Training Error 0.0814 0.0408 0.0364 0.0343Test Error 0.0873 0.0480 0.0449 0.0470Run Time (sec) 0.134 0.701 1.229 1.882Particle SGD N =100 N =1000 N =2000 N =3000Training Error 0.200 0.034 0.031 0.032Test Error 0.200 0.043 0.043 0.043Run Time (sec) 0.350 1.747 3.833 5.222Importance Sampling N =100 N =1000 N =2000 N =3000Training Error 0.135 0.078 0.055 0.051Test Error 0.144 0.099 0.063 0.056Run Time (sec) 0.038 0.138 0.352 0.404Gaussian Kernel N =100 N =1000 N =2000 N =3000Training Error 0.146 0.056 0.033 0.030Test Error 0.154 0.093 0.076 0.069 Table 2.

Performance comparison between kernel learning method (Algorithm 1),the particle SGD [26], the importance sampling [48], and a regular Gaussian kernel inconjunction with the kernel SVMs for diﬀerent random feature samples N on benchmarkdata-sets. xxvi n the case of f ( x ) = x log x , the corresponding distance is the Kullback-Leiblerdivergence which we denote by D KL ( ·||· ). • Wasserstein distance : W p ( µ, ν ) = inf π ∈ Π( µ,ν ) (cid:18)(cid:90) X ×X d p ( x , ˜ x ) π ( x , ˜ x ) (cid:19) p , (A.2) where Π( µ, ν ) is the set of all measure µ and ν . For p = 1, the Kantorovich-Rubinstein duality yields W ( µ, ν ) = sup f ∈F L (cid:90) X f ( x )(d µ ( x ) − d ν ( x )) , (A.3) where F L def = { f : X →

IR : (cid:107) f (cid:107) Lip ≤ , f ∈ C ( X ) } , where C ( X ) is the class ofcontinuous functions. • Bounded Lipschitz metric: D BL ( µ, ν ) = sup f ∈F BL (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X f ( x )d µ ( x ) − (cid:90) X f ( x )d ν ( x ) (cid:12)(cid:12)(cid:12)(cid:12) , (A.4) where F BL def = { f : IR d → IR : (cid:107) f (cid:107) Lip ≤ , (cid:107) f (cid:107) ∞ ≤ } .We denote vectors by lower case bold letters, e.g. x = ( x , · · · , x n ) ∈ IR n , and matricesby the upper case bold letters, e.g. , M = [ M ij ] ∈ IR n × m . For a real x ∈ IR, (cid:98) x (cid:99) standsfor the largest integer not exceeding x . Let IB r ( x ) def = { y ∈ IR d : (cid:107) y − x (cid:107) ≤ r } denote theEuclidean ball of radius r centered at x . For a sub-set S ⊂ IR d of the Euclidean distance,we deﬁne its closure ¯ S def = { s ∈ S : s = lim n →∞ x n , x n ∈ S , ∀ n ∈ IN } , and its boundary ∂ S def = ¯ S\S . Given a random variable x , we denote its law with P x law = x . We use W k,p (Ω)for 1 ≤ p ≤ ∞ to denote the Sobolev space of functions f ∈ L p (Ω) whose weak derivativesup to order k are in L p (Ω). The Sobolev space W , (Ω) is denoted by H (Ω), and H (Ω)is the space of functions in H (Ω) that vanishes at the boundary. The dual space of H (Ω)is denoted by H − (Ω).To establish the concentration results in this paper, we require the following two deﬁni-tions: Deﬁnition A.1. (Sub-Gaussian Norm)

The sub-Gaussian norm of a random variable Z ,denoted by (cid:107) Z (cid:107) ψ , is deﬁned as (cid:107) Z (cid:107) ψ = sup q ≥ q − / (IE | Z | q ) /q . (A.5)For a random vector Z ∈ IR n , its sub-Gaussian norm is deﬁned as follows (cid:107) Z (cid:107) ψ = sup x ∈ S n − (cid:107)(cid:104) x , Z (cid:105)(cid:107) ψ . (A.6) Deﬁnition A.2. (Sub-exponential Norm)

The sub-exponential norm of a randomvariable Z , denoted by (cid:107) Z (cid:107) ψ , is deﬁned as follows (cid:107) Z (cid:107) ψ = sup q ≥ q − (IE[ | Z | q ]) /q . (A.7)For a random vector Z ∈ IR n , its sub-exponential norm is deﬁned below (cid:107) Z (cid:107) ψ = sup x ∈ S n − (cid:107)(cid:104) Z , x (cid:105)(cid:107) ψ . (A.8) xxvii .1. Proof of Theorem 3.1.

We begin the proof by recalling the following deﬁnitionsfrom the main text: (cid:98) E ( µ ) = 2 n ( n − (cid:88) ≤ i

Suppose ( A . ) − ( A . ) holds. Consider the distribution ball P = { µ ∈ M + (Ξ) : W ( µ, µ ) ≤ R } , where µ ∈ M + (Ξ) is an arbitrary distribution. Then, sup µ ∈P (cid:12)(cid:12)(cid:12) E ( µ ) − (cid:98) E ( µ ) (cid:12)(cid:12)(cid:12) ≤ max (cid:40) c K n (cid:115) ln (cid:18) ρ (cid:19) , c RK n ln (cid:32) e √ K ρ (cid:33)(cid:41) , (A.12) with the probability of at least − ρ over the draw of the training data ( x i , y i ) ≤ i ≤ n ∼ i.i.d. P x ,y , where c = 3 × and c = √ × . The proof of Lemma A.3 is provided in Section A.4. Lemma A.3 asserts that when thenumber of training data n tends to inﬁnity, the error due to the ﬁnite sample approximationbecomes negligible. In the next lemma, we provide a similar consistency type result for thesample average approximation with respect to the Langevin particles: Lemma A.4. (Consistency with Respect to the Langevin Particles)

Suppose ( A . ) − ( A . ) holds. Consider the distributional ball P in Lemma A.3 and consider itsempirical approximation P N = { (cid:98) µ N : W ( (cid:98) µ N , (cid:98) µ N ) ≤ R } . Then, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup µ ∈P (cid:98) E ( µ ) − sup (cid:98) µ N ∈P N (cid:98) E ( µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ ξ u − ξ l ) R √ N (cid:32) (cid:32) √ N ( ξ u − ξ l ) ρ (cid:33)(cid:33) , (A.13) with the probability of at least − ρ over the draw of the initial particles ξ , · · · , ξ N ∼ i.i.d. µ . The proof of Lemma A.4 is presented in Section A.5.In the next lemma, we quantify the amount of error due to including the regularization γ to the risk function: Lemma A.5. (Regularization Error)

Suppose ( A . ) − ( A . ) holds. Consider theempirical distribution ball P N of Lemma A.4, and deﬁne (cid:98) µ N ∗ ( γ ) def = arg sup (cid:98) µ N ∈P N E γ ( (cid:98) µ N ) . (A.14) xxviii hen, for any γ > , the following upper bound holds (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup (cid:98) µ N ∈P N E ( (cid:98) µ N ) − E ( (cid:98) µ N ∗ ( γ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ γ . (A.15)The proof of Lemma A.5 is presented in Section A.6.The last lemma of this part is concerned with the approximation error due to using theSinkhorn divergence in lieu of the Wasserstein distance: Lemma A.6. (Sinkhorn Divergence Approximation Error)

Suppose ( A . ) − ( A . ) holds. Let (cid:98) µ N ∗ ( γ ) denotes the empirical measure of Lemma A.5. Furthermore, consider thedistribution ball P Nε = { (cid:98) µ N : W ,ε ( (cid:98) µ N , (cid:98) µ N ) ≤ R } , and (cid:98) µ N ∗ ( γ, ε ) def = arg sup (cid:98) µ N ∈P Nε E γ ( (cid:98) µ N )(A.16) Then, for any ε > , the following upper bound holds (cid:12)(cid:12) E ( (cid:98) µ N ∗ ( γ )) − E ( (cid:98) µ N ∗ ( γ, ε )) (cid:12)(cid:12) ≤ c exp (cid:18) − ε (cid:19) + 2 γ , (A.17) where c > is a universal constant independent of ε . The proof of Lemma A.6 is presented in Section A.7.Equipped with Lemmas A.3-A.6, we are now in position to proof main inequality ofTheorem 3.1.We use the triangle inequality to decompose the error term into four diﬀerent components: (cid:12)(cid:12)(cid:12) E ( µ ∗ ) − E ( (cid:98) µ N ∗ ( γ, ε )) (cid:12)(cid:12)(cid:12) ≤ E + E + E + E + E , (A.18)where the error terms E i , i = 1 , , , , E = (cid:12)(cid:12)(cid:12) E ( µ ∗ ) − sup µ ∈P (cid:98) E ( µ ) (cid:12)(cid:12)(cid:12) E = (cid:12)(cid:12)(cid:12) sup µ ∈P (cid:98) E ( µ ) − sup (cid:98) µ N ∈P N (cid:98) E ( (cid:98) µ N ) (cid:12)(cid:12)(cid:12) E = (cid:12)(cid:12)(cid:12) sup (cid:98) µ N ∈P N (cid:98) E ( (cid:98) µ N ) − E ( (cid:98) µ N ∗ ) (cid:12)(cid:12)(cid:12) E = (cid:12)(cid:12)(cid:12) E ( (cid:98) µ N ∗ ) − E ( (cid:98) µ N ∗ ( γ )) (cid:12)(cid:12)(cid:12) E = (cid:12)(cid:12)(cid:12) E ( (cid:98) µ N ∗ ( γ )) − E ( (cid:98) µ N ∗ ( γ, ε )) (cid:12)(cid:12)(cid:12) . The error terms E and E can be bounded using Lemma A.3, E using Lemma A.4, E using Lemma A.5, and E using Lemma A.6.A.2. Proof of Theorem 3.2.

Consider the projected particle Langevin dynamics in Eq.(2.14) which we repeat here for the convenience of the reader ξ m = P Ξ N (cid:18) ξ m − − η ∇ (cid:98) J N ( ξ m − ; z m , ˜ z m ) + (cid:114) ηβ ζ m (cid:19) , (A.19)for all m ∈ (cid:2) , T η − (cid:3) ∩ IN, where ζ m ∼ N ( , I N × N ). Associated with the discrete-timeprocess ( ξ m ) m ∈ IN , we deﬁne the continuous-time c´adl´ag process (¯ ξ t ) ≤ t ≤ T that is constant xxix n the interval ¯ ξ t = ¯ ξ mη , ∀ t ∈ [ mη, ( m + 1) η ) and satisﬁes the following recursion¯ ξ ηm = P Ξ N (cid:18) ¯ ξ η ( m − − η ∇ (cid:98) J N (¯ ξ η ( m − ; z m , ˜ z m ) + (cid:114) β ζ ηm (cid:19) , (A.20)for all m ∈ (cid:2) , T η − (cid:3) ∩ IN, where ζ ηm = W ηm − W η ( m − , and W t is a F t -adapted Wienerprocess with the initial value W = 0, and independent incrementsIE (cid:104) e i (cid:104) x , W t − W s (cid:105) |F s (cid:105) = e − ( t − s ) (cid:107) x (cid:107) , ∀ x ∈ IR N . (A.21)Therefore, by comparing the dynamics in Eqs. (A.19) and (A.20), we observe that ¯ ξ ηm = ξ m for all m ∈ [0 , T η − ] ∩ IN.We compare the iterations in Eq. (A.19) with the following decoupled N -dimensionalreﬂected Itˆo stochastic diﬀerential equation θ t = θ − (cid:90) t ∇ J ( θ s ; µ s )d s + (cid:114) β W t + (cid:90) t n s L (d s ) , ≤ t ≤ T (A.22a) θ t ∈ Ξ N , d L ( s ) ≥ , (cid:90) t ¯Ξ N \ ∂ Ξ N ( θ t ) L (d s ) = 0 , (A.22b)where • θ = ( θ , · · · , θ N ) ∼ µ ⊗ N , • µ ⊗ Ns = P θ s , with P θ s law = θ s , • J ( θ s ; µ s ) is the drift process deﬁned as follows J ( θ s ; µ s ) def = 1 N N (cid:88) k =1 IE (cid:104) y (cid:98) ye − θ ks (cid:107) x − (cid:98) x (cid:107) (cid:105) + 1 γN N (cid:88) k =1 (cid:90) ∞ IE (cid:104) e − ( θ ks + θ ) (cid:107) x − (cid:98) x (cid:107) (cid:105) µ s (d θ ) . (A.23)Moreover, ∇ J ( θ s ; µ s ) = ( ∇ k J ( θ s ; µ s )) ≤ k ≤ N is the gradient vector with the followingelements ∇ k J ( θ s ; µ s ) def = ∂∂θ ks J ( θ s ; µ s )= 1 N ∂∂θ ks IE (cid:104) y (cid:98) ye − θ ks (cid:107) x − (cid:98) x (cid:107) (cid:105) + 1 γN ∂∂θ ks (cid:90) ∞ IE (cid:104) e − ( θ ks + θ ) (cid:107) x − (cid:98) x (cid:107) (cid:105) µ s (d θ ) , (A.24) • L is the local time of θ t at the boundary of the projection space Ξ N . In particular, L isa measure on [0 , T ] that is non-negative, non-decreasing, and whose support is deﬁnedsupp( L ) ⊆ { t ≥ θ t ∈ ∂ Ξ N } , (A.25) • and, n t is the normal vector to the boundary ∂ Ξ N at the time t ∈ [0 , T ].Let us make some remarks about the stochastic diﬀerential equation in Eq. (A.22). First,let Z t = (cid:82) t n s L (d s ). Then, due to Skorokhod [51] and Tanka’s [55] theorems, it is knownthat the processes θ t and Z t are uniquely deﬁned. The uniqueness property is also referredto as the Skorohod’s reﬂection mapping principle in stochastic calculus (cf. Deﬁnition A.18).Second, notice that at each given time t ∈ [0 , T ], the drift of the k th coordinate ∇ k J ( θ s , µ s )only depends on the location of the k th particle ( θ ks ) ≤ s ≤ t , and is independent of otherparticles ( θ js ) ≤ s ≤ t , j (cid:54) = k . xxx ow, consider the following c`adl`ag process (¯ θ t ) ≤ t ≤ T , where ¯ θ t = ¯ θ mη , ∀ t ∈ [ mη, ( m +1) η ), and ¯ θ ηm = P Ξ N (cid:18) ¯ θ η ( m − − η ∇ J (¯ θ η ( m − ; ρ η ( m − ) + (cid:114) β ζ ηm (cid:19) , (A.26)for all m ∈ (cid:2) , T η − (cid:3) ∩ IN, where ρ ⊗ Ns = P ¯ θ s , and ¯ θ = θ . We also consider the followingc`adl`ag process (˜ θ t ) ≤ t ≤ T that is constant on the interval ˜ θ t = ˜ θ mη , ∀ t ∈ [ mη, ( m + 1) η ), andhas the following recursions˜ θ ηm = P Ξ N (cid:18) ˜ θ η ( m − − η ∇ J (˜ θ η ( m − ; ν η ( m − ) + (cid:114) β ˜ ζ ηm (cid:19) , (A.27)for all m ∈ (cid:2) , T η − (cid:3) ∩ IN, where ν ⊗ Ns = P ˜ θ s , ˜ θ ηm = θ , and ˜ ζ ηm def = ˜ W ηm − ˜ W η ( m − is theWiener process with a drift. In particular,˜ W t = W t + (cid:114) β (cid:90) t ( ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ))d s. (A.28)We notice that the processes (¯ θ t ) ≤ t ≤ T and (˜ θ t ) ≤ t ≤ T in Eqs. (A.26) and (A.27), respec-tively, only diﬀer in the deﬁnition of the Wiener process.In the sequel, we establish three propositions to derive bounds on the Wasserstein dis-tances between the laws of c´adl´ag processes in Eqs. (A.20),(A.22),(A.26), and (A.27): Proposition A.1. (Wasserstein Distance between the Laws of C´adl´ag Pro-cesses)

Consider the empirical measure µ Nt def = (cid:98) µ N (cid:98) tη (cid:99) def = N (cid:80) Nk =1 δ ξ k (cid:98) tη (cid:99) ( ξ ) associated withthe Langevin particles in Eq. (A.20) . Furthermore, let ρ ⊗ Nη (cid:98) tη (cid:99) = P ¯ θ η (cid:98) tη (cid:99) denotes the law ofthe random variable ¯ θ η (cid:98) tη (cid:99) deﬁned by the recursion in Equation (A.20) . Then, the followingupper bound holds sup ≤ t ≤ T W (cid:16) µ Nt , ρ η (cid:98) tη (cid:99) (cid:17) ≤ ξ u − ξ l ) N (1 + γ − ) K (cid:115) ηT log (cid:18) Tηρ (cid:19) exp (cid:18) K (1 + γ − ) T √ N (cid:19) , (A.29) with the probability of (at least) − ρ . The proof of Proposition A.1 is presented in Appendix A.8.

Proposition A.2. (Change of Measure)

Consider the laws ρ ⊗ Nη (cid:98) tη (cid:99) = P ¯ θ η (cid:98) tη (cid:99) and ν ⊗ Nη (cid:98) tη (cid:99) = P ˜ θ η (cid:98) tη (cid:99) , where ¯ θ η (cid:98) tη (cid:99) and ˜ θ η (cid:98) tη (cid:99) evolve according to the recursions in Eqs. (A.26) and (A.27) ,respectively. Then, sup ≤ t ≤ T W (cid:16) ρ η (cid:98) tη (cid:99) , ν η (cid:98) tη (cid:99) (cid:17) ≤ β ( ξ u − ξ l ) K (1 + γ − ) N (cid:32)(cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105) d s (cid:33) . The proof of Proposition A.2 is presented in Appendix A.10.

Proposition A.3. (Time Discreteization Error)

Consider the laws µ ⊗ Ns = P θ s and ν ⊗ Nη (cid:98) tη (cid:99) = P ˜ θ η (cid:98) tη (cid:99) , where θ t and ˜ θ η (cid:98) tη (cid:99) are governed by the dynamics of Eqs. (A.26) and xxxi A.27) , respectively. Then, the following inequality holds. sup ≤ t ≤ T W (cid:16) ν η (cid:98) tη (cid:99) , µ t (cid:17) ≤ IE (cid:20) sup ≤ t ≤ T (cid:107) θ s − ˜ θ s (cid:107) (cid:21) , (A.30) Moreover, for any p ∈ IN , the following maximal inequality holds IE (cid:20) sup ≤ t ≤ T (cid:107) θ s − ˜ θ s (cid:107) p (cid:21) ≤ (cid:112) B p (cid:18) p − β p A p + 2 p − C p (cid:19) e p − T D p √ B p . (A.31) where the constants A p , B p , C p and D p are deﬁned as follows A p def = T (cid:18) p p − (cid:19) p p − η p − N Γ (cid:16) N +2 p (cid:17) Γ (cid:0) N +22 (cid:1) , (A.32a) B p def = c p (cid:0) dist( θ , ∂ Ξ N ) (cid:1) − p (cid:18)(cid:18) β (cid:19) p N p T p + K p N p T p (cid:19) , (A.32b) C p def = 2 p − η K p N p (1 + γ − ) p , (A.32c) D p def = (cid:18) K p (1 + γ − ) p N p log (cid:18) p K p T (1 + γ − ) p N p (cid:19) + 2 p − K p N p (cid:19) . (A.32d) Above, c p > is a constant independent of N and T , and dist( θ , ∂ Ξ N ) = min θ ∈ ∂ Ξ N (cid:107) θ − θ (cid:107) is the distance of the initial value from the boundary. From the triangle inequality of the 2-Wasserstein distance, we obtain thatsup ≤ t ≤ T W (cid:16)(cid:98) µ Nη (cid:98) tη (cid:99) , µ t (cid:17) ≤ sup ≤ t ≤ T W (cid:16)(cid:98) µ Nη (cid:98) tη (cid:99) , ρ η (cid:98) tη (cid:99) (cid:17) + sup ≤ t ≤ T W (cid:16) ρ η (cid:98) tη (cid:99) , ν η (cid:98) tη (cid:99) (cid:17) + sup ≤ t ≤ T W (cid:16) ν η (cid:98) tη (cid:99) , µ t (cid:17) . (A.33)We now leverage the upper bounds in Propositions A.1, A.2, and A.3 to bound each termon the right hand side of Eq. (A.33). Taking the limit N → ∞ and choosing the prescribedstep-size η = η N in Theorem 3.2 yieldslim N →∞ sup ≤ t ≤ T W (cid:16)(cid:98) µ Nη (cid:98) tη (cid:99) , µ t (cid:17) = 0 . (A.34)For any pseudo-Lipschitz function ψ : IR + → IR and empirical measures { µ n } n ∈ IN , µ n ∈M + (IR + ), it holds that (cid:82) ψµ n → (cid:82) ψµ ∗ as n → ∞ if and only if W ( µ n , µ ) →

0, see [56,Thm. 6.9]. Therefore, due to Eq. (A.34), we conclude that (cid:98) µ N (cid:98) tη (cid:99) weakly → µ t uniformly on theinterval 0 ≤ t ≤ T .In the sequel, we characterize a distributional dynamics describing the evolution of theLebesgue density of the limiting measure µ t : Proposition A.4. (McKean-Vlaso Mean-Field Equation)

Let (Ω , F , F t , IP) denotesa probability space, and consider the F t -adapted reﬂected diﬀusion-drift process X : Ω × [0 , T ] (cid:55)→ [ ξ l , ξ u ] described as follows X t = X + (cid:90) t g ( X s ; µ s )d s + 1 β W t − Z + t + Z − t , (A.35a) θ ∼ µ , supp( µ ) ⊂ Ξ = [ ξ l , ξ u ] , , ≤ t ≤ T, (A.35b) xxxii here µ s = P X s is the law of the underlying process, and Z + t and Z − t are the reﬂectionprocesses from the boundaries ξ u and ξ l , respectively. In particular, Z +0 = Z − = 0 , non-decreasing, c´adl´ag, and (cid:90) ∞ ( X t − ξ l )d Z − t = 0 , (cid:90) ∞ ( ξ u − X t )d Z + t = 0 . (A.36) Suppose the Lebesgue density q ( ξ ) = d µ d ξ exists. Then, the Lebesgue density of the law ofthe stochastic process ( p t ( ξ ) = d µ t / d ξ ) ≤ t at subsequent times is governed by the followingone dimensional partial diﬀerential equation with Robin boundary conditions ∂p ( t, ξ ) ∂t = − ∂∂ξ ( p t ( ξ ) g ( ξ, p t ( ξ ))) + 1 β ∂ ∂ξ p t ( ξ ) , ∀ t ∈ [0 , T ] , ∀ ξ ∈ ( ξ l , ξ u )(A.37a) ∂p t ( ξ ) ∂ξ + βp t ( ξ ) g ( ξ, µ t ) (cid:12)(cid:12)(cid:12) ξ = ξ l = 0 , ∂p t ( ξ ) ∂ξ + βp t ( ξ ) g ( ξ, µ t ) (cid:12)(cid:12)(cid:12) ξ = ξ u = 0 , ∀ t ∈ [0 , T ](A.37b) p ( ξ ) = q ( ξ ) , ∀ ξ ∈ Ξ . (A.37c)The proof of Theorem A.4 is a special case of the proof provided by Harrison and Reiman[20] for general multi-dimensional diﬀusion-drift processes that solve the Skorokhod problemfor a general reﬂection matrix (cf. Deﬁnition A.18). More speciﬁcally, the proof of [20] isbased on an extension of Itˆo’s formula proved by Kunita and Watanabe [27], and Grisanov’schange of measure technique in Theorem A.16.To ﬁnish the proof, we apply the result of Proposition A.4 to each coordinate of thestochastic process in Eqs. (A.22). (cid:4) A.3.

Proof of Proposition A.3.

We establish the existence and uniqueness via the stan-dard technique of Lax-Milgram theorem in the theory of elliptic PDEs, see [15, Chapters5,6], and [4, 5]. In particular, we closely follow the work of [4], and provide the proof for aslightly more general case, namely we consider the Poisson equationdiv F ( ξ ) = g ( ξ ) , F ( ξ ) def = 1 β ∇ p ∗ ( ξ ) + p ∗ ( ξ ) ∇ V ( ξ ) , ∀ ξ ∈ Ω , (A.38a) (cid:104) F ( ξ ) , n (cid:105) = 0 , ξ ∈ ∂ Ω . (A.38b)where n is the outward normal vectors at the boundary ∂ Ω. Furthermore, we supposethe potential satisﬁes V ∈ W , ∞ (Ω) for every t ∈ [0 , T ]. In the steady-state regime ( i.e. when ∂ t p t = 0), the mean-ﬁeld PDE in (3.3) is a special case of the general form describedin Eq. (A.38) with Ω = Ξ ⊂ IR + , ∂ Ω = { ξ l , ξ u } , g ( ξ ) = 0 for all ξ ∈ Ω, and V ( ξ ) = J ( ξ, p ∗ ( ξ )), where p ∗ ( ξ ) is a steady-state solution of the mean-ﬁeld PDE in (3.3). Equation(A.38) describes a probability in the space of probability measures in M + (Ω), and is to beinterpreted in the weak sense. The main diﬃculty arises from assigning boundary valuesalong ∂ Ω to a function p ∗ ∈ W ,p (Ω) as it is not in general continuous. The following tracetheorem attempts to address this issue: Deﬁnition A.7. ( Trace Theorem, see, e.g. , [15, Thm. 1]) Suppose Ω is bounded and ∂ Ω is in C . Then, there exists a bounded linear operator T : W ,p (Ω) → L p ( ∂ Ω) such that • T p ∗ = p ∗ | ∂ Ω if p ∗ ∈ W ,p (Ω) ∩ C (Ω) • (cid:107)T p ∗ (cid:107) L p ( ∂ Ω) ≤ c p, Ω (cid:107) p ∗ (cid:107) W ,p (Ω) ,for each p ∗ ∈ W ,p (Ω), with the constant c p, Ω depending only on p and Ω. Furthermore, T p ∗ is called the trace of p ∗ on ∂ Ω. xxxiii quipped with Deﬁnition A.7, we can now provide the form of weak solutions of the PDEin Eq. (A.38): Deﬁnition A.8. ( Weak Solutions of the Steady-State Poisson Equation ) Wesay that a function p ∗ ∈ H (Ω) is a weak solution to equation (A.38) supplemented withthe boundary condition if for all the test functions ψ ∈ H (Ω) the following identity holds1 β (cid:90) Ω (cid:104)∇ p ∗ ( ξ ) , ∇ ψ ( ξ ) (cid:105) d ξ + (cid:90) Ω (cid:104)∇ V ( ξ ) , ∇ ψ ( ξ ) (cid:105) p ∗ ( ξ )d ξ − (cid:90) ∂ Ω (cid:104)∇ V ( σ ) , n (cid:105)T p ∗ ( σ ) T ψ ( σ )d σ = (cid:104) g, ψ (cid:105) L (Ω) . (A.39)almost everywhere in time t ∈ [0 , T ].To establish the uniqueness of the steady-state solution, we reformulate the weak solutionin terms of the Slotboom variable ρ ∗ ( ξ ) = p ∗ ( ξ ) e βV ( ξ ) . The transformed ﬂux is given by F ( ξ ) = 1 β e − βV ( ξ ) ∇ ρ ∗ ( ξ ). Then, the weak formulation in Eq. (A.39) can be rewritten asfollows1 β (cid:90) Ω e − βV ( ξ ) (cid:104)∇ ρ ∗ ( ξ ) , ∇ ψ ( ξ ) (cid:105) d ξ − (cid:90) ∂ Ω (cid:104)∇ V ( σ ) , n (cid:105)T ρ ∗ ( σ ) − βV ( σ ) T ψ d σ = (cid:104) g, ψ (cid:105) L (Ω) . (A.40)Deﬁne the following bilinear form B : H (Ω) × H (Ω) → IR , (A.41) ( φ, ψ ) (cid:55)→ B [ φ, ψ ] def = 1 β (cid:90) Ω e − βV ( ξ ) (cid:104)∇ φ, ∇ ψ (cid:105) d ξ − (cid:90) ∂ Ω (cid:104)∇ V, n (cid:105)T φe − βV T ψ d σ (A.42)The bilinear form B is continuous on H (Ω) due to the Cauchy-Schwarz inequality and thefact that the Sobolev norm is controlled by the L (Ω)-norm of the gradient. Moreover, thebilinear form is non-coercive. A weak solution ρ ∗ ∈ H (Ω) for Eq. (A.40) satisﬁes the following condition B [ ρ ∗ , ψ ] = (cid:104) g, ψ (cid:105) L (Ω) , ∀ ψ ∈ H (Ω) . (A.43)We claim B [ φ, ψ ] satisﬁes the hypotheses of Lax-Milgram on H (Ω). In particular, we mustshow that there exist constants c , c , c > ψ, φ ∈ H (Ω) (see [15, 6.2.2 Theorem 2])(Boundedness): | B [ φ, ψ ] | ≤ c (cid:107) φ (cid:107) H (Ω) (cid:107) ψ (cid:107) H (Ω) , (A.44a) (G˚arding Inequality): B [ φ, φ ] ≥ c (cid:107) φ (cid:107) H (Ω) − c (cid:107) φ (cid:107) L (Ω) . (A.44b)The ﬁrst condition follows by boundedness of the trace operator T in Deﬁnition A.7 as wellas boundedness of V ( ξ ) and its gradient. In particular, | B [ φ, ψ ] | ≤ β c V (cid:107) φ (cid:107) H (Ω) (cid:107) ψ (cid:107) H (Ω) + c (cid:48) V (cid:13)(cid:13) T φe − βV (cid:13)(cid:13) L ( ∂ Ω) (cid:107)T ψ (cid:107) L ( ∂ Ω) ≤ β c V (cid:107) φ (cid:107) H (Ω) (cid:107) ψ (cid:107) H (Ω) + c , Ω c (cid:48) V c V (cid:107) φ (cid:107) H (Ω) (cid:107) ψ (cid:107) H (Ω) , (A.45)where c V def = sup ξ ∈ Ω e − βV ( ξ ) and c (cid:48) V = sup ξ ∈ Ω |(cid:104)∇ V ( ξ ) , n (cid:105)| .To establish the G˚arding Inequality, we invoke the following variation of Sobolev typeinequality for boundary value problems: A bilinear form B : H (Ω) × H (Ω) → IR is coercive if B [ ψ, ψ ] ≥ c (cid:107) ψ (cid:107) H (Ω) for some constant c > xxxiv heorem A.9. (Sobolev Type Inequality for Bounded Domains, [44, p. 4] ) Fora bounded domain Ω ⊂ IR d and for all the functions φ ∈ H (Ω) , there exists a constant K Ω > depending on Ω only, such that the following inequality holds (cid:107) φ (cid:107) H (Ω) ≤ K Ω (cid:16) (cid:107)∇ φ (cid:107) H (Ω) + (cid:107) φ (cid:107) L (Ω) (cid:17) . (A.46)Now, for the bilinear form in Eq. (A.41) we obtain the following lower bound B [ ψ, ψ ] = 1 β (cid:90) Ω e − βV ( ξ ) (cid:107)∇ ψ (cid:107) d ξ − (cid:90) ∂ Ω (cid:104)∇ V, n (cid:105)T ψe − βV T ψ d σ ≥ β ˜ c V (cid:90) Ω (cid:107)∇ ψ (cid:107) d ξ − c V c (cid:48) V (cid:90) ∂ Ω |T ψ | d σ = 1 β ˜ c V (cid:107)∇ ψ (cid:107) H (Ω) − c V c (cid:48) V (cid:107) ψ (cid:107) L (Ω) = 1 βK Ω ˜ c V (cid:107) ψ (cid:107) H (Ω) − (cid:18) βK Ω ˜ c V + c V c (cid:48) V (cid:19) (cid:107) ψ (cid:107) L (Ω) , where ˜ c V def = inf ξ ∈ Ω e − βV ( ξ ) , and the last inequality follows from Inequality (A.46) in Theo-rem A.9. Having veriﬁed the criteria described in Eqs. (A.44), the existence and uniquenessfollows by invoking the Lax-Milgram theorem.A.4. Proof of Lemma A.3.

We begin the proof by writing the following basic inequalities (cid:12)(cid:12)(cid:12) sup µ ∈P E ( µ ) − sup µ ∈P (cid:98) E ( µ ) (cid:12)(cid:12)(cid:12) ≤ sup µ ∈P (cid:12)(cid:12) E ( µ ) − (cid:98) E ( µ ) (cid:12)(cid:12) = sup µ ∈P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n ( n − (cid:88) i (cid:54) = j y i y j K µ ( x i , x j ) − IE P ⊗ y, x [ y (cid:98) yK µ ( x , (cid:98) x )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 4 sup µ ∈P (cid:12)(cid:12) IE µ [ e n ( ξ )] (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) sup µ ∈P IE µ [ e n ( ξ )] (cid:12)(cid:12)(cid:12)(cid:12) , where the error term is deﬁned as follows e n ( ξ ) def = 2 n ( n − (cid:88) ≤ i

IR is deﬁned asfollows M αf ( y ) def = inf x ∈X (cid:26) α (cid:107) x − y (cid:107) + f ( x ) (cid:27) , ∀ y ∈ X , (A.50)where α > αf ( y ) def = arg inf x ∈X (cid:26) α (cid:107) x − y (cid:107) + f ( x ) (cid:27) , ∀ y ∈ X . (A.51)When the function f is diﬀerentiable, the following lemma is established in [26]: Lemma A.11. (Moreau’s envelope of Differentiable Functions)

Suppose thefunction f : X → IR is diﬀerentiable. Then, the Moreau’s envelope deﬁned in Eq. (A.50) has the following upper bound and lower bounds f ( y ) − α (cid:90) sup x ∈X (cid:107)∇ f ( y + s ( x − y )) (cid:107) d s ≤ M αf ( y ) ≤ f ( y ) . (A.52) In particular, when f is L f -Lipschitz, we have f ( y ) − αL f ≤ M αf ( y ) ≤ f ( y ) . (A.53)We now require the following result due to [43]: Proposition A.5. (Basic Properties of Moreau’s envelope, [43] ) Let f : IR → IR bea lower semi-continuous, proper and convex. The following statements hold for any α > : • The proximal operator prox αf ( x ) is unique and continuous, in the sense that prox α (cid:48) f ( x (cid:48) ) → prox αf ( x ) whenever ( x (cid:48) , α (cid:48) ) → ( x, α ) with α > . • The value of M α ( x ) is ﬁnite and depends continuously on ( α, x ) , with M α ( x ) → f ( x ) forall x ∈ IR as α → + . • The Moreau envelope function is diﬀerentiable with respect to x and the regularizationparameter α . Speciﬁcally, for all x ∈ IR , the following properties are true: dd x M αf ( x ) = 1 α ( x − prox αf ( x )) , (A.54a) dd α M αf ( x ) = − α ( x − prox αf ( x )) . (A.54b) xxxvi ow, we return to Equation (A.49). We leverage the lower bound on Moreau’s envelopein Eq. (A.52) of Lemma A.11 as follows (cid:12)(cid:12)(cid:12) sup µ ∈P E ( µ ) − sup µ ∈P (cid:98) E ( µ ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) min λ ≥ (cid:26) λR − (cid:90) IR D M λ − e n ( ξ ) µ (d ξ ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) min λ ≥ (cid:40) λR + IE µ [ e n ( ξ )] + 14 λ IE µ (cid:34)(cid:90) sup ζ ∈ IR D (cid:107)∇ e n ((1 − s ) ξ + sζ ) (cid:107) d s (cid:35)(cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | IE µ [ e n ( ξ )] | + 4 R IE µ (cid:34)(cid:90) sup ζ ∈ IR D (cid:107)∇ e n ((1 − s ) ξ + sζ ) (cid:107) d s (cid:35) . (A.55)Let ζ ∗ = ζ ∗ ( ξ, s ) = arg sup ζ ∈ IR D (cid:107)∇ e n ((1 − s ) ξ + sζ ) (cid:107) . Then, applying the union bound inconjunction with Inequality (A.55) yieldsIP (cid:32)(cid:12)(cid:12)(cid:12) sup µ ∈P E ( µ ) − sup µ ∈P (cid:98) E ( µ ) (cid:12)(cid:12)(cid:12) ≥ δ (cid:33) ≤ IP (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:90) IR D e n ( ξ ) µ (d ξ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ δ (cid:33) + IP (cid:32) (cid:90) IR D (cid:90) (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) d sµ (d ξ ) ≥ δ R (cid:33) . (A.56)Now, we state the following lemma: Lemma A.12. (Tail Bounds for the Finite Sample Estimation Error)

Considerthe estimation error e n deﬁned in Eq. (A.47) . Then, the following statements hold: • Z = (cid:107)∇ e n ( ξ ) (cid:107) is a sub-exponential random variable with the Orlicz norm of (cid:107) Z (cid:107) ψ ≤ √ K /n for every ξ ∈ IR + . Moreover, IP (cid:32) (cid:90) IR + (cid:90) (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) d sµ (d ξ ) ≥ δ (cid:33) ≤ e − n δ √ K + √ K , (A.57) • e n ( ξ ) is a zero-mean sub-Gaussian random variable with the Orlicz norm of (cid:107) e n ( ξ ) (cid:107) ψ ≤ √ K n for every ξ ∈ IR + . Moreover, IP (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) Ξ e n ( ξ ) µ (d ξ ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ δ (cid:19) ≥ e − n δ √ K . (A.58)The proof of Lemma A.12 is deferred to Appendix B.1.Now, we leverage the concentration bounds (A.57) and (A.58) of Lemma A.12 to upperbound the terms on the right hand side of Eq. (A.56) as belowIP (cid:32)(cid:12)(cid:12)(cid:12) sup µ ∈P E ( µ ) − sup µ ∈P E ( µ ) (cid:12)(cid:12)(cid:12) ≥ δ (cid:33) ≤ e − n δ √ × × K + 2 e − n δ √ × × RL + √ K ≤ (cid:26) e − n δ √ × × K , e − n δ √ × × RK + √ K (cid:27) , (A.59)where the last inequality comes from a + b ≤ { a, b } . Therefore, with the probabilityof (at least) 1 − (cid:37) , we have that (cid:12)(cid:12)(cid:12) sup µ ∈P E ( µ ) − sup µ ∈P E ( µ ) (cid:12)(cid:12)(cid:12) ≤ max (cid:40) c K n (cid:115) ln (cid:18) (cid:37) (cid:19) , c RK n ln (cid:32) e √ K (cid:37) (cid:33)(cid:41) , xxxvii here c = 3 × , and c = √ × . (cid:4) A.5.

Proof of Lemma A.4.

We recall the deﬁnitions of the population and empiricaldistributional balls P = { µ ∈ M (IR + ) : W ( µ, µ ) ≤ R } . (A.60a) P N = { (cid:98) µ N ∈ M (IR + ) : W ( (cid:98) µ N , (cid:98) µ N ) ≤ R } . (A.60b)Due to the strong duality result of Theorem A.10, the following identity holdssup µ ∈P (cid:98) E ( µ ) = sup µ ∈P n ( n − (cid:88) ≤ i

2, we obtain that δR · R − N N (cid:88) k =1 inf ζ ∈ Ξ (cid:26) δR ( ξ k − ζ ) − φ n ( ζ ) (cid:27) (c) ≥ δ − N N (cid:88) k =1 φ n ( ξ k ) ≥ δ − N N (cid:88) k =1 | φ n ( ξ k ) | (d) ≥ δ − > , (A.67)where (c) is due to the upper bound on Moreau’s envelop in Lemma A.11, and (d) is due tothe upper bound in Eq. (A.66). From Inequalities (A.65) and (A.67) we conclude that theobjective value of the minimization for λ = δ/R , δ > λ = 0. Thus, necessarily (cid:98) λ N ∗ ≤ R . Using a similar argument ( mutatismutandis for the upper bound (A.66)), we can prove that λ ∗ ≤ R .Then, the following inequality holds (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup µ ∈P (cid:98) E ( µ ) − sup (cid:98) µ N ∈P N (cid:98) E ( (cid:98) µ N ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) inf λ ∈ Λ (cid:26) λR − (cid:90) Ξ inf ζ ∈ Ξ M λ φ n ( ξ ) µ (d ξ ) (cid:27) − inf λ ∈ Λ (cid:40) λR − N N (cid:88) k =1 inf ζ ∈ Ξ M λ φ n ( ξ k ) (cid:41) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup λ ∈ Λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) k =1 inf ζ ∈ Ξ M λ φ n ( ξ k ) − (cid:90) Ξ inf ζ ∈ Ξ M λ φ n ( ξ ) µ (d ξ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (A.68)where the last inequality follows by the fact that for two bounded functions f, g : X →

IR,we have | inf X f − inf X g | ≤ sup X | f − g | .For any given λ ∈ Λ, deﬁne the following function T λ : IR N + → IR ξ = ( ξ , · · · , ξ N ) (cid:55)→ T λ ( ξ ) = 1 N N (cid:88) k =1 M λ φ n ( ξ k ) − (cid:90) Ξ M λ φ n ( ξ ) µ (d ξ ) , Then, we rewrite Inequality (A.68) as follows (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup µ ∈P (cid:98) E ( µ ) − sup (cid:98) µ N ∈P N (cid:98) E ( (cid:98) µ N ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup λ ∈ Λ | T λ ( ξ ) | . (A.69) xxxix et ξ = ( ξ , · · · , ξ k , · · · , ξ N ) ∈ Ξ N + and (cid:101) ξ = ( ξ , · · · , (cid:101) ξ k , · · · , ξ N ) ∈ Ξ N + denote twosequences that diﬀers in the k -th coordinate for 1 ≤ k ≤ N . Then, (cid:12)(cid:12) T λ ( ξ ) − T λ ( (cid:101) ξ ) (cid:12)(cid:12) ≤ N (cid:12)(cid:12)(cid:12) M λ φ n ( ξ k ) − M λ φ n ( (cid:101) ξ k ) (cid:12)(cid:12)(cid:12) (A.70) ≤ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) inf ζ ∈ Ξ  λ | ζ − ξ k | − n ( n − (cid:88) ≤ i

0. Consider an ε -net covering of the interval Λ denoted by N ( ε, Λ) = { λ , · · · , λ N ( ε, Λ) } , where N ( ε, Λ) ≤ R ε .The mapping λ (cid:55)→ T λ ( ξ ) is Lipschitz on the domain λ ∈ Λ. Indeed, for any λ , λ ∈ Λ,we obtain | T λ ( ξ ) − T λ ( ξ ) | ≤ N N (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) M λ φ n ( ξ k ) − M λ φ n ( ξ k ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:90) Ξ (cid:12)(cid:12)(cid:12)(cid:12) M λ φ n ( ξ ) − M λ φ n ( ξ ) (cid:12)(cid:12)(cid:12)(cid:12) µ (d ξ ) . We leverage Eq. (A.54b) of Proposition A.5 to obtaindd λ M λ f ( ξ ) = ( ξ − prox λ f ( ξ )) ≤ ξ u ξ l ) , (A.73)where the last inequality is due to the fact that ξ, prox λ f ( ξ ) ∈ Ξ = [ ξ l , ξ u ]. Consequently, M λ f is 4 ξ u -Lipschitz, and hence T λ ( ξ ) is 8 ξ u -Lipschitz.sup λ ∈ Λ | T λ ( ξ ) | ≤ max i ∈N ( ε, Λ) | T λ i ( ξ ) | + 8( ξ u − ξ l ) ε. (A.74)Using the union bound yieldsIP (cid:18) max i ∈N ( ε, Λ) | T λ i ( ξ ) | ≥ δ (cid:19) ≤ ∪ N ( ε, Λ) i =1 IP( | T λ i ( ξ ) | ≥ δ ) ≤ R ε exp (cid:18) − N R δ ξ u − ξ l ) (cid:19) (A.75) xl herefore, with the probability of (at least) 1 − ρ , we havemax i ∈N ( ε, Λ) | T λ i ( ξ ) | ≤ √ ξ u − ξ l ) R √ N (cid:115) log (cid:18) εR ρ (cid:19) . (A.76)We plug Eq. (A.76) into (A.74)sup λ ∈ Λ | T λ ( ξ ) | ≤ √ ξ u − ξ l ) R √ N (cid:115) log (cid:18) εR ρ (cid:19) + 8( ξ u − ξ l ) ε. (A.77)Since the size of the net 0 < ε < R is arbitrary, we let ε = √ ξ u − ξ l ) R √ N . Then, fromEq. (A.69) we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup µ ∈P (cid:98) E ( µ ) − sup (cid:98) µ N ∈P N (cid:98) E ( (cid:98) µ N ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ ξ u − ξ l ) R √ N  (cid:118)(cid:117)(cid:117)(cid:116) log (cid:32) √ N ( ξ u − ξ l ) ρ (cid:33) , (A.78)with the probability of at least 1 − ρ . (cid:4) A.6.

Proof of Lemma A.5.

We deﬁne the solution of the population objective functionas follows (cid:98) µ N ∗ ( γ ) def = arg sup (cid:98) µ N ∈P N E γ ( (cid:98) µ N )= arg sup (cid:98) µ N ∈P N E ( (cid:98) µ N ) − γ IE P ⊗ x (cid:2) K µ ( x , (cid:98) x ) (cid:3) . (A.79)We also deﬁne the solution of the empirical kernel alignment as follows (cid:98) µ N ◦ def = arg sup (cid:98) µ N ∈P N E ( (cid:98) µ N ) . (A.80)Due to the optimality of the empirical measure (cid:98) µ N ∗ ( γ ) for the inner optimization in Eq.(A.79), the following inequality holds E γ ( (cid:98) µ N ◦ ) ≤ E γ ( (cid:98) µ N ∗ ) ≤ E ( (cid:98) µ N ∗ ) . (A.81)We now expand E γ ( (cid:98) µ N ◦ ) to obtain the following inequality E ( (cid:98) µ N ◦ ) − γ IE P ⊗ x [ K (cid:98) µ N ◦ ( x , (cid:98) x )] ≤ E ( (cid:98) µ N ∗ ) . After rearranging the terms in Eq. (A.81), we arrive at E ( (cid:98) µ N ◦ ) − E ( (cid:98) µ N ∗ ) ≤ γ IE P ⊗ x [ K (cid:98) µ N ◦ ( x , (cid:98) x )] ≤ γ , (A.82)where the last step follows by the fact that for radial kernels K ( x , (cid:98) x ) ≤ x , (cid:98) x ∈ X .Similarly, due to the optimality of the empirical measure (cid:98) µ N ◦ for the optimization in Eq.(A.80) we have that E ( (cid:98) µ N ∗ ) ≤ E ( (cid:98) µ N ◦ ) . (A.83)Combining Eqs. (A.82) and (A.83) now yields (cid:12)(cid:12) E ( (cid:98) µ N ∗ ) − E ( (cid:98) µ N ◦ ) (cid:12)(cid:12) ≤ γ . (A.84) (cid:4) xli .7. Proof of Lemma A.6.

To establish the proof, we state the following proposition dueto [29, Proposition 1]:

Proposition A.6. (Approximation Error in Sikhorn’s divergence, [29] ) For anypair of discrete measures (cid:98) µ N , (cid:98) ν N ∈ M ( X ) , the following inequality holds (cid:12)(cid:12) W ,ε ( (cid:98) µ N , (cid:98) ν N ) − W ( (cid:98) µ N , (cid:98) ν N ) (cid:12)(cid:12) ≤ ce − ε , (A.85) where c is a constant independent of ε , that depends on the support of (cid:98) µ N , and (cid:98) ν N . Using Lagrange’s multipliers for the distributional constraints yield the following saddlepoint problems sup (cid:98) µ N ∈M (Ξ) inf h ∈ IR + J h ( (cid:98) µ N ( ξ )) = E γ ( (cid:98) µ N ( ξ )) − h W ( (cid:98) µ N , (cid:98) µ N ) − R )(A.86a) sup (cid:98) µ N ∈M (Ξ) inf h ∈ IR + J εh ( (cid:98) µ N ( ξ )) = E γ ( (cid:98) µ N ( ξ )) − h W ,ε ( (cid:98) µ N , (cid:98) µ N ) − R ) . (A.86b)Let ( (cid:98) µ N (cid:5) , h (cid:5) ) and ( (cid:98) µ N ∗ , h ∗ ) denote the saddle points of (A.86a) and (A.86b), respectively.Due to the approximation error in Eq. (A.85) of Proposition A.6, the following inequalityholds for all ε > (cid:12)(cid:12) J h ( (cid:98) µ N ( ξ )) − J εh ( (cid:98) µ N ( ξ )) (cid:12)(cid:12) ≤ hce − ε . (A.87)respectively. Due to the optimality of ( (cid:98) µ N (cid:5) , h (cid:5) ) for the saddle point optimization (A.86a),we have that E γ ( (cid:98) µ N (cid:5) ) = J h (cid:5) ( (cid:98) µ N (cid:5) ) ≥ J h (cid:5) ( (cid:98) µ N ∗ ) . (A.88)Using the upper bound in Eq. (A.87) yields J h (cid:5) ( (cid:98) µ N ∗ ) ≥ J εh (cid:5) ( (cid:98) µ N ∗ ) − h (cid:5) ce − ε ≥ J εh ∗ ( (cid:98) µ N ∗ ) − h (cid:5) ce − ε = E γ ( (cid:98) µ N ∗ ) − h (cid:5) ce − ε . (A.89)Combining Eqs. (A.88) and (A.89) yields E γ ( (cid:98) µ N ∗ ) − E γ ( (cid:98) µ N (cid:5) ) ≤ h (cid:5) ce − ε . (A.90)Similarly, it can be shown that E γ ( (cid:98) µ N (cid:5) ) − E γ ( (cid:98) µ N ∗ ) ≤ h ∗ ce − ε . (A.91)Putting together Eqs. (A.90) and (A.91) yields the following inequality (cid:12)(cid:12) E γ ( (cid:98) µ N ∗ ) − E γ ( (cid:98) µ N (cid:5) ) (cid:12)(cid:12) ≤

12 max { h ∗ , h (cid:5) } c exp (cid:18) − ε (cid:19) . (A.92)From Eqs. (A.9) and (A.10), we also have (cid:12)(cid:12) E ( (cid:98) µ N ∗ ) − E ( (cid:98) µ N (cid:5) ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) E γ ( (cid:98) µ N ∗ ) − E γ ( (cid:98) µ N (cid:5) ) (cid:12)(cid:12) + 1 γ IE (cid:104) K (cid:98) µ N ∗ ( x , (cid:98) x ) + K (cid:98) µ N (cid:5) ( x , (cid:98) x ) (cid:105) . (A.93) xlii lugging Eq. (A.92) into Eq. (A.93) and using the fact that K µ ( x , (cid:98) x ) ≤ µ ∈ M + (Ξ)yields (cid:12)(cid:12) E ( (cid:98) µ N ∗ ) − E ( (cid:98) µ N (cid:5) ) (cid:12)(cid:12) ≤

12 max { h ∗ , h (cid:5) } c exp (cid:18) − ε (cid:19) + 2 γ . (A.94)It now remains to show that the optimal Lagrange multipliers h ∗ and h (cid:63) in Equation (A.94)are bounded. To this end, deﬁne the following Lagrangian dual function Q ( h ) def = inf (cid:98) µ N ∈M (IR + ) J h ( (cid:98) µ N ) , (A.95a) Q ε ( h ) def = inf (cid:98) µ N ∈M (IR + ) J εh ( (cid:98) µ N ) . (A.95b)Furthermore, let (cid:98) µ Nsl, = 1 N (cid:80) Nk =1 δ ξ ksl and (cid:98) µ Nsl, = 1 N (cid:80) Nk =1 δ ¯ ξ ksl are the empirical measuresin conjunction with the slater vectors ξ sl , ¯ ξ sl ∈ IR N + in Assumption (A.2) . We leverage [34,Lemma 1], to obtain following upper bounds h (cid:5) ≤ R − W ( (cid:98) µ Ns, , (cid:98) µ N ) (cid:0) E γ ( (cid:98) µ Ns, ) − Q (¯ h ) (cid:1) , (A.96a) h ∗ ≤ R − W ( (cid:98) µ Ns, , (cid:98) µ N ) (cid:0) E γ ( (cid:98) µ Ns, ) − Q ε (¯ h ) (cid:1) , (A.96b)where ¯ h ∈ IR + is arbitrary, e.g. ¯ h = 0. (cid:4) A.8.

Proof of Proposition A.1.

To compute the 2-norm diﬀerence between the processes(¯ ξ t ) ≤ t ≤ T and (¯ θ t ) ≤ t ≤ T , we deﬁne the following auxiliary vectors¯ ξ ∗ η ( m −

1) def = ¯ ξ η ( m − − η ∇ (cid:98) J N (cid:0) ¯ ξ η ( m − ; z m − , ˜ z m − (cid:1) + (cid:114) β ζ ηm , (A.97a) ¯ θ ∗ η ( m −

1) def = ¯ θ η ( m − − η ∇ J (cid:0) ¯ θ η ( m − ; ρ η ( m − (cid:1) + (cid:114) β ζ ηm . (A.97b)The recursions in Eqs. (A.20)-(A.26) then take the following forms¯ ξ ηm = P Ξ N (cid:16) ξ ∗ η ( m − (cid:17) , (A.98a) ¯ θ ηm = P Ξ N (cid:16) θ ∗ η ( m − (cid:17) . (A.98b)By the non-expansive property of the Euclidean projection onto a non-empty, closed, convexset Ξ N we obtain (see [2]) (cid:13)(cid:13)(cid:13) P Ξ N (cid:0) ξ ∗ η ( m − (cid:1) − P Ξ N (cid:16) θ ∗ η ( m − (cid:17) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13) ¯ ξ η ( m − − ¯ θ η ( m − (cid:13)(cid:13) . (A.99)Using a triangle inequality yields (cid:13)(cid:13) ¯ ξ ηm − ¯ θ ηm (cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ¯ ξ η ( m − − ¯ θ η ( m − (cid:13)(cid:13)(cid:13) + η (cid:13)(cid:13)(cid:13) ∇ (cid:98) J N (cid:16) ¯ ξ η ( m − ; z m − , ˜ z m − (cid:17) − ∇ J (cid:16) ¯ θ η ( m − ; ρ η ( m − (cid:17)(cid:13)(cid:13)(cid:13) . (A.100)Computing Eq. (A.100) recursively yields (cid:13)(cid:13)(cid:13) ¯ ξ ηm − ¯ θ ηm (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ¯ ξ − ¯ θ (cid:13)(cid:13)(cid:13) + η m − (cid:88) (cid:96) =0 (cid:13)(cid:13)(cid:13) ∇ (cid:98) J N (cid:16) ¯ ξ η(cid:96) ; z (cid:96) , ˜ z (cid:96) (cid:17) − ∇ J (cid:16) ¯ θ η(cid:96) ; ρ η(cid:96) (cid:17)(cid:13)(cid:13)(cid:13) . (A.101) xliii herefore, using the triangle inequality and based on the initialization ¯ ξ = θ , we canrewrite Eq. (A.101) as follows (cid:13)(cid:13)(cid:13) ¯ ξ ηm − ¯ θ ηm (cid:13)(cid:13)(cid:13) ≤ η m − (cid:88) (cid:96) =0 (cid:13)(cid:13)(cid:13) ∇ (cid:98) J N (cid:16) ¯ ξ η(cid:96) ; z (cid:96) , ˜ z (cid:96) (cid:17) − ∇ J (cid:16) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17)(cid:13)(cid:13)(cid:13) + η m − (cid:88) (cid:96) =0 (cid:13)(cid:13)(cid:13) ∇ J (cid:16) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17) − ∇ J (cid:16) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17)(cid:13)(cid:13)(cid:13) + η m − (cid:88) (cid:96) =0 (cid:13)(cid:13)(cid:13) ∇ J (cid:16) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17) − ∇ J (cid:0) ¯ θ η(cid:96) ; ρ η(cid:96) (cid:1)(cid:13)(cid:13)(cid:13) = E ( ηm ) + E ( ηm ) + E ( ηm ) . (A.102)In the sequel, we analyze each term separately:A.8.1. Upper Bound on E ( ηm ) . Let F m denotes the σ -algebra generated by the samples( z k , ˜ z k ) k ≤ m , and the initial condition ξ . Let F = ∅ . Then, taking the expectation withrespect to the joint distribution P x ,y yieldsIE P ⊗ x ,y (cid:104) ∇ (cid:98) J N (cid:0) ¯ ξ η(cid:96) ; z (cid:96) , ˜ z (cid:96) (cid:1)(cid:12)(cid:12)(cid:12) F (cid:96) − (cid:105) = ∇ J (cid:0) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1) , where ∇ J (cid:0) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1) = (cid:0) ∇ k J (cid:0) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1)(cid:1) ≤ k ≤ N has the following elements ∇ k J (¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) ) = Q ( ¯ ξ kη(cid:96) ) + 1 N N (cid:88) m =1 R ( ¯ ξ kη(cid:96) , ¯ ξ mη(cid:96) ) , (A.103)where Q ( ¯ ξ kη(cid:96) ) def = 1 N IE P ⊗ x ,y (cid:104) (cid:107) x − ˜ x (cid:107) y ˜ ye − ¯ ξ kη(cid:96) (cid:107) x − (cid:101) x (cid:107) (cid:105) , (A.104a) R ( ¯ ξ kη(cid:96) , ¯ ξ mη(cid:96) ) def = 1 N γ IE P ⊗ x (cid:104) (cid:107) x − ˜ x (cid:107) e − (¯ ξ mη(cid:96) +¯ ξ kη(cid:96) ) (cid:107) x − ˜ x (cid:107) (cid:105) . (A.104b)Deﬁne the following random vector Z m def = η m (cid:88) (cid:96) =0 (cid:32) ∇ (cid:98) J N (cid:16) ¯ ξ η(cid:96) ; z (cid:96) , ˜ z (cid:96) (cid:17) − ∇ J (cid:16) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17)(cid:33) (A.105a) = η m (cid:88) (cid:96) =0 (cid:32) ∇ (cid:98) J N (cid:16) ¯ ξ η(cid:96) ; z (cid:96) , ˜ z (cid:96) (cid:17) − IE (cid:104) ∇ (cid:98) J N (¯ ξ η(cid:96) ; z (cid:96) , ˜ z (cid:96) ) (cid:12)(cid:12)(cid:12) F (cid:96) − (cid:105) (cid:33) , (A.105b)with Z = . Clearly, Z m is a martingale IE[ Z m |F m − ] = Z m − . Moreover, it has abounded diﬀerence (cid:107) Z m − Z m − (cid:107) ≤ η (cid:13)(cid:13)(cid:13) ∇ (cid:98) J N (cid:16) ¯ ξ η(cid:96) ; z m , ˜ z m (cid:17)(cid:13)(cid:13)(cid:13) + η (cid:13)(cid:13)(cid:13) ∇ J (cid:16) ¯ ξ ηm ; (cid:98) µ Nηm (cid:17)(cid:13)(cid:13)(cid:13) . (A.106)Now, for all k = 1 , , · · · , N , the following inequalities can be established using Assumption (A.1) , (cid:12)(cid:12) ∇ k J (¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) ) (cid:12)(cid:12) ≤ N (1 + γ − ) K (A.107a) (cid:12)(cid:12)(cid:12) ∇ k (cid:98) J N (¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) ) (cid:12)(cid:12)(cid:12) ≤ N (1 + γ − ) K , (A.107b) xliv espectively. From Eq. (A.106) we have (cid:107) Z m − Z m − (cid:107) ≤ √ N (1 + γ − ) K . (A.108)Therefore, Z m − Z m − is (conditionally) zero mean and bounded, and is thus (condition-ally) sub-Gaussian with the Orlicz norm of (cid:107) Z m − Z m − (cid:107) ψ ≤ γ − ) K √ N . , i.e. ,IE (cid:104) e (cid:104) u , Z m − Z m − (cid:105) (cid:12)(cid:12)(cid:12) F m − (cid:105) ≤ exp (cid:18) N (1 + γ − ) K (cid:107) u (cid:107) (cid:19) , ∀ u ∈ IR N . (A.109)We thus conclude that (cid:107) Z m − Z m − (cid:107) is sub-Gaussian with the Orlicz norm of √ N (1 + γ − ) K . Now, we note that E ( ηm ) = η m − (cid:88) (cid:96) =0 (cid:107) Z (cid:96) − Z (cid:96) − (cid:107) , Z − = . (A.110)Applying the Bernestein inequality yieldsIP( E ( ηm ) ≥ ε ) ≤ exp (cid:18) − ε N mη (1 + γ − ) K (cid:19) (A.111) ≤ exp (cid:32) − ε N η (1 + γ − ) K (cid:98) Tη (cid:99) (cid:33) , ∀ m ∈ [0 , T η − ] ∩ IN . (A.112)Applying a union bound yieldsIP (cid:32) sup m ∈ [0 , Tη ] ∩ IN E ( ηm ) ≥ ε (cid:33) ≤ (cid:22) Tη (cid:23) exp (cid:32) − ε N η (1 + γ − ) K (cid:98) Tη (cid:99) (cid:33) . (A.113)Therefore, with the probability of at least 1 − ρ , we obtainsup m ∈ [0 , Tη ] ∩ IN E ( ηm ) ≤ (cid:115) (cid:98) Tη (cid:99) η (1 + γ − ) K N log (cid:98) Tη (cid:99) ρ . (A.114)A.8.2. Upper Bound on E ( ηm ) . To characterize the upper bound on E ( ηm ), we write (cid:13)(cid:13) ∇ J (¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) ) − ∇ J (¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) ∇ J (¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) ) − ∇ J (¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) ) (cid:13)(cid:13) (A.115) = N (cid:88) k =1 (cid:12)(cid:12) ∇ k J (cid:0) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1) − ∇ k J (cid:0) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1)(cid:12)(cid:12) . (A.116)Moreover, using Eqs. (A.103)-(A.104) and the triangle inequality yields (cid:12)(cid:12) ∇ k J (cid:0) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1) − ∇ k J (cid:0) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1)(cid:12)(cid:12) ≤ (cid:12)(cid:12) Q ( ¯ ξ kη(cid:96) ) − Q (¯ θ kη(cid:96) ) (cid:12)(cid:12) + 1 N N (cid:88) m =1 (cid:12)(cid:12) R ( ¯ ξ kη(cid:96) , ¯ ξ mη(cid:96) ) − R (¯ θ kη(cid:96) , ¯ ξ mη(cid:96) ) (cid:12)(cid:12) . (A.117) xlv he ﬁrst term on the right hand side of Eq.(A.117) has the following upper bound (cid:12)(cid:12) Q ( ¯ ξ kη(cid:96) ) − Q (¯ θ kη(cid:96) ) (cid:12)(cid:12) (a) ≤ N IE P ⊗ x ,y (cid:104) (cid:107) x − ˜ x (cid:107) | y ˜ y | (cid:12)(cid:12)(cid:12) e − ¯ ξ kη(cid:96) (cid:107) x − ˜ x (cid:107) − e − ¯ θ kη(cid:96) (cid:107) x − ˜ x (cid:107) (cid:12)(cid:12)(cid:12)(cid:105) (b) ≤ K | ¯ ξ kη(cid:96) − ¯ θ kη(cid:96) | N IE P ⊗ x ,y (cid:2) (cid:107) x − ˜ x (cid:107) | y ˜ y | (cid:3) ≤ K | ¯ ξ kη(cid:96) − ¯ θ kη(cid:96) | N , (A.118)where (a) we used the fact that the mapping ξ (cid:55)→ exp( − ξ (cid:107) x − ˜ x (cid:107) ) is K -Lipschitz, and (b)follows by using Assumption (A.2) .The second term on the right hand side of Eq. (A.117) has the following upper bound (cid:12)(cid:12) R ( ¯ ξ kη(cid:96) , ¯ ξ mη(cid:96) ) − R (¯ θ kη(cid:96) , ¯ ξ mη(cid:96) ) (cid:12)(cid:12) ≤ N γ IE P ⊗ x (cid:104) (cid:107) x − ˜ x (cid:107) e − ¯ ξ mη(cid:96) (cid:107) x − ˜ x (cid:107) (cid:12)(cid:12)(cid:12) e − ¯ ξ kη(cid:96) (cid:107) x − ˜ x (cid:107) − e − ¯ θ kη(cid:96) (cid:107) x − ˜ x (cid:107) (cid:12)(cid:12)(cid:12)(cid:105) ≤ K | ¯ ξ kη(cid:96) − ¯ θ kη(cid:96) | N γ IE P ⊗ x (cid:104) (cid:107) x − ˜ x (cid:107) e − ¯ ξ mη(cid:96) (cid:107) x − ˜ x (cid:107) (cid:105) ≤ K | ¯ ξ kη(cid:96) − ¯ θ kη(cid:96) | N γ . (A.119)Plugging Eqs. (A.118), (A.119) into Eq. (A.117) yields (cid:12)(cid:12) ∇ k J (cid:0) ¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1) − ∇ k J (cid:0) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:1)(cid:12)(cid:12) ≤ K (1 + γ − ) N | ¯ ξ kη(cid:96) − ¯ θ kη(cid:96) | , where the last inequality follows from the fact that | y ˜ y | ≤ (cid:107) x − ˜ x (cid:107) ≤ K , and exp( − ξ mη(cid:96) (cid:107) x − ˜ x (cid:107) ) ≤

1. Now, plugging Eq. (A.120) into Eq. (A.116) yields (cid:13)(cid:13) ∇ J (¯ ξ η(cid:96) ; (cid:98) µ Nη(cid:96) ) − ∇ J (¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) ) (cid:13)(cid:13) ≤ K (1 + γ − ) √ N (cid:107) ¯ ξ η(cid:96) − ¯ θ η(cid:96) (cid:107) . (A.120)Therefore, the error E ( ηm ) has the following upper bound E ( ηm ) ≤ K (1 + γ − ) η √ N m − (cid:88) (cid:96) =0 (cid:107) ¯ ξ η(cid:96) − ¯ θ η(cid:96) (cid:107) , (A.121)for all m ∈ [0 , T η − ] ∩ IN. Therefore,sup m ∈ [0 ,T η − ] ∩ IN E ( ηm ) ≤ K (1 + γ − ) η √ N (cid:98) Tη (cid:99)− (cid:88) (cid:96) =0 (cid:107) ¯ ξ η(cid:96) − ¯ θ η(cid:96) (cid:107) , (A.122)A.8.3. Upper Bound on E ( m ) . To upper bound on E ( ηm ), we bound the following norm (cid:13)(cid:13)(cid:13) ∇ J (cid:16) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17) − ∇ J (cid:0) ¯ θ η(cid:96) ; ρ η(cid:96) (cid:1)(cid:13)(cid:13)(cid:13) ≤ N (cid:88) k =1 (cid:12)(cid:12)(cid:12) ∇ k J (cid:16) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17) − ∇ k J (cid:16) ¯ θ η(cid:96) ; ρ η(cid:96) (cid:17)(cid:12)(cid:12)(cid:12) . (A.123) xlvi ach term inside the parenthesis on the right hand side of Eq. (A.123) has the followingupper bound (cid:12)(cid:12)(cid:12) ∇ k J (cid:16) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17) − ∇ k J (cid:16) ¯ θ η(cid:96) ; ρ η(cid:96) (cid:17)(cid:12)(cid:12)(cid:12) ≤ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 R (¯ θ kη(cid:96) , ¯ ξ mη(cid:96) ) − (cid:90) Ξ R (¯ θ kη(cid:96) , ¯ θ ) ρ η(cid:96) (d¯ θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 R (¯ θ kη(cid:96) , ¯ ξ mη(cid:96) ) − N N (cid:88) m =1 R (¯ θ kη(cid:96) , ¯ θ mη(cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 R (¯ θ kη(cid:96) , ¯ θ mη(cid:96) ) − (cid:90) Ξ R (¯ θ kη(cid:96) , ¯ θ ) ρ η(cid:96) (d¯ θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (A.124)where the last step is due to the triangle inequality. The ﬁrst term on the right hand sideof the last inequality in (A.124) has the following upper bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 R (¯ θ kη(cid:96) , ¯ ξ mη(cid:96) ) − N N (cid:88) m =1 R (¯ θ kη(cid:96) , ¯ θ mη(cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K N N (cid:88) m =1 (cid:12)(cid:12) ¯ ξ mη(cid:96) − ¯ θ mη(cid:96) (cid:12)(cid:12) . (A.125)The second term on the right hand side of the last inequality in (A.124) has the followingupper bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 R (¯ θ kη(cid:96) , ¯ θ mη(cid:96) ) − (cid:90) Ξ R (¯ θ kη(cid:96) , ¯ θ ) ρ η(cid:96) (d¯ θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) Ξ R (¯ θ kη(cid:96) , ¯ θ ) (cid:0)(cid:98) ρ Nη(cid:96) (d¯ θ ) − ρ η(cid:96) (d¯ θ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K N D BL ( ρ η(cid:96) , (cid:98) ρ Nη(cid:96) ) , (A.126)where the inequality follows by the fact that | R (¯ θ kη(cid:96) , ¯ θ ) | ≤ K /N , and (cid:98) ρ Nη(cid:96) is the empiricalmeasure associated with the samples ¯ θ kη(cid:96) , i.e. , (cid:98) ρ Nη(cid:96) def = 1 N N (cid:88) m =1 δ θ (¯ θ mη(cid:96) ) . (A.127)We use the upper bounds Eqs. (A.126) and (A.125) in conjunction with Inequality (A.124).We derive (cid:13)(cid:13)(cid:13) ∇ J (cid:16) ¯ θ η(cid:96) ; (cid:98) µ Nη(cid:96) (cid:17) − ∇ J (cid:0) ¯ θ η(cid:96) ; ρ η(cid:96) (cid:1)(cid:13)(cid:13)(cid:13) ≤ K N (cid:107) ¯ ξ η(cid:96) − ¯ θ η(cid:96) (cid:107) + K N D BL ( ρ η(cid:96) , (cid:98) ρ Nη(cid:96) ) ≤ K √ N (cid:107) ¯ ξ η(cid:96) − ¯ θ η(cid:96) (cid:107) + K N D BL ( ρ η(cid:96) , (cid:98) ρ Nη(cid:96) ) . (A.128)We apply McDiarmid’s martingale inequality to obtain the following concentration inequal-ity IP (cid:16) D BL ( ρ η(cid:96) , (cid:98) ρ Nη(cid:96) ) ≥ δ (cid:17) ≤ exp (cid:0) − N δ (cid:1) . (A.129)Let Q ( m ) = (cid:80) m − (cid:96) =0 D BL ( ρ η(cid:96) , (cid:98) ρ Nη(cid:96) ). Then, we obtain for all m = 0 , , · · · , (cid:98) Tη (cid:99) thatIP ( Q ( m ) ≥ δ ) ≤ exp (cid:18) − N δ m (cid:19) (A.130) ≤ exp (cid:32) − N δ (cid:98) Tη (cid:99) (cid:33) . (A.131) xlvii pplying a union bound yieldsIP (cid:32) sup m ∈ [0 , Tη ] ∩ IN Q ( m ) ≥ δ (cid:33) ≤ (cid:22) Tη (cid:23) exp (cid:32) − N δ (cid:98) Tη (cid:99) (cid:33) . (A.132)Therefore, with the probability of at least 1 − ρ , we havesup m ∈ [0 , Tη ] ∩ IN E ( ηm ) ≤ K η √ N (cid:98) Tη (cid:99)− (cid:88) (cid:96) =0 (cid:107) ¯ ξ η(cid:96) − ¯ θ η(cid:96) (cid:107) + K ηN (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) (cid:106) Tη (cid:107) N log  (cid:106) Tη (cid:107) ρ  . (A.133)A.9. Combining the upper bounds.

We now leverage the upper bounds on E ( ηm ), E ( ηm ), and E ( ηm ) in Eqs. (A.111), (A.122), and (A.133). Deﬁne S η ( n ) def = sup ≤ m ≤ n (cid:13)(cid:13)(cid:13) ¯ ξ ηm − ¯ θ ηm (cid:13)(cid:13)(cid:13) . (A.134)Applying a union bound yields the following inequality from Eq. (A.102) S η (cid:16)(cid:106) Tη (cid:107)(cid:17) ≤ η (1 + γ − ) K (cid:18) √ N + 1 N √ N (cid:19) (cid:118)(cid:117)(cid:117)(cid:116)(cid:106) Tη (cid:107) log (cid:32) (cid:98) Tη (cid:99) ρ (cid:33) + 2 K (1 + γ − ) η √ N (cid:98) Tη (cid:99)− (cid:88) (cid:96) =0 S η ( (cid:96) ) , (A.135)with the probability of at least 1 − ρ . In deriving the last inequality, we assumed that K ≥ Lemma A.13. (Discrete Gr¨onwall’s inequality, [21] ) If { y m } m ∈ IN , { x m } m ∈ IN , and { z m } m ∈ IN are non-negative sequences, and y m ≤ x + m − (cid:88) (cid:96) =0 z (cid:96) y (cid:96) , m ∈ IN , (A.136) then, y m ≤ x (cid:89) ≤ (cid:96)

Proof of Proposition A.2.

Let (Ω , F , ( F ) ≤ t ≤ T , P ) denotes a ﬁltered probabilityspace, and let ( W t ) ≤ t ≤ T denotes the Wiener processes that is F t -adapted. To establishthe proof, we consider the mapping T : C ([0 , ηm ] , IR N ) → IR N × m from the space of samplepaths such that T (( W ηk ) k ≤ m ) = θ ηm , where θ ηm is deﬁned recursively as follows θ ηm = P Ξ N (cid:18) θ η ( m − − η ∇ J ( θ ηm , ς ηm ) + (cid:114) β ζ ηm (cid:19) , (A.149)where ς ⊗ Nηm = P θ ηm , and ζ ηm = W ηm − W η ( m − . Then, T (( W t ) t ≤ ηm )) = ¯ θ ηm , and T (( ˜ W t ) t ≤ ηm ) = ˜ θ ηm . Let P ˜ θ ηm law = ˜ θ ηm and P ¯ θ ηm law = ¯ θ ηm . Furthermore, P ˜ W law = ˜ W and P W law = W , where ˜ W = ( ˜ W ηk ) k ≤ m and W = ( W ηk ) ≤ k ≤ m .Then, for any measurable mapping T : (Ω , F ) → (Ω (cid:48) , F (cid:48) ), the following inequality holds D KL (cid:0) P ◦ T − || Q ◦ T − (cid:1) ≤ D KL ( P || Q ) . (A.150)Therefore, D KL (cid:16) P ¯ θ ηm || P ˜ θ ηm (cid:17) ≤ D KL (cid:16) P W || P ˜ W (cid:17) . (A.151)Deﬁne the stopping times τ kn def = inf (cid:26) t ≥ (cid:90) t G ks d s ≥ n (cid:27) , k = 1 , , · · · , N, (A.152)where G s = ( G s , · · · , G Ns ) and G s def = (cid:114) β (cid:16) ∇ J (˜ θ s , ρ s ) − ∇ J ( θ s , µ s ) (cid:17) . (A.153)Furthermore, deﬁne the stopped process U t ( n ) def = U t ∧ τ n def = (cid:90) ηm ∧ τ n G s d s. (A.154)The processes ( U t ( n )) are adapted, locally- H processes, i.e. , for every t ≥

0, the truncatedprocesses ( U s ( n )) s ≤ t are in the class H . Thus, the Itˆo integrals (cid:82) t G ks d W ks are well deﬁnedfor all k = 1 , , · · · , N . For each coordinate k = 1 , , · · · , N , we deﬁne the Dol´eans-Dadeexponential as follows E t ( U k ( n )) def = exp (cid:32)(cid:90) t ∧ τ kn G ks ( n )d W s − (cid:90) t ∧ τ kn | G ks ( n ) | d s (cid:33) , k = 1 , , · · · , N. (A.155)The following theorem provides the suﬃcient condition for the Dol´eans-Dade exponentialto be a martingale: l heorem A.15. (Novikov Theorem [35] ) Consider the process ( Y t ) t ≥ that is a real-valued adapted process on the probability space (Ω , F , P , ( F t ) t ≤ T ) and locally H . Further-more, ( W t ) ≤ t ≤ T is an adapted Wiener process. Suppose the following condition holds IE (cid:20) exp (cid:18) (cid:90) t | Y s | d s (cid:19)(cid:21) < ∞ . (A.156) Then for each t ≥ , the Dol´eans-Dade exponential deﬁned as below E t ( Y ) = exp (cid:18)(cid:90) t Y s d W s − (cid:90) t | Y s | d s (cid:19) , (A.157) is a positive martingale, under the probability measure P . It is easy to see that the processes U kt ( n ) , k = 1 , , · · · , n deﬁned in Eq. (A.163) satisﬁesthe condition A.156 of Theorem A.15, and thus the exponential process in (A.163) is amartingale. This in turn allows us to employ the following change of measure argument dueto Girsanov [18]: Theorem A.16. (Girsanov’s Change of Measure [18] ) Let ( W t ) t ≤ T be a Wienerprocess on the Wiener probability space (Ω , F , P ) . Let ( X t ) t ≤ T be a measurable processadapted to the natural ﬁltration of the Wiener process F t = σ ( W s ≤ t ) with X = 0 . If theDol´eans-Dade exponential E t ( X ) is a strictly positive martingale, the probability measure Q can be deﬁned on (Ω , F ) via the Radon-Nikodym derivative d Q d P (cid:12)(cid:12)(cid:12) F t = E t ( X ) . (A.158) Then for each t ≤ T , the measure Q restricted to the unaugmented sigma ﬁelds F t isequivalent to P restricted to F t . Now, consider the following change of measure d P U k ( n )+ W k d P W k (cid:12)(cid:12)(cid:12) F t = E t ( G k ( n )) , k = 1 , , · · · , N, (A.159)which deﬁnes a new probability measure on (Ω , F ). Furthermore, deﬁne the following Wienerprocess ˜ W kt ( n ) = U kt ( n ) + W kt . (A.160)Let P U ( n )+ W law = U ( n )+ W . As n → ∞ , τ kn → ∞ for all k = 1 , , · · · , n , and P U ( n )+ W weakly → P ˜ W . Due to the lower semi-continuity of the KL divergence, we then have that D KL (cid:16) P W || P ˜ W (cid:17) = lim inf n →∞ D KL ( P W || P U ( n )+ W ) . (A.161)Furthermore, it is easy to see that P U ( n )+ W = (cid:78) Nk =1 P U k ( n )+ W k and P W = (cid:78) Nk =1 P W k .Due to the tensorization property of the KL divergence, we have that D KL ( P W || P U ( n )+ W ) = N (cid:88) k =1 D KL ( P W k || P U k ( n )+ W k ) . (A.162) li herefore, we proceed from Eq. (A.161) using (A.162) D KL (cid:16) P W || P ˜ W (cid:17) = lim inf n →∞ N (cid:88) k =1 D KL ( P W k || P U k ( n )+ W k )= − lim inf n →∞ N (cid:88) k =1 IE (cid:20) log (cid:18) P U k ( n )+ W k P W k (cid:19)(cid:21) = − lim inf n →∞ N (cid:88) k =1 IE (cid:104) log (cid:16) E ηm ( U k ( n )) (cid:17)(cid:105) = − lim inf n →∞ N (cid:88) k =1 IE (cid:20)(cid:90) ηm ∧ τ n G ks ( n )d W ks − (cid:90) t ∧ τ n | G ks ( n ) | d s (cid:21) . (A.163)The integral term (cid:82) t ∧ τ n G ks ( n )d W ks is a martingale and its expectation vanishes. Therefore,Eq. (A.163) reduces to D KL (cid:16) P W || P ˜ W (cid:17) = lim inf n →∞ N (cid:88) k =1 IE (cid:20) (cid:90) ηm ∧ τ n | G ks ( n ) | d s (cid:21) (A.164) (a) = N (cid:88) k =1 IE (cid:20) lim inf n →∞ (cid:90) ηm ∧ τ n | G ks ( n ) | d s (cid:21) (A.165) = β (cid:20)(cid:90) ηm (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s (cid:21) , (A.166)where (a) follows by the monotone convergence theorem. Plugging Inequality (A.166) into(A.151) yields D KL (cid:16) P ¯ θ ηm || P ˜ θ ηm (cid:17) ≤ β (cid:20)(cid:90) ηm (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s (cid:21) . for all m ∈ [0 , T η − ] ∩ IN. Taking the supremum from both sides yieldssup m ∈ [0 , Tη ] ∩ IN D KL (cid:16) P ¯ θ ηm || P ˜ θ ηm (cid:17) ≤ β m ∈ [0 ,T η − ] ∩ IN IE (cid:20)(cid:90) ηm (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s (cid:21) ≤ β (cid:34) sup m ∈ [0 ,T η − ] ∩ IN (cid:90) ηm (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s (cid:35) . (A.167)The discrete-time process m (cid:55)→ (cid:82) ηm (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s is a submartingale.Therefore, invoking the discrete-time version of Doob’s maximal submartingale inequalityin Eq. (A.189b) of Theorem A.20 in Section A.11 yieldssup m ∈ [0 , Tη ] ∩ IN D KL (cid:16) P ¯ θ ηm || P ˜ θ ηm (cid:17) ≤ β IE (cid:34) (cid:90) η (cid:98) Tη (cid:99) (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s × log (cid:32)(cid:90) η (cid:98) Tη (cid:99) (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s (cid:33) (cid:35) (A.168) ≤ β IE (cid:32)(cid:90) η (cid:98) Tη (cid:99) (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s (cid:33)  , (A.169) lii here the last step is due to the basic inequality x log( x ) ≤ x − x ≤ x , and we used thefact that ee − ≤

2. By the Cauchy-Schwarz inequality, we obtain (cid:32)(cid:90) η (cid:98) Tη (cid:99) (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s (cid:33) ≤ βT (cid:90) η (cid:98) Tη (cid:99) (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s. (A.170)Combining (A.169) and (A.170) yieldssup m ∈ [0 , Tη ] ∩ IN D KL (cid:16) P ¯ θ ηm || P ˜ θ ηm (cid:17) ≤ βT IE (cid:34)(cid:90) η (cid:98) Tη (cid:99) (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) d s (cid:35) = βT (cid:90) η (cid:98) Tη (cid:99) IE (cid:104)(cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) (cid:105) d s. (A.171)We compute the following upper bound for the integrand using the triangle inequality (cid:13)(cid:13) ∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:13)(cid:13) ≤ (cid:107)∇ J (˜ θ s , ν s ) − ∇ J (˜ θ s , µ s ) (cid:107) + (cid:107)∇ J (˜ θ s , µ s ) − ∇ J ( θ s , µ s ) (cid:107) . (A.172)The ﬁrst term on the right hand side of Eq. (A.172) has the following upper bound (cid:107)∇ J (˜ θ s , ν s ) − ∇ J (˜ θ s , µ s ) (cid:107) ≤ N (cid:88) k =1 (cid:12)(cid:12)(cid:12) ∇ k J (˜ θ s , ν s ) − ∇ k J (˜ θ s , µ s ) (cid:12)(cid:12)(cid:12) = 1 N N (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) Ξ R (˜ θ ks , θ ) ν s (d θ ) − (cid:90) Ξ R (˜ θ ks , θ ) µ s (d θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ K N D BL ( ν s , µ s ) . (A.173)Now, recall that for any two probability measures µ, ν ∈ M ( X ), we have D BL ( ν, µ ) ≤ W ( ν, µ ) ≤ W ( ν, µ ). From Eq. (A.173) we proceed (cid:107)∇ J (˜ θ s , ν s ) − ∇ J (˜ θ s , µ s ) (cid:107) ≤ K N W ( ν s , µ s )(A.174) (a) = K N W ( ν ⊗ Ns , µ ⊗ Ns )(A.175) ≤ K N (cid:16) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105)(cid:17) , (A.176)where (a) follows by the tensorization property of the Wasserstein distance in Theorem A.14,Similarly, for the second term on the right hand side of Eq. (A.172), the following upperbound holds (cid:107)∇ J (˜ θ s , µ s ) − ∇ J ( θ s , µ s ) (cid:107) ≤ K (1 + γ − ) √ N (cid:107) ˜ θ s − θ s (cid:107) . (A.177)Plugging Eqs. (A.176) and (A.177) into Eq. (A.172), raising to the power, and taking theexpectation yieldsIE (cid:104) (cid:107)∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:107) (cid:105) ≤ K N (cid:16) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105)(cid:17) + 4 K (1 + γ − ) N IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105) , (A.178) liii here we used the basic inequality ( a + b ) n ≤ n − a n + 2 n − b n . Due to Jensen’s inequality (cid:16) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105)(cid:17) ≤ IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105) , and using the fact that 1 /N ≤ (1 + γ − ) /N yieldsIE (cid:104) (cid:107)∇ J (˜ θ s , ν s ) − ∇ J ( θ s , µ s ) (cid:107) (cid:105) ≤ K (1 + γ − ) N IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105) . (A.179)We substitute Inequality (A.179) into Eq. (A.171)sup m ∈ [0 , Tη ] ∩ IN D KL (cid:16) P ˜ θ ηm || P ¯ θ ηm (cid:17) ≤ βK (1 + γ − ) N (cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105) d s. (A.180)From Csisz´ar-Kullback-Pinsker inequality (see, e.g. , [46]), we have (cid:13)(cid:13)(cid:13) P ˜ θ ηm − P ¯ θ ηm (cid:13)(cid:13)(cid:13) ≤ D KL (cid:16) P ˜ θ ηm || P ¯ θ ηm (cid:17) . (A.181)we obtain thatsup m ∈ [0 , Tη ] ∩ IN (cid:13)(cid:13)(cid:13) P ˜ θ ηm − P θ ηm (cid:13)(cid:13)(cid:13) ≤ βT K (1 + γ − ) N (cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105) d s. (A.182)To obtain an upper bound on the Wasserstein distance W ( P ˜ θ ηm , P θ ηm (cid:17) , we leverage thefollowing result due to Villani [56, Theorem 6.13]: Theorem A.17. (An Inequality for the Wasserstein Distance, [56, Theorem 6.13] ) Let µ and ν be two probability measures on a Polish space ( X , d ) . Let p ∈ [1 , ∞ ) , and x ∈ X .Then, W p ( µ, ν ) ≤ q (cid:18)(cid:90) X d p ( x , x )d | µ − ν | ( x ) (cid:19) p , p + 1 q = 1 . (A.183)The following corollary is immediate from Theorem A.17: Corollary A.17.1.

Suppose the polish space ( X , d ) has a ﬁnite diameter. Then, for anytwo probability measures µ and ν on a Polish space ( X , d ) we have W pp ( µ, ν ) ≤ (diam( X )) p pq (cid:107) µ − ν (cid:107) TV , (A.184) where diam( X ) = sup x , x ∈X d ( x , x ) < ∞ . Consider the metric space (Ξ N , (cid:107) · (cid:107) ). By assumption ( A . ), Ξ = [ ξ l , ξ u ]. Therefore,diam(Ξ N ) = diam(Ξ) √ N . Using Inequality (A.184) of Corollary A.17.1 in conjunction withEq. (A.182) yieldssup m ∈ [0 , Tη ] ∩ IN W (cid:16) P ˜ θ ηm , P ¯ θ ηm (cid:17) ≤ βT ( ξ u − ξ l ) K (1 + γ − ) (cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105) d s. (A.185)Recall that ν ⊗ Nηm = P ˜ θ ηm and ρ ⊗ Nηm = P ¯ θ ηm . Using the tensorization property of the 2-Wasserstein distance (cf. Eq. (A.147) of Thm. A.14) yields W (cid:16) P ˜ θ ηm , P ¯ θ ηm (cid:17) = N W ( ν ηm , ρ ηm ) . (A.186) liv lugging Eq. (A.186) into (A.185)sup m ∈ [0 , Tη ] ∩ IN W ( ν ηm , ρ ηm ) ≤ βT ( ξ u − ξ l ) K (1 + γ − ) N (cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105) d s. Taking the square root from both sides and using the fact thatsup m ∈ [0 , Tη ] ∩ IN W ( ν ηm , ρ ηm ) ≤ (cid:32) sup m ∈ [0 , Tη ] ∩ IN W ( ν ηm , ρ ηm ) (cid:33) , (A.187)completes the proof. (cid:4) A.11.

Proof of Proposition A.3.

Before proving our results, we collect a few technicalresults to which we refer in the sequel, and we also give a deﬁnition. The following deﬁnitionis concerned with the Skorokhod problem:

Deﬁnition A.18. (Skorokhod problem, [51] ) Given a d -dimensional reﬂection matrix R , the Skorokhod problem is the problem of constructing a map Ψ : C ((0 , T ] , IR d ) → C ((0 , T ] , IR d ) × C ((0 , T ] , IR d ) such that for every process ( Y t ) ≤ t ≤ T ∈ C ((0 , T ) × IR d ), theimage ( X t , Z t ) ≤ t ≤ T = Ψ(( Y t ) ≤ t ≤ T ) satisﬁes the following properties • X t = Y t + RZ t , t ∈ [0 , T ]. • Z = , and Z jt is non-decreasing for all j = 1 , , · · · , d . • (cid:82) ∞ X jt d Z jt = for all j = 1 , , · · · , d , where the integral is in the Stieltjes sense, whichis well deﬁned since the processes Z jt are non-decreasingWe leverage the following lemma due to Tanaka [55] which establishes an upper boundon the norm of the diﬀerence between two reﬂected processes: Lemma A.19. (Tanaka [55, Lemma 2.2.] ) Let ( Y t ) ≤ t ≤ T and ( ˜ Y t ) ≤ t ≤ T denote twoc´adl´ag processes. Furthermore, let ( X t , Z t ) ≤ t ≤ T and ( ˜ X t , ˜ Z t ) ≤ t ≤ T denote the correspond-ing solutions to the Sokhrhod problem with the reﬂection matrix R = I d × d in DeﬁnitionA.18. Then, the following inequality holds (cid:107) X t − ˜ X t (cid:107) ≤(cid:107) Y t − ˜ Y t (cid:107) + 2 (cid:90) t (cid:104) Y t − ˜ Y t − Y s + ˜ Y s , d Z s − d ˜ Z s (cid:105) , (A.188) for all t ∈ [0 , T ] . To establish our results, we also need the following maximal inequality for sub-martingales:

Theorem A.20. (Doob’s Sub-martingale Maximal Inequality, [13, Thm. 3.4] ) Consider the ﬁltered probability space (Ω , F , ( F t ) t ≥ , IP) and let ( M t ) t ≥ be a continuous F t -adapted non-negative sub-martingale. Let p ≥ and T > . If IE[ M pT ] < + ∞ , then wehave IE (cid:20)(cid:18) sup ≤ t ≤ T M t (cid:19) p (cid:21) ≤ (cid:18) pp − (cid:19) p IE (cid:2) M pT (cid:3) , p > (cid:20) sup ≤ t ≤ T M t (cid:21) ≤ (cid:18) ee − (cid:19) IE (cid:2) M T log M T (cid:3) + IE (cid:2) M (1 − log M ) (cid:3) . (A.189b)We remark that Inequality (A.189a) is the classical Doob L p -inequality, p ∈ (1 , ∞ ), [13,Theorem 3.4.]. The second result in (A.189b) represents the Doob L -inequality in the sharpform derived by Gilat [17] from the L log L Hardy-Littlewood inequality. lv he following result is due to S(cid:32)lomi´nski [52]: Theorem A.21. (Non-Central Moments of Local Time of Semi-martingales, [52, Thm. 2.2] ) Consider the ﬁltered probability space (Ω , F , ( F t ) t ≥ , IP) , and let ( Y t ) ≤ t ≤ T denotes a F t -adapted IR d -valued semi-martingale with the following Doob-Meyer decompo-sition Y t = Y + M t + A t , (A.190) where ( M t ) ≤ t ≤ T is a F t -adapted local martingale, ( A t ) ≤ t ≤ T is a F t -adapted process oflocally bounded variation, and Y ∈ B ⊂ IR d is the initial condition, conﬁned to the convexregion B . Let ( X t , Z t ) ≤ t ≤ T denote the solution of the Sokhrhod problem for the process ( Y t ) ≤ t ≤ T with R = I d × d in Deﬁnition A.18. Further, Z t = (cid:82) t n t L (d s ) is the regulatorprocess associated with the feasible set B of the process, and n t and L ( s ) are the normalvector and the local time at the boundary ∂B . Then, for every p ∈ IN , every stopping time τ on F t , and any a ∈ ¯ B \ ∂B , there exists c p > such that IE[ L p ( τ )] ≤ c p (cid:16) dist (cid:16) a , ∂B (cid:17)(cid:17) − p IE (cid:20) sup ≤ t ≤ τ (cid:107) X t − a (cid:107) p (cid:21) (A.191) ≤ c p (cid:16) dist (cid:16) a , ∂B (cid:17)(cid:17) − p (cid:16) (cid:107) a − Y (cid:107) p + IE (cid:104) (cid:104) M , M (cid:105) pτ + (cid:107) A τ (cid:107) p (cid:105)(cid:17) , (A.192) where (cid:104) M , M (cid:105) τ def = (cid:80) di =1 (cid:104) M i , M i (cid:105) τ is the quadratic variation, and dist( a , ∂B ) def = min b ∈ ∂B (cid:107) a − b (cid:107) . In the next lemma, we present a maximal inequality for the Euclidean norm of the Wienerprocesses:

Lemma A.22. (A Maximal Inequality for the Wiener Processes)

Suppose ( W t ) ≤ t ≤ T is the standard Wiener process with W = , and W t + u − W t ∼ N ( , u I d × d ) . Then, IE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) W t − W η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) p (cid:21) ≤ T (cid:18) p p − (cid:19) p p − η p − N Γ (cid:16) N +2 p (cid:17) Γ (cid:0) N +22 (cid:1) def = A p . (A.193)The proof of Lemma A.22 is is due to [3] and is presented in Appendix B.2 for complete-ness.Equipped with these technical results, we are in position to prove the main result ofProposition A.3. Consider the c´adl´ag process (˜ θ t ) ≤ t ≤ T in Eq. (A.27). It can be readilyveriﬁed that the process can be reformulated as follows˜ θ ηm = P Ξ N (cid:18) ˜ θ η ( m − + (cid:114) β ˜ ζ ηm (cid:19) , (A.194)where ˜ ζ ηm = ˜ W ηm − ˜ W η ( m − , and˜ W ηm = W ηm − (cid:114) β (cid:90) ηm ∇ J ( θ s , ν s )d s. (A.195) lvi lternatively, the embedded continuous-time dynamics in Eq. (A.194) can be written asfollows ˜ θ t = ˜ θ + (cid:114) β (cid:98) tη (cid:99) (cid:88) (cid:96) =0 ˜ ζ η(cid:96) + (cid:90) t ˜ n s ˜ L (d s ) , (A.196) = ˜ θ + (cid:114) β W η (cid:98) tη (cid:99) − (cid:90) η (cid:98) tη (cid:99) ∇ J (˜ θ s , ν s )d s + (cid:90) t ˜ n s ˜ L (d s ) , (A.197)where ˜ θ = θ . Furthermore, the local time ˜ L is deﬁned as below˜ L ( s ) def = (cid:98) Tη (cid:99) (cid:88) m =0 (cid:107) ∆ m (cid:107) δ ηm ( s ) . (A.198)Above, ∆ m is deﬁned as follows ∆ m def = ˜ θ m − + (cid:114) β ˜ ζ m − P Ξ N (cid:18) ˜ θ m − + (cid:114) β ˜ ζ m (cid:19) . (A.199)Furthermore, the normal vector ( ¯ n t ) ≤ t ≤ T is a piece-wise process, where ¯ n t = ¯ n ηm for t ∈ [ ηm, η ( m + 1)), and ˜ n ηm def = ∆ m (cid:107) ∆ m (cid:107) , m ∈ [0 , Tη ] ∩ IN . (A.200)We employ Inequality (A.188) from Lemma A.19 to compute an upper bound on the normof the diﬀerence between the processes ( θ t ) ≤ t ≤ T and (˜ θ t ) ≤ t ≤ T in Eqs. (A.22) and (A.196),respectively. In particular, we deﬁne the following c´adl´ag processes Y t = θ + (cid:114) β W t − (cid:90) t ∇ J ( θ s , µ s )d s, (A.201a) ˜ Y t = ˜ θ + (cid:114) β W η (cid:98) tη (cid:99) − (cid:90) η (cid:98) tη (cid:99) ∇ J (˜ θ s , ν s )d s. (A.201b)Furthermore, Z t = (cid:90) t n s L (d s ) , (cid:101) Z t = (cid:90) t ˜ n s ˜ L (d s ) . (A.202a)Using Tanaka’s Inequaltiy in Eq. (A.188) of Lemma A.19, we derivesup ≤ t ≤ T (cid:13)(cid:13)(cid:13) ˜ θ t − θ t (cid:13)(cid:13)(cid:13) ≤ sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) Y t − ˜ Y t (cid:13)(cid:13)(cid:13) + 4 sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) Y t − ˜ Y t (cid:13)(cid:13)(cid:13) ( L ( T ) + ˜ L ( T )) . (A.203)By Schwarz inequality, for any p ∈ IN, we obtainIE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) ˜ θ t − θ t (cid:13)(cid:13)(cid:13) p (cid:21) ≤ p − IE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) Y t − ˜ Y t (cid:13)(cid:13)(cid:13) p (cid:21) (A.204) + 2 p +1 (cid:18) IE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) Y t − ˜ Y t (cid:13)(cid:13)(cid:13) p (cid:21)(cid:19) (cid:16) IE[ L p ( T )] + IE[ ˜ L p ( T )] (cid:17) . (A.205) lvii e invoke Inequality (A.192) of Theorem A.21 for the solutions of the Sokhrhod problemassociated with the processes in Eq. (A.201), where M t = (cid:114) β W t , ˜ M t = (cid:114) β W η (cid:98) tη (cid:99) , (A.206a) A t = − (cid:90) t ∇ J ( θ s , µ s )d s, ˜ A t = − (cid:90) η (cid:98) tη (cid:99) ∇ J (˜ θ s , µ s )d s. (A.206b)Then, (cid:104) M , M (cid:105) t = β N t , and (cid:104) ˜ M , ˜ M (cid:105) t = β ηN (cid:98) tη (cid:99) . Furthermore, (cid:107) A t (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) t ∇ J ( θ s , µ s )d s (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:90) t (cid:107)∇ J ( θ s , µ s ) (cid:107) d s ≤ (1 + γ − ) K t √ N . (A.207)Similarly, we have (cid:107) ˜ A t (cid:107) ≤ (1 + γ − ) K η (cid:98) tη (cid:99)√ N .

For simplicity, we suppose θ = θ ∈ ¯Ξ N \ ∂ Ξ N , and let a = θ = ˜ θ in Eq. (A.192). We then obtain thatIE (cid:2) L p ( T ) (cid:3) ≤ B p , IE (cid:104) ˜ L p ( T ) (cid:105) ≤ B p , (A.208a)where B p def = c p (cid:0) dist( θ , ∂ Ξ N ) (cid:1) − p (cid:18)(cid:18) β (cid:19) p N p T p + K p N p T p (cid:19) . (A.208b)for some constant c p > N and T , where in the last inequality we used thefact that η (cid:98) Tη (cid:99) ≤ T . Due to the fact that θ = ˜ θ , we getIE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) Y t − ˜ Y t (cid:13)(cid:13)(cid:13) p (cid:21) ≤ p − IE  sup ≤ t ≤ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) t ∇ J ( θ s , µ s )d s − (cid:90) η (cid:98) tη (cid:99) ∇ J (˜ θ s , ν s )d s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p  + 2 p − β p IE (cid:20) sup ≤ t ≤ T (cid:107) W t − W η (cid:98) tη (cid:99) (cid:107) p p (cid:21) , (A.209)where once again we used the geometric inequality ( a + b ) p ≤ p − a p + 2 p − b p . For theﬁrst term in Eq. (A.209) we obtain (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) t ∇ J ( θ s , µ s )d s − (cid:90) η (cid:98) tη (cid:99) ∇ J (˜ θ s , ν s )d s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p ≤ p − (cid:90) tη (cid:98) tη (cid:99) (cid:107)∇ J ( θ s , µ s ) (cid:107) p d s (A.210) + 2 p − (cid:90) η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) ∇ J ( θ s , µ s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p d s. From Eq. (A.107) we recall (cid:107)∇ J ( θ s , µ s ) (cid:107) p ≤ K p N p (1 + γ − ) p . (A.211a) lviii urthermore, from the derivations leading to Eq. (A.178) we obtain (cid:13)(cid:13)(cid:13) ∇ J ( θ s , µ s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p ≤ p − (cid:107)∇ J ( θ s , µ s ) − ∇ J ( θ s , ν s ) (cid:107) p + 2 p − (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p (A.212a) ≤ p − K p N p (cid:16) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105)(cid:17) p + 2 p − (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p . (A.212b)We plug Eqs. (A.211),(A.212b) into Eq. (A.210), take the sup, and subsequently take theexpectation IE  sup ≤ t ≤ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) t ∇ J ( θ s , µ s )d s − (cid:90) η (cid:98) tη (cid:99) ∇ J (˜ θ s , ν s )d s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p  ≤ p − K p N p (1 + γ − ) p + 2 p − K p N p (cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) p (cid:105) d s + 2 p − IE (cid:34) sup ≤ t ≤ T (cid:90) η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p d s (cid:35) , (A.213)where in writing the upper bound, we leveraged Jensen’s inequality (cid:16) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) (cid:105)(cid:17) p ≤ IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) p (cid:105) . The process t (cid:55)→ (cid:82) η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p d s is a sub-martingale.Therefore, invoking Doob’s sub-martingale maximal inequality in Theorem A.20 yieldsIE (cid:34) sup ≤ t ≤ T (cid:90) η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p d s (cid:35) (A.214) ≤ (cid:34) log (cid:32)(cid:90) η (cid:98) Tη (cid:99) (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p d s (cid:33) (cid:90) η (cid:98) Tη (cid:99) (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p d s (cid:35) . Let us recall the following two upper bounds (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p ≤ p − (cid:107)∇ J ( θ s , ν s ) (cid:107) p + 2 p − ||∇ J ( ˜ θ s , ν s ) (cid:107) p ≤ p K p (1 + γ − ) p N p (A.215) (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p ≤ K p (1 + γ − ) p N p (cid:107) θ s − ˜ θ s (cid:107) p . (A.216)We use the ﬁrst inequality in Eq. (A.215) to bound the logarithm term in Eq. (A.214), andthe second inequality in Eq. (A.216) for the term outside of the logarithmIE (cid:34) sup ≤ t ≤ T (cid:90) η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) ∇ J ( θ s , ν s ) − ∇ J (˜ θ s , ν s ) (cid:13)(cid:13)(cid:13) p d s (cid:35) ≤ K p (1 + γ − ) p N p log (cid:18) p K p T (1 + γ − ) p N p (cid:19) IE (cid:34)(cid:90) η (cid:98) Tη (cid:99) (cid:107) θ s − ˜ θ s (cid:107) p d s (cid:35) . (A.217) lix y the Fubini-Tonelli theorem, the expectation on the right hand side of Eq. (A.217) canbe moved inside the integral. Plugging the result in Eq. (A.213) yieldsIE  sup ≤ t ≤ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) t ∇ J ( θ s , µ s )d s − (cid:90) η (cid:98) tη (cid:99) ∇ J (˜ θ s , ν s )d s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p  ≤ C p + D p (cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) p (cid:105) d s, (A.218)where C p and D p are deﬁned in Eqs. (A.32c) and (A.32d). We now return to Equation(A.209). By plugging Eq. (A.218) and (A.193) from Lemma A.193 in Eq. (A.209), weobtainIE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) Y t − ˜ Y t (cid:13)(cid:13)(cid:13) p (cid:21) ≤ p − β p A p + 2 p − C p + 2 p − D p (cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) p (cid:105) d s. (A.219)Substituting Eqs. (A.208) and (A.219) into Eq. (A.204) yieldsIE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) θ t − ˜ θ t (cid:13)(cid:13)(cid:13) p (cid:21) ≤ (cid:18) p − β p A p + 2 p − C p (cid:19) (cid:112) B p + 2 p − D p (cid:112) B p (cid:90) η (cid:98) Tη (cid:99) IE (cid:104) (cid:107) ˜ θ s − θ s (cid:107) p (cid:105) d s. (A.220)Now, deﬁne F ( t ) def = IE (cid:20) sup ≤ s ≤ t (cid:13)(cid:13)(cid:13) θ s − ˜ θ s (cid:13)(cid:13)(cid:13) p (cid:21) . (A.221)Equation (A.220) implies F ( T ) ≤ (cid:18) p − β p A p + 2 p − C p (cid:19) (cid:112) B p + 2 p − D p (cid:112) B p (cid:90) T F ( s )d s. (A.222)We apply the Gronwall’s inequality to obtain F ( T ) ≤ (cid:112) B p (cid:18) p − β p A p + 2 p − C p (cid:19) e p − T D p √ B p . (A.223)This inequality completes the proof of the ﬁrst part. To establish the proof of the secondpart, we use the fact that sup ≤ s ≤ T W ( ν s , µ s ) ≤ IE (cid:104) sup ≤ s ≤ t (cid:13)(cid:13)(cid:13) θ s − ˜ θ s (cid:13)(cid:13)(cid:13) (cid:105) . Since ν ⊗ Ns = P ˜ θ s , and ˜ θ s = ˜ θ η (cid:98) sη (cid:99) is c´adl´ag, we have ν s = ν η (cid:98) sη (cid:99) for all 0 ≤ s ≤ T . Hence,sup ≤ s ≤ T W (cid:16) ν η (cid:98) sη (cid:99) , µ s (cid:17) = sup ≤ s ≤ T W ( ν s , µ s )(A.224) ≤ (cid:112) B (cid:18) β A + 2 C (cid:19) e T D √ B . (A.225) (cid:4) lx .12. Proof of Proposition A.12.

Recall that a kernel function is deﬁned via the innerproduct K ( x , y ) = (cid:104) Φ( x ) , Φ(˜ x ) (cid:105) H K , where Φ( x ) : X → H is the implicit feature map whichis an element of the RKHS. The random feature model of Rahimi and Recht [41, 42] relieson the following embedding of the RKHS in L (Ω ⊗ IR , ν × ν ) space K ( x , ˜ x ) = (cid:104) ϕ ( x ; ω , b ) , ϕ (˜ x ; ω , b ) (cid:105) L (Ω ⊗ IR ,ν × ν ) , (A.226)where ϕ ( x ; ω , b ) = cos( (cid:104) ω , x (cid:105) + b ). Alternatively, since K ( x , x ) = K (˜ x , ˜ x ) = φ ( ) = 1, wehave ∆( x , ˜ x ) def = (cid:107) ϕ ( x ; ω , b ) − ϕ (˜ x ; ω , b ) (cid:107) L (Ω ⊗ IR ,ν × ν ) = √ − K ( x , ˜ x )) . (A.227)Now, consider the following embedding of the kernel from L (Ω ⊗ IR , ν × ν ) into (cid:96) space (cid:98) ∆ N ( x , ˜ x ) def = (cid:107) ϕ N ( x ) − ϕ N (˜ x ) (cid:107) , (A.228)where we recall ϕ N ( x ) def = (cos( (cid:104) ω k , x (cid:105) + b k )) ≤ k ≤ N , where b , · · · , b N ∼ i.i.d. ν = Uniform[ − π, π ],and ω , · · · , ω N ∼ i.i.d. ν Based on [41, Claim 1], the following concentration inequality holdsIP (cid:18) sup x , ˜ x ∈X | K ( x , ˜ x ) − (cid:104) ϕ N ( x ) , ϕ N (˜ x ) (cid:105)| ≥ δ (cid:19) ≤ (cid:18) σ diam( X ) δ (cid:19) exp (cid:18) − N δ d + 2) (cid:19) , where σ = IE[ (cid:107) ω (cid:107) ]. Applying a union bound yields the following inequalitysup x , ˜ x ∈X (cid:12)(cid:12)(cid:12) ∆ ( x , ˜ x ) − (cid:98) ∆ N ( x , ˜ x ) (cid:12)(cid:12)(cid:12) ≤ (cid:114) d + 2) N ln (cid:18) σ diam ( X ) N d + 2) (cid:19) def = e N , (A.229)the probability of at least 1 − ρ . Due to the stability of the Gaussian distribution and theresult of [11], we obtainIP [ G t, w ( x ) = G t, w (˜ x )] = (cid:90) W (cid:113) π (cid:98) ∆ N ( x , ˜ x ) exp (cid:32) − s (cid:98) ∆ N ( x , ˜ x ) (cid:33) (cid:16) − sW (cid:17) d s. (A.230)A.12.1. The Upper Bound.

The integral on the right hand side of Eq. (A.230) is monotonedecreasing in (cid:98) ∆ N ( x , ˜ x ). Therefore, using the inequality (cid:98) ∆ N ( x , ˜ x ) ≥ ∆ ( x , ˜ x ) − e N fromEq. (A.229) yieldsIP [ G t, w ( x ) = G t, w (˜ x )] ≤ (cid:90) W (cid:112) π (∆ ( x , ˜ x ) − e N ) e − s x , ˜ x ) − eN ) (cid:16) − sW (cid:17) d s ≤ (cid:90) W (cid:112) π (∆ ( x , ˜ x ) − e N ) e − s x , ˜ x ) (cid:16) − sW (cid:17) d s, (A.231)We proceed by using the elementary inequality (1 + x ) − r ≤ rx for x ≥ − r ∈ IR \ (0 , G t, w ( x ) = G t, w (˜ x )] ≤ − e N ( x , ˜ x ) (cid:90) W (cid:112) π ∆ ( x , ˜ x ) e − s x , ˜ x ) (cid:16) − sW (cid:17) d s = (1 + O ( e N )) (cid:90) W (cid:112) π ∆ ( x , ˜ x ) e − s x , ˜ x ) (cid:16) − sW (cid:17) d s, provided that e N is suﬃciently small ( N is suﬃciently large). lxi .12.2. The Lower Bound.

To derive a lower bound, we substitute (cid:98) ∆ N ( x , ˜ x ) ≤ ∆ ( x , ˜ x ) + e N which results inIP [ G t, w ( x ) = G t, w (˜ x )] ≥ (cid:90) W (cid:112) π (∆ ( x , ˜ x ) + e N ) e − s x , ˜ x )+ eN ) (cid:16) − sW (cid:17) d s ≥ (cid:90) W (cid:112) π (∆ ( x , ˜ x ) + e N ) e − s x , ˜ x ) (cid:16) − sW (cid:17) d s, ≥ (cid:32)

11 + e N ∆ ( x , ˜ x ) (cid:33) (cid:90) W (cid:112) π ∆ ( x , ˜ x ) e − s x , ˜ x ) (cid:16) − sW (cid:17) d s. Now, using the basic inequality (cid:18)

11 + x (cid:19) r ≤ − rx for x < r ∈ (0 ,

1) yields,IP [ G t, w ( x ) = G t, w (˜ x )] ≥ (1 − O ( e N )) (cid:90) W (cid:112) π ∆ ( x , ˜ x ) e − s x , ˜ x ) (cid:16) − sW (cid:17) d s. Appendix B. Proofs of auxiliary results

B.1.

Proof of Lemma A.12.

Let z ∈ S d − denote an arbitrary vector on the unit sphere.We deﬁne the following function Q z (cid:0) ( y , x ) , · · · , ( y n , x n ) (cid:1) def = (cid:104) z , ∇ e n ( ξ ) (cid:105) = − n ( n − (cid:88) ≤ i

0. Now, for every p ∈ IN, the 2 p -th moment of the random variable Q z is givenby IE (cid:104) Q p z (( y , x ) , · · · , ( y n , x n )) (cid:105) = (cid:90) Ξ pu p − IP( Q z (( y , x ) , · · · , ( y n , x n )) ≥ u )d u (a) ≤ (cid:90) Ξ pu p − exp (cid:18) − nu K (cid:19) d u = 2 (cid:18) K n (cid:19) p p ! , (B.4)where (a) is due to the concentration bound in Eq. (B.3). Therefore,IE (cid:2) exp (cid:0) Q z (( y , x ) , · · · , ( y n , x n )) /σ (cid:1)(cid:3) = ∞ (cid:88) p =0 p ! γ p IE (cid:104) Q p z (( y , x ) , · · · , ( y n , x n )) (cid:105) = 1 + 2 (cid:88) p ∈ IN (cid:18) K nσ (cid:19) p = 21 − (4 K /nσ ) − . For σ = 4 √ K /n , we obtain IE (cid:2) exp (cid:0) Q z (( y , x ) , · · · , ( y n , x n )) /σ (cid:1)(cid:3) ≤

2. Therefore, (cid:107) Q z (cid:107) ψ = (cid:107)(cid:104) z , ∇ e n ( ξ ) (cid:105)(cid:107) ψ ≤ √ K /n for all z ∈ S n − and ξ ∈ IR + . Consequently, bythe deﬁnition of the sub-Gaussian random vector in Eq. (A.6) of Deﬁnition A.1, we have (cid:107)∇ e n ( ξ ) (cid:107) ψ ≤ √ K /n for every ξ ∈ IR + . We invoke the following lemma due to [25,Lemma 16]: Lemma B.1. (The Orlicz Norm of the Squared Vector Norms, [25, Lemma 16] ) Consider the zero-mean random vector Z satisfying (cid:107) Z (cid:107) ψ ν ≤ β for every ν ≥ . Then, (cid:107)(cid:107) Z (cid:107) (cid:107) ψ ν ≤ · ν · β . Using Lemma B.1, we now have that (cid:107)(cid:107)∇ e n ( ξ ) (cid:107) (cid:107) ψ ≤ √ K /n for every ξ ∈ IR + .Applying the exponential Chebyshev’s inequality with β = 64 √ K /n yieldsIP (cid:32) (cid:90) Ξ (cid:90) (cid:12)(cid:12)(cid:12) (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) − IE x ,y [ (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) ] (cid:12)(cid:12)(cid:12) µ (d ξ ) ≥ δ (cid:33) ≤ e − n δ √ K IE x ,y (cid:20) e (cid:16) n √ K (cid:82) IR D (cid:82) (cid:12)(cid:12) (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) − IE x ,y [ (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) ] (cid:12)(cid:12) d sµ (d ξ ) (cid:17) (cid:21) (a) ≤ e − n δ √ K (cid:90) Ξ (cid:90) IE x ,y (cid:104) e n √ K ( |(cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) − IE x ,y [ (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) ]) | (cid:105) d sµ (d ξ ) (b) ≤ e − n δ √ K , where (a) follows by Jensen’s inequality, and (b) follows from the fact thatIE x ,y (cid:104) e n √ K ( |(cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) − IE x ,y [ (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) ] | ) (cid:105) ≤ , (B.5) lxiii y deﬁnition of a sub-Gaussian random variable. Therefore,IP (cid:18)(cid:90) Ξ (cid:90) (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) d sµ (d ξ ) ≥ δ (cid:19) ≤ (cid:32) − n ( δ − (cid:82) (cid:82) Ξ IE x ,y [ (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) ]d sµ (d ξ ))16 √ K (cid:33) . (B.6)It now remains to compute an upper bound on the expectation IE x ,y [ (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) ].But this readily follows from Eq. (B.4) by letting p = 1 and z = ∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) as followsIE x ,y [ (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) ] = IE x ,y (cid:34)(cid:28) ∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107)∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:107) , ∇ e n ((1 − s ) ξ + sζ ∗ ) (cid:29) (cid:35) = IE x ,y (cid:104) Q ∇ en ((1 − s ) ξ + sζ ∗ ) (cid:107)∇ en ((1 − s ) ξ + sζ ∗ ) (cid:107) (cid:105) ≤ K n . (B.7)Plugging the expectation upper bound of Eq. (B.7) into Eq. (B.6) completes the proof ofthe ﬁrst part of Lemma A.12.The second part of Lemma A.12 follows by a similar approach and we thus omit theproof. (cid:4) B.2.

Proof of Lemma A.22.

The proof is a minor modiﬁcation of the proof due to Bubeck, et al. [3]. We present the proof for completeness. By the deﬁnition of the Wiener process W t − W η (cid:98) tη (cid:99) law = W t − η (cid:98) tη (cid:99) ∼ N (cid:16) , ( t − η (cid:98) tη (cid:99) ) I N × N (cid:17) . Therefore,IE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) W t − W η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) p (cid:21) = IE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) W t − η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) p (cid:21) (B.8) = IE (cid:34) max m ∈ [0 , Tη ] ∩ IN sup t ∈ [ mη, ( m +1) η ) (cid:107) W t − ηm (cid:107) p (cid:35) (B.9) ≤ (cid:98) Tη (cid:99) (cid:88) m =0 IE (cid:34) sup t ∈ [ mη, ( m +1) η ) (cid:107) W t − ηm (cid:107) p (cid:35) . (B.10)The process t (cid:55)→ (cid:107) W t − ηm (cid:107) p is a sub-martingale on the interval [ mη, ( m + 1) η ). Therefore,by Doob’s sub-martingale maximal inequality Theorem A.20, we obtain thatIE (cid:34) sup t ∈ [ mη, ( m +1) η ) (cid:107) W t − ηm (cid:107) p (cid:35) ≤ (cid:18) p p − (cid:19) p IE (cid:104) (cid:107) W η (cid:107) p (cid:105) , (B.11)for all p ∈ IN and m ∈ (cid:2) , T η − (cid:3) ∩ IN. Plugging Eq. (B.11) into Eq. (B.10) yieldsIE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) W t − W η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) p (cid:21) ≤ (cid:22) Tη (cid:23) (cid:18) p p − (cid:19) p IE (cid:104) (cid:107) W η (cid:107) p (cid:105) . (B.12) lxiv he expectation on the right hand side of Eq. (B.12) can be evaluated using integration inthe spherical coordinatesIE (cid:104) (cid:107) W η (cid:107) p (cid:105) = 1(2 πη ) N N π N/ Γ( N + 1) (cid:90) ∞ e − r η r N +2 p − d r = 2 p − η p N Γ (cid:16) N +2 p (cid:17) Γ (cid:0) N +22 (cid:1) , (B.13)where Γ( · ) is the Euler’s Gamma function. Combining Eq. (B.13) and Eq. (B.12) gives usIE (cid:20) sup ≤ t ≤ T (cid:13)(cid:13)(cid:13) W t − W η (cid:98) tη (cid:99) (cid:13)(cid:13)(cid:13) p (cid:21) ≤ T (cid:18) p p − (cid:19) p p − η p − N Γ (cid:16) N +2 p (cid:17) Γ (cid:0) N +22 (cid:1) , (B.14)where we used the fact that (cid:98) T /η (cid:99) ≤

T /η . (cid:4) References

1. Francis R Bach, Gert RG Lanckriet, and Michael I Jordan,

Multiple kernel learning, conic duality, andthe SMO algorithm , Proceedings of the twenty-ﬁrst international conference on Machine learning, ACM,2004, p. 6.2. Dimitri Bertsekas and Angelia Nedic,

Convex analysis and optimization (conservative) , (2003).3. S´ebastien Bubeck, Ronen Eldan, and Joseph Lehec,

Sampling from a log-concave distribution withprojected langevin Monte Carlo , Discrete & Computational Geometry (2018), no. 4, 757–783.4. Martin Burger, Ina Humpert, and Jan-Frederik Pietschmann, On fokker-planck equations with in-andoutﬂow of mass , arXiv preprint arXiv:1812.07064 (2018).5. Martin Burger and Jan-Frederik Pietschmann,

Flow characteristics in a crowded transport model , Non-linearity (2016), no. 11, 3528.6. Terence Chan et al., Dynamics of the Mckean-Vlasov equation , The Annals of Probability (1994),no. 1, 431–441.7. Guangliang Chen, Wilson Florero-Salinas, and Dan Li, Simple, fast and accurate hyper-parameter tuningin Gaussian-kernel SVM , 2017 International Joint Conference on Neural Networks (IJCNN), IEEE,2017, pp. 348–355.8. Elliott Ward Cheney,

Introduction to approximation theory , (1966).9. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh,

Two-stage learning kernel algorithms. ,ICML, 2010, pp. 239–246.10. Marco Cuturi,

Sinkhorn distances: Lightspeed computation of optimal transport , Advances in neuralinformation processing systems, 2013, pp. 2292–2300.11. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni,

Locality-sensitive hashing schemebased on p-stable distributions , Proceedings of the twentieth annual symposium on Computational ge-ometry, 2004, pp. 253–262.12. Erick Delage and Yinyu Ye,

Distributionally robust optimization under moment uncertainty with appli-cation to data-driven problems , Operations research (2010), no. 3, 595–612.13. Joseph Leo Doob, Stochastic processes , vol. 101, New York Wiley, 1953.14. Bradley Efron and Charles Stein,

The jackknife estimate of variance , The Annals of Statistics (1981),586–596.15. Lawrence C Evans,

Partial diﬀerential equations , vol. volume 19 of Graduate Studies in Mathematics,American Mathematical Society, 1998.16. Rui Gao and Anton J Kleywegt,

Distributionally robust stochastic optimization with Wasserstein dis-tance , arXiv preprint arXiv:1604.02199 (2016).17. David Gilat,

The best bound in the l log l inequality of hardy and littlewood and its martingale counter-part , Proceedings of the American Mathematical Society (1986), no. 3, 429–436.18. Igor Vladimirovich Girsanov, On transforming a certain class of stochastic processes by absolutelycontinuous substitution of measures , Theory of Probability & Its Applications (1960), no. 3, 285–301.19. Mehmet G¨onen and Ethem Alpaydın, Multiple kernel learning algorithms , Journal of machine learningresearch (2011), no. Jul, 2211–2268. lxv

0. J Michael Harrison and Martin I Reiman,

On the distribution of multidimensional reﬂected brownianmotion , SIAM Journal on Applied Mathematics (1981), no. 2, 345–361.21. John M Holte, Discrete Gronwall lemma and applications , MAA-NCS meeting at the University ofNorth Dakota, vol. 24, 2009, pp. 1–7.22. Piotr Indyk and Rajeev Motwani,

Approximate nearest neighbors: towards removing the curse of di-mensionality , Proceedings of the thirtieth annual ACM symposium on Theory of computing, 1998,pp. 604–613.23. Adel Javanmard, Marco Mondelli, and Andrea Montanari,

Analysis of a two-layer neural network viadisplacement convexity , arXiv preprint arXiv:1901.01375 (2019).24. Richard Jordan, David Kinderlehrer, and Felix Otto,

The variational formulation of the Fokker–Planckequation , SIAM journal on mathematical analysis (1998), no. 1, 1–17.25. Masoud Badiei Khuzani and Na Li, Stochastic primal-dual method on Riemannian manifolds of boundedsectional curvature , 2017 16th IEEE International Conference on Machine Learning and Applications(ICMLA), IEEE, 2017, pp. 133–140.26. Masoud Badiei Khuzani, Liyue Shen, Shahin Shahrampour, and Lei Xing,

A mean-ﬁeld theoryfor kernel alignment with random features in generative and discriminative models , arXiv preprintarXiv:1909.11820 (2019).27. Hiroshi Kunita and Shinzo Watanabe,

On square integrable martingales , Nagoya Mathematical Journal (1967), 209–245.28. Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan, Learningthe kernel matrix with semideﬁnite programming , Journal of Machine learning research (2004), no. Jan,27–72.29. Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto, Diﬀerential properties ofSinkhorn approximation for learning with Wasserstein distance , Advances in Neural Information Pro-cessing Systems, 2018, pp. 5859–5870.30. Ester Mariucci, Markus Reiß, et al.,

Wasserstein and total variation distance between marginals of L´evyprocesses , Electronic Journal of Statistics (2018), no. 2, 2482–2514.31. Colin McDiarmid, On the method of bounded diﬀerences , Surveys in combinatorics (1989), no. 1,148–188.32. Song Mei, Theodor Misiakiewicz, and Andrea Montanari,

Mean-ﬁeld theory of two-layers neural net-works: dimension-free bounds and kernel limit , arXiv preprint arXiv:1902.06015 (2019).33. Song Mei, Andrea Montanari, and Phan-Minh Nguyen,

A mean ﬁeld view of the landscape of two-layerneural networks , Proceedings of the National Academy of Sciences (2018), no. 33, E7665–E7671.34. Angelia Nedi´c and Asuman Ozdaglar,

Approximate primal solutions and rate analysis for dual subgra-dient methods , SIAM Journal on Optimization (2009), no. 4, 1757–1780.35. AA Novikov, On moment inequalities and identities for stochastic integrals , Proceedings of the SecondJapan-USSR Symposium on Probability Theory, Springer, 1973, pp. 333–339.36. Felix Otto,

The geometry of dissipative evolution equations: the porous medium equation , (2001).37. Victor M Panaretos and Yoav Zemel,

Statistical aspects of Wasserstein distances , Annual review ofstatistics and its application (2019), 405–431.38. Neal Parikh, Stephen Boyd, et al., Proximal algorithms , Foundations and Trends R (cid:13) in Optimization (2014), no. 3, 127–239.39. MJD Powell, Radial basis function methods for interpolation to functions of many variables , HERCMA,Citeseer, 2001, pp. 2–24.40. Maxim Raginsky and Svetlana Lazebnik,

Locality-sensitive binary codes from shift-invariant kernels ,Advances in neural information processing systems, 2009, pp. 1509–1517.41. Ali Rahimi and Benjamin Recht,

Random features for large-scale kernel machines , Advances in neuralinformation processing systems, 2008, pp. 1177–1184.42. ,

Weighted sums of random kitchen sinks: Replacing minimization with randomization in learn-ing , Advances in neural information processing systems, 2009, pp. 1313–1320.43. R Tyrrell Rockafellar and Roger J-B Wets,

Variational analysis , vol. 317, Springer Science & BusinessMedia, 2009.44. Duduchava Roland,

On poincar \ ’e, friedrichs and korns inequalities on domains and hypersurfaces ,arXiv preprint arXiv:1504.01677 (2015).45. Grant M Rotskoﬀ and Eric Vanden-Eijnden, Neural networks as interacting particle systems: Asymp-totic convexity of the loss landscape and universal scaling of the approximation error , arXiv preprintarXiv:1805.00915 (2018). lxvi

6. Igal Sason and Sergio Verdu, f -divergence inequalities , IEEE Transactions on Information Theory (2016), no. 11, 5973–6006.47. Isaac J Schoenberg, Metric spaces and completely monotone functions , Annals of Mathematics (1938),811–841.48. Aman Sinha and John C Duchi,

Learning kernels with random features , Advances in Neural InformationProcessing Systems, 2016, pp. 1298–1306.49. Richard Sinkhorn and Paul Knopp,

Concerning nonnegative matrices and doubly stochastic matrices ,Paciﬁc Journal of Mathematics (1967), no. 2, 343–348.50. Justin Sirignano and Konstantinos Spiliopoulos, Mean ﬁeld analysis of neural networks , arXiv preprintarXiv:1805.01053 (2018).51. Anatoliy V Skorokhod,

Stochastic equations for diﬀusion processes in a bounded region , Theory ofProbability & Its Applications (1961), no. 3, 264–274.52. Leszek S(cid:32)lomi´nski, Euler’s approximations of solutions of SDEs with reﬂecting boundary , Stochasticprocesses and their applications (2001), no. 2, 317–337.53. Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet, Universality, characteristic kernelsand RKHS embedding of measures , Journal of Machine Learning Research (2011), no. Jul, 2389–2410.54. Alain-Sol Sznitman, Topics in propagation of chaos , Ecole d’´et´e de probabilit´es de Saint-Flour XIX–1989, Springer, 1991, pp. 165–251.55. Hiroshi Tanaka,

Stochastic diﬀerential equations with reﬂecting boundary condition in convex regions ,Stochastic Processes: Selected Papers of Hiroshi Tanaka, World Scientiﬁc, 2002, pp. 157–171.56. C´edric Villani,

Optimal transport: old and new , vol. 338, Springer Science & Business Media, 2008.57. Chuang Wang, Jonathan Mattingly, and Yue Lu,

Scaling limit: Exact and tractable analysis ofonline learning algorithms with applications to regularized regression and PCA , arXiv preprintarXiv:1712.04332 (2017).

Stanford University, 450 Serra Mall, Stanford, CA 94305,,

E-mail address : mbadieik,yyye,snapel,[email protected],yyye,snapel,[email protected]