[PDF] A Dimension-free Computational Upper-bound for Smooth Optimal Transport Estimation

Abstract

It is well-known that plug-in statistical estimation of optimal transport suffers from the curse of dimensionality. Despite recent efforts to improve the rate of estimation with the smoothness of the problem, the computational complexities of these recently proposed methods still degrade exponentially with the dimension. In this paper, thanks to an infinite-dimensional sum-of-squares representation, we derive a statistical estimator of smooth optimal transport which achieves a precision \varepsilon from \tilde{O}(\varepsilon^{-2}) independent and identically distributed samples from the distributions, for a computational cost of \tilde{O}(\varepsilon^{-4}) when the smoothness increases, hence yielding dimension-free statistical \emph{and} computational rates, with potentially exponentially dimension-dependent constants.

Full PDF

AA Dimension-free Computational Upper-bound for SmoothOptimal Transport Estimation

Adrien Vacher ◦∗ Boris Muzellec (cid:63) ∗ Alessandro Rudi (cid:63) ∗ Francis Bach (cid:63) ∗ François-Xavier Vialard ◦ ◦ LIGM, Université Gustave Eiﬀel, CNRS ∗ INRIA Paris, 2 rue Simone Iﬀ, 75012, Paris, France (cid:63)

ENS - Département d’Informatique de l’École Normale Supérieure, (cid:63)

PSL Research University, 2 rue Simone Iﬀ, 75012, Paris, France [email protected], [email protected], [email protected],[email protected], [email protected]

February 1, 2021

Abstract

It is well-known that plug-in statistical estimation of optimal transport suﬀers fromthe curse of dimensionality. Despite recent eﬀorts to improve the rate of estimationwith the smoothness of the problem, the computational complexities of these recentlyproposed methods still degrade exponentially with the dimension. In this paper, thanksto an inﬁnite-dimensional sum-of-squares representation, we derive a statistical estimatorof smooth optimal transport which achieves a precision ε from ˜ O ( ε − ) independentand identically distributed samples from the distributions, for a computational cost of ˜ O ( ε − ) when the smoothness increases, hence yielding dimension-free statistical and computational rates, with potentially exponentially dimension-dependent constants. The comparison between probability distributions is a fundamental task and has beenextensively used in machine learning. For this purpose, optimal transport (OT) has recentlygained traction in diﬀerent subﬁelds of machine learning (ML), such as natural languageprocessing (NLP) (Xu et al., 2018; Chen et al., 2018), generative modeling (Salimans et al.,2018), multi-label classiﬁcation (Frogner et al., 2015), domain adaptation (Redko et al., 2019),clustering (Ho et al., 2017), and has had an impact in other areas such as imaging sciences(Feydy et al., 2017; Bonneel et al., 2011). Indeed, OT is a tool to compare data distributionswhich has arguably many more geometric properties than other available divergences (Peyréand Cuturi, 2019).In practice, the optimal transport cost is often computed for the squared distance (leadingto the Wasserstein squared distance) on sampled distributions with n observations, and it iswell-known that optimal transport suﬀers from the curse of dimensionality (Fournier andGuillin, 2015): the plug-in strategy, which simply consists in computing the Wassersteindistance between the sampled distributions, yields an estimation of the Wasserstein squareddistance between a density and its sampled version in O (1 /n /d ) , which degrades rapidly1 a r X i v : . [ m a t h . S T ] J a n n high dimensions; this can only be improved to O (1 /n /d ) in the case of two diﬀerentdistributions (Chizat et al., 2020).However, high dimension is the usual setting in machine learning, such as in NLP (Graveet al., 2019), and even if the intrinsic dimensionality of data can be leveraged (Weed andBach, 2019), poor theoretical rates of convergence are a recurrent feature of OT. Weedand Berthet (2019) recently showed that when the measures admit smooth densities withgeometric assumptions on their supports, those theoretical rates can be improved with adedicated non-polynomial-time algorithm to almost O (1 / √ n ) when the smoothness increases.Matching rates were then proved for the transportation maps themselves (Hütter and Rigollet,2019), under smoothness assumptions on those maps. This line of work is deeply related tothe regularity theory of optimal transport, that guarantees the smoothness of the optimalmap in Euclidean spaces under similar assumptions on the source and target distributions,and their supports (Caﬀarelli, 1992; De Philippis and Figalli, 2014). Yet, to this day nopractically-tractable algorithm (e.g., with polynomial time) matching the bounds of Weedand Berthet (2019) and Hütter and Rigollet (2019) is known. State of the art.

An approach that has ﬁrst been advocated in the machine learningcommunity as a way to eﬃciently approximate empirical OT and to make it diﬀerentiableconsists in adding an entropic regularization term to the OT problem (Cuturi, 2013). Rateson the sample complexity of entropic optimal transport have then been studied by Genevayet al. (2019) and Mena and Niles-Weed (2019), and were proven to be of the order O ( ε (cid:98) d/ (cid:99) √ n ) ,for small values of ε . Although the dependency in the number of samples is in / √ n , theconstant degrades exponentially with respect to the dimension. Entropic OT and Sinkhorndivergences (Genevay et al., 2018) were then leveraged as a tool to study the sample complexityof the (unregularized) OT problem itself: the most advanced results in this direction werederived by Chizat et al. (2020), who show that with few assumptions on the regularity of theKantorovich potentials, the squared Wasserstein distance can be estimated using O ( ε − d/ ) samples and ˜ O ( ε − ( d (cid:48) +5 . ) operations (with d (cid:48) = 2 (cid:98) d (cid:99) ) with high probability.Our work can be related to a current research direction which consists in developingestimators of the Wasserstein distance for classes of smooth distributions, with smoothnessparameter m , that have better performances than in the general case. So far, two maincontributions in this direction can be found. Hütter and Rigollet (2019) derive minimax ratesfor the estimation of the OT maps and propose an estimator which necessitates, for an L error on the maps of order ε , O ( ε − m − d/ m ) samples. While statistically almost optimal, thisestimator is not computationally feasible as it requires to project the potentials on a spaceof smooth, strongly-convex functions. Instead, Weed and Berthet (2019) derive estimatorsfor the densities requiring O ( ε − d +2 m m ) samples and, under the assumption that an eﬃcientresampler is available, derive an estimator of the OT distance that can be calculated in ˜ O ( ε − (2 d + d/ ) time.While the contributions above do succeed in taking advantage of the smoothness from astatistical point of view, they do not manage to take advantage of the smoothness from acomputational point of view. Actually, statistical-computational gaps are known to exist forsome instances of high-dimensional OT, such as the spiked transport model of Niles-Weedand Rigollet (2019). 2 ontributions. In this paper, we bridge the statistical-computational gap of smooth OTestimation and we provide a positive answer to the question whether smoothness of theoptimal potentials can be computationally beneﬁcial to an eﬃcient statistical estimator.More precisely, we propose an algorithm which, for a given accuracy ε , needs O ( ε − ) samplesand has a computational complexity of ˜ O ( ε − max(4 , dm − d ) ) . Note that the computationalcomplexity improves with the regularity of the distributions and, when m ≥ d , it is ˜ O ( ε − ) ,i.e., independent of the dimension d in the exponent (but not in the constants). We thusshow that smoothness can be leveraged in the computational estimation of optimal transport.Moreover, we consider diﬀerent scenarios beyond i.i.d. sampling, such as the case wherewe are able to compute exact integrals or where we can evaluate the densities in given points,by representing the problem in terms of kernel mean embeddings (Muandet et al., 2017).This allows to make a uniﬁed analysis for all the cases. The total error is then the sum ofthe error of approximating via the kernel mean embedding plus the error of subsamplingconstraints. Interestingly, in the other scenarios the computational cost to achieve an error ε can be smaller than ε − , as reported below. This is particularly interesting in the casewe can evaluate the densities in given points and avoids to use expensive Monte-Carlosampling techniques to obtain i.i.d. samples (see Section 6 for more details). Our results aresummarized below. Theorem 1.

Let ε > . Let µ, ν satisfy Assumption 1 for some m > d . Let (cid:100) OT be theproposed estimator deﬁned in Eq. (6) and computed as in Appendix F with the same parametersas in Corollary 3. The cost to achieve | (cid:100) OT − OT ( µ, ν ) | ≤ ε for the three scenarios is:1. (Exact integral) Time: ˜ O ( ε − dm − d ) . Space: ˜ O ( ε − dm − d ) .2. (Evaluation) Time: ˜ O ( ε − dm − d ) . Space: ˜ O ( ε − dm − d ) . evaluations of µ, ν : ˜ O ( ε − dm +1 ) .3. (Sampling) Time: ˜ O ( ε − max(4 , dm − d ) ) . Space: ˜ O ( ε − dm − d ) . samples of µ, ν : ˜ O ( ε − ) . The second key contribution of this paper is to provide a new representation theorem forsolutions of smooth optimal transport. The inequality constraint in the dual OT problem canbe replaced with an equality constraint involving a ﬁnite sum-of-squares in a Sobolev space.In comparison with Rudi et al. (2020), it is a non-trivial extension of their representationresult to the case of a continuous set of global minimizers instead of a ﬁnite set.

In this paper, we consider the optimal transport problem for the quadratic cost on boundedsubsets

X, Y of the Euclidean space R d . The set of probability measures on X is denotedby P ( X ) . The optimal transport problem with quadratic cost c ( x, y ) = (cid:107) x − y (cid:107) can bestated in its dual formulation asOT ( µ, ν ) = sup u,v ∈ C ( R d ) (cid:90) u ( x ) dµ ( x ) + (cid:90) v ( y ) dν ( y ) subject to c ( x, y ) ≥ u ( x ) + v ( y ) , ∀ ( x, y ) ∈ X × Y, (1)As a standard result in optimal transport theory, the supremum is attained and the functions u (cid:63) , v (cid:63) are referred to as the Kantorovich potentials (see Santambrogio, 2015).3he proposed approach to approximate OT ( µ, ν ) is the result of two main ingredients:(1) a suitable way to represent smooth functions and to approximate their integral in µ, ν ,(2) a way to enforce eﬃciently the dense set of constraints on u, v . Preliminary step: Representing smooth functions and integrals.

We representsmooth functions via a reproducing Kernel Hilbert space (RKHS) (Aronszajn, 1950; Steinwartand Christmann, 2008), for which functions can be represented as linear forms. In Section 4we show that under smoothness assumptions on µ and ν (Assumption 1) we have u ∈ H X and v ∈ H Y where H X and H Y are two suitable RKHS on X and Y , associated with twobounded continuous feature maps φ X : X → H X and φ Y : Y → H Y . Note that RKHSsoﬀer several advantages. First, leveraging the reproducing property, we can represent theintegrals in the functional of Eq. (1) as inner products in terms of the kernel mean embeddings w µ ∈ H X and w ν ∈ H Y where w µ = (cid:82) X φ X ( x ) dµ ( x ) and w ν = (cid:82) Y φ Y ( y ) dν ( y ) . Indeed, bythe reproducing property, for all u ∈ H X , we have: (cid:90) X u ( x ) dµ ( x ) = (cid:90) X (cid:104) u, φ X ( x ) (cid:105) H X dµ ( x ) = (cid:10) u, (cid:90) X φ X ( x ) dµ ( x ) (cid:11) H X = (cid:104) u, w µ (cid:105) H X , and the same reasoning holds for the integral on ν , i.e., (cid:82) Y v ( y ) dν ( y ) = (cid:104) v, w ν (cid:105) H Y , for all v ∈ H Y . This construction is known as kernel mean embedding (Muandet et al., 2017).Moreover, RKHSs allow the so-called kernel trick (Steinwart and Christmann, 2008), i.e., toexpress the resulting algorithm in terms of kernel functions that in our case correspond to k X ( x, x (cid:48) ) = (cid:104) φ X ( x ) , φ X ( x (cid:48) ) (cid:105) H X and k Y ( y, y (cid:48) ) = (cid:104) φ Y ( y ) , φ Y ( y (cid:48) ) (cid:105) H Y , that are known explicitlyand are easily computable in O ( d ) . The main step: Dealing with a dense set of inequalities.

Even assuming that weare able to compute integrals in closed form and restricting to m -times diﬀerentiable u, v ,the main challenge is to deal with the dense set of inequalities c ( x, y ) ≥ u ( x ) + v ( y ) that u, v must satisfy, for any ( x, y ) ∈ X × Y . Indeed, an intuitive approach would be to subsamplethe set, i.e., to take (cid:96) points (˜ x , ˜ y ) , . . . , (˜ x (cid:96) , ˜ y (cid:96) ) in X × Y and consider only the constraints c (˜ x j , ˜ y j ) ≥ u (˜ x j ) + v (˜ y j ) for j = 1 , . . . , (cid:96) . This approach, however, is only able to leveragethe Lipschitzianity of u, v (Aubin-Frankowski and Szabo, 2020) and leads to an error in theorder of (cid:96) − /d that does not allow to break the dependence in d in the exponent, and yieldsrates that are equivalent to the plugin estimator.In this paper, we leverage a more reﬁned technique to approximate the dense set ofinequalities, introduced by Rudi et al. (2020) for the problem of non-convex optimization, andthat allows to break the curse of dimensionality for smooth problems. The idea behind thistechnique is the consideration that, while a dense set of inequalities is poorly approximatedby subsampling, the situation is diﬀerent in the case of a dense set of equality constraints,for which an optimal rate of O ( (cid:96) − m/d ) is achievable for m -times diﬀerentiable constraints(Wendland and Rieger, 2005). The construction works in two steps: ﬁrst, substitute theinequality constraints with equality constraints that are equivalent, and then subsample. Inthe next two paragraphs we explain how to apply this approach to the problem of OT. Removing the inequalities: positive deﬁnite operator characterization.

To applythe construction recalled above to our scenario, we ﬁrst consider the following problem. Let4 XY be a Hilbert space on X × Y and φ : X × Y → H XY . Denote by k XY the kernel k XY ( x, y, x (cid:48) , y (cid:48) ) = (cid:104) φ ( x, y ) , φ ( x (cid:48) , y (cid:48) ) (cid:105) H XY for any ( x, y ) , ( x (cid:48) , y (cid:48) ) ∈ X × Y and by S + ( H XY ) the space of positive operators on H XY . We deﬁne max u ∈H X ,v ∈H Y ,A ∈ S + ( H XY ) (cid:104) u, w µ (cid:105) H X + (cid:104) v, w ν (cid:105) H Y subject to ∀ ( x, y ) ∈ X × Y c ( x, y ) − u ( x ) − v ( y ) = (cid:104) φ ( x, y ) , Aφ ( x, y ) (cid:105) H XY , (2)where the inequality in (1) is substituted with an equality w.r.t. a positive deﬁnite operator A on H XY . Note that Problem (1) is a relaxation of Problem (2): indeed, if for a given pair u ∈ H X , v ∈ H Y there exists a positive deﬁnite A satisfying the equality above, then c ( x, y ) − u ( x ) − v ( y ) = (cid:104) φ ( x, y ) , Aφ ( x, y ) (cid:105) H XY ≥ , ∀ ( x, y ) ∈ X × Y, so the couple ( u, v ) is admissible for (1). However, even for an admissible couple in (1)satisfying u ∈ H X , v ∈ H Y , a positive operator A may not exist. Indeed, note that thetechnique of representing a positivity constraint in terms of a positive matrix has a longhistory in the community of polynomial optimization (Lasserre, 2001; Parrilo, 2003; Lasserre,2009), which shows that in general the resulting problem is not equivalent to the originalone, for any chosen degree of polynomial approximation. This fact leads to the so-called sum of squares hierarchies , also used for optimal transport (Henrion and Lasserre, 2020).Instead, using kernels, Rudi et al. (2020) showed that there exists a positive operator withﬁnite rank that matches the constraints and makes the two problems equivalent, when theconstraint is attained on a ﬁnite set of points. However, such existence results cannot be usedfor the problem in (2), since in the case of optimal transport the set of zeros corresponds tothe graph of the optimal transport map and is a smooth manifold, when µ, ν are smooth(De Philippis and Figalli, 2014).A crucial point of our contribution is then to prove that, with a quadratic cost c ( x, y ) = (cid:107) x − y (cid:107) and under the same assumptions on the densities and their supports, or undersmoothness assumptions on the Kantorovich potentials, there exists a positive operator fora suitable Hilbert space that represents the function c ( x, y ) − u ( x ) − v ( y ) for a pair u, v maximizing (1), making the two problems equivalent. The result is reported in Theorem 5.The proof is derived using the Fenchel dual characterization of u (cid:63) , v (cid:63) and gives a sharpcontrol of the rank of A . Subsampling the constraints and approximating the integrals.

We restrict theconstraint of (2) to (˜ x , ˜ y ) , . . . , (˜ x (cid:96) , ˜ y (cid:96) ) ⊂ X × Y for (cid:96) ∈ N . However, we need to add apenalization for u, v and A to avoid overﬁtting, since the error induced by subsampling theconstraints is proportional to the trace of A (Rudi et al., 2020) and, in our case, also to thenorms of u, v , as derived in Theorem 9 in Section 5. Finally, in two of the three scenariosof interest for the paper, i.e., (i) when we can only evaluate µ, ν pointwise, or (ii) when wehave only i.i.d. samples from µ, ν , we do not have access to the kernel mean embeddings w µ ∈ H X , w ν ∈ H Y . Therefore, we need to use some estimators ˆ w µ ∈ H X , ˆ w ν ∈ H Y thatare derived in Section 6. The resulting problem is the following, for some regularization5arameters λ , λ > : max u ∈H X ,v ∈H Y ,A ∈ S + ( H XY ) (cid:104) u, ˆ w µ (cid:105) H X + (cid:104) v, ˆ w ν (cid:105) H Y − λ Tr( A ) − λ ( (cid:107) u (cid:107) H X + (cid:107) v (cid:107) H Y ) subject to ∀ j ∈ [ (cid:96) ] c (˜ x j , ˜ y j ) − u (˜ x j ) − v (˜ y j ) = (cid:104) φ (˜ x j , ˜ y j ) , Aφ (˜ x j , ˜ y j ) (cid:105) H XY . (3)Let ˆ u, ˆ v be the maximizers of the problem above (unique since the problem is strongly concavein u, v ). The estimator for OT we consider corresponds to (cid:100) OT = (cid:104) ˆ u, ˆ w µ (cid:105) H X + (cid:104) ˆ v, ˆ w ν (cid:105) H Y . (4) Finite-dimensional characterization.

In Appendix F, following Marteau-Ferey et al.(2020), we derive the dual problem of Eq. (3). Deﬁne Q ∈ R (cid:96) × (cid:96) as Q i,j = k X (˜ x i , ˜ x j )+ k Y (˜ y i , ˜ y j ) and z j = ˆ w µ (˜ x j ) + ˆ w ν (˜ y j ) − λ c (˜ x j , ˜ y j ) for i, j ∈ [ (cid:96) ] and q = (cid:107) ˆ w µ (cid:107) H X + (cid:107) ˆ w ν (cid:107) H Y , and let I (cid:96) ∈ R (cid:96) × (cid:96) be the identity matrix. Let K ∈ R (cid:96) × (cid:96) be deﬁned as K i,j = k XY (˜ x i , ˜ y i , ˜ x j , ˜ y j ) anddeﬁne Φ i ∈ R (cid:96) as the i -th column of R , the upper triangular matrix corresponding to theCholesky decomposition of K (i.e., R satisﬁes K = R (cid:62) R ). The dual problem writes: min γ ∈ R (cid:96) λ γ (cid:62) Q γ − λ (cid:80) (cid:96)j =1 γ j z j + q λ such that (cid:80) (cid:96)j =1 γ j Φ j Φ (cid:62) j + λ I (cid:96) (cid:23) . (5)In the same section in Appendix F, we derive an explicit characterization of ˆ u, ˆ w, ˆ A in termsof ˆ γ , the solution of the problem above and we characterize (cid:100) OT as follows: (cid:100) OT = q λ − λ (cid:80) (cid:96)j =1 ˆ γ j ( ˆ w µ (˜ x j ) + ˆ w ν (˜ y j )) . (6)As it is possible to observe from the problem above and the characterization of (cid:100) OT, the onlyquantities necessary to compute ˆ γ and (cid:100) OT are the kernels k X , k Y , k XY and the evaluationof the functions ˆ w µ ∈ H X , ˆ w ν ∈ H Y at the points ˜ x j and ˜ y j respectively for j ∈ [ (cid:96) ] . InAppendix F, we consider a Newton method with self-concordant barrier to solve the problemabove (Nesterov and Nemirovskii, 1994). To illustrate that this algorithm can indeed beimplemented in practice, we run simulations on toy data in Appendix G. The total cost ofthe procedure to achieve error ε for the computation of (cid:100) OT is the following (see Theorem 17,Page 26 in the appendix): O (cid:0) C + E(cid:96) + (cid:96) . log (cid:96)ε (cid:1) time , O ( (cid:96) ) memory , (7)where C is the cost for computing q and E is the cost to compute one z j . Depending onthe operations that we are able to perform on µ, ν and k X , k Y , we have three scenarios. InSection 6 we specify how to compute the vectors ˆ w µ , ˆ w ν , in Corollary 3 we report only theconditions of applicability and the resulting cost. In the next section and then in Appendix Fwe quantify instead how to choose (cid:96), λ , λ to achieve | (cid:100) OT − OT ( µ, ν ) | ≤ ε and we provide acomplete computational complexity in ε . Here we quantify the convergence rate of (cid:100)

OT to OT. To simplify the exposition, in thissection we will make a classical assumption on the smoothness of the densities (De Philippisand Figalli, 2014). Note however that the results of the paper hold under a more generalassumption on the smoothness of the potentials (see Theorem 5).6 ssumption 1 ( m -times diﬀerentiable densities) . Let m, d ∈ N . Let µ, ν ∈ P ( R d ) .a) µ, ν have densities. Their supports, resp. X, Y ⊂ R d are convex, bounded and open;b) the densities are ﬁnite, bounded away from zero, with Lipschitz derivatives up to order m . Assumption 1 is particularly adapted to our context since it guarantees that the Kan-torovich potentials have a similar order of diﬀerentiability (De Philippis and Figalli, 2014).The main result, Theorem 9, is expressed for a general set of couples (˜ x j , ˜ y j ) , j ∈ [ (cid:96) ] . Here,we specify it for the case where the couples are sampled independently and uniformly atrandom. Theorem 2.

Let µ, ν satisfy Assumption 1 for m > d and m ≥ . Let (cid:96) ∈ N and δ ∈ (0 , .Let (˜ x j , ˜ y j ) be independently sampled from the uniform distribution on X × Y . Let (cid:100) OT becomputed with k X = k m +1 , k Y = k m +1 and k XY = k m where k s for s > is the Sobolevkernel in Eq. (9) . Then, there exists (cid:96) depending only on d, m, R and C , C depending onlyon u (cid:63) , v (cid:63) , such that when (cid:96) ≥ (cid:96) and λ , λ are chosen to satisfy λ ≥ C (cid:96) − m/ d +1 / log (cid:96)δ , λ ≥ (cid:107) w µ − ˆ w µ (cid:107) H X + (cid:107) w ν − ˆ w ν (cid:107) H Y + λ , (8) then, with probability − δ , we have | (cid:100) OT − OT ( µ, ν ) | ≤ C λ . Note that while the rate does not depend exponentially in d as we are going to see in therest of the section, the constants (cid:96) , C , C depend exponentially in d in the worst case, asRudi et al. (2020) for the case of global optimization. From the theorem above it is clearthat the approximation error of (cid:100) OT is the sum of the error induced by the kernel meanembeddings plus the error induced by the subsampling of the inequality. Note here thatthe result of the theorem above holds also if the (cid:96) couples are i.i.d. from ρ = µ ⊗ ν , asdiscussed in Remark 10. This can be beneﬁcial if we do not know X, Y or we do not knowhow to sample from them. In the next corollary we will specialize the result depending onthe considered scenarios.

Corollary 3.

Under the same assumptions as Theorem 2, let k X = k m +1 , k Y = k m +1 , k XY = k m where k s for s > is in Eq. (9) and λ ≥ C (cid:96) − ( m − d ) / d log (cid:96)δ . Compute (cid:100) OT with ˆ w µ , ˆ w ν chosen according to one of the three scenarios below, as in Section 6. There exist C, C (cid:48) , C (cid:48) , C (cid:48)(cid:48) s.t.1. (Exact integral) When we are able to compute exactly (cid:82) k X ( x, x (cid:48) ) dµ ( x (cid:48) ) dµ ( x ) and also (cid:82) X k X ( x, x (cid:48) ) dµ ( x ) for any x (cid:48) ∈ X (and analogously for ν ). Choose λ = λ . Then, | (cid:100) OT − OT ( µ, ν ) | ≤ C (cid:96) − ( m − d ) / d log (cid:96)δ .

2. (Evaluation) When we are only able to evaluate µ, ν on given points and to compute (cid:82) X k X ( x, z ) k X ( x (cid:48) , z ) dz, (cid:82) X (cid:82) X k X ( x, z ) k X ( z, z (cid:48) ) k X ( x (cid:48) , z (cid:48) ) dzdz (cid:48) . Evaluate µ in n µ pointssampled uniformly from X (and n ν for ν ). Let λ = λ + C ( n µ + n ν ) − ( m +1) /d log n µ + n ν δ , | (cid:100) OT − OT ( µ, ν ) | ≤ C (cid:48) ( n − ( m +1) /dµ log n µ δ + n − ( m +1) /dν log n ν δ + (cid:96) − ( m − d ) / d log (cid:96)δ ) .

3. (Sampling) When we are only able to sample from µ, ν , by using n µ i.i.d. samples from µ and n ν from ν . Choose λ = λ + C (cid:48) ( n µ + n ν ) − / log n µ + n ν δ . Then, | (cid:100) OT − OT ( µ, ν ) | ≤ C (cid:48)(cid:48) ( n − / µ log n µ δ + n − / ν log n ν δ + (cid:96) − ( m − d ) / d log (cid:96)δ ) . Notations and background

Let n ∈ N , we denote by [ n ] the set { , . . . , n } . For a set Z , and a positive deﬁnite kernel k : Z × Z → R (i.e., so that all matrices of pairwise kernel evaluations are positive semi-deﬁnite), we can deﬁne a reproducing kernel Hilbert spaces (Aronszajn, 1950) H of realfunctions on Z , endowed with an inner product (cid:104)· , ·(cid:105) H , and a norm (cid:107) · (cid:107) H . It satisﬁes: (1) k ( z, · ) ∈ H for any z ∈ Z and (2) it has the reproducing property , i.e., for any f ∈ H , z ∈ Z it holds that f ( z ) = (cid:104) f, k ( z, · ) (cid:105) H . The canonical feature map associated to H is the map φ : Z → H corresponding to z (cid:55)→ k ( z, · ) , so that k ( z, z (cid:48) ) = (cid:104) φ ( z ) , φ ( z (cid:48) ) (cid:105) H (Aronszajn, 1950).In this paper we use Sobolev spaces, deﬁned on Z ⊆ R q , with q ∈ N , an open set.For s ∈ N , denote by H s ( Z ) the Sobolev space of functions whose weak-derivatives up toorder s are square-integrable, i.e., H s ( Z ) := { f ∈ L ( Z ) | (cid:107) f (cid:107) H s ( Z ) < ∞} and (cid:107) f (cid:107) H s ( Z ) := (cid:80) | α |≤ s (cid:107) D α f (cid:107) L ( Z ) (Adams and Fournier, 2003). A remarkable property of H s ( Z ) that weare going to use in the proofs is that H s ( Z ) ⊂ C k ( Z ) for any s > q/ k and k ∈ N .Moreover H m +1 ( Z ) ⊂ H m ( Z ) , ∀ m ∈ N . Proposition 4 (Sobolev kernel, Wendland (2004)) . Let Z ⊂ R q , q ∈ N , be an open boundedset. Let s > q/ . Deﬁne k s ( z, z (cid:48) ) = c s (cid:107) z − z (cid:48) (cid:107) s − q/ K s − q/ ( (cid:107) z − z (cid:48) (cid:107) ) , ∀ z, z (cid:48) ∈ Z, (9) where K : R → R is the Bessel function of the second kind (see, e.g., Eq. 5.10 of the samebook) and c s = q/ − s Γ( s − q/ . Then the function k s is a kernel. Denoting by H Z the associatedRKHS, when Z has Lipschitz boundary, then H Z = H s ( Z ) and the norms are equivalent. In the particular case of s = q/ / , we have k s ( z, z (cid:48) ) = exp( −(cid:107) z − z (cid:48) (cid:107) ) . Note thatthe constant c s is chosen such that k s ( z, z ) = 1 for any z ∈ Z . We start with the following representation result on the structure of the optimal potentials,which is one of our main contributions in this paper.

Theorem 5.

Let µ ∈ P ( X ) , ν ∈ P ( Y ) satisfying Assumption 1a and let ( u (cid:63) , v (cid:63) ) be Kan-torovich potentials such that u (cid:63) ∈ H s +2 ( X ) and v (cid:63) ∈ H s +2 ( Y ) for s > d + 1 . There existfunctions w , ..., w d ∈ H s ( X × Y ) such that (cid:107) x − y (cid:107) − u (cid:63) ( x ) − v (cid:63) ( y ) = (cid:80) di =1 w i ( x, y ) , ∀ ( x, y ) ∈ X × Y. Proof

Denote by h the function h ( x, y ) = c ( x, y ) − u (cid:63) ( x ) − v (cid:63) ( y ) for all ( x, y ) ∈ X × Y .Let f ( x ) = (cid:107) x (cid:107) − u (cid:63) ( x ) , x ∈ X . By Brenier’s theorem (Brenier, 1987; Santambrogio, 2015,Theorem 1.22) for quadratic optimal transport,(i) T = ∇ f , where T is the optimal transport map from µ to ν ,(ii) f is convex on X ,(iii) h is characterized by h ( x, y ) = f ( x ) + f (cid:63) ( y ) − x (cid:62) y , where f (cid:63) : y ∈ Y (cid:55)→ sup x ∈ X x (cid:62) y − f ( x ) is the Fenchel-Legendre conjugate of f . Moreover, f (cid:63) ( y ) = (cid:107) y (cid:107) − v (cid:63) ( y ) .8 .0 0.2 0.4 0.6 0.8 1.0 x y c ( x , y ) u * ( x ) v * ( y ) T ( x ) x z c ( x , T ( z )) u * ( x ) v * ( T ( z )) Figure 1:

Left : dual constraint function h ( x, y ) corresponding to the measures in Figure 2 inAppendix G. Note that h attains its minimum on the graph of transportation map T , and iselsewhere positive. Right : changing parameterization ﬂattens h in the neighborhood of thegraph of T . The red dotted line represent the zeros of both functions, and coincides with thegraph of T in the original parameterization ( left ).Further, from the properties of Fenchel-Legendre conjugacy (Rockafellar, 1970, Section 26),we have T − = ∇ f (cid:63) . Hence, since u (cid:63) ∈ H s +2 ( X ) and v (cid:63) ∈ H s +2 ( Y ) , we have(iv) T = ∇ f (resp. T − = ∇ f (cid:63) ) is a H s +1 -diﬀeomorphism from X to Y (resp. Y to X ).Since f ∈ H s +2 ( X ) and s > d/ and X is a bounded open set with locally Lipschitz boundary(see Lemma 11), then H s +2 ( X ) ⊂ C ( X ) (Adams and Fournier, 2003) and the Hessian H f ( x ) is well deﬁned for any x ∈ X . Since, by (iv) , T = ∇ f is a diﬀeomorphism, then, byFenchel-Legendre conjugacy, f is strictly convex (Rockafellar, 1970). Hence by compactnessof X , f has a Hessian H f ( x ) which is bounded away from . This implies:(v) There exists ρ > such that H f ( x ) (cid:23) ρ Id for all x ∈ X .We will now use the decomposition (iii) along a with a reparameterization of h to obtain adecomposition as a sum of squares. Let ˜ h ( x, z ) def = h ( x, T ( z )) , ∀ ( x, z ) ∈ X × X. (10)The eﬀect of this change of coordinates is illustrated in Figure 1. Since f is diﬀerentiable, bythe properties of the Fenchel-Legendre conjugate it holds that f (cid:63) ( ∇ f ( z )) = ∇ f ( z ) (cid:62) z − f ( z ) for any z ∈ X (Rockafellar, 1970). Therefore, from (i) we have that ˜ h ( x, z ) = f ( x ) − f ( z ) − ∇ f ( z ) (cid:62) ( x − z ) . (11)Now, since X is convex, we can characterize f in terms of its Taylor expansion: f ( x ) = f ( z ) + ∇ f ( z ) (cid:62) ( x − z ) + ( x − z ) (cid:62) R ( x, z )( x − z ) , ∀ x, z ∈ X, where R is the integral reminder R ( x, z ) def = (cid:82) (1 − t ) H f ((1 − t ) x + tz )d t . This implies ˜ h ( x, z ) = ( x − z ) (cid:62) R ( x, z )( x − z ) , ∀ x, z ∈ X. (12)9rom (v), we get ∀ x, z, R ( x, z ) (cid:23) ρ Id . In particular, for all x, z ∈ X , the matrix R ( x, z ) admits a positive square root (cid:112) R ( x, z ) . Further, since √· is C ∞ on the closed set { A ∈ S + ( R d ) : A (cid:23) ρ Id } and ∂ ∂x i ∂x j f ∈ H s ( X ) for all i, j ∈ [ d ] , the functions r i,j : ( x, z ) (cid:55)→ e (cid:62) i (cid:112) R ( x, z ) e j are in H s ( X × X ) for all i, j ∈ [ d ] (see Proposition 1 and Assumption 2b ofRudi et al., 2020), where ( e , ..., e d ) is the canonical ONB of R d . Deﬁne now the functions ˜ w i ( x, z ) def = d (cid:88) j =1 r i,j ( x, z )( e (cid:62) j ( x − z )) , ∀ x, z ∈ X, i ∈ [ d ] . From the above arguments, it holds that ˜ w i ∈ H s ( X × X ) , i ∈ [ d ] , and ˜ h ( x, z ) = (cid:80) di =1 ˜ w i ( x, z ) .Now, since T is a H s +1 -diﬀeomorphism from X to Y , changing parameterization again we have h ( x, y ) = (cid:80) di =1 w i ( x, y ) , ∀ ( x, y ) ∈ X × Y , with w i ( x, y ) = ˜ w i ( x, T − ( y )) , ∀ ( x, y ) ∈ X × Y .We conclude by proving that w i ∈ H s ( X × Y ) for all i ∈ [ d ] . Indeed, from (iv) T − is a H s +1 -diﬀeomorphism from ¯ Y to ¯ X and it is bi-Lipschitz (since T and T − are Lipschitzdue to the continuity of their Hessian and the boundedness of X, Y ). Deﬁne the map Q as ( x, y ) (cid:55)→ ( x, T − ( y )) and note that Q ∈ H s +1 ( X × Y, R d ) , by construction. Note that ˜ w i ∈ H s ( X × X ) and has bounded weak derivatives of order , since s > d + 1 and H s ( X × X ) is bounded (Adams and Fournier, 2003). The conditions above on ˜ w i , Q guarantee that w i = ˜ w i ◦ Q belongs to H s ( X × Y ) (see, e.g., Theorem 1.2 of Campbell et al., 2015).Theorem 5 implies the existence of A (cid:63) ∈ S + ( H XY ) by eﬀect of the reproducing property,when we consider a RKHS H XY containing H s ( X × Y ) . The proof is in Appendix A, Page 16. Corollary 6.

Let H XY be a RKHS such that H s ( X × Y ) ⊆ H XY . Under the hypothesis ofTheorem 5, there exists a positive operator A (cid:63) ∈ S + ( H XY ) with rank at most d , such that c ( x, y ) − u (cid:63) ( x ) − v (cid:63) ( y ) = (cid:104) φ ( x, y ) , A ∗ φ ( x, y ) (cid:105) H XY . (13)Finally, the following corollary shows the eﬀect of Assumption 1 on the existence of A (cid:63) . Note indeed that such an assumption implies smoothness of the Kantorovich potentials(De Philippis and Figalli, 2014). If m > d they are smooth enough to apply Theorem 5 andthe corollary above. The proof of the following corollary is in Appendix A, Page 16. Corollary 7 (Eﬀects of Asm. 1) . Let µ, ν satisfy Assumption 1 for m > d . Let ( u (cid:63) , v (cid:63) ) beKantorovich potentials for µ, ν . Let H X = H m +3 ( X ) , H Y = H m +3 ( Y ) , H XY = H m ( X × Y ) .Then u (cid:63) ∈ H X , v (cid:63) ∈ H Y and there exists A (cid:63) ∈ S + ( H XY ) satifying Eq. (13) and rank A (cid:63) ≤ d . Given (cid:96) ∈ N and a set of points ˜ Z (cid:96) = { (˜ x , ˜ y ) , . . . , (˜ x (cid:96) , ˜ y (cid:96) ) } , deﬁne the ﬁll distance (Wendland,2004), h (cid:96) = sup x ∈ X,y ∈ Y min j ∈ [ (cid:96) ] (cid:107) ( x, y ) − (˜ x j , ˜ y j ) (cid:107) . (14)The following theorem, that is an adaptation of Theorem 4 from Rudi et al. (2020), quantiﬁesthe error of subsampling the constraints in terms of the ﬁll distance.10 heorem 8. Let

X, Y satisfy Assumption 1a. Let s ≥ and s > d . Let H X ⊆ H s ( X ) , H Y ⊆ H s ( Y ) , H XY = H s ( X × Y ) . Let ˜ Z (cid:96) ⊂ X × Y be a set of points of cardinality (cid:96) and ﬁlldistance h (cid:96) and let u ∈ H X , v ∈ H Y , A ∈ S + ( H XY ) satisfy c (˜ x j , ˜ y j ) − u (˜ x j ) − v (˜ y j ) = (cid:104) φ (˜ x j , ˜ y j ) , Aφ (˜ x j , ˜ y j ) (cid:105) H XY , ∀ j ∈ [ (cid:96) ] . (15) There exist h , C depending only on s, d, X, Y such that, when h (cid:96) ≤ h , then c ( x, y ) ≥ a ( x ) + b ( y ) − ε, ∀ x, y ∈ X × Y, where ε = Qh s − d(cid:96) , where Q = C ( (cid:107) u (cid:107) H X + (cid:107) v (cid:107) H Y + Tr( A )) . Note that h , C depend only on d, m, X, Y . The proof of the theorem above is reported in Appendix B, Page 17. Using the theoremabove we are able to show that, given a maximizer (ˆ u, ˆ v, ˆ A ) of Problem 3 the couple (ˆ u − ε/ , ˆ v − ε/ is admissible for Problem 1. This is a crucial step in the bound of | (cid:100) OT − OT | in terms of the ﬁll distance h (cid:96) and it is stated next. Theorem 9.

Let µ, ν and

X, Y ⊂ R d satisfy Assumption 1a. Let s > d and let H X , H Y , H XY be RKHS on X, Y, X × Y such that H X ⊆ H s ( X ) , H Y ⊆ H s ( Y ) , H XY = H s ( X × Y ) . Assumethat there exist two Kantorovich potentials u (cid:63) , v (cid:63) such that u (cid:63) ∈ H X , v (cid:63) ∈ H Y and there exists A (cid:63) ∈ S + ( X × Y ) that satisﬁes Eq. (13) . Let (cid:100) OT be computed according to Eq. (6) using aset of (cid:96) ∈ N points ˜ Z (cid:96) ⊂ X × Y with ﬁll distance h (cid:96) . Let h , C as in Theorem 8. Let η = C h s − d(cid:96) , γ = (cid:107) w µ − ˆ w µ (cid:107) H X + (cid:107) w ν − ˆ w ν (cid:107) H Y . When h (cid:96) ≤ h , λ > and λ is chosen such that λ ≥ η , then | (cid:100) OT − OT | ≤ λ Tr( A (cid:63) ) + 6 ( γ + η ) λ + 6 λ (cid:0) (cid:107) u (cid:63) (cid:107) H X + (cid:107) v (cid:63) (cid:107) H Y ) . The proof of the theorem above is in Appendix B, Page 17. Theorem 9 together withTheorem 5 (in particular Corollary7) allow to prove Theorem 2. To bound h (cid:96) in terms of (cid:96) ,we used a result recalled in Lemma 12. Proof of Theorem 2.

First note that k X = k m +1 , k Y = k m +1 , k XY = k m imply that H X = H m +1 ( X ) , H Y = H m +1 ( X ) , H XY = H m ( X × Y ) . Then, by Corollary 7 we havethat under Assumption 1 for any Kantorovich potentials u (cid:63) , v (cid:63) there exists a ﬁnite rank A (cid:63) ∈ S + ( H XY ) such that c ( x, y ) − u (cid:63) ( x ) − v (cid:63) ( y ) = (cid:104) φ ( x, y ) , A ∗ φ ( x, y ) (cid:105) H XY , and that u (cid:63) ∈ H m +3 ( X ) ⊆ H m +1 ( X ) = H X and analogously v (cid:63) ∈ H Y . Among them, we select thetriplet minimizing (cid:107) u (cid:63) (cid:107) H X + (cid:107) v (cid:63) (cid:107) H Y + Tr( A (cid:63) ) and we denote by Q µ,ν the resulting minimum.The proof is obtained by using this triplet in Theorem 9 and bounding h (cid:96) as follows. Firstnote that X × Y is a convex bounded set, since X, Y have the same property. As recalledin Lemma 11, Page 19 of the appendix, convex bounded sets have the so-called uniforminterior cone condition . This guarantees that (cid:96) i.i.d. points sampled from a distribution ρ ,that has a density bounded away from zero, achieve the following bound on the ﬁll distance: h (cid:96) ≤ ( C(cid:96) − log( C (cid:48) (cid:96)/δ )) − / (2 d ) , with probability at least − δ , when (cid:96) ≥ (cid:96) . Here (cid:96) , C, C (cid:48) are constants that depend only on d, X × Y and the constant c for which the density of ρ isbounded away from zero. The ﬁnal constants C , C depend also on Q µ,ν .We conclude with the following remark that is useful when we do not know the supports X, Y or we are not able to sample i.i.d. points from the uniform distribution on them.11 emark 10 (Sampling from X × Y using µ ⊗ ν ) . Since, under Assumption 1, we have µ and ν bounded away from , we can compute (cid:100) OT by using (cid:96) i.i.d. samples from ρ = µ ⊗ ν ,obtaining the same guarantees as Theorem 2. Indeed, by inspecting the proof above, it isrequired only a ρ that has support X × Y , has density and it is bounded away from . However,the constants (cid:96) , C , C will depend also on how far the density of ρ is bounded away from . µ, ν In this section we consider three classes of estimators ( ˆ w µ , ˆ w ν ) for the kernel mean embeddings w µ ∈ H X and w ν ∈ H Y deﬁned as w µ = (cid:82) φ X ( x ) dµ ( x ) and w ν = (cid:82) φ Y ( y ) dν ( y ) . As weobserved in the introduction to Eq. (5), the operations we need to perform on ˆ w µ , ˆ w ν tocompute the algorithm are the evaluation of the norm (cid:107) ˆ w µ (cid:107) H X and the evaluation of ˆ w µ (˜ x i ) for i ∈ [ (cid:96) ] (and the same for ˆ w ν ). For each class we will specify such operations. Here,we assume that φ X , φ Y are uniformly bounded maps (as for the Sobolev kernel, where sup x ∈ X (cid:107) φ X ( x ) (cid:107) H X = sup x ∈ X k ( x, x ) = 1 , see Proposition 4). Clearly a class of estimatorsmust be chosen if we are able to perform the required operations. Exact integral.

Here we take ˆ w µ := w µ and ˆ w ν = w ν and we report only the operationsfor µ since the ones for ν are the same. This is the estimator that leads to the best rates aswe are going to see later. However it requires to perform the most diﬃcult operations, i.e.,(1) ˆ w µ (˜ x i ) = (cid:104) w µ , φ X (˜ x i ) (cid:105) H X = (cid:82) X (cid:104) φ X (˜ x i ) , φ X ( x ) (cid:105) dµ ( x ) = (cid:82) X k X (˜ x i , x ) dµ ( x ) for all i ∈ [ (cid:96) ] ;(2) (cid:107) ˆ w µ (cid:107) H X = (cid:104) w µ , w µ (cid:105) H X = (cid:82) X (cid:104) φ X ( x ) , φ X ( x (cid:48) ) (cid:105) H X dµ ( x ) dµ ( x (cid:48) ) = (cid:82) X k X ( x, x (cid:48) ) dµ ( x ) dµ ( x (cid:48) ) .Moreover the costs C, E in Eq. (7) are C = O (1) , E = O (1) . Evaluation estimator.

Here we assume to be able to evaluate µ ( x j ) in a given setof points x , . . . , x n µ with n µ ∈ N (analogously for ν on y , . . . , y n ν with n ν ∈ N ). Wedeﬁne the estimators as ˆ w µ = (cid:82) X φ X ( x )ˆ g µ ( x ) dx (analogously for ˆ w ν ). Here ˆ g µ ∈ H X is the kernel least squares estimator (Narcowich et al., 2005) of the density µ deﬁned as ˆ g µ ( x ) = (cid:80) j ∈ [ n µ ] α j k X ( x j , x ) where α = K − X c µ and K X ∈ R n µ × n µ , ( K X ) i,j = k X ( x i , x j ) for all i, j ∈ [ n µ ] , while c µ ∈ R n µ , c µ = ( µ ( x ) , . . . , µ ( x n )) . In this case the steps are: (1) ˆ w µ (˜ x i ) = (cid:82) X (cid:104) φ X (˜ x i ) , φ X ( x ) (cid:105) ˆ g µ ( x ) dx = (cid:80) j ∈ [ n µ ] α j (cid:82) X k X (˜ x i , x ) k X ( x j , x ) dx, i ∈ [ (cid:96) ] , and (2) (cid:107) ˆ w µ (cid:107) H X = (cid:80) i,t ∈ [ n µ ] α i α t (cid:82) X (cid:82) X k X ( x i , x ) k ( x, x (cid:48) ) k ( x t , x (cid:48) ) dxdx (cid:48) . Moreover the costs C, E inEq. (7) are C = O ( n µ + n ν ) , E = O ( n µ + n ν ) . Sample estimator.

Here we assume to be able to sample from µ, ν . Let x , . . . , x n µ , n µ ∈ N , be sampled i.i.d. from µ and y , . . . , y n ν , n ν ∈ N be sampled i.i.d. from ν . The esti-mators are ˆ w µ = n µ (cid:80) j ∈ [ n µ ] φ X ( x j ) , and analogously for ˆ w ν . In this case the operations are:(1) ˆ w µ (˜ x j ) = n µ (cid:80) j ∈ [ n µ ] k X ( x j , ˜ x i ) , and (2) (cid:107) ˆ w µ (cid:107) H X = n µ (cid:80) i,j ∈ [ n µ ] k X ( x i , x j ) . Moreover thecosts C, E in Eq. (7) are C = O ( n µ + n ν ) , E = O ( n µ + n ν ) .The eﬀects of the estimators above are studied in Theorem 1 and Corollary 3, reported inSection 1 and proven in Appendix E, Page 21, using standard tools from approximationtheory and machine learning with kernel methods (Wendland, 2004; Caponnetto and De Vito,2007; Muandet et al., 2017). 12 cknowledgements. This work was funded in part by the French government undermanagement of Agence Nationale de la Recherche as part of the “Investissements d’avenir”program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). We also acknowledgesupport from the European Research Council (grant SEQUOIA 724063), and the RégionIle-de-France.

References

Robert A. Adams and John J. F. Fournier.

Sobolev Spaces . Elsevier, 2003.Nachman Aronszajn. Theory of reproducing kernels.

Transactions of the American Mathe-matical Society , 68(3):337–404, 1950.Pierre-Cyril Aubin-Frankowski and Zoltan Szabo. Hard shape-constrained kernel machines.

Advances in Neural Information Processing Systems , 33, 2020.Dimitri P. Bertsekas.

Convex Optimization Theory . Athena Scientiﬁc Belmont, 2009.Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacementinterpolation using Lagrangian mass transport. In

Proceedings of the 2011 SIGGRAPHAsia Conference , pages 1–12, 2011.Stephen P. Boyd and Lieven Vandenberghe.

Convex Optimization . Cambridge UniversityPress, 2004.Yann Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs.

CR Acad. Sci. Paris Sér. I Math. , 305:805–808, 1987.Luis A. Caﬀarelli. The regularity of mappings with a convex potential.

Journal of theAmerican Mathematical Society , 5(1):99–104, 1992.Daniel Campbell, Stanislav Hencl, and František Konopeck`y. The weak inverse mappingtheorem.

Zeitschrift für Analysis und ihre Anwendungen , 34(3):321–342, 2015.Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squaresalgorithm.

Foundations of Computational Mathematics , 7(3):331–368, 2007.Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Zhang, Zhe Gan, Dinghan Shen, YizheZhang, Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. Adversarial text generation viafeature-mover’s distance. In

Advances in Neural Information Processing Systems 31 , pages4666–4677. 2018.Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré.Faster Wasserstein distance estimation with the Sinkhorn divergence.

Advances in NeuralInformation Processing Systems , 33, 2020.Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In

Advancesin Neural Information Processing Systems , 2013.Guido De Philippis and Alessio Figalli. The Monge–Ampère equation and its link to optimaltransportation.

Bulletin of the American Mathematical Society , 51(4):527–580, 2014.13hai Dekel and Dany Leviatan. Whitney estimates for convex domains with applications tomultivariate piecewise polynomial approximation.

Foundations of Computational Mathe-matics , 4(4):345–368, 2004.Jean Feydy, Benjamin Charlier, François-Xavier Vialard, and Gabriel Peyré. OptimalTransport for Diﬀeomorphic Registration. In

MICCAI 2017 , September 2017.Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance ofthe empirical measure.

Probability Theory and Related Fields , 162(3-4):707–738, 2015.Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A. Poggio.Learning with a Wasserstein loss. In

Advances in Neural Information Processing Systems ,pages 2053–2061, 2015.Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorndivergences. In

International Conference on Artiﬁcial Intelligence and Statistics , pages1608–1617, 2018.Aude Genevay, Lénaïc Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Samplecomplexity of Sinkhorn divergences. In

The 22nd International Conference on ArtiﬁcialIntelligence and Statistics , pages 1574–1583, 2019.Edouard Grave, Armand Joulin, and Quentin Berthet. Unsupervised alignment of embeddingswith Wasserstein Procrustes. In

International Conference on Artiﬁcial Intelligence andStatistics (AISTATS) , pages 1880–1890. PMLR, 2019.Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and AlexanderSmola. A kernel two-sample test.

The Journal of Machine Learning Research , 13(1):723–773, 2012.Pierre Grisvard.

Elliptic Problems in Nonsmooth Domains . SIAM, 1985.Didier Henrion and Jean-Bernard Lasserre. Graph recovery from incomplete moment infor-mation, 2020.Nhat Ho, XuanLong Nguyen, Mikhail Yurochkin, Hung Hai Bui, Viet Huynh, and DinhPhung. Multilevel clustering via Wasserstein means. In

International Conference onMachine Learning , pages 1501–1509. PMLR, 2017.Jan-Christian Hütter and Philippe Rigollet. Minimax rates of estimation for smooth optimaltransport maps. arXiv preprint arXiv:1905.05828 , 2019.David Krieg and Mathias Sonnleitner. Random points are optimal for the approximation ofsobolev functions. arXiv preprint arXiv:2009.11275 , 2020.Jean Bernard Lasserre. Global optimization with polynomials and the problem of moments.

SIAM Journal on Optimization , 11(3):796–817, 2001.Jean Bernard Lasserre.

Moments, Positive Polynomials and Their Applications . ImperialCollege Press, 2009. 14lysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Non-parametric models fornon-negative functions.

Advances in Neural Information Processing Systems , 2020.Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport:sample complexity and the central limit theorem. In

Advances in Neural InformationProcessing Systems , pages 4543–4553, 2019.Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf.

Ker-nel Mean Embedding of Distributions: A Review and Beyond , volume 10. Now Foundationsand Trends, 2017.Francis Narcowich, Joseph Ward, and Holger Wendland. Sobolev bounds on functions withscattered zeros, with applications to radial basis function surface ﬁtting.

Mathematics ofComputation , 74(250):743–763, 2005.Arkadi Nemirovski. Interior point polynomial time methods in convex programming.

Lecturenotes , 2004.Yurii Nesterov and Arkadii Nemirovskii.

Interior-Point Polynomial Algorithms in ConvexProgramming . Society for Industrial and Applied Mathematics, 1994.Jonathan Niles-Weed and Philippe Rigollet. Estimation of Wasserstein distances in thespiked transport model. arXiv preprint 1909.07513 , 2019.Pablo A Parrilo. Semideﬁnite programming relaxations for semialgebraic problems.

Mathe-matical programming , 96(2):293–320, 2003.Gabriel Peyré and Marco Cuturi. Computational optimal transport.

Foundations andTrends® in Machine Learning , 11(5-6):355–607, 2019.Ievgen Redko, Nicolas Courty, Rémi Flamary, and Devis Tuia. Optimal transport for multi-source domain adaptation under target shift. In

The 22nd International Conference onArtiﬁcial Intelligence and Statistics , pages 849–858. PMLR, 2019.A. Reznikov and E. B. Saﬀ. The covering radius of randomly distributed points on a manifold.

International Mathematics Research Notices , 2016(19):6065–6094, 2016.R. Tyrrell Rockafellar.

Convex Analysis , volume 36. Princeton University Press, 1970.Lorenzo Rosasco, Mikhail Belkin, and Ernesto De Vito. On learning with integral operators.

Journal of Machine Learning Research , 11(2), 2010.Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding global minima via kernelapproximations. In

Arxiv preprint arXiv:2012.11978 , 2020.Tim Salimans, Dimitris Metaxas, Han Zhang, and Alec Radford. Improving GANs usingoptimal transport. In , 2018.Filippo Santambrogio.

Optimal Transport for Applied Mathematicians: Calculus of Variations,PDEs, and Modeling . Progress in Nonlinear Diﬀerential Equations and Their Applications.Springer International Publishing, 2015. 15lya Meerovich Sobol. On the distribution of points in a cube and the approximate evaluationof integrals.

Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki , 7(4):784–802,1967.Ingo Steinwart and Andreas Christmann.

Support Vector Machines . Springer Science &Business Media, 2008.Jonathan Weed and Francis Bach. Sharp asymptotic and ﬁnite-sample rates of convergenceof empirical measures in Wasserstein distance.

Bernoulli , 25(4A):2620–2648, 2019.Jonathan Weed and Quentin Berthet. Estimation of smooth densities in Wasserstein distance.In

Conference on Learning Theory , pages 3118–3119, 2019.Holger Wendland.

Scattered Data Approximation , volume 17. Cambridge University Press,2004.Holger Wendland and Christian Rieger. Approximate interpolation with applications toselecting smoothing parameters.

Numer. Math. , 101(4):729–748, October 2005.Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. Distilled Wasserstein learningfor word embedding and topic modeling. In

Advances in Neural Information ProcessingSystems 31 , pages 1716–1725. 2018.

A Proofs of Corollary 6 and Corollary 7

Proof of Corollary 6.

Deﬁne h ( x, y ) := c ( x, y ) − u (cid:63) ( x ) − v (cid:63) ( y ) , ∀ ( x, y ) ∈ X × Y . By Theo-rem 5, there exist w , ..., w d ∈ H s ( X × Y ) such that h = (cid:80) di =1 w i . Since H s ( X × Y ) ⊆ H XY ,then w , . . . , w d ∈ H XY . Hence A (cid:63) = (cid:80) di =1 w i ⊗ w i ∈ S + ( H XY ) . Moreover, by the repro-ducing property of H XY , A (cid:63) satisﬁes (cid:104) φ ( x, y ) , A (cid:63) φ ( x, y ) (cid:105) H XY = (cid:80) di =1 (cid:104) w j , φ ( x, y ) (cid:105) H XY = (cid:80) di =1 w ( x, y ) = h ( x, y ) , for all ( x, y ) ∈ X × Y . Finally, note that rank A (cid:63) ≤ d by construc-tion. Proof of Corollary 7.

Theorem 3.3 in De Philippis and Figalli (2014) implies thatKantorovich potentials satisfy u (cid:63) ∈ C m +2 , ( X ) , v (cid:63) ∈ C m +2 , ( Y ) , where C m +2 , ( Z ) for anopen set Z is the space of real functions over Z that are m + 2 -times diﬀerentiable, with allthe derivatives of order m + 2 that are Lipschitz continuous (Adams and Fournier, 2003, 1.29,Page 10). Since X is convex and bounded, then Lipschitz continuity of all the derivatives oforder m + 2 implies that all the weak derivatives up to order m + 3 are in L ∞ ( X ) . Since L ∞ ( X ) ⊂ L ( X ) , by the boundedness of X , we have C m +2 , ( X ) ⊂ H m +32 ( X ) . Analogously C m +2 , ( Y ) ⊂ H m +32 ( Y ) . Then u (cid:63) ∈ H m +3 ( X ) , v (cid:63) ∈ H m +3 ( Y ) , and we can apply Theorem 5and Corollary 6 with s = m +1 , when m > d , which guarantee the existence of A (cid:63) ∈ S + ( H XY ) ,since H XY = H s ( X × Y ) . 16 Proofs of Theorem 8 and Theorem 9

Proof of Theorem 8.

Let f ( x, y ) = c ( x, y ) − u ( x ) X ( y ) − v ( y ) Y ( x ) − (cid:104) φ ( x, y ) , Aφ ( x, y ) (cid:105) for any x, y ∈ X × Y , where the function X ( x ) and Y ( y ) are respectively the constantfunction over X and over Y . By construction f ∈ H s ( X × Y ) . Since X, Y are openbounded convex sets, then X × Y has the same property which, in turn, implies the so-called uniform interior cone condition (see Lemma 11, Page 19 in the appendix). Then we canapply Theorem 1 of Narcowich et al. (2005) for which there exist h , C depending only on s, d, X, Y such that for any h (cid:96) ≤ h the following holds sup ( x,y ) ∈ X × Y | f ( x, y ) | ≤ ε, ε := C h s − d(cid:96) | f | H s ( X × Y ) , where | g | H s ( X × Y ) := (cid:80) | α | = s (cid:107) D α g (cid:107) L ( X × Y ) ≤ (cid:107) g (cid:107) H s ( X × Y ) for any g ∈ H s ( X × Y ) . Let r A ( x, y ) = (cid:104) φ (˜ x j , ˜ y j ) , Aφ (˜ x j , ˜ y j ) (cid:105) H XY , we have | f | H s ( X × Y ) ≤ | c | H s ( X × Y ) + | u X | H s ( X × Y ) + | v Y | H s ( X × Y ) + | r A | H s ( X × Y ) . Now | c | H s ( X × Y ) = 0 since c is the quadratic cost and s ≥ and | u X | H s ( X × Y ) = | u | H s ( X ) ≤(cid:107) u (cid:107) H s ( X ) , | v Y | H s ( X × Y ) = | v | H s ( Y ) ≤ (cid:107) v (cid:107) H s ( Y ) . We recall now that for any two Banachspaces satisfying B ⊆ A , we have (cid:107) w (cid:107) A ≤ (cid:107) w (cid:107) B for any w ∈ B . Then (cid:107) u (cid:107) H s ( X ) ≤ (cid:107) u (cid:107) H X , (cid:107) v (cid:107) H s ( Y ) ≤ (cid:107) v (cid:107) H Y . Finally | r A | H s ( X × Y ) ≤ (cid:107) r A (cid:107) H s ( X × Y ) ≤ C Tr( A ) via Lemma 9 page41 from Rudi et al. (2020) and Proposition 1 of the same paper, where C depends onlyon s, X, Y . Now since | f ( x, y ) | ≤ ε for all ( x, y ) ∈ X × Y , and since r A ( x, y ) ≥ for any x, y ∈ X × Y since A is a positive operator, we have − ε ≤ f ( x, y ) ≤ f ( x, y ) + r A ( x, y ) = c ( x, y ) − u ( x ) − v ( y ) , ∀ ( x, y ) ∈ X × Y. The ﬁnal result is obtained by deﬁning C = C (1 + C ) . Proof of Theorem 9.

Denote by V ( u, v ) the functional of Problem 1 and by ˆ V λ ,λ ( u, v, A ) the functional of Problem 3. Then OT = V ( u (cid:63) , v (cid:63) ) . Denote by ∆( u, v ) the quantity ∆( u, v ) = |(cid:104) u, ˆ w µ − w µ (cid:105) H X + (cid:104) v, ˆ w ν − w ν (cid:105) H Y | . Denote by ˆ R ( u, v ) the quantity ˆ R ( u, v ) = (cid:107) u (cid:107) H X + (cid:107) v (cid:107) H Y for any u ∈ H X , v ∈ H Y . Step 0. Admissibility of u (cid:63) , v (cid:63) , A (cid:63) and existence of a maximizer. Note that ( u (cid:63) , v (cid:63) , A (cid:63) ) is an admissible point for Problem 3, since the triple satisﬁes c (cid:63) ( x, y ) − u (cid:63) ( x ) − v (cid:63) ( y ) = (cid:104) φ ( x, y ) , A (cid:63) φ ( x, y ) (cid:105) ∀ ( x, y ) ∈ X × Y , and Problem 3 applies the same constraints but ona subset of X × Y . Moreover λ , λ > , this is enough to guarantee the existence of amaximizer for Problem 3. Indeed, as we recall in Lemma 14, Page 21, in the appendix, aform of representer theorem holds for Problem 3 (see Lemma 13, Page 20 in the appendix),moreover the functional is coercive on the ﬁnite-dimensional space induced by the data,it has an upper bound and the problem has an admissible point. Now denote by (ˆ u, ˆ v, ˆ A ) a maximizer of Problem 3 and deﬁne OT := ˆ V λ ,λ (ˆ u, ˆ v, ˆ A ) . By construction OT is themaximum of ˆ V λ ,λ . Then by deﬁnition of (cid:100) OT in Eq. (6), (cid:100) OT = OT + λ Tr( ˆ A ) + λ R (ˆ u, ˆ v ) . tep 1. Subsampling the inequality. The assumption on h (cid:96) and the fact that theconstraints of Eq. (3) satisfy Eq. (15) allow to apply Theorem 8, from which there exists acouple (ˆ u ε , ˆ v ε ) that is admissible for Problem 1, where ˆ u ε = ˆ u − ε and ˆ v ε = ˆ v − ε , for any ε ≥ C h s − d(cid:96) ( (cid:107) ˆ u (cid:107) H X + (cid:107) ˆ v (cid:107) H Y + Tr( ˆ A )) . In particular, since (cid:107) u (cid:107) H X + (cid:107) v (cid:107) H Y ≤ R (ˆ u, ˆ v ) , wechoose ε = η ( R (ˆ u, ˆ v ) + Tr( ˆ A )) , with η = C h s − d(cid:96) . Step 2. Bounding (cid:100) OT − OT.

Since OT is the maximum of Problem 1 thenOT = V ( u (cid:63) , v (cid:63) ) ≥ V (ˆ u ε , ˆ v ε ) = V (ˆ u, ˆ v ) − ε ≥ (cid:100) OT − ∆(ˆ u, ˆ v ) − ε. (16)Analogously, since (ˆ u, ˆ v, ˆ A ) maximize Problem 3, thenOT = ˆ V λ ,λ (ˆ u, ˆ v, ˆ A ) ≥ ˆ V λ ,λ ( u (cid:63) , v (cid:63) , A (cid:63) ) (17) = V ( u (cid:63) , v (cid:63) ) − (cid:104) V ( u (cid:63) , v (cid:63) ) − ˆ V λ ,λ ( u (cid:63) , v (cid:63) , A (cid:63) ) (cid:105) (18) ≥ OT − ∆( u (cid:63) , v (cid:63) ) − λ Tr( A (cid:63) ) + λ R ( u (cid:63) , v (cid:63) ) . (19)By expressing OT in terms of (cid:100) OT in the inequality above and combining it with Eq. (16),we have λ Tr( ˆ A − A (cid:63) ) + λ ( R (ˆ u, ˆ v ) − R ( u (cid:63) , v (cid:63) )) − ∆( u (cid:63) , v (cid:63) ) ≤ (cid:100) OT − OT ≤ ∆(ˆ u, ˆ v ) + ε. (20) Step 3. Bound on ∆ . Note that for two RKHS H , K and any u, v ∈ H , w, z ∈ K ,the following identity holds: (cid:104) u, v (cid:105) H + (cid:104) w, z (cid:105) K = (cid:104) ( u, w ) , ( v, z ) (cid:105) H⊕K , where

H ⊕ K is aRKHS (Aronszajn, 1950). This implies ( (cid:104) u, v (cid:105) H + (cid:104) w, z (cid:105) K ) = (cid:107) ( u, w ) (cid:107) H⊕K (cid:107) ( v, z ) (cid:107) H⊕K =( (cid:107) u (cid:107) H + (cid:107) v (cid:107) K )( (cid:107) w (cid:107) H + (cid:107) z (cid:107) K ) . Applying this inequality to ∆ leads to the following result: ∆( a, b ) ≤ R ( a, b ) γ. (21) Step 4. Bounding

Tr( ˆ A ) , R (ˆ u, ˆ v ) . The bound Eq. (20) implies λ Tr( ˆ A ) + λ R (ˆ u, ˆ v ) ≤ λ Tr( A (cid:63) ) + λ R ( u (cid:63) , v (cid:63) ) + ∆( u (cid:63) , v (cid:63) ) + ∆(ˆ u, ˆ v ) + ε. (22)By bounding ∆ via Eq. (21), expanding the deﬁnition of ε and reordering the terms in theinequality above, we obtain α Tr( ˆ A ) + λ R (ˆ u, ˆ v ) − βR (ˆ u, ˆ v ) ≤ λ Tr( A (cid:63) ) + λ R ( u (cid:63) , v (cid:63) ) + γR ( u (cid:63) , v (cid:63) ) , (23)with α = λ − η and β = γ + η . By completing the square in R (ˆ u, ˆ v ) , the inequality aboveis rewritten as α Tr( ˆ A ) + λ ( R (ˆ u, ˆ v ) − β λ ) ≤ λ Tr( A (cid:63) ) + λ R ( u (cid:63) , v (cid:63) ) + γR ( u (cid:63) , v (cid:63) ) + β λ . (24)Since α ≥ λ / , by the assumption λ ≥ η , the inequality above implies λ Tr( ˆ A ) ≤ λ S, R (ˆ u, ˆ v ) ≤ β λ + √ S. (25)with S := λ λ Tr( A (cid:63) ) + R ( u (cid:63) , v (cid:63) ) + γλ R ( u (cid:63) , v (cid:63) ) + β λ . onclusion. From the lower bound in Eq. (20) and the bound Eq. (21) on ∆ , we havethat (cid:100) OT − OT ≥ − λ Tr( A (cid:63) ) − λ R ( u (cid:63) , v (cid:63) ) − γR ( u (cid:63) , v (cid:63) ) ≥ − λ S. From the upper bound in Eq. (20), the bound for ∆ in Eq. (21), the bound for Tr( ˆ A ) , R (ˆ u, ˆ v ) in Eq. (25), the deﬁnition of ε , and the fact that λ ≥ η , we have (cid:100) OT − OT ≤ βR (ˆ u, ˆ v ) + λ Tr( ˆ A ) ≤ β λ + β √ S + λ S. Then | (cid:100) OT − OT | ≤ β λ + β √ S + λ S. To conclude, by noting that S ≤ λ λ Tr( A (cid:63) )+( R ( u (cid:63) , v (cid:63) )+ β λ ) , since γ ≤ β , and that ( a + b + c ) ≤ a + 3 b + 3 c for any a, b, c ≥ , we have β λ + β √ S + λ S ≤ λ ( β λ + √ S ) ≤ λ (cid:18) R ( u (cid:63) , v (cid:63) ) + βλ + (cid:113) λ λ Tr( A (cid:63) ) (cid:19) ≤ λ ( R ( u (cid:63) , v (cid:63) ) + β λ + λ λ Tr( A (cid:63) )) . C Additional results on convex sets and random points

We ﬁrst recall that the following property about bounded sets in Euclidean spaces.

Lemma 11 (Krieg and Sonnleitner (2020)) . Let Z ⊂ R q , q ∈ N be a non-empty open set.The following holds:1. If Z is a bounded Lipschitz domain (Grisvard, 1985), then it satisﬁes the uniforminterior cone condition (Wendland, 2004).2. If Z is a convex bounded set, then it is a bounded Lipschitz domain. Proof

For the ﬁrst point, see Lemma 5 of Krieg and Sonnleitner (2020) or Theorem 1.2.2.2of Grisvard (1985). For the second point, see Lemma 4 of Krieg and Sonnleitner (2020) orLemma 7 in Dekel and Leviatan (2004). Also, the fact that a convex bounded open set hasthe uniform interior cone condition is implied by Proposition 11.26 of Wendland (2004), thatproves it for the more general class of sets called star shaped sets w.r.t. a ball . Lemma 12 (Fill distance of i.i.d points on a u.i.c. set (Reznikov and Saﬀ, 2016)) . Let Z ⊂ R q , q ∈ N be a non-empty open set satisfying the uniform interior cone condition (seeLemma 11). Let ˜ Z (cid:96) be a collection of (cid:96) points sampled independently and uniformly at randomfrom a probability ρ that admits density (denote it by p ) and such that p ( z ) ≥ c > for any z ∈ Z . Let δ ∈ (0 , . Then there exist (cid:96) , C , C depending on c , Z, ρ, q, r such that for (cid:96) ≥ (cid:96) , the following holds with probability at least − δ : h (cid:96) ≤ ( C (cid:96) − log( C (cid:96)/δ )) /q . roof The uniform interior cone condition guarantees that there exists a cone C such thatfor any z ∈ Z there exists a spherical cone of radius r such that C z ⊆ Z that is congruent to C and with vertex in z . Then, for any r ≤ r there exists c such that ρ ( B ( z, r ) ∩ Z ) ≥ ρ ( B ( z, r ) ∩ C z ) ≥ c vol ( B ( z, r ) ∩ C z ) vol ( Z ) ≥ c c vol ( Z ) r q . Then we can apply Theorem 2.1 of Reznikov and Saﬀ (2016) with Φ( r ) = c c vol ( Z ) r q obtainingthat there exists (cid:96) , C , C depending on c , c , q, r , Z such that with probability at least − δ , h (cid:96) ≤ ( C (cid:96) − log( C (cid:96)/δ )) /q . D Representer theorem and coercivity for Problem (3)

We ﬁrst adapt the proof of the representer theorem from Marteau-Ferey et al. (2020).We use it to prove coercivity of the functional. Given the RKHS H X , H Y , H XY andwith the same notation of Problem (3) deﬁne ˆ H X = span { ˆ w µ , φ X (˜ x ) , . . . , φ X (˜ x (cid:96) ) } , ˆ H Y = span { ˆ w ν , φ Y (˜ y ) , . . . , φ Y (˜ y (cid:96) ) } and ˆ H XY = span { φ ( x , y ) , . . . , φ ( x (cid:96) , y (cid:96) ) } . Lemma 13 (Representer theorem, Marteau-Ferey et al. (2020)) . Let λ , λ > . Denote by V λ ,λ ( u, v, A ) the objective function of Problem (3) . Let u ∈ ˆ H X , u ∈ ˆ H ⊥ X , v ∈ ˆ H Y , v ∈ ˆ H ⊥ Y and A ∈ ˆ H XY ⊗ ˆ H XY , A ∈ ˆ H ⊥ XY ⊗ ˆ H XY , A ∈ ˆ H ⊥ XY ⊗ ˆ H ⊥ XY . Assume that u or v or A or A are diﬀerent from . Set u = u + u , v = v + v and A = A + A + A ∗ + A and assume that u, v, A is an admissible point for Problem (3) Then1. ( u , v , A ) is an admissible point for Problem (3) ,2. V λ ,λ ( u , v , A ) > V λ ,λ ( u, v, A ) . Proof

We can decompose any u ∈ H X , v ∈ H Y , A ∈ S + ( H XY ) as u ∈ ˆ H X , u ∈ ˆ H ⊥ X , v ∈ ˆ H Y , v ∈ ˆ H ⊥ Y and A ∈ ˆ H XY ⊗ ˆ H XY , A ∈ ˆ H ⊥ XY ⊗ ˆ H XY , A ∈ ˆ H ⊥ XY ⊗ ˆ H ⊥ XY . Note thatthe components u , v , A , A do not impact the constraints or the functional but are onlypenalized by the regularizer, indeed (cid:104) u , φ X (˜ x i ) (cid:105) H X = 0 since φ X (˜ x i ) ∈ ˆ H X , while u ∈ ˆ H ⊥ X ,then u ( ˜ x i ) = (cid:104) u, φ X (˜ x i ) (cid:105) H X = (cid:104) u , φ X (˜ x i ) (cid:105) H X + (cid:104) u , φ X (˜ x i ) (cid:105) H X = (cid:104) u , φ X (˜ x i ) (cid:105) H X = u (˜ x i ) . The same reasoning holds for v (˜ x i ) and for (cid:104) u, w µ (cid:105) H X , (cid:104) v, w ν (cid:105) H Y . For A analogously wehave that t = A φ (˜ x i , ˜ y i ) ∈ ˆ H ⊥ XY so (cid:104) φ (˜ x i , ˜ y i ) , A φ (˜ x i , ˜ y i ) (cid:105) H XY = (cid:104) φ (˜ x i , ˜ y i ) , t (cid:105) H XY = 0 andsimilarly for A , we have (cid:104) φ (˜ x i , ˜ y i ) , A φ (˜ x i , ˜ y i ) (cid:105) H XY = 0 . Let’s see what happens to thepenalization terms. For the quadratic term, we have −(cid:107) u (cid:107) H X = −(cid:107) u (cid:107) H X − (cid:107) u (cid:107) H X < −(cid:107) u (cid:107) H X and analogously for v . For the trace term we have Tr( A ) = Tr( A ) + Tr( A ) .Moreover A ∈ S + ( H XY ) implies that A (cid:23) and by the Schur complement property A (cid:23) A ∗ A − A (cid:23) . Then if A (cid:23) and diﬀerent from zero we have − Tr( A ) < − Tr( A ) . If A is diﬀerent from zero this implies by the Schur complement property that A is a positivedeﬁnite operator diﬀerent from zero and so − Tr( A ) < − Tr( A ) .20 emma 14 (Problem (3) has a maximizer) . When λ , λ > and when there exists anadmissible point (¯ u, ¯ v, ¯ A ) with ¯ u ∈ H X , ¯ v ∈ H Y , ¯ A ∈ S + ( H XY ) , Problem (3) admits amaximizer. Proof

First deﬁne S = H X × H Y × ( H XY ⊗ H XY ) and ˆ S = ˆ H X × ˆ H Y × ( ˆ H XY ⊗ ˆ H XY ) .From the lemma above, we have that for any admissible point in S \ ˆ S there exists anotheradmissible point in ˆ S that has a strictly larger value. Note moreover that ˆ S is a ﬁnitedimensional Hilbert space (with dimension at most (cid:96) ( (cid:96) + 1) ) and that (minus) the functionalof Problem (3) is coercive on it. Note moreover that Eq. (3) is bounded from above since V λ ,λ ( u, v, A ) ≤ (cid:107) u (cid:107) H X (cid:107) ˆ w µ (cid:107) H X + (cid:107) v (cid:107) H Y (cid:107) ˆ w ν (cid:107) H Y − λ ( (cid:107) u (cid:107) H X + (cid:107) v (cid:107) H Y ) has a maximum in u, v . Since Problem (3) has also an admissible point by assumption, then it admits at leastone maximizer (see, e.g., Proposition 3.2.1, Page 119 of Bertsekas, 2009). E Proof of Corollary 3 and Theorem 1

Proof of Corollary 3.

The result is obtained by plugging in Theorem 2 the bounds on γ = (cid:107) w µ − ˆ w µ (cid:107) H X + (cid:107) w ν − ˆ w ν (cid:107) H Y . We deal now with the three scenarios. (Exact integral) In this case, since ˆ w µ := w µ , ˆ w ν := w ν , then γ = 0 . (Evaluation) We use here the construction used to analyze kernel least squares, e.g., fromCaponnetto and De Vito (2007); Rosasco et al. (2010) (see also Steinwart and Christmann,2008). We will do the construction for w µ , since the case of w ν is analogous. We recallthat φ X : X → H X is uniformly bounded and continuous on X , since we are consideringthe Sobolev kernel (see Proposition 4). Denote by T ∈ S + ( H X ) the operator deﬁned as T = (cid:82) φ X ( x ) ⊗ φ X ( x ) dx where dx is the Lebesgue measure. Note that T is trace class, indeed Tr( T ) = (cid:82) Tr( φ X ( x ) ⊗ φ X ( x )) dx = (cid:82) (cid:107) φ X ( x ) (cid:107) H X dx ≤ vol ( X ) sup x ∈ X (cid:107) φ X ( x ) (cid:107) H X < ∞ . Forany f, g ∈ H X , by the reproducing property (cid:104) f, T g (cid:105) H X = (cid:90) (cid:104) f, ( φ X ( x ) ⊗ φ X ( x )) g (cid:105) H X dx = (cid:90) (cid:104) f, φ X ( x ) (cid:105) H X (cid:104) g, φ X ( x ) (cid:105) H X dx = (cid:90) f ( x ) g ( x ) dx. In particular the equation above implies that (cid:107) f (cid:107) L ( X ) = (cid:107) T / f (cid:107) H X for any f ∈ H X . Now,note that by assumption in this Corollary, we have chosen the kernel k X = k m +1 so H X corresponds to the Sobolev space H X = H m +1 ( X ) (see Proposition 4). Now by Assumption 1, µ has a density that we denote g µ , that is diﬀerentiable up to order m and such that allthe derivatives of order m are Lipschitz continuous. Since X is convex and bounded, thenLipschitz continuity of all the derivatives of order m implies that all the weak derivatives upto order m + 1 are in L ∞ ( X ) . Since L ∞ ( X ) ⊂ L ( X ) , by the boundedness of X , we havethat all the weak derivatives of g µ up to order m + 1 belong to L ( X ) , i.e. (cid:107) g µ (cid:107) H m +1 ( X ) < ∞ g µ ∈ H m +1 ( X ) = H X . So, by the reproducing property w µ = (cid:90) φ X ( x ) dµ ( x ) = (cid:90) φ X ( x ) g µ ( x ) dx = (cid:90) φ X ( x ) (cid:104) φ X ( x ) , g µ (cid:105) H X dx = (cid:90) ( φ X ( x ) ⊗ φ X ( x )) g µ dx = T g µ . Now, the estimator ˆ w µ is deﬁned as ˆ w µ = (cid:82) X φ X ( x )ˆ g µ ( x ) , where ˆ g µ ∈ H X is the Kernel LeastSquares estimator (see Narcowich et al., 2005; Caponnetto and De Vito, 2007) of the densityof µ that is g µ . With the same reasoning as above, we see that ˆ w µ = T ˆ g µ . Now note that (cid:107) w µ − ˆ w µ (cid:107) H X = (cid:107) T ( g µ − ˆ g µ ) (cid:107) H X ≤ (cid:107) T / (cid:107) op (cid:107) T / ( g µ − ˆ g µ ) (cid:107) H X = (cid:107) T / (cid:107) op (cid:107) g µ − ˆ g µ (cid:107) L ( X ) . Now (cid:107) T / (cid:107) op = (cid:107) T / (cid:107) op ≤ Tr( T ) < ∞ as we have seen above. Moreover (cid:107) g µ − ˆ g µ (cid:107) L ( X ) iscontrolled by classical results on approximation theory, e.g. Proposition 3.2 of Narcowichet al. (2005) (applied with α = 0 and q = 2 ). It is possible to apply such results as the set X is convex bounded and so it satisﬁes the required uniform interior cone condition (Wendland,2004) (see Lemma 11, Page 19). The result guarantees that there exists two constants C, h depending only on X , such that (cid:107) g µ − ˆ g µ (cid:107) L ( X ) ≤ Ch m +1 (cid:107) g µ (cid:107) H X , where h is the ﬁll distance (see Eq. (14)) of the sampled n µ points, with respect to X . Now by Lemma 12, Page 19we have that h ≤ ( C (cid:48) n µ log( C (cid:48)(cid:48) n µ /δ )) /d with probability at least − δ for some constants C (cid:48) , C (cid:48)(cid:48) depending on d, X . Then, ﬁnally for some constant C (cid:48)(cid:48)(cid:48) we have (cid:107) w µ − ˆ w µ (cid:107) H X ≤ Tr( T ) / (cid:107) g µ − ˆ g µ (cid:107) L ( X ) ≤ Tr( T ) / C (cid:107) g µ (cid:107) H X ( C (cid:48) n µ log( C (cid:48)(cid:48) n µ /δ )) /d ≤ C (cid:48)(cid:48)(cid:48) n − ( m +1) /dµ log( n µ /δ ) . (Sampling) In this case we have ˆ w µ = n µ (cid:80) i ∈ [ n µ ] φ X ( x i ) where x , . . . , x n are indepen-dently and identically distributed according to µ . Then ξ i = φ X ( x i ) for i ∈ [ n ν ] arei.i.d. random vectors and ˆ w µ = E ξ . Since φ X : X → H X is uniformly bounded on X (see Proposition 4) denote by c that bound. We have that (cid:107) ξ i (cid:107) H X ≤ c almost surely and E (cid:107) ξ i − E ξ i (cid:107) H X ≤ c for all i ∈ [ n µ ] . Then we can apply the Pinelis inequality (see Proposition2 of Caponnetto and De Vito, 2007), to control ˆ w µ = n µ (cid:80) n µ i =1 ξ i , for which the followingholds with probability − δ (cid:107) w µ − ˆ w µ (cid:107) H X = (cid:107) E ξ − n µ n µ (cid:88) i =1 ξ i (cid:107) H X ≤ cn − / µ log 6 n µ δ . Proof of Theorem 1.

This theorem is a direct consequence of Corollary 3. Let ε > , (cid:96), n µ , n ν ∈ N . We denote by f ( x ) (cid:16) g ( x ) the fact that there exists two constants < C ≤ C not depending on x such that C g ( x ) ≤ f ≤ C g ( x ) (in our particular case thismeans that the constants will not depend on ε, (cid:96), n µ , n ν ). For the rest of the proof we willﬁx (cid:96) (cid:16) ε − d/ ( m − d ) . Indeed, this choice implies that (cid:96) − ( m − d ) / d = O ( ε ) . We recall, moreover,from Eq. (7) (that is proven in Appendix F) that the computational time to achieve theestimator is ˜ O ( C + E(cid:96) + (cid:96) . ) and the required memory is O ( (cid:96) ) , where E, C are speciﬁedfor each scenario in Section 6. Now we will quantify the complexity for the three scenarios.22

Exact integral)

In this case, by Corollary 3 | (cid:100) OT − OT | = ˜ O ( (cid:96) − ( m − d ) / d ) = ˜ O ( ε ) . Note that, since for this scenario C = O (1) , E = O (1) , the complexity in time is ˜ O ( C + E(cid:96) + (cid:96) . ) = ˜ O ( (cid:96) . ) = ˜ O ( ε − d/ ( m − d ) ) . In space, analogously O ( (cid:96) ) = O ( ε − d/ ( m − d ) ) . (Evaluation) In this case we choose n ν (cid:16) n µ (cid:16) O ( ε − d/ ( m +1) ) . Indeed, with this choice n − ( m +1) /dµ = O ( ε ) (analogously for n ν ). Then, by Corollary 3 | (cid:100) OT − OT | = ˜ O ( n − ( m +1) /dµ + n − ( m +1) /dν + (cid:96) − ( m − d ) / d ) = ˜ O ( ε ) . From a computational viewpoint C = O ( n µ + n ν ) and E = O ( n µ + n ν ) . Then the timecomplexity is ˜ O ( C + E(cid:96) + (cid:96) . ) = ˜ O ( n µ + n ν + ( n µ + n ν ) (cid:96) + (cid:96) . )= ˜ O ( ε − d/ ( m +1) + ε − d/ ( m +1) − d/ ( m − d ) + ε − d/ ( m − d ) ) = O ( ε − d/ ( m − d ) ) , Wwile the space complexity is O ( (cid:96) ) = O ( ε − d/ ( m − d ) ) . (Sampling) In this case we choose n ν (cid:16) n µ (cid:16) O ( ε − ) . Indeed with this choice n − / µ = O ( ε ) (analogously for n ν ). Then, by Corollary 3 | (cid:100) OT − OT | = ˜ O ( n − / µ + n − / ν + (cid:96) − ( m − d ) / d ) = ˜ O ( ε ) . From a computational viewpoint C = O ( n µ + n ν ) and E = O ( n µ + n ν ) . Then the timecomplexity is ˜ O ( C + E(cid:96) + (cid:96) . ) = ˜ O ( n µ + n ν + ( n µ + n ν ) (cid:96) + (cid:96) . )= ˜ O ( ε − + ε − − d/ ( m − d ) + ε − d/ ( m − d ) ) = ˜ O ( ε − max(4 , d/ ( m − d )) ) . Also in this case the space complexity is O ( (cid:96) ) = O ( ε − d/ ( m − d ) ) . F Dual Algorithm and Computational Bounds

In this section, we describe a dual algorithmic procedure to compute Eq. (4), and bound thecomputational complexity and memory footprint it requires to achieve a given precision. Westart by deriving Eq. (5), the dual formulation of Eq. (3), and express Eq. (4) as a functionof the dual solution ˆ γ . Theorem 15.

The dual problem of Eq. (3) is Eq. (5) . Further, the estimator (cid:100)

OT can beexpressed as a function of the solution ˆ γ of Eq. (5) : (cid:100) OT = q λ − λ (cid:96) (cid:88) j =1 ˆ γ j ( ˆ w µ (˜ x j ) + ˆ w ν (˜ y j )) . (26)23 roof of Theorem 15. First, let us observe that problem Equation (3) is convex, andadmits a feasible point by Corollary 7 and a maximizer by Lemma 14. Therefore, strongduality holds. Next, the Lagragian of Eq. (3) reads L ( u, v, B , γ ) = (cid:104) u, ˆ w µ (cid:105) + (cid:104) v, ˆ w ν (cid:105) − λ Tr B − λ (cid:107) u (cid:107) H X − λ (cid:107) v (cid:107) H Y + (cid:96) (cid:88) i =1 γ i ( c (˜ x i , ˜ y i ) − (cid:104) u, φ X (˜ x i ) (cid:105) − (cid:104) v, φ Y (˜ y i ) (cid:105) − Φ (cid:62) i B Φ i ) . (27)At the optimum, we have ∇ u L ( u, v, B , γ ) = 0 and ∇ v L ( u, v, B , γ ) = 0 , which yields u = 12 λ ( ˆ w µ − (cid:96) (cid:88) i =1 γ i φ X (˜ x i )) v = 12 λ ( ˆ w ν − (cid:96) (cid:88) i =1 γ i φ Y (˜ y i )) . (28)Let us now derive the optimality condition on B : we have sup B ∈ S + ( R (cid:96) ) − (cid:96) (cid:88) i =1 γ i Φ (cid:62) i B Φ i − λ Tr B = sup B ∈ S + ( R (cid:96) ) (cid:104) B , − ( (cid:96) (cid:88) i =1 γ i Φ i Φ (cid:62) i + λ I (cid:96) ) (cid:105) = (cid:40) if (cid:80) (cid:96)i =1 γ i Φ (cid:62) i Φ i + λ I k (cid:23) −∞ otherwise . (29)Plugging Eq. (28) and Eq. (29) in the Eq. (27), we get (5). Finally, using Eq. (28), we have (cid:100) OT = (cid:104) ˆ u, ˆ w µ (cid:105) H X + (cid:104) ˆ u, ˆ w ν (cid:105) H X = q λ − λ (cid:96) (cid:88) j =1 ˆ γ j ( ˆ w µ (˜ x j ) + ˆ w ν (˜ y j )) . (30)To solve (5), it is possible to use standard software packages (Boyd and Vandenberghe,2004). Alternatively, it can be made more scalable by adding a self-concordant barrier termto (5) and using interior point methods with Newton steps. For a given barrier penalization δ > , we thus aim to solve min γ ∈ R (cid:96) λ γ (cid:62) Q γ − λ (cid:96) (cid:88) j =1 γ j z j + q λ − δ(cid:96) log det( (cid:96) (cid:88) i =1 γ i Φ i Φ (cid:62) i + λ I (cid:96) )) such that (cid:96) (cid:88) j =1 γ j Φ j Φ (cid:62) j + λ I (cid:96) (cid:23) . (31)Starting from an initial value δ , the barrier method (Nemirovski, 2004) consists in iterativelysolving Eq. (31) (using Newton iterations), and progressively decreasing δ . In Theorems 16and 17, we precisely analyze the complexity of the barrier method applied to Eq. (5), andbound the number of operations required to obtain an estimator of OT with a desiredaccuracy. 24 heorem 16. Using a dual interior point method, a solution of problem (3) with valueprecision O ( τ ) can be obtained in O ( C + E(cid:96) + (cid:96) . log( (cid:96)τ )) operations and O ( (cid:96) ) memory,where E is the cost of querying ˆ w µ and ˆ w ν , and C is the cost of computing q . Proof of Theorem 16.

Removing terms that are constant in γ , problem (31) is equivalentto minimizing the dual functional J ( γ ) def = 14 λ γ (cid:62) Q γ − λ (cid:96) (cid:88) j =1 γ j z j − δ(cid:96) log det( (cid:96) (cid:88) i =1 γ i Φ i Φ (cid:62) i + λ I (cid:96) ) . (32)Its gradient is J (cid:48) ( γ ) i = 12 λ ( Q γ ) i − λ z i − δ(cid:96) Φ (cid:62) i (Φ diag( γ )Φ (cid:62) + λ I k ) − Φ i , (33)and its Hessian J (cid:48)(cid:48) ( γ ) ij = 12 λ Q ij − δ(cid:96) [Φ (cid:62) i (Φ diag( γ )Φ (cid:62) + λ I k ) − Φ j ] . (34)From there, we may minimize J ( γ ) using damped Newton iterations γ (cid:48) = γ − [ J (cid:48)(cid:48) ( γ )] − J (cid:48) ( γ )1 + (cid:113) (cid:96)δ λ ( γ ) , (35)where λ ( γ ) = J (cid:48) ( γ ) (cid:62) [ J (cid:48)(cid:48) ( γ )] − J (cid:48) ( γ ) is the Newton decrement, or using backtracking line-search Newton iterations (Boyd and Vandenberghe, 2004). Number of iterations. J (cid:48)(cid:48) ( γ ) can be computed and inverted in O ( (cid:96) ) operations, andassuming ˆ w µ (˜ x i ) , i ∈ [ l ] and ˆ w ν (˜ y i ) , i ∈ [ l ] are precomputed, J (cid:48) ( γ ) can be computed in O ( (cid:96) ) operations, hence the complexity per iteration is O ( (cid:96) ) .Let F ( γ ) def = λ γ (cid:62) Q γ − λ (cid:80) (cid:96)j =1 γ j z j + q λ be the objective function of (5). Since H ( γ ) def = − log det( (cid:80) (cid:96)i =1 γ i Φ i Φ (cid:62) i + λ I (cid:96) )) is a self-concordant barrier function of concordanceparameter (cid:96) , standard results on barrier methods imply that δ controls the deviation (invalue) to the optimum of F (Nemirovski, 2004). Moreover, a solution ˜ γ to (5) of precision τ > , i.e. satisfying F (˜ γ ) − F (ˆ γ ) ≤ τ where ˆ γ is the optimum of (5), can computed in O ( √ (cid:96) log (cid:96)τ ) Newton iterations using an interior point method by progressively decreasing δ using a suitable scheme until δ ≤ τ : see (Nemirovski, 2004).Hence, taking into account the O ( E(cid:96) ) cost of computing z j , j = 1 , ..., (cid:96) and the O ( C ) cost required to compute q , to achieve a precision τ > a total of O ( C + E(cid:96) + (cid:96) . log( (cid:96)τ )) operations and O ( (cid:96) ) memory are required.In Theorem 16, we may use any of the kernel mean estimators presented in Section 6 andapply the corresponding computational costs C and E . However, the given bounds only applyto the precision in value, i.e. on V λ ,λ (ˆ u, ˆ v, ˆ A ) − V λ ,λ ( u, v, A ) , and not on the solutions ( u, v, A ) themselves. In Theorem 17, we derive bounds on the algorithmic approximation ofthe estimator in (6) obtained by minimizing Eq. (31) to a precision τ > , as a function ofits computational complexity. 25 heorem 17. Under the same notation and assumptions of Theorem 9. Let ˜ γ be obtainedby running a barrier method on (5) to precision τ > , i.e., by iteratively solving (31) andprogressively decreasing δ until δ ≤ τ , as described by Nemirovski (2004). Deﬁne (cid:103) OT as (cid:103) OT = q λ − λ (cid:96) (cid:88) j =1 ˜ γ j ( ˆ w µ (˜ x j ) + ˆ w ν (˜ y j )) . Then (cid:103)

OT satisﬁes the following bound | (cid:103) OT − OT | ≤ λ ( (cid:107) u (cid:63) (cid:107) H X + (cid:107) v (cid:63) (cid:107) H Y ) + 6 β λ + 6 λ Tr A (cid:63) + 6 λ τ. (36) Further, the considered algorithm to compute (cid:103)

OT has a cost of O ( C + E(cid:96) + (cid:96) . log( (cid:96)τ )) intime and O ( (cid:96) ) in memory. Proof of Theorem 17.

Let ˜ γ be obtained by running a barrier method on Eq. (5) toprecision τ , i.e. by solving (31) and decreasing δ until δ ≤ τ . Then, from properties of barriermethods (Nesterov and Nemirovskii, 1994) we can associate to ˜ γ the following primal feasiblepoints, obtained by nullifying the Lagragian of (31) at ˜ γ : ˜ u = 12 λ ( ˆ w µ − (cid:96) (cid:88) i =1 ˜ γ i φ X (˜ x i ))˜ v = 12 λ ( ˆ w ν − (cid:96) (cid:88) i =1 ˜ γ i φ Y (˜ y i ))˜ A = (cid:96) (cid:88) ij =1 B ij φ (˜ x i , ˜ y i ) φ (˜ x j , ˜ y j ) , (37)with B = δ(cid:96) ( (cid:80) (cid:96)i =1 ˜ γ i Φ i Φ (cid:62) i + λ I (cid:96) ) − . In particular, from properties of interior point meth-ods (Nemirovski, 2004; Nesterov and Nemirovskii, 1994), we have(i) (˜ u, ˜ v, ˜ A ) is a feasible point of (3),(ii) The duality gap between the objective ˆ V λ ,λ of (3) evaluated at (˜ u, ˜ v, ˜ A ) and theobjective F of (5) at ˜ γ is equal to δ , i.e. F (˜ γ ) − ˆ V λ ,λ (˜ u, ˜ v, ˜ A ) = δ .Further, note that we have (cid:103) OT def = q λ − λ (cid:96) (cid:88) j =1 ˜ γ j ( ˆ w µ (˜ x j ) + ˆ w ν (˜ y j ))= (cid:104) ˜ u, ˆ w µ (cid:105) H X + (cid:104) ˜ v, ˆ w ν (cid:105) H Y . Let us now bound (cid:103) OT − OT. We will follow similar arguments than in the proof of Theorem 9,with an additional τ precision term. As in the proof of Theorem 9, let (˜ u ε , ˜ v ε ) = (˜ u − ε/ , (˜ v − ε ) .Since (˜ u, ˜ v, ˜ A ) satisfy the constraints of (3), from the same arguments (˜ u ε , ˜ v ε ) deﬁnes a feasiblepoint for (1). Hence, we have the equivalent of Eq. (16):OT = V ( u (cid:63) , v (cid:63) ) ≥ V (˜ u ε , ˜ v ε ) = V (˜ u, ˜ v ) − ε ≥ (cid:103) OT − ∆(˜ u, ˜ v ) − ε. (38)26ext, let ˆ γ be the optimum of (5), and F the objective function of (5). By strong duality(see Theorem 15), we have F (ˆ γ ) = ˆ V λ ,λ (ˆ u, ˆ v, ˆ A ) . Further, by optimality of ˆ γ , we have F (ˆ γ ) ≤ F (˜ γ ) . Hence, using (ii) and δ ≤ τ , we have ˆ V λ ,λ (˜ u, ˜ v, ˜ A ) = F (˜ γ ) − δ ≥ F (ˆ γ ) − δ = ˆ V λ ,λ (ˆ u, ˆ v, ˆ A ) − δ ≥ ˆ V λ ,λ ( u (cid:63) , v (cid:63) , A (cid:63) ) − τ = V ( u (cid:63) , v (cid:63) ) − (cid:104) V ( u (cid:63) , v (cid:63) ) − ˆ V λ ,λ ( u (cid:63) , v (cid:63) , A (cid:63) ) (cid:105) − τ ≥ OT − ∆( u (cid:63) , v (cid:63) ) − λ Tr( A (cid:63) ) + λ R ( u (cid:63) , v (cid:63) ) − τ. From there, the rest of the proof of Theorem 9 follows identically: developing ˆ V λ ,λ (˜ u, ˜ v, ˜ A ) and combining with Eq. (38), we have λ Tr( ˜ A − A (cid:63) ) + λ ( R (˜ u, ˜ v ) − R ( u (cid:63) , v (cid:63) )) − ∆( u (cid:63) , v (cid:63) ) − τ ≤ (cid:103) OT − OT ≤ ∆(˜ u, ˜ v ) + ε. Hence λ Tr( ˜ A ) + λ R (˜ u, ˜ v ) ≤ λ Tr( A (cid:63) ) + λ R ( u (cid:63) , v (cid:63) ) + ∆( u (cid:63) , v (cid:63) ) + ∆(˜ u, ˜ v ) + ε + τ. Therefore, replacing S from the proof of (9) with S (cid:48) := S + τ , and ˆ u, ˆ v, ˆ A with ˜ u, ˜ v, ˜ A , therest of the proof follows and eventually yields | (cid:103) OT − OT | ≤ λ ( R ( u (cid:63) , v (cid:63) ) + β λ + λ λ Tr A (cid:63) + τ ) . To conclude, note that as a consequence of Theorem 16, (cid:103)

OT can be computed in O ( C + E(cid:96) + (cid:96) . log( (cid:96)τ )) operations and O ( (cid:96) ) memory. Recovering unregularized optimal transport.

In Eq. (3) and in the case of empiricalestimators ˆ w µ and ˆ w ν (see Section 6), we considered the case where the sample pairs (˜ x i , ˜ y i ) , i ∈ [ l ] covering X × Y are distinct from the samples x i ∼ µ, i ∈ [ n µ ] and y j ∼ ν, j ∈ [ n ν ] . However, covering X × Y with the n µ n ν pairs given by the µ and ν samples ( x i , y j ) , i ∈ [ n µ ] , j ∈ [ n ν ] , we may rewrite (5) as a regularized optimal transport problem: min Γ ∈ R nµ × nν (cid:88) ij Γ ij c ( x i , y j ) + 12 λ r T K X r + 12 λ c T K Y c s.t. (cid:88) i =1 ,...,n µ j =1 ,...,n ν Γ ij Φ ij Φ (cid:62) ij + λ I n µ n ν (cid:60) ,r i = 1 n µ − n ν (cid:88) j =1 Γ ij , i ∈ [ n µ ] ,c j = 1 n ν − n µ (cid:88) i =1 Γ ij , j ∈ [ n ν ] , (39)27ere we reindexed Φ p , p ∈ [ n µ n ν ] as Φ ij , i ∈ [ n µ ] , j ∈ [ n ν ] , and where ( K X ) ij = k X ( x i , x j ) , i, j ∈ [ n µ ] , ( K Y ) ij = k Y ( y i , y j ) , i, j ∈ [ n ν ] . Hence, (39) can be interpreted as a regularized optimaltransport problem, where Γ ∈ R n µ × n ν plays the role of the transportation plan, and themarginal violations are penalized with maximum mean discrepancy (MMD) terms (Grettonet al., 2012). When λ goes to , the SDP constraint becomes a Γ ∈ R n µ × n ν + positivityconstraint, and when λ goes to , the MMD penalization terms enforce the hard constraints Γ n ν = nµ n µ and Γ (cid:62) n µ = nν n ν . In particular, when ( λ , λ ) → (0 , , we recover the unregu-larized OT problem between the empirical measures ˆ µ = n µ (cid:80) n µ i =1 δ x i and ˆ ν = n ν (cid:80) n ν j =1 δ y j ,which can be formally veriﬁed by deriving the dual of (3) with λ and/or λ equal to . G Numerical Experiments

True mapInferred mapFilling samples True mapInferred mapFilling samples True mapInferred mapFilling samples

Figure 2: Eﬀect of increasing the number of ﬁlling samples (cid:96) on the transportation map. (left) : (cid:96) = 50 , n µ = n ν = 25 , (middle) : (cid:96) = 100 , n = 25 , (right) : (cid:96) = 200 , n µ = n ν = 25 . x y c ( x , y ) u ( x ) v ( y ) Inferred map x y c ( x , y ) u ( x ) v ( y ) Inferred map x y c ( x , y ) u * ( x ) v * ( y ) T ( x ) Figure 3: Eﬀect of increasing the number of ﬁlling samples (cid:96) on the constraint model. (left) : (cid:96) = 50 , n µ = n ν = 25 , (middle) : (cid:96) = 100 , n µ = n ν = 25 , (right) : true function.

1D transportation maps and dual constraint functions.

We illustrate the algorithmdescribed in Theorem 16 in a 1D setting in Figures 2 to 5, by representing the inferredtransportation map ˆ T obtained from ˆ u , deﬁned as ˆ T = x − ∇ x ˆ u ( x ) , and the correspondingconstraint function ˆ h ( x, y ) = (cid:107) x − y (cid:107) − ˆ u ( x ) − ˆ v ( y ) . We sample x , ..., x n µ i.i.d. from µ and y , ..., y n ν i.i.d. from ν , and use quasi-random samples (˜ x , ˜ y ) , ... (˜ x (cid:96) , ˜ y (cid:96) ) from a 2D28 rue mapInferred mapFilling samples True mapInferred mapFilling samples True mapInferred mapFilling samples Figure 4: Eﬀect of increasing the number of µ and ν samples on the transportation map. (left) : (cid:96) = 100 , n µ = n ν = 10 , (middle) : (cid:96) = 100 , n µ = n ν = 25 , (right) : (cid:96) = 100 , n µ = n ν = 50 . x y c ( x , y ) u ( x ) v ( y ) Inferred map x y c ( x , y ) u ( x ) v ( y ) Inferred map x y c ( x , y ) u * ( x ) v * ( y ) T ( x ) Figure 5: Eﬀect of increasing the number of µ and ν samples on the constraint model. (left) : (cid:96) = 100 , n µ = n ν = 10 , (middle) : (cid:96) = 100 , n µ = n ν = 50 , (right) : true function.Sobol sequence (Sobol, 1967), and illustrate the eﬀect of varying n , the number of µ and ν samples, and (cid:96) , the number of space-ﬁlling samples. For k X , k Y and k XY , we use Gaussiankernels exp( − (cid:107) x − y (cid:107) σ ) of ﬁxed bandwidth σ = 0 . , and scale the regularization parametersas λ = (cid:96) and λ = √ n . Convergence of (cid:100)

OT to OT.

In Figure 6, we compare (cid:100)

OT to the sampled optimaltransport estimator on two 4D truncated Gaussian distributions µ and ν s.t. the optimaltransportation map from one to another is linear. We progressively increase the number of µ and ν samples, averaging on random draws for each number of samples. The number ofﬁlling sample pairs (˜ x i , ˜ y i ) is (cid:96) = 100 + n , where n = n µ = n ν is the number samples from µ and ν . We select the best estimator (cid:100) OT using a grid search on ( λ , λ ) . As such, thissimulation does not provide a method for selecting those parameters, but rather illustratesthat a good pair of parameters exists. 29

20 40 60 80 100 120 140 | WW | Kernel SoS OTSampled OT

Figure 6: Convergence on 4D truncated Gaussian data with increasing number of samples( left ). Full lines correspond to the average mean absolute error (MAE), shaded areas to − and − MAE quantiles. The parameters λ , λ2