Efficient Discretizations of Optimal Transport
EEfficient Discretizations of Optimal Transport
Junqi Wang Pei Wang Patrick Shafto Abstract
Obtaining solutions to Optimal Transportation(OT) problems is typically intractable when themarginal spaces are continuous. Recent researchhas focused on approximating continuous solu-tions with discretization methods based on i.i.d.sampling, and has proven convergence as the sam-ple size increases. However, obtaining OT solu-tions with large sample sizes requires intensivecomputation effort, that can be prohibitive in prac-tice. In this paper, we propose an algorithm forcalculating discretizations with a given number ofpoints for marginal distributions, by minimizingthe (entropy-regularized) Wasserstein distance,and result in plans that are comparable to those ob-tained with much larger numbers of i.i.d. samples.Moreover, a local version of such discretizationswhich is parallelizable for large scale applicationsis proposed. We prove bounds for our approxi-mation and demonstrate performance on a widerange of problems.
1. Introduction
Optimal transport is the problem of finding a coupling ofprobability distributions that minimizes cost (Kantorovich,2006), and is a technique applied across various fields andliteratures (Peyr´e & Cuturi, 2019; Villani, 2008). Althoughmany methods exist for obtaining optimal transference plansfor distributions on discrete spaces, computing the plans isnot generally possible for continuous spaces (Janati et al.,2020). Given the prevalence of continuous spaces in ma-chine learning, this is a significant limitation for theoreticaland practical applications.One strategy for approximating continuous OT plans isbased on discrete approximation, e.g. via samples. Recentresearch has provided guarantees on the fidelity of discrete,sample-based approximations to continuous OT as N → ∞ Department of Mathematics & Computer Science, RutgersUniversity, Newark, US. Correspondence to: Pei Wang < [email protected] > .Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute. (Aude et al., 2016). Specifically, by sampling large numbersof points S i from each marginal, one may compute discreteoptimal transference plan on S × S , with the cost matrixbeing derived from pointwise evaluation of the cost functionon S × S .Even in the discrete case, obtaining minimal cost plans iscomputationally challenging. For example, Sinkhorn scal-ing, which computes an entropy-regularized approximationto OT plans, has a complexity that scales with | S × S | (Allen-Zhu et al., 2017). Though many comparable methodsexist (Lin et al., 2019), all have complexity that scales withthe product of sample sizes, and require construction of thecost matrix that also scales with | S × S | .Building on previous sample-based approaches, we developmethods for optimizing small N approximations of OTplans. In Sec. 2, we formulate the problem of fixed sizeapproximation and reduce it to discretization problems onmarginals with theoretical guarantees. In Sec. 3, the gradi-ent of entropy-regularized Wasserstein distance between acontinuous distribution and its discretization is calculated.In Sec. 4, we present a stochastic gradient descent algorithmthat is based on optimization of the location and weightof the points with empirical test results. Sec. 5 introducesparellelizable algorithm via decompositions of the marginalspaces, which can exploit the intrinsic geometry. In Sec. 6,we analyze time and space complexity, and provide compar-ison with the naive sampling.
2. Optimal Discretizations
Let ( X, d X ) , ( Y, d Y ) be compact Polish spaces (com-plete separable metric spaces), µ ∈ P ( X ) , ν ∈ P ( Y ) be probability distributions on their Borel-algebras, and c : X × Y → R be a cost function. Denote the set of alljoint probability measures on X × Y with marginals µ and ν by Π( µ, ν ) . For a given cost function c , the optimal transfer-ence (OT) plan between µ and ν is defined as (Kantorovich,2006): γ ( µ, ν ) := argmin π ∈ Π( µ,ν ) (cid:104) c, π (cid:105) where (cid:104) c, π (cid:105) := (cid:82) X × Y c ( x, y ) d π ( x, y ) .When X = Y , if cost c ( x, y ) = d kX ( x, y ) , W k ( µ, ν ) = (cid:104) c, γ ( µ, ν ) (cid:105) /k defines the k -Wasserstein distance between µ and ν for k ≥ . Here d kX ( x, y ) is the k -th power of the a r X i v : . [ m a t h . O C ] F e b fficient Discretizations of Optimal Transport metric d X on X . Entropy regularized optimal transport (EOT) (Cuturi, 2013;Aude et al., 2016) was introduced to ease the computationalburden of obtaining OT plan: γ λ ( µ, ν ) := argmin π ∈ Π( µ,ν ) (cid:104) c, π (cid:105) + λ KL ( π || µ ⊗ ν ) (1)where λ > is a regularization parameter and KL ( π || µ ⊗ ν ) := (cid:82) log( d π d µ ⊗ d ν ) d π is the Kullback-Leibler divergence.EOT smooths the classical OT into a convex problem.Hence, given ( µ, ν, c ) , there exists an unique solution to (1).After X and Y are discretized, EOT problems can be solvedby Sinkhorn iteration (Sinkhorn & Knopp, 1967), whichcan be easily parallelized. However, for large-scale discretespaces, the computational cost can still be unfeasible (Allen-Zhu et al., 2017). Even worse, to even apply Sinkhorniteration, one must have a cost matrix, which itself can be anon-trivial computational burden to obtain; in some cases,for example where the cost is derived from a probabilitymodel (Wang et al., 2020a), it may require intractable com-putations (Tran et al., 2017; Overstall et al., 2020).We propose to efficiently estimate continuous OT with afixed size discrete approximation. In more details, let m, n ∈ Z ∗ , µ m ∈ P ( X ) , ν n ∈ P ( Y ) be discrete ap-proximation of µ, ν respectively with µ m = (cid:80) mi =1 w i δ x i and ν n = (cid:80) nj =1 u j δ y j , x i ∈ X , y j ∈ Y and w i , u j ∈ R + . Then the EOT plan γ λ ( µ, ν ) ∈ Π( µ, ν ) for OTproblem ( µ, ν, c ) , can be approximated by the EOT plan γ λ ( µ m , ν n ) ∈ Π( µ m , ν n ) for OT problem ( µ m , ν n , c ) .To obtain the best estimation the OT problem of ( µ, ν, c ) with fixed size m, n ∈ Z ∗ , we introduce the objective: Ω k,ρ ( µ m , ν n ) = W kk ( µ, µ m ) + W kk ( ν, ν n )+ ρW kk ( γ λ ( µ, ν ) , γ λ ( µ m , ν n )) , (2) W kk ( µ, µ m ) , W kk ( ν, ν n ) and W kk ( γ λ ( µ, ν ) , γ λ ( µ m , ν n )) are the k -th power of k -Wasserstein distancesbetween µ, ν, γ λ ( µ, ν ) and their approximations µ m , ν n , γ λ ( µ m , ν n ) . The hyperparameter ρ > bal-ances between the estimation accuracy over marginals andthat of the transference plan.To properly compute W kk ( γ λ ( µ, ν ) , γ λ ( µ m , ν n )) , a metric d X × Y on X × Y is needed. Without lost of generality, weassume there exists a constant A such that max { d kX ( x , x ) ,d kY ( y , y ) } ( i ) ≤ d kX × Y (( x , y ) , ( x , y )) ( ii ) ≤ A ( d kX ( x , x ) + d kY ( y , y )) . (3)For instance, (3) holds when d X × Y is the p -product metricfor ≤ p ≤ ∞ . In particular, when p = k = 1 , the equality holds at ( ii ) with A = 1 ; when d X × Y is the ∞ -productmetric, equality holds at ( i ) .To efficiently compute (2), three Wasserstein distancesare estimated by their entropy regularized approximations(Luise et al., 2018): Ω k,ζ,ρ ( µ m , ν n ) = W kk,ζ ( µ, µ m ) + W kk,ζ ( ν, ν n )+ ρW kk,ζ ( γ λ ( µ, ν ) , γ λ ( µ m , ν n )) . (4)For instance, W kk ( µ, µ m ) = (cid:104) d kX , γ ( µ, µ m ) (cid:105) /k is esti-mated by W kk,ζ ( µ, µ m ) = (cid:104) d kX , γ ζ ( µ, µ m ) (cid:105) /k with regu-larizer ζ . Here γ ζ ( µ, µ m ) is computed by optimizing: (cid:99) W kk,ζ ( µ, µ m ) = (cid:104) d kX , γ ζ ( µ, µ m ) (cid:105) + λ KL ( γ ζ ( µ, µ m ) || µ ⊗ µ m ) . A major drawback of optimizing Ω k,ζ,ρ ( µ m , ν n ) is eval-uating the Wasserstein distance between γ λ ( µ, ν ) and γ λ ( µ m , ν n ) . In fact, calculating γ λ ( µ, ν ) is intractable,which is the original motivation to compute γ λ ( µ m , ν n ) .To overcome this difficulty, utilizing the dual formulationof the entropy regularized OT, we show that : Proposition 2.1.
When X and Y are two compact spacesand c is C ∞ , there exists a constant C ∈ R + such that max { W kk ( µ, µ m ) ,W kk ( ν, ν n ) } ≤ W kk,ζ ( γ λ ( µ, ν ) ,γ λ ( µ m , ν n )) ≤ C [ W kk,ζ ( µ, µ m ) + W kk,ζ ( ν, ν n )] . (5)Proposition 2.1 indicates that Ω k,ζ,ρ ( µ m , ν n ) can be ap-proximated by W kk,ζ ( µ, µ m ) + W kk,ζ ( ν, ν n ) . Thus when thecontinuous marginals µ and ν are properly approximated, sois the optimal transference plan between them. Therefore, inthe next sections, we will focus on developing algorithms toobtain µ ∗ m , ν ∗ n that minimize W kk,ζ ( µ, µ m ) and W kk,ζ ( ν, ν n ) .
3. Gradient of the Objective Function
Let ν = (cid:80) mi =1 w i δ y i be the discrete probability measurein position of “ µ m ” in the last section. The task now is tominimize W kk,ζ ( µ, ν ) about the target discrete probabilitymeasure ν on X , where µ is a fixed continuous probabil-ity measure on X . The entropy-approximated Wasserstein W kk,ζ ( µ, ν ) is convex on w i ’s, while its “convexity” on y i ’sare not defined for a general X .In this section, we derive the gradient of W kk,ζ ( µ, ν ) about ν following the discrete discussions of (Wang et al., 2020b;Luise et al., 2018), for applying stochastic gradient de-scent (SGD) on both the positions { y i } mi =1 and their weights { w i } mi =1 . The SGD on X is either through an exponentialmap, or by treating X as (part of) an Euclidean space. Proofs are included in the Supplementary Material. fficient Discretizations of Optimal Transport
Let g ( x, y ) := d kX ( x, y ) , and denote the joint distributionminimizing (cid:99) W kk,ζ as π with differential form at ( x, y i ) beingd π i ( x ) , which is used to define W kk,ζ in Section 2.By introducing Lagrange multipliers α ∈ L ∞ ( X ) , β ∈ R m , (cid:99) W kk,ζ ( µ, ν ) = max α,β L ( µ, ν ; α, β ) where L ( µ, ν ; α, β ) = (cid:82) X α ( x ) d µ ( x ) + (cid:80) ni =1 βw i − ζ (cid:82) X (cid:80) ni =1 w i E i ( x ) d µ ( x ) with E i ( x ) = e ( α ( x )+ β i g ( x,y i )) /ζ (See (Aude et al., 2016)or Supplementary). Let α ∗ , β ∗ be the argmax, we have W kk,ζ ( µ.ν ) = (cid:90) X n (cid:88) i =1 g ( x, y i ) E ∗ i ( x ) w i d µ ( x ) with E ∗ i ( x ) = e ( α ∗ ( x )+ β ∗ i − g ( x,y i )) /ζ . Since α (cid:48) ( x ) := α ( x ) + t and β (cid:48) i := β i − t produces the same E i ( x ) forany t ∈ R , the representative with β n = 0 that is equivalentto β (as well as β ∗ ) is denoted by β (similarly β ∗ ) below, inorder to get uniqueness, making the differentiation possible.From a direct differentiation of W kk,ζ , we have ∂W kk,ζ ∂w i = (cid:90) X g ( x, y i ) E ∗ i ( x ) d µ ( x )+1 ζ (cid:90) X n (cid:88) j =1 g ( x, y j ) (cid:18) ∂α ∗ ( x ) ∂w i + ∂β ∗ j ∂w i (cid:19) w j E ∗ j ( x ) d µ ( x ) . (6) ∇ y i W kk,ζ = (cid:90) X ∇ y i g ( x, y i ) (cid:18) − g ( x, y i ) ζ (cid:19) E ∗ i ( x ) w i d µ ( x )+1 ζ (cid:90) X n (cid:88) j =1 g ( x, y j ) (cid:0) ∇ y i α ∗ ( x )+ ∇ y i β ∗ j (cid:1) w j E ∗ j ( x ) d µ ( x ) . (7)With the transference plan d π i ( x ) = w i E ∗ i ( x ) d µ ( x ) andthe derivatives of α ∗ , β ∗ , g ( x, y i ) calculated, the gradientof W kk,ζ can be assembled.Assume that g is Lipschitz and is differentiable almost ev-erywhere (for k ≥ , and d X Euclidean distance in R d ,differentiability fails to hold only when k = 1 and y i = x ),and that ∇ y g ( x, y ) is calculated. The derivatives of α ∗ and β ∗ can then be calculated thanks to Implicit FunctionTheorem for Banach spaces (see (Accinelli, 2009)).The maximality of L at α ∗ and β ∗ induces that N := ∇ α,β L| ( α ∗ , ¯ β ∗ ) = 0 ∈ ( L ∞ ( X ) ⊗ R m − ) ∨ , the Fr´echetderivative vanishes. Differentiate (in the sense of Fr´echet)again on ( α, β ) and y i , w i , respectively, we get ∇ ( α,β ) N = − ζ (cid:20) d µ ( x ) δ ( x, x (cid:48) ) d π j ( x (cid:48) ) d π i ( x ) w i δ ij (cid:21) (8)as a bilinear functional on L ∞ ( X ) × R m − (note that inEq. (8), the index i of d π i cannot be m ). The bilinearfunctional ∇ ( α,β ) N is invertible, and denote its inverse by M as a bilinear form on (cid:0) L ∞ ( X ) ⊗ R m − (cid:1) ∨ . The last ingredient for Implicit Function Theorem is ∇ w i ,y i N : ∇ w i N = (cid:18) − w i (cid:90) X ( · ) d π i ( x ) ,(cid:126) (cid:19) (9) ∇ y i N = (cid:18) ζ (cid:90) X ( · ) ∇ y i g ( x, y i ) d π i ( x ) ,δ ij ζ (cid:90) X ∇ y i g ( x, y i ) d π i ( x ) (cid:19) . (10)Then ∇ w i ,y i ( α ∗ , β ∗ ) = M ( ∇ w i ,y i N ) .Therefore, we have gradient ∇ w i ,y i W kk,ζ calculated.Moreover, we can differentiate Eq. (6), (7), (8), (9), (10) toget a Hessian matrix of W kk,ζ on w i ’s and y i ’s provided a bet-ter differentiability of g ( x, y ) (which may enable Newton’smethod, or a mixture of Newton’s method and minibatchSGD to accelerate the convergence). More details aboutclaims, calculations and proofs are provided in the Supple-mentary Text.
4. The Discretization Algorithm
Here we provide a description of an algorithm for EfficientDiscretizations of Optimal Transport (EDOT) from a dis-tribution µ to µ m with integer m , a given cardinality ofsupport. In general, µ need not be explicitly accessible, andeven if it is accessible, exactly computing the transferenceplan is not feasible. Therefore, in this construction, we as-sume µ is given in terms of a random sampler , and applyminibatch stochastic gradient descent (SGD) through a setof samples independently drawn from µ of size N on eachstep to approximate µ .Recall that to calculate the gradient ∇ µ m W kk,ζ ( µ, µ m ) = (cid:16) ∇ x i W kk,ζ ( µ, µ m ) , ∇ w i W kk,ζ ( µ, µ m ) (cid:17) mi =1 , we need π X,ζ , the EOT transference plan between µ and µ m , the cost g = d kX on X and, its gradient on thesecond variable ∇ x (cid:48) d kX ( x, x (cid:48) ) . From N samples { y i } Ni =1 ,we can construct µ N = N (cid:80) Ni =1 δ y i and calculate thegradients with µ replaced by µ N as an estimation, whoseeffectiveness (convergence as N → ∞ ) is induced by(Aude et al., 2016). The algorithm is stated in Algorithm 1(EDOT).In simulations, we choose k = 2 to reduce complexityin calculating the distance function (square of Euclideandistance is quadratic) and its derivatives. The regularizer ζ should be small enough to reduce the error of the EOTestimation W k,ζ of the Wasserstein distance W k . However,setting ζ too small will make e − g ( x,y ) /ζ or its byproductin Sinkhorn algorithm indistinguishable from in double fficient Discretizations of Optimal Transport format. Our strategy is setting ζ = 0 . for X of diameter and scales propotional with diam( X ) k (see next section). Algorithm 1
Simple EDOT using minibatch SGD input: µ , k , m , ζ , N batch size, (cid:15) for stopping criterion, α =0 . for momentum, β = 0 . for learning rate. output: x i ∈ X , w i > such that µ m = (cid:80) mi =1 w i δ x i . initialize: randomly choose (cid:80) mi =1 w (0) i δ x (0) i ; set t = 0 ;set learning rate η t = 0 . βt ) − / (for t > ). repeat Set t ← t + 1 ;Sample N points { y j } Nj =1 ⊆ X from µ independently;Construct µ ( t ) N = N (cid:80) Nj =1 δ y j ;Calculate gradients ∇ x (cid:99) W kk,ζ ( µ ( t ) N , µ ( t ) m ) and ∇ w (cid:99) W kk,ζ ( µ ( t ) N , µ ( t ) m ) ;Update D x t ← αD x t − + ∇ x (cid:99) W kk,ζ ( µ ( t ) N , µ ( t ) m ) , D w t ← αD w t − + ∇ w (cid:99) W kk,ζ ( µ ( t ) N , µ ( t ) m ) ;Update x ( t ) ← x ( t − − η t D x t , w ( t ) ← w ( t − − η t D w t ; until |∇ x (cid:99) W kk,ζ | + |∇ w (cid:99) W kk,ζ | < (cid:15) ;Set output x i ← x ( t ) i , w i ← w ( t ) i . We demonstrate our algorithm on: µ is the uniformdistribution on X = [0 , . µ is the mixture of two trun-cated normal distributions on X = [0 , , the PDF is f ( x ) =0 . φ ( x ; 0 . , . . φ ( x ; 0 . , . , where φ ( x ; ξ, σ ) is thedensity of the truncated normal distribution on [0 , withexpectation ξ and standard deviation σ . µ is the mixtureof two truncated normal distributions on X = [0 , , thetwo distributions are: φ ( x ; 0 . , . φ ( y ; 0 . , . of weight . and φ ( x ; 0 . , . φ ( y ; 0 . , . of weight . .Let k = 2 , ζ = 0 . , N = 100 for all plots in this section.Fig. 1 (1-3) plot the discretizations ( µ m ) for example (1-3)with m = 5 , , respectively. Convergence.
There is no guarantee that Algorithm 1 con-verges to the global optimal positions (though if stabilizedat some positions, the weights are optimized). On theseexamples we usually iterate 5000 to 6000 steps to get aresult (gradients will be small, but are still affected by thefluctuation caused by the sample size N ). We estimateentropy-regularized Wasserstein W kk,ζ between µ and µ m by Richardson extrapolation (see Supplementary for details)and use it to score the convergence rate.Fig. 1 (6) illustrates the convergence rate of W kk,ζ ( µ, µ m ) versus SGD steps on example 2 (1D truncated normal mix-ture) with µ m obtained by 5-point EDOT.Generally, Wasserstein distance decays and gets close tofinal state after tens of steps, while slight changes can stillbe observed (in 1D examples) even after 3000 steps. Some- Figure 1. (1-3) plot sample discretizations of the examples (1-3).In (3), x -axis and y -axis are the 2D coordinates, the probabil-ity density of µ and weights of µ m are encoded by color. (4,5)Comparisons with i.i.d. sampling on Wasserstein distance to thecontinuous distribution. EDOT are calculated with m = 3 to ( to ). The 4 boundary curves of the shaded region are -, -, - and -percentile curves, orange line represents thelevel of m = 5 ; (6) plots the entropy regularized Wasserstein dis-tance W kk,ζ ( µ, µ m ) versus SGD steps on example 2 (1D truncatednormal mixture) with µ m optimized by 5-point EDOT. times an ill positioned initial distribution can make it slowerto approach a stationary state. Comparison with Naive Sampling.
Fig. 1 (4-5) plot theentropy-regularized Wasserstein W kk,ζ ( µ, µ m ) with differ-ent choices of m for Example 1) and 2). Here µ m ’s aregenerated by two methods: a). Algorithm 1 with m chosenfrom 3 to 7 in Example 1 and to 8 in Example 2, shown by × ’s. b). naive sampling (i.i.d. with equal weights) simulatedusing Monte Carlo of volume 20000 on each size from 3 to200. Fig. 1 (4-5) demonstrate the effectiveness of EDOT:as indicated by the orange horizontal dashed line, even 5points in these 2 examples excels 95% of naive samplingsof size 40 and 75% of naive samplings of size over 100. In Fig. 2, we illustrate an application of EDOT on an opti-mal transport task as proposed in Section 2. The optimaltransport problem has X = Y = [0 , , and the marginaldistributions µ , ν are truncated normal (mixtures), where µ has 2 components (shown on the left) and ν has only 1component (shown on the top). The cost function is taken asthe square of Euclidean distance on an interval. We can see fficient Discretizations of Optimal Transport Figure 2.
Approximation of the transference plan. Left: EDOT on × grid with W kk,ζ ( µ, µ ) = 4 . × − , W kk,ζ ( ν, ν ) =5 . × − , W kk,ζ ( γ, γ , ) = 8 . × − ; Right: Naive on × grid with W kk,ζ ( µ, µ ) = 5 . × − , W kk,ζ ( ν, ν ) =2 . × − , W kk,ζ ( γ, γ , ) = 2 . × − . In both figures,the background grayscale density represents the true EOT density. that on the × EDOT example, the high density area ofthe transference plan are correctly covered by lattice pointswith high weights, while in the naive sampling, even in gridsize × , the points on the lattice with highest weightsmissed the region where true transference plan is of the mostdensity. Comparison on Wasserstein distance also tells usEDOT works better with a smaller grid.
5. Methods of Improvement
The computational cost of simple EDOT increases with thedimensionality and diameter of the underlying space. Larger m discretization µ m = (cid:80) mi =1 w i δ y i is needed to capture thehigher dimensional distributions. This will result an increasein parameters in SGD for calculating the gradient of W kk,ζ : md for positions y i ’s and m − for weights w i ’s. Such anincrement will both increase complexity in each step, andalso require more steps for SGD to converge. Furthermore,the calculation will have a higher complexity ( O ( mN ) foreach normalization in Sinkhorn).We propose to reduce computational complexity by “divideand conquer”. The Wasserstein distance takes k -th powerof the distance function d kX as cost function. The locality ofdistance d X makes the solution to the OT / EOT problemlocal, meaning the probability mass is more likely to betransported to a close destination than to a remote one. Thuswe can “divide and conquer” — cut the space X into smallcells and solve the discretization problem separately. Werequire an adaptive dividing procedure to balance the accu-racy and computational intensity among the cells. Therefore,determining size of discretization and choosing a proper reg-ularizer for each cell are questions to answer, after havinga partition X = X (cid:116) · · · (cid:116) X I . We first sample a largesample set S ∼ µ (with | S | (cid:29) N ) and partition S into S (cid:116) · · · (cid:116) S I according to X . In other words, to constructan EDOT problem for each X i and sample set S i located in X i , we must figure out the parameters m i and ζ i . Choosing Size m . An appropriate choice of m i will balancecontributions to the Wasserstein among the subproblems asfollows: Let X i be a manifold of dimension d , let diam( X i ) be its diameter, and p i = µ ( X i ) be the probability of X i .The entropy regularized wasserstein distance can be esti-mated as W kk,ζ = O ( p i m − k/di diam( X i ) k ) (Weed et al.,2019; Dudley, 1969). The contribution to W kk,ζ ( µ, µ m ) per point in support of µ m is O ( p i m − ( k + d ) /di diam( X i ) k ) .Therefore, to balance each point’s contribution to theWasserstein among the divided subproblems, we set m i ≈ ( p i diam( X i ) k ) d/ ( k + d ) (cid:80) I j =1 ( p j diam( X j ) k ) d/ ( k + d ) . Since m i must be an integer, itis rounded properly. If some cell has m i = 0 , then theprobability should be added to its adjacent cells. Adjusting Regularizer ζ . In the calculation of W kk,ζ , theSinkhorn iterations on e − g ( x,y ) /ζ is calculated. Therefore, ζ should scale with g ( x, y ) (i.e., d kX ) to make the transferenceplan not be affected by scaling of d X . Precisely, we maychoose ζ i = diam( X i ) k ζ for some constant ζ . The Construction.
Theoretically, the idea of division canbe applied to any refinement procedure that can be appliediteratively and eventually makes the diameter of the cellsapproach . In our simulation, we use an adaptive kd-treestyle cell refinement in an Euclidean space R d .Let X be embedded into R d within an axis-aligned rectan-gular region. We choose an axis x l in R d and evenly splitthe region along a hyperplane orthogonal to x l (e.g. cutsquare [0 , along the line x = 0 . ), thus we construct X and X . With a sample set S given, we split it into twosample sizes S and S according to which subregion eachsample is located in. Then the corresponding m i and ζ i canbe calculated as discussed above. Thus two cells and corre-sponding subproblems are constructed. If some of the m i isstill too large, the cell is cut along another axis to constructtwo other cells. The full list of cells and subproblems canbe constructed recursively, see Algorithm 2.After having the set of subproblems, we may apply Algo-rithm 1 for solutions in each cell (samples used there areredrawn from the sample set S i in each subproblem), thencombine the solutions µ ( i ) m i = (cid:80) m i j =1 w ( i ) j δ y ( i ) j into the finalresult µ m := (cid:80) I i =1 (cid:80) m i j =1 p i w ( i ) j δ y ( i ) j .Fig. 3 shows the optimal discretization for the example inFig. 1(3)) with m = 30 , obtained by applying the EDOTwith adaptive cell refinement. In general, there are various ways of distributing the probabil-ity to neighbors (in Algorithm 2, it is unique). fficient Discretizations of Optimal Transport
Algorithm 2
Adaptive Refinement via Depth First Search input: m , µ , N sample size, m ∗ max number of points in acell, a = ( a , a , . . . , a d − ) and b = ( b , b , . . . , b d − ) aslower/upper bounds of the region; output: out : stack of subproblems ( S i , m i , p i ) with (cid:80) p i = 1 , (cid:80) m i = m , (cid:80) | S i | = N ; initialization: p ← ; T , out be empty stacks;Sample N points S ∼ µ ; T .push(( S , m , p , a , b )); while T is not empty do ( S , m , p , a , b ) ← T .pop(); l ← argmax { b i − a i } , mid ← ( a l + b l ) / ; a (1) ← a , a (2) ← ( a , . . . , mid, . . . , a d − ) ; b (1) ← ( b , . . . , mid, . . . , b d − ) , b (2) ← b ; S ← { x ∈ S : x l < = mid } , S ← { x ∈ S : x l > mid } , N ← | S | , N ← | S | ; m ← Round . ↑ (cid:18) N d/ ( k + d )1 N d/ ( k + d )1 + N d/ ( k + d )2 (cid:19) ; m ← Round . ↓ (cid:18) N d/ ( k + d )2 N d/ ( k + d )1 + N d/ ( k + d )2 (cid:19) ; if m = 0 then p ← , p ← ( N + N ) /N ; else if m = 0 then p ← ( N + N ) /N , p ← ; else p ← N /N , p ← N /N ; end iffor i ← , doif m i > m ∗ then T .push( ( S i , m i , p i , a ( i ) , b ( i ) ) ); else if m i > then out .push( ( S i , m i , p i ); end ifend forend while Figure 3.
An example of adaptive refinement on a unit square. Left:division of 10000 sample S approximating a mixture of two trun-cated normal distributions, and the refinement for 30 discretizationpoints. Right: the discretization optimized locally and combinedas a probability measure, with k = 2 . Although the samples on space X are usually representedas a vector in R d , inducing an embedding X (cid:44) → R d , thespace X usually has its own structure as a CW-complex (orsimply a manifold) with a metric.As the metric in the CW-complex usually bears a moreintrinsic structure of X than the one induced by the embed- ding, if the CW-complex structure and the metric is known,even piecewise, we may apply Algorithm 1 or Algorithm 2on each cell or piece to get a precise discretization of X regarding its own metric, whereas direct discretization asa subset in R d may result in low expressing efficiency e.g.some points may drift off from X .We show two examples: truncated normal mixture distribu-tion on a Swiss roll, and a mixture normal distribution on asphere mapped through stereographic projection. On Swiss Roll.
In this case, the underlying space X swiss is a the Swiss Roll, a 2D rectangular strip embedded in R : ( θ, z ) (cid:55)→ ( θ, θ, z ) in cylindrical coordinates, µ swiss is atruncated normal mixture on ( θ, z ) -plane. Samples S swiss ∼ µ swiss over X swiss is shown on Fig. 4 (left) embedded in 3Dand Fig. 5 (1) isometric into R .Following the Euclidean metric in R , Fig. 4 (right) plotsthe EDOT solution µ m through adaptive cell refinement(Algorithm 2) with m = 50 . The resulting cell structureis shown as colored boxes. The corresponding partition of S swiss is shown on Fig. 5 (1), with samples contained in acell marked by the same color. According to Fig. 4 (right),points in µ m are mainly located on the strip with only onepoint off in the most sparse cell (yellow cell located in thebottom in the figure). Figure 4.
On Swiss Roll. Left: 15000 samples from the truncatednormal mixture distribution µ swiss over X swiss ; Right: 50-point 3Ddiscretization using Algorithm 2, the refinement cells are shown incolored boxes. On the other hand, consider the metric on X swiss induced bythe isometry from the Swiss Roll as a manifold to a strip on R . A more intrinsic discretization of µ swiss can be obtainedby applying EDOT through a refinement on the coordinatespace, the (2D) strip. The partition of S swiss is shown onFig. 5 (2), and resulting discretization µ is shown in Fig. 5(3). Notice that all points are located on the (locally) highdensity region of the Swiss Roll. We observe from Fig. 5(1) and (2) that the 3D partition may pull disconnected andintrinsically remote regions together while the 2D partitionmaintains the intrinsic structure. On Sphere.
The underlying space X sphere is the unit spherein R . µ sphere is the pushforward of a normal mixture distri-bution on R by stereographic projection. The sample set S sphere ∼ µ sphere over X sphere is shown on Fig. 6 left. fficient Discretizations of Optimal Transport Figure 5.
Swiss Roll under Isometry. (1) Refinement Cells under3D Euclidean metric (one color per samples from a refinementcell); (2) Refinement Cells under 2D metric induced by the isome-try; (3) EDOT of 50 points with respect to the 2D refinement.
Consider the (3D) Euclidean metric on X sphere induced bythe the embedding, Fig. 6 (right) plots the EDOT solutionwith refinement for µ m with m = 40 . The resulting cellstructure is shown as colored boxes.To consider the intrinsic metric, a CW-complex is con-structed. The structure is built with a point on the equator asa -cell structure, the rest of the equator as a -cell, and theupper hemisphere and lower hemisphere as two dimension- (open) cells. We take the upper and lower hemispheres andmap them onto unit disc through stereographic projectionwith respect to south pole and north pole, respectively. Thenwe take the metric from spherical geometry, and rewritethe distance function and its gradient using the natural co-ordinate on the unit disc (see Supplementary for details.)Fig. 7 shows the refinement of EDOT on the samples (in red)and corresponding discretizations in colored points. Morefigures can be found in the Supplementary. We now illustrate the performance of Adaptive EDOT on a2D optimal transport task. Let X = Y = [0 , , c = d X be the Euclidean distance, g = d X , and the marginal µ , ν be truncated normal (mixtures), where µ has only 2 com-ponents and ν has 5 components. Fig. 8 plots the McCannInterpolation of the OT plan of between µ and ν (shownin red dots) and its discrete approximations (weights arecolor coded) with m = n = 30 . With m = n = 10 ,Adaptive EDOT results: W kk,ζ ( µ, µ ) = 1 . × − , Figure 6.
Left: 30000 samples on sphere; Right: EDOT of 40points in 3D.
Figure 7.
CW-EDOT of 40 points in 2D. Left: upper hemisphere;Right: lower hemisphere, by stereographic projections about poles. W kk,ζ ( ν, ν ) = 1 . × − , W kk,ζ ( γ, γ , ) = 2 . × − . With m = n = 30 , Adaptive EDOT results, W kk,ζ ( µ, µ ) = 9 . × − , W kk,ζ ( ν, ν ) = 9 . × − , W kk,ζ ( γ, γ , ) = 1 . × − . Where as naive sam-pling results: W kk,ζ ( µ, µ (cid:48) ) = 1 . × − , W kk,ζ ( ν, ν (cid:48) ) =1 . × − , W kk,ζ ( γ, γ (cid:48) , ) = 3 . × − . AdaptiveEDOT approximated the quality of 900 naive samples withonly 100 points on a 4 dimensional transference plan.
6. Analysis of the Algorithms
First, for each iteration in the minibatch SGD, let N be thesample (minibatch) size of µ N for approximating µ . Let m be the size of target discretization µ m (the output). Furtherlet d be the dimension of X , (cid:15) be the error bound in Sinkhorncalculation for the entropy-regularized optimal transferenceplan between µ N and µ m . The Sinkhorn algorithm forthe positive matrix e − g ( x,y ) /ζ (of size N × m ) convergeslinearly, which takes O (log(1 /(cid:15) )) steps to fall into a regionof radius (cid:15) , contributing O ( N m log(1 /(cid:15) )) in time complex-ity. The inverse matrix M of ∇ ( α, ¯ β ) N = − ζ (cid:20) A BB T D (cid:21) (Eq. (8)) is taken block-wise (see Supplementary for de-tails): M = − ζ (cid:20) A − + A − BE − B T A − − A − BE − − E − B T A − E − (cid:21) fficient Discretizations of Optimal Transport Figure 8.
McCann Interpolation (See Supplementary for moreslices and animation) of × transference plan. where E = D − B T A − B . Block E is constructedin O ( N m ) and inverted in O ( m ) ; block A − BE − takes O ( N m ) as A is diagonal; and the block A − + A − BE − B T A − takes O ( N m ) to construct. When m (cid:28) N , the time complexity in constructing M is O ( N m ) . From M to gradient of dual variables: the tensorcontractions have complexity O (( N + m ) md ) . Finally, toget the gradient, the complexity is dominated by the secondterm of Eq. (7), which is a contraction between a matrix N m (i.e., g d π ( x ) ) with tensors of sizes N md and m d (two gra-dients on dual variables α , β ) along N and m respectively.Thus the final step contributes O (( N + m ) md ) .The time complexity of increment steps in SGD turns outto be O ( md ) . Therefore, for L steps of minibatch SGD, thetime complexity is O (( N + m ) mdL + N mL log(1 /(cid:15) )) .For space complexity, Sinkhorn algorithm (which can bedone in position O ( N m ) ) is the only iterative computationin a single SGD step, and between two SGD steps, onlythe resulting distribution is passed to next step. Therefore,the space complexity is O (( N + m ) ) coming from the M ,others are at most of size O ( m ( N + m )) .From the above discussion, we can further see that whenAlgorithm 2 is applies, the total complexities (both in timeand space) are reduced as the magnitudes of both N and m are much smaller in each refined cell.Finally, the convergence rate affects the time complexity.Since minibatch SGD is applied, the convergence rate de-pends on the choice of the initial distribution, and is alsoaffected by the size m , N , and the diameter of the cell. The procedure of dividing sample set S into subsets in Al-gorithm 2 is similar to Quicksort, thus the space and timecomplexities are similar. The similarity comes from the bi- nary divide-and-conquer structure as well as that each splitaction is based on comparing each sample with a target (the“mid” in Algorithm 2).The time complexity of Algorithm 2 is O ( N log N ) inbest and average case, O ( N ) in worst case (which wouldhappen when each cell contains only one sample). Thetime complexity depends only on the distribution, not onthe order the samples are stored.The space complexity is O ( N d + m ) , or simply O ( N d ) as m (cid:28) N . Furthermore, if we use swap to make the splitof sample size “in position” (as in Quicksort), we may onlyneed space for one copy of set S (of size N d ), those inDFS stack and output stack (proportional to m ) and somefixed size space for calculations.The applications of Algorithm 1 on the cells are independentfrom each other, making the computation parallelizable.And the post-processing of such parallelization takes only m multiplications and a concatenation to construct µ m . After having a size- m discretization on X and a size- n dis-cretization on Y , the EOT solution (Sinkhorn algorithm) hastime compleixty O ( mn log(1 /(cid:15) )) . In EDOT, two discretiza-tion problems must be solved before applying Sinkhorn,while the naive sampling requires nothing but sampling.According to the previous analysis, solving a single continu-ous EOT problem using size- m EDOT method directly mayresult in higher time complexity than naive sampling witheven a larger sample size (than m ), when the cost function c : X × Y → R is completely known. However, in realapplications, the cost function may be from real world ex-periments (or from extra computations) done for each pair ( x, y ) in the discretization, thus the size of discretized distri-bution is critical for cost-control, but the distance function d X and d Y usually come along with the spaces X and Y and are easy to calculate. Another scenario of applicationof EDOT is that the marginal distributions µ X and ν Y arefixed for different cost functions, then discretizations canbe reused, thus the cost of discretization is one-time andimprovement it brings accumulates in each repeat.
7. Conclusion
We developed methods for efficiently approximating OTcouplings with fixed size m × n approximations. We pro-vided bounds on the relationship between the discrete ap-proximation and the original continuous problem. We im-plemented two algorithms and demonstrated their efficacyas compared to naive sampling and analyzed computationalcomplexity. Our approach provides a new approach to effi-ciently computing OT plans. fficient Discretizations of Optimal Transport A. Proof of Proposition 2.1
Proposition 2.1.
When X and Y are two compact spacesand c is C ∞ , there exists a constant C ∈ R + such that max { W kk ( µ, µ m ) ,W kk ( ν, ν n ) } ( i ) ≤ W kk,ζ ( γ λ ( µ, ν ) ,γ λ ( µ m , ν n )) ( ii ) ≤ C [ W kk,ζ ( µ, µ m ) + W kk,ζ ( ν, ν n )] . (11) Proof.
We will adopt notations: Z = X × Y and z i =( x i , y i ) ∈ Z .For inequality (i), without loss assume that max { W kk ( µ, µ m ) , W kk ( ν, ν n ) } = W kk ( µ, µ m ) Denote the optimal π Z ∈ Π( γ λ ( µ, ν ) , γ λ ( µ m , ν n )) thatachieves W kk ( γ λ ( µ, ν ) , γ λ ( µ m , ν n )) by π ∗ Z similarly for π ∗ X . Then we have: W kk ( γ λ ( µ, ν ) , γ λ ( µ m , ν n ))= (cid:90) Z d kZ ( z , z ) d π ∗ Z ≥ A (cid:90) Z d kX ( x , x ) d π ∗ Z = A (cid:90) X d kX ( x , x ) (cid:90) Y d π ∗ Z ( a ) = A (cid:90) X d kX ( x , x ) d π (cid:48) X ( b ) ≥ A (cid:90) X d kX ( x , x ) d π ∗ X = W kk ( µ, µ m ) Here π (cid:48) X ∈ Π( µ, µ m ) , eq (a) holds since π ∗ Z ∈ Π( γ λ ( µ, ν ) , γ λ ( µ m , ν n )) and ineq (b) holds since π ∗ X isthe optimal choice.For inequality (ii), we will use the following to simplify thenotations,d γ ⊗ d γ mn := d γ λ ( µ, ν ) ⊗ d γ λ ( µ m , ν n ) d kZ := d kZ ( z , z ) , d kX := d kX ( x , x ) , d kY := d kY ( y , y ) W kk,ζ ( γ λ ( µ, ν ) , γ λ ( µ m , ν n )) = (cid:90) Z d kZ ( z , z ) d π ∗ z,ζ ( a ) = (cid:90) Z d kZ · exp( α ( z ) + β ( z ) − d kZ ( z , z ) ζ ) d γ ⊗ d γ mn ( b ) ≤ A (cid:90) Z ( d kX + d kY ) · exp( α ( z ) + β ( z ) − d kZ ζ ) d γ ⊗ d γ mn ( c ) ≤ C (cid:90) Z ( d kX + d kY ) · exp( − d kZ ζ ) d γ ⊗ d γ mn ( d ) ≤ C [ (cid:90) Z d kX · exp( − d kX ζ ) + d kY · exp( − d kY ζ ) d γ ⊗ d γ mn ] ( e ) = C [ (cid:90) X d kX · exp( − d kX ζ ) d µ ⊗ d µ m + (cid:90) Y d kY · exp( − d kY ζ ) d ν ⊗ d ν n ] ( f ) ≤ C (cid:90) X d kX · exp( s ( x ) + t ( x ) − d kX ζ ) d µ ⊗ d µ m + C (cid:90) Y d kY · exp( s (cid:48) ( y ) + t (cid:48) ( y ) − d kY ζ ) d ν ⊗ d ν n = C [ W kk,ζ ( µ, µ m ) + W kk,ζ ( ν, ν n )] Justifications for the derivations:(a) Based on the dual formulation, it is shown in(Aude et al., 2016)[Proposition 2.1] that for ζ > ,there exist α ( z ) , β ( z ) ∈ C ( Z ) such that d π ∗ z,ζ =exp( α ( z )+ β ( z ) − d kZ ( z ,z ) ζ ) d γ ⊗ d γ mn ;(b) Inequality (ii) of eq (3);(c) According to (Genevay et al., 2019)[Theorem 2], when X, Y are compact and c is smooth, α, β are uniformlybounded; moreover both d kX and d kY are uniformly boundedby the diameter of X and Y respectively, hence constant B exists;(d) Inequality (ii) of eq (3);(e) γ λ ( µ, ν ) ∈ Π( µ, ν ) and γ λ ( µ m , ν n ) ∈ Π( µ m , ν n ) ;(f) Similarly as in (a), for ζ > , there exist s ( x ) , t ( x ) ∈C ( X ) and s (cid:48) ( y ) , t (cid:48) ( y ) ∈ C ( Y ) such that exp( − d kX ζ ) d µ ⊗ d µ m = d π ∗ X and exp( − d kY ζ ) d ν ⊗ d ν n = d π ∗ Y . Moreover, (cid:82) X d kX · exp( s ( x )+ t ( x ) ζ ) d µ ⊗ d µ m ≥ and (cid:82) Y d kY · exp( s (cid:48) ( y )+ t (cid:48) ( y ) ζ ) d ν ⊗ d ν n ≥ . fficient Discretizations of Optimal Transport B. Gradient of W kk,ζ B.1. The Gradient
Following the definitions and notations in Section 2 and Sec-tion 3 of the paper, we calculate the gradient of W kk,ζ ( µ, µ m ) about parameters of µ m := (cid:80) mi =1 w i δ y i in detail. W kk,ζ ( µ, µ m ) = (cid:82) X g ( x, y ) d γ ζ ( x, y ) where γ ζ = argmin γ ∈ Π( µ,µ m ) (cid:90) X g ( x, y ) d γ ( x, y ) + ζ KL( γ || µ ⊗ µ m ) . (12) Let α ∈ L ∞ ( X ) and β ∈ R m , denote β = (cid:80) mi =1 β i δ y i , let F ( γ ; µ, µ m , α, β ):= (cid:90) X g ( x, y ) d γ ( x, y ) + ζ KL( γ || µ ⊗ µ m )+ (cid:90) X α ( x ) (cid:18)(cid:90) X d γ ( x, y ) − d µ m ( y ) (cid:19) + (cid:90) X β ( y ) (cid:18)(cid:90) X d γ ( x, y ) − d µ ( x ) (cid:19) = (cid:90) X g ( x, y ) d γ ( x, y ) + ζ KL( γ || µ ⊗ µ m )+ (cid:90) X α ( x ) (cid:32) m (cid:88) i =1 d γ ( x, y i ) − d µ m ( y i ) (cid:33) + m (cid:88) i =1 β i (cid:18)(cid:90) X d γ ( x, y i ) − d µ ( x ) (cid:19) (13)Since γ on second component X is discrete and supportedon { y i } mi =1 , we may denote d γ ( x, y i ) by d π i ( x ) , thus F ( γ ; µ, µ m , α, β )= (cid:90) X m (cid:88) i =1 g ( x, y i ) d π i ( x )+ ζ m (cid:88) i =1 (cid:90) X (cid:18) ln d π i ( x ) d µ ( x ) − ln( µ m ( y i )) (cid:19) d π i ( x )+ (cid:90) X α ( x ) (cid:32) m (cid:88) i =1 d π i ( x ) − d µ m ( y i ) (cid:33) + m (cid:88) i =1 β i (cid:18)(cid:90) X d π i ( x ) − d µ ( x ) (cid:19) (14)Then the Fenchel duality of Problem (12) is L ( µ, µ m ; α, β )= (cid:90) X α ( x ) d µ ( x ) + m (cid:88) i =1 β i w i − ζ (cid:90) X m (cid:88) i =1 e ( α ( x )+ β i − g ( x,y i )) /ζ w i d µ ( x ) . (15) Let α ∗ , β ∗ be the argmax of the Fenchel dual(15), the primal is solved by d γ ∗ ( x, y i ) = e ( α ∗ ( x )+ β ∗ i − g ( x,y i )) /ζ w i d µ ( x ) . To make the solutionunique, we restrict the freedom of solution (where wesee that L ( µ, µ m ; α, β ) = L ( µ, µ m ; α + t, β − t ) for any t ∈ R ). We use the condition β m = 0 to narrow the choicesdown to only one, and denote the dual variable β having theproperty by β and β ∗ .We first calculate ∇ w i ,y i L ( µ, µ m ; α ∗ , β ∗ ) with α ∗ and β ∗ as functions of µ m . (From the paper) ∂ L ∂w i = (cid:90) X g ( x, y i ) E ∗ i ( x ) d µ ( x )+1 ζ (cid:90) X n (cid:88) j =1 g ( x, y j ) (cid:18) ∂α ∗ ( x ) ∂w i + ∂β ∗ j ∂w i (cid:19) w j E ∗ j ( x ) d µ ( x ) . (16) ∇ y i L = (cid:90) X ∇ y i g ( x, y i ) (cid:18) − g ( x, y i ) ζ (cid:19) E ∗ i ( x ) w i d µ ( x )+1 ζ (cid:90) X n (cid:88) j =1 g ( x, y j ) (cid:0) ∇ y i α ∗ ( x )+ ∇ y i β ∗ j (cid:1) w j E ∗ j ( x ) d µ ( x ) . (17)Next, we calcuate the derivatives of α ∗ and β ∗ , by findingtheir defining equation and then using Implicit FunctionTheorem.The optimal solution to the dual variables α ∗ , β ∗ is obtainedby solving the stationary state equation ∇ α,β L = 0 . Thederivatives are taken in the sense of Fr´echet derivative. TheFenchel dual function on α , β , has its domain and codomain L ( µ, µ m ; · , · ) : L ∞ ( X ) × R m − → R , the derivatives are ∇ α L ( µ, µ m ; α, β )= (cid:90) X (cid:32) − m (cid:88) i =1 w i E i ( x ) (cid:33) ( · ) d µ ( x ) , (18) ∂∂β i L ( µ, µ m ; α, β ) = w i (cid:18) − (cid:90) X E i ( x ) d µ ( x ) (cid:19) (19)where E i ( x ) = e ( α ( x )+ β i − g ( x,y i )) /ζ is defined as in thepaper, and ∇ α L ( µ, µ m ; α, β ) ∈ ( L ∞ ( X )) ∨ (as a linearfunctional), ∂∂β i L ( µ, µ m ; α, β ) ∈ R . Next, we need toshow L is differentiable in the sense of Fr´echet derivative,i.e., lim || h ||→ || h || (cid:0) L ( µ, µ m ; α + h, β ) − L ( µ, µ m ; α, β ) −∇ α L ( µ, µ m ; α, β )( h ) (cid:1) = 0 . (20) fficient Discretizations of Optimal Transport By definition of L (we write L ( α ) for L ( µ, µ m ; α, β ) ), L ( α + h ) − L ( α ) − ∇ α L ( α )( h )= (cid:90) X h ( x ) d µ ( x ) − ζ (cid:90) X m (cid:88) i =1 (cid:16) e h ( x ) /ζ − (cid:17) w i E i ( x ) d µ ( x ) − (cid:90) X (cid:32) − m (cid:88) i =1 w i E i ( x ) (cid:33) h ( x ) d µ ( x )= ζ (cid:90) X m (cid:88) i =1 (cid:18) h ( x ) ζ − e h ( x ) /ζ (cid:19) E i ( x ) w i d µ ( x )= ζ (cid:90) X (cid:32) ∞ (cid:88) k =2 k ! h ( x ) k ζ k (cid:33) m (cid:88) i =1 E i ( x ) w i d µ ( x ) , (21)the last equality is from the Taylor expansion of exponentialfunction. Consider that || h || ∞ = ess sup | x ∈ X h ( x ) | theessential supremum of | h ( x ) | for x ∈ X given measure µ .Denote N := ∇ α,β L , || h || ( L ( α + h ) − L ( α ) − ∇ α L ( α )( h )) ≤ ζ || h || (cid:90) X (cid:32) ∞ (cid:88) k =2 k ! | h ( x ) | k ζ k (cid:33) m (cid:88) i =1 E i ( x ) w i d µ ( x ) ≤ ζ || h || (cid:90) X (cid:32) ∞ (cid:88) k =2 k ! || h || k ζ k (cid:33) m (cid:88) i =1 E i ( x ) w i d µ ( x )= ζ (cid:32) ∞ (cid:88) k =2 k ! || h || k − ζ k (cid:33) (cid:90) X m (cid:88) i =1 E i ( x ) w i d µ ( x )= ζ (cid:32) ∞ (cid:88) k =2 k ! || h || k − ζ k (cid:33) (22)Therefore, lim || h ||→ || h || ( L ( α + h ) − L ( α ) −∇ α L ( α )( h )) = 0 , (23)which shows that the expression of ∇ α L ( α ) in Eq. (18)gives the correct Fr´echet derivative. Note here that α ∈ L ∞ ( X ) is critical in Eq. (22).Let N := ∇ α,β L values in ( L ∞ ( X )) ∨ × R m − , then N =0 defines α ∗ and β ∗ , which makes it possible to differentiatethem about µ m using Implicit Function Theorem for Banachspaces. From now on, N take values at α = α ∗ , β = β ∗ ,i.e., the marginal conditions on d π i ( x ) = w i E i ( x ) d µ ( x ) hold.Thus we need ∇ α,β N and ∇ w i ,y i N calculated, and provethat ∇ α,β N is invertible (and give the inverse).It is necessary to make sure which form ∇ α,β N is inaccording to Fr´echet derivative. Start from the map N ( µ, µ m ; · , · ) : ( L ∞ ( X )) × R m − → ( L ∞ ( X )) ∨ × ( R m − ) ∨ where R m − is isomorphic to its dual Banachspace ( R m − ) ∨ . Then ∇ α,β N ∈
Hom b R ( L ∞ ( X ) × R m − , ( L ∞ ( X )) ∨ × ( R m − ) ∨ ) , where Hom b representsthe set of bounded linear operators. Moreover, recall that ( · ) ⊗ A is the left adjoint functor of Hom b R ( A, · ) , then for R -vector spaces, Hom b R ( L ∞ ( X ) × R m − , ( L ∞ ( X )) ∨ × ( R m − ) ∨ ) ∼ = Hom b R (cid:0) ( L ∞ ( X ) × R m − ) ⊗ , R (cid:1) . Thus, wecan write ∇ α,β N in terms of a bilinear form on vector space L ∞ ( X ) × R m − .From the expression of N , we may differentiate (similarlyas the calculations (21) to (23)): ∇ α N = (cid:32) − ζ (cid:90) X ( · )( − ) m (cid:88) i =1 w i E i ( x ) d µ ( x ) , − ζ (cid:90) X ( · ) w i E i ( x ) d µ ( x ) (cid:19) (24) ∇ β N = (cid:18) − ζ (cid:90) X ( · ) w i E i ( x ) d µ ( x ) , − δ ij ζ (cid:90) X w i E i ( x ) d µ ( x ) (cid:19) (25)Consider the boundary conditions: (cid:80) mi =1 w i E i ( x ) d µ ( x ) = (cid:80) mi =1 d π i ( x ) = µ ( x ) and (cid:82) X w i E i ( x ) d µ ( x ) = (cid:82) X π i ( x ) = w i , the ∇ α,β N as the Hessian form of L , canbe written as ∇ α,β N = − ζ (cid:20) (cid:104)− , ·(cid:105) (cid:104) π j , ·(cid:105)(cid:104)− , π i (cid:105) w i δ ij (cid:21) (26)with (cid:104) φ , φ (cid:105) = (cid:82) X φ ( x ) φ ( x ) µ ( x ) , or further ∇ α,β N = − ζ (cid:20) d µ ( x ) d µ ( x (cid:48) ) δ ( x, x (cid:48) ) d π j ( x ) d π i ( x (cid:48) ) w i δ ij (cid:21) (27)over the basis { δ ( x ) , e i } x ∈ X,i In this part, we calculate the second derivatives of W kk,ζ ( µ, µ m ) with respect to the ingredients of µ m , i.e., w i ’sand y i ’s, for the potential of applying Newton’s method inEDOT (while we have not implemented yet).Using the previous results, we can further calculate the sec-ond derivatives of W kk,ζ about w i ’s and y i ’s. Differentiating(16) and (17) results in ∂ L ∂w i ∂w j = 1 ζ (cid:90) X g ( x, y j ) (cid:88) k = i,j (cid:18) ∂α ( x ) ∂w k + ∂β k ∂w k (cid:19) E k ( x ) d µ ( x )+ 1 ζ (cid:90) X n (cid:88) k =1 (cid:18) ∂ α ( x ) ∂w i ∂w j + ∂ β k ∂w i ∂w j (cid:19) w k E k ( x ) d µ ( x )+ 1 ζ (cid:90) X m (cid:88) k =1 (cid:89) l = i,j (cid:18) ∂α ( x ) ∂w l + ∂β k ∂w l (cid:19) w k E k ( x ) d µ ( x ) (33) ∇ y j ∂ L ∂w i = δ ij (cid:20)(cid:90) X (cid:18) − g ( x, y i ) ζ (cid:19) ∇ y i E i ( x ) d µ ( x ) (cid:21) + 1 ζ (cid:90) X ∇ y j g ( x, y j ) (cid:32) ∂α ( x ) ∂w i + ∂β j ∂w i (cid:33) w j E j ( x ) d µ ( x )+ 1 ζ (cid:90) X g ( x, y j ) ∇ y j (cid:32) ∂α ( x ) ∂w i + ∂β j ∂w i (cid:33) w j E j ( x ) d µ ( x )+ 1 ζ (cid:90) X m (cid:88) k =1 g ( x, y k ) (cid:18) ∂α ( x ) ∂w i + ∂β k ∂w i (cid:19)(cid:0) ∇ y j α ( x ) + ∇ y j β k (cid:1) w k E k ( x ) d µ ( x ) (34) fficient Discretizations of Optimal Transport ∇ y j ∇ y i L = δ ij (cid:20)(cid:90) X ∇ y i g ( x, y i )(1 − g ( x, y i ) ζ ) w i E i ( x ) d µ ( x ) − ζ (cid:90) X ( ∇ y i g ( x, y i )) (2 − g ( x, y i ) ζ ) w i E i ( x ) d µ ( x ) (cid:21) + 1 ζ (cid:90) X (cid:18) − g ( x, y i ) ζ (cid:19) ∇ y i g ( x, y i ) · (cid:0) ∇ y j α ( x ) + ∇ y j β i (cid:1) w i E i ( x ) d µ ( x )+ 1 ζ (cid:90) X (cid:18) − g ( x, y j ) ζ (cid:19) ∇ y j g ( x, y j ) · (cid:0) ∇ y i α ( x ) + ∇ y i β j (cid:1) w j E j ( x ) d µ ( x )+ 1 ζ (cid:90) X m (cid:88) k =1 g ( x, y k ) (cid:0) ∇ y i ∇ y j α ( x ) + ∇ y i ∇ y j β k (cid:1) · w k E k ( x ) d µ ( x )+ 1 ζ (cid:90) X m (cid:88) k =1 g ( x, y k ) (cid:89) l = i,j (cid:0) ∇ l α + ∇ l β k (cid:1) · w k E k ( x ) d µ ( x ) (35)Once we have the second derivatives of g ( x, y ) on y i ’s, weneed the second derivatives of α ∗ and β ∗ to build the abovesecond derivatives. From the formula ∇ w i ,y j ( α ∗ , β ∗ ) = − (cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17) − (cid:16) ∇ w i ,y j N | α ∗ ,β ∗ (cid:17) , we can differenti-ate ∇ w k ,y l ∇ w i ,y j ( α ∗ , β ∗ )= − ∇ w k ,y l (cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17) − (cid:16) ∇ w i ,y j N | α ∗ ,β ∗ (cid:17) − (cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17) − (cid:16) ∇ w k ,y l ∇ w i ,y j N | α ∗ ,β ∗ (cid:17) . (36)Here from the formula that ∇ A − = − A − ∇ AA − (thisis the product rule for AA − = I ), we have ∇ w k ,y l (cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17) − = − (cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17) − ∇ w k ,y l (cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17)(cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17) − (37)and ∇ w k (cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17) = − ζ (cid:20) δ jk E k ( x ) d µ ( x ) δ ik E k ( x (cid:48) ) d µ ( x (cid:48) ) δ ij δ jk (cid:21) (38) ∇ y k (cid:16) ∇ α,β N | α ∗ ,β ∗ (cid:17) = 1 ζ (cid:20) δ jk ∇ y k g ( x, y k ) d π k ( x ) δ ik ∇ y k g ( x (cid:48) , y k ) d π k ( x (cid:48) ) 0 (cid:21) . (39)The last piece we need is (cid:16) ∇ w k ,y l ∇ w i ,y j N | α ∗ ,β ∗ (cid:17) : ∂ N ∂w j ∂w i = (0 , , (40) ∇ y j ∂ N ∂w i = (cid:18) ζ (cid:90) X ( · ) ∇ y j g ( x, y j ) E i ( x ) d µ ( x ) , (cid:19) , (41) ∇ y j ∇ y i N = δ ij ζ (cid:18)(cid:90) X ( · ) ∇ y i g ( x, y i ) w i E i ( x ) d µ ( x )+ (cid:90) X ( · ) ( ∇ y i g ( x, y i )) w i E i ( x ) d µ ( x ) ,δ ik (cid:90) X ∇ y i g ( x, y i ) w i E i ( x ) d µ ( x )+ (cid:90) X ( ∇ y i g ( x, y i )) w i E i ( x ) d µ ( x ) (cid:19) , (42)where in the last one, k represent the k -th component in N ’ssecond part (about β ). C. Empirical Parts C.1. Estimate W kk,ζ : Richardson Extrapolation andOthers In the analysis, we may need W kk,ζ ( µ, µ m ) to compare howdiscretization methods behave. However, when the µ is notdiscrete, generally we are not able to obtain the analyticalsolution to the Wasserstein distance.In certain cases including all examples this paper contains,the Wasserstein can be estimated by finite samples (witha large size). According to (Mensch & Peyr´e, 2020), for µ ∈ P ( X ) in our setup (a probability measure on a compactPolish space with Borel algebra) and g = d kX ∈ C ( X ) being a continuous function, the the Online Sinkhorn meth-ods can be used to estimate W kk,ζ . Online Sinkhorn needs alarge number of samples for µ (in batch) to be accurate.In our paper, as X are compact subsets in R n and µ hasa continuous probability density function, we may use fficient Discretizations of Optimal Transport Richardson Extrapolation method to estimate the Wasser-stein distance between µ and µ m , which may require fewersamples and fewer computations (Sinkhorn twice with dif-ferent sizes).Our examples are on intervals or rectangles, in which twogrids Λ of N points and Λ of rN points ( N and rN areboth integers) can be constructed naturally for each. With µ determined by a smooth probability density function ρ ,let µ ( N ) be normalization of (cid:80) Ni =1 ρ (Λ i ) δ Λ i (this may notbe a probability distribution, so we use its normalization).From continuity of ρ and the boundedness of dual variables α, β , we can conclude that lim N →∞ W kk,ζ ( µ ( N ) , µ m ) = W kk,ζ ( µ, µ m ) . Let W kk,ζ ( µ ( N ) , µ m ) be a function of /N , to apply Richard-son extrapolation, we need the exponent of lowest term of /N in the expansion W kk,ζ ( µ ( N ) , µ m ) = W ∗ + O (1 /N h )+ O ((1 /N ) h +1 ) , where W ∗ = W kk,ζ ( µ, µ m ) .Consider that (cid:12)(cid:12) W kk,ζ ( µ, µ m ) − W kk,ζ ( µ ( N ) , µ m ) (cid:12)(cid:12) ≤ W kk,ζ ( µ, µ ( N ) ) Since W k,ζ ( µ ( N ) , µ ) ∝ N − /d , we may conclude that h = k/d , where d is the dimension of X . Figure 9 shows anempirical example in d = 1 , k = 2 situation. Figure 9. Richardson Extrapolation: the power of the expansionabout N − . We take the EDOT µ of example 2 (1-dim truncatednormal mixture) as the target µ m , use evenly positioned µ N fordifferent N ’s to estimate. The y -axis is ln( W , . ( µ ( rN ) , µ ) − W , . ( µ ( N ) , µ )) , where r = 2 and r = 3 are calculated. With ln N as x -axis, linearity can be observed. The slopes are bothabout − . , representing the exponent of the leading non-constant term of W , . ( µ ( N ) , µ ) on N , while the theoreticalresult is r = − k/d = − . The differences are from higher orderterms on N . C.2. Example: The Sphere The CW-complex structrue of the unit sphere S is con-structed as: let (1 , , , the point on the equator be theonly dimension-0 structure, and let let the equator be the dimension-1 structure (line segment [0 , attached to thedimension-0 structure by identifying both end points to theonly point (1 , , ). The dimension-2 structure is the unionof two unit discs, identified to the south / north hemisphereof S by stereographic projection π ± N : ( X, Y ) → 11 + X + Y (cid:0) X, Y , ± ( X + Y − (cid:1) (43) with respect to the north / south pole. Spherical Geometry. The spherical geometry is the Rie-mannian manifold structure induced by the embedding ontothe unit sphere in R .The geodesic between two points is the shorter arc along thegreat circle determined by the two points. In their R coor-dinates, d S ( x , y ) = arccos( (cid:104) x , y (cid:105) ) . Composed with stere-ographic projections, the distance in terms of CW-complexcoordinates can be calculated (and be differentiated).The gradient about y (or its CW-coordinate) can be calcu-lated via above formulas. In practice, the only problem isthat when x = ± y function arccos at ± is singular. Fromthe symmetry of sphere on the rotation along axis x , thederivatives of distance along all directions are the same.Therefore, we may choose the radial direction on the CW-coordinate (unit disc). And the differentiations are primaryto calculate. C.3. A Note on the Implementation of SGD withMomentum There is a slight difference between our implementation ofthe SGD and the Algorithm provided in the paper. In theimplementation, we give two different learning rates to thepositions ( y i ’s) and the weights ( w i ’s), as moving alongpositions is usually observed much slower than movingalong weights. Empirically, we make the learning rates onposition be exactly 3 times of the learning rates on weights,at each SGD iteration. With this change, the convergence isfaster, but we do not have a theory or empirical evidencesto show a fixed ratio 3 is the best choice.Implementing and testing the Newton’s method (hybridwith SGD) and other improved SGD methods could be goodproblems to work on. C.4. Some Figures from Empirical Results In this part, we post some figures revealing the results fromthe simulations, mainly the 3d figures in different directions. fficient Discretizations of Optimal Transport Figure 10. The Swiss roll example, 3D discretization (same as the paper) in different view directions. Figure 11. The Swiss roll example, 3D discretization (same as the paper) in terms of an autostereogram. References Accinelli, E. A generalization of the implicit function theo-rems. Available at SSRN 1512763 , 2009.Allen-Zhu, Z., Li, Y., Oliveira, R., and Wigderson, A. Muchfaster algorithms for matrix scaling. In , pp. 890–901. IEEE, 2017.Aude, G., Cuturi, M., Peyr´e, G., and Bach, F. Stochas-tic optimization for large-scale optimal transport. arXivpreprint arXiv:1605.08527 , 2016.Cuturi, M. Sinkhorn distances: Lightspeed computationof optimal transport. In Advances in neural informationprocessing systems , pp. 2292–2300, 2013.Dudley, R. M. The speed of mean glivenko-cantelli con-vergence. The Annals of Mathematical Statistics , 40(1):40–50, 1969. Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyr´e, G.Sample complexity of sinkhorn divergences. In The 22ndInternational Conference on Artificial Intelligence andStatistics , pp. 1574–1583. PMLR, 2019.Janati, H., Muzellec, B., Peyr´e, G., and Cuturi, M. Entropicoptimal transport between unbalanced gaussian measureshas a closed form. Advances in Neural Information Pro-cessing Systems , 33, 2020.Kantorovich, L. V. On the translocation of masses. Journalof Mathematical Sciences , 133(4):1381–1382, 2006.Lin, T., Ho, N., and Jordan, M. I. On the efficiencyof the sinkhorn and greenkhorn algorithms and theiracceleration for optimal transport. arXiv preprintarXiv:1906.01437 , 2019.Luise, G., Rudi, A., Pontil, M., and Ciliberto, C. Differentialproperties of sinkhorn approximation for learning with fficient Discretizations of Optimal Transport Figure 12. The Sphere example, 3D discretization (same as the paper) on two view directions. wasserstein distance. In Advances in Neural InformationProcessing Systems , pp. 5859–5870, 2018.Mensch, A. and Peyr´e, G. Online sinkhorn: Optimal trans-port distances from sample streams. arXiv e-prints , pp.arXiv–2003, 2020.Overstall, A., McGree, J., et al. Bayesian design of exper-iments for intractable likelihood models using coupledauxiliary models and multivariate emulation. BayesianAnalysis , 15(1):103–131, 2020.Peyr´e, G. and Cuturi, M. Computational optimal transport. Foundations and Trends in Machine Learning , 11(5-6):355–607, 2019.Sinkhorn, R. and Knopp, P. Concerning nonnegative ma-trices and doubly stochastic matrices. Pacific Journal ofMathematics , 21(2):343–348, 1967.Tran, M.-N., Nott, D. J., and Kohn, R. Variational bayeswith intractable likelihood. Journal of Computationaland Graphical Statistics , 26(4):873–882, 2017.Villani, C. Optimal transport: old and new , volume 338.Springer Science & Business Media, 2008.Wang, J., Wang, P., and Shafto, P. Sequential cooperativebayesian inference. In International Conference on Ma-chine Learning , pp. 10039–10049. PMLR, 2020a.Wang, P., Wang, J., Paranamana, P., and Shafto, P. A mathe-matical theory of cooperative communication. Advancesin Neural Information Processing Systems , 33, 2020b.Weed, J., Bach, F., et al. Sharp asymptotic and finite-samplerates of convergence of empirical measures in wassersteindistance. Bernoulli , 25(4A):2620–2648, 2019. fficient Discretizations of Optimal Transport Figure 13. The Sphere example, 3D discretization (same as the paper) in terms of an autostereogram.