Adaptive Optimal Transport
AAdaptive Optimal Transport
Montacer Essid , Debra Laefer , and Esteban G. Tabak Courant Institute of Mathematical Sciences, NYU,251 Mercer St, New York, NY 10012 NYU Center of Urban Science and Progress,370 Jay St, Brooklyn, NY 11201February 20, 2019
Abstract
An adaptive, adversarial methodology is developed for the optimal transport problem betweentwo distributions µ and ν , known only through a finite set of independent samples ( x i ) i =1 ..N and( y j ) j =1 ..M . The methodology automatically creates features that adapt to the data, thus avoidingreliance on a priori knowledge of data distribution. Specifically, instead of a discrete point-by-point assignment, the new procedure seeks an optimal map T ( x ) defined for all x , minimizing theKullback-Leibler divergence between ( T ( x i )) and the target ( y j ). The relative entropy is given asample-based, variational characterization, thereby creating an adversarial setting: as one playerseeks to push forward one distribution to the other, the second player develops features that focuson those areas where the two distributions fail to match. The procedure solves local problemsmatching consecutive, intermediate distributions between µ and ν . As a result, maps of arbitrarycomplexity can be built by composing the simple maps used for each local problem. Displacedinterpolation is used to guarantee global from local optimality. The procedure is illustratedthrough synthetic examples in one and two dimensions. The optimal transport problem consists of finding, from among all transformations y = T ( x ) thatpush forward a source distribution µ ( x ) to a target ν ( y ), the map that minimizes the expected trans-portation cost: min T (cid:90) c ( x, T ( x )) µ ( x ) dx, T µ = ν, (1.1)where c ( x, y ) is the externally provided cost of moving a unit of mass from x to y [21]. The applicationfor which Monge formulated the optimal transport problem was the actual transportation of materialbetween two sites at minimal cost [10]. Two centuries later, starting with Kantorovich and Koopmans[6], the problem was relaxed from maps to couplings, and applied to more general matching problems,such as matching supply and demand or positions and employees. More recently, the optimal transportproblem has become a central tool in many computer and data science applications, as well as inanalysis and partial differential equations. Among the many applications for which optimal transportcould be used, the particular one that drove the methodology proposed in this article is changedetection, for which one seeks a correspondence between two point clouds (from remote sensing data– either imagery or laser scanning) in order to identify differences between them.The numerical solution of optimal transportation problems has been an active area of research forsome years. When the two measures µ and ν have discrete support, the relaxation of optimal transport1 a r X i v : . [ m a t h . O C ] F e b ue to Kantorovich [6] becomes a linear programming problem, which can be solved effectively forproblems of small and medium size. When the size of the problem grows, its solution can be acceleratedsignificantly through the addition of an entropic regularization and a Sinkhorn-type iterative algorithm[2, 14]. This regularized problem, both in the discrete and the continuous versions, is equivalent tothe Schr¨odinger bridge [8, 1]. When the space underlying the two measures µ and ν is continuous andthe distributions are known in closed form, one can –in small dimensional problems– discretize themon a grid or a graph before applying these techniques. Then their solution provides a point-by-pointassignment between the source and the target measures.However, in most data science applications, the distributions underlying the source and/or targetsamples are unknown. Moreover, those samples are often embedded in a high dimensional space,and the data are relatively scarce. Density estimation techniques using this scarce data will yield apoor representation of the source and target measures. Hence the transport map or transference planprovided by these techniques will be either inaccurate or highly over-fitted, which leads to a very poorpredictive power for the target of new sample points from the source.In order to provide a more flexible framework for data science applications, sample-based techniquesto solve the OT problem were developed in [19, 7, 20]. A central question to address when posingsample-based OT problems is the meaning of the push-forward condition T µ = ν when µ and ν are only known through samples { x i } , { y j } . In the formulations in [19, 7, 20], this condition wasrelaxed to the equality of the empirical means of a pre-determined set of functions or “features” overthe two sample sets; a relaxation that appears naturally in the dual formulation of the problem. Thisraises the feature selection problem of finding the set of features best suited to each application. Theassociated challenges are particularly apparent in the change detection problem, where elements in twopoint clouds may differ for instance in size, color, shape, data distribution or location, may be large orsmall, may have appeared, disappeared, have been displaced, deformed, broken, consolidated. . . Thusthe development of a robust, application-independent feature-selection methodology is far from trivial.The methodology proposed in this article incorporates feature selection into the formulation of theoptimal transport problem itself, through an adversarial approach. This involves three main steps:1. Borrowing from the methodology developed in [7], we subdivide the transportation problembetween µ and ν into finding N local maps T t pushing forward ρ t − to ρ t , with ρ = µ and ρ N = ν . The global map T results from the composition of these local maps: T = T N ◦ T N − ◦ . . . ◦ T , and global optimality is guaranteed by requiring that the ρ t are McCann’s displacementinterpolants [9] between µ and ν . This decomposition achieves two goals: • Because every pair of successive ρ t are close to each other, the corresponding maps T t areclose to the identity, which is the gradient of the strictly convex function (cid:107) x (cid:107) . Thispermits relaxing the requirement that φ t be convex in the optimality condition T t = ∇ φ t for the standard quadratic transportation cost. • Arbitrarily complex maps T can be built through the composition of quite simple maps T t .Thus, the maps over which to optimize each local problem can be reduced to a suitablefamily depending on just a handful of parameters.2. We formulate the push-forward condition T t ρ t − = ρ t not in terms of the empirical expectationof features but as the minimization of the relative entropy between T t ρ t − and ρ t . Oneadvantage of this formulation is that it is a natural relaxation of the push-forward conditionwhen T t is restricted to a small family of maps, which renders impossible the achievement of aperfect match between T t ρ t − and ρ t .3. We use a variational characterization of the relative entropy, as the maximizer of a suitablefunctional over functions g ( x ). This formulation has three critical properties:(a) Since the variational characterization involves expected values of functions over ρ t − and ρ t , it can be immediately extended to a sample-based scenario, thereby, replacing thoseexpected values by empirical means. 2b) Replacing “all” functions g ( x ) by a suitable family of functions provides a natural relaxationin the presence of finite sample sets. We show that, unlike the maps T t , which producethe global map T via composition, it is the sum of the functions g t that approximates theglobal g . Moreover, we prove that, if the families of T t and g t are built through the linearsuperposition of a predetermined set of functions, we recover the solution in [7].(c) Each local problem has now been given a minimax formulation (minimize over T , maximizeover g .) This has a natural adversarial interpretation: while the “player” with strategy T seeks to minimize the discrepancies between T ρ t − and ρ t ; its adversary with strategy g develops features to prove that the two distributions have not been perfectly matched. Thisprovides the desired adaptability: the user does not need to provide features adapted tothe problem in hand, as these will emerge automatically from its solution. This facilitatesapplications across a broad range of problems, including problems with significant featuresat various, possibly unknown scales.This paper is organized as follows. After this introduction, section 2 describes the methodologyand its theoretical underpinning. Subsection 2.1 introduces the variational characterization of therelative entropy that the algorithm uses and concludes with the sample-based minimax formulation ofthe local optimal transport problem. Subsection 2.2 shows that, when the functions g and potentials φ are drawn from finite-dimensional linear functional spaces, the solution to the problem agrees with theone obtained in [7] with pre-determined features. Subsection 2.3 proves that the order of minimizationand maximization does not matter –that is, that there is no duality gap– and explains the intuitionbehind the adversarial nature of the game, by detailing how each player reacts to the other’s strategy.Subsection 2.4 integrates the local algorithm just described into a global algorithm for the full optimaltransport between µ and ν .Section 3 details the algorithm further. Subsection 3.1 specifies the functional spaces chosen for g and φ , subsection 3.2 the procedure used for solving the minimax problem, and subsection 3.3the additional penalization terms required for the non-linear components of the functional spaces.Finally, section 4 performs some illustrative numerical experiments, applying the new methodologyto synthetic low-dimensional data. The focus of these experiments is to display in action, in easy tovisualize scenarios, the adversarial nature of the formulation. We are given two sample sets ( x i ) i =1 ,..,n , ( y j ) j =1 ,..,m ⊂ R d with n and m sample points respectively,independent realizations of two random variables with unknown distributions µ and ν . Both distribu-tions are assumed to be absolutely continuous with respect to the Lebesgue measure on R d and havefinite second order moments. By a slight abuse of notation, we will identify the measures and theirdensities.In this case, Brenier’s theorem [21, p. 66] guarantees the existence of a map T pushing forward µ to ν and minimizing the transportation cost (cid:90) (cid:107) T ( x ) − x (cid:107) µ ( x ) dx. (2.1)From the samples provided, we seek a map T that would perform the transport well when appliedto other independent realizations of the unknown distributions µ, ν . We can assume that the sourceand target distribution are close: Remark.
Solving the problem for nearby distributions is the building block of a general procedurefor arbitrary distributions and for finding the Wasserstein barycenter of distributions [7]. This moregeneral procedure is presented in Section 2.4. T ( x i )) and ( y j ) havethe same distribution and the minimization of the cost. Remark.
For the quadratic cost, the optimal solution is the gradient of a convex function φ ( x ) , y = T ( x ) = ∇ φ ( x ) , a convenient characterization. More general cost functions of the type (cid:96) ( x − y ) would only require modifying ∇ φ into x −∇ (cid:96) ∗ ( ∇ φ ) , where (cid:96) ∗ represents the Legendre-Fenchel transformof the strictly convex function (cid:96) , in the algorithm presented below. In [7], the push-forward condition was formulated in terms of the equality of the empirical expectedvalues of a pre-determined set of feature functions. Instead, we propose a broader and adaptiveformulation, in terms of the relative entropy between the two distributions. This introduces somesignificant improvements:1. Of the two characterizations of equality of distributions: that all test-functions within a broadenough class agree and that their relative entropy vanish, the latter is far more succinct andeasier to enforce.2. Replacing “all” test functions by a finite set, though a sensible approximation in the presenceof finite sample-sizes, leads to questions of robustness and feature selection. To address this, wewill use a variational characterization of the relative entropy, which automatically selects the“best” features within a given class.3. For finite sample sets, one would expect the empirical expected values of test functions on the twodistributions to agree only in a statistical sense, so requiring their strict equality is somewhatartificial. By contrast, in the new formulation, rather than requiring the relative entropy tovanish, which may be unrealistic for finite sample-sizes and a limited family of maps T , we seekto minimize it. Definition 1.
For two probability measures ρ, ν ∈ P ( R d ) , the Kullback-Leibler divergence of ρ withrespect to ν –also called their relative entropy– is defined as D KL ( ρ || ν ) = (cid:90) log (cid:18) dρdν (cid:19) dρ (2.2) if ρ is absolutely continuous with respect to ν ( ρ (cid:28) ν ), and + ∞ otherwise. Solving the optimal transport problem is equivalent to minimizing a Kullback-Leibler divergence,as the following proposition shows:
Proposition 1.
Let µ, ν ∈ P ( R d ) , with µ absolutely continuous with respect to the Lebesgue measure m on R d .Let C be the set of convex functions from R d → R .Define the minimization problem inf φ ∈C D KL ( ∇ φ µ || ν ) (KLopt) where ∇ φ µ ( A ) = µ (( ∇ φ ) − ( A )) , for any Borel measurable set A .Then there exists a unique minimizer φ (up to zero measure sets), which coincides with the mini-mizer of the 2-Wassertein distance between µ and ν : φ = arg inf ψ ∈C D KL ( ∇ ψ µ || ν ) and W ( µ, ν ) = (cid:90) |∇ φ ( x ) − x | dµ ( x ) . ∇ φ is well defined m -a.e. by Theorem 25.5 [15], and hence µ -a.e. roof. By Brenier’s theorem, there exists a unique minimizer φ (up to zero measure sets) for the 2-Wassertein problem. The potential φ is a proper lower semi-continuous convex function and ∇ φ µ = ν . One easily sees that φ is the minimizer for the Kullback-Leibler divergence optimization problem(KLopt), since for any measure ρ ∈ P ( R d ) one has, D KL ( ρ || ν ) ≥ ρ = ν almost everywhere (in ν ).Inequality (2.3) is easy to prove: if ρ is not absolutely continuous w.r.t. ν , the Kuklback-Leiblerdivergence is infinite, so the statement is true. Otherwise, we have D KL ( ρ || ν ) = (cid:90) log (cid:18) dρdν (cid:19) dρ = (cid:90) log (cid:18) dρdν (cid:19) dρdν dν ≥ (cid:18)(cid:90) dρdν dν (cid:19) log (cid:18)(cid:90) dρdν dν (cid:19) = 0 , where we used Jensen’s inequality and the convexity of x (cid:55)→ x log( x ).So the equality D KL ( ρ || ν ) = 0 will hold if and only if Jensen’s inequality becomes an equality, i.e.if and only if dρdν ≡
1, or ρ = ν .In particular, the solution to the optimal transport problem satisfies ∇ φ µ = ν . Hence D KL ( ∇ φ µ || ν ) = 0 , which shows that φ is a minimizer of the optimal transport problem and (KLopt).As for uniqueness, let φ , φ ∈ C be two minimizers. Then ∇ φ µ = ∇ φ µ = ν from thestatement above. By Brenier’s theorem, they both solve the quadratic cost optimal transportationproblem, which has a unique solution up to zero measure sets.Recently there has been a push in machine learning to replace the Kullback-Leibler divergenceby Wasserstein distances in order to penalize differences in data sets [5, 14]. Unlike the Kullback-Leibler divergence, the Wasserstein distance defines a proper distance, enjoys regularity and symmetryproperties, and is computationally tractable. Nonetheless, the Kullback-Leibler divergence is wellsuited to measure the dissimilarities between measures that we are trying to detect. In particular,the asymmetry between the two measures under the Kullback-Leibler divergence is well within thespirit of the problem, as we seek a convex function φ that makes the transported distribution ∇ φ µ indistinguishable from the target reference ν . Also, as we shall see, the minimization of the relativeentropy captures the differences between the two sample sets far more deftly than does a predefinedfinite set of test functions.Thus, the biggest drawback in using the Kullback-Leibler divergence appears to be the difficultyin its numerical evaluation, particularly when we do not have access to a closed form expression for µ and ν , but merely to a finite set of independent samples from each of these distributions. One couldresort to density estimation techniques [16, 17] to approximate µ and ν and then proceed to numericalintegration. Instead, we use a variational characterization of the Kulback-Leibler divergence of ρ withrespect to ν , in the form of a sample-friendly expression : Proposition 2.
Let ρ, ν ∈ P ( R d ) . Then D KL ( ρ || ν ) = 1 + sup g (cid:26)(cid:90) gdρ − (cid:90) e g dν (cid:27) over all Borel measurable functions g : R d → R .5 roof. If we do not have ρ (cid:28) ν , there exists a set A ⊂ R d such that ρ ( A ) > ν ( A ) = 0. Then1 + sup g (cid:26)(cid:90) gdρ − (cid:90) e g dν (cid:27) is infinite, as it can be made arbitrarily large by picking functions of the type g = c A , c ∈ R . D KL ( ρ || ν ) is also infinite in this case. Hence their values agree.When ρ (cid:28) ν , notice that for ν -almost every x ∈ R d , g ∈ R (cid:55)→ g dρdν ( x ) − e g is concave and maximized for g ( x ) = log (cid:16) dρdν ( x ) (cid:17) (note that the Radon-Nikodym derivative dρdν isnon-negative, ν -a.e.).Thus, for almost every x ∈ R d and any choice of g ( x ) ∈ R , we have:1 + g ( x ) dρdν ( x ) − e g ( x ) ≤ dρdν ( x ) (cid:20) log (cid:18) dρdν ( x ) (cid:19) − (cid:21) with equality if and only if g ( x ) = log (cid:16) dρdν ( x ) (cid:17) .Integrating over the measure ν yields1 + (cid:90) R d (cid:18) g ( x ) dρdν ( x ) − e g ( x ) (cid:19) dν ( x ) ≤ (cid:90) R d log (cid:18) dρdν ( x ) (cid:19) dρ ( x ) = D KL ( ρ || ν )and, thus, one has 1 + sup g (cid:26)(cid:90) R d g ( x ) dρ ( x ) − (cid:90) R d e g ( y ) dν ( y ) (cid:27) = D KL ( ρ || ν )since we have equality for g = log (cid:18) dρdν (cid:19) on the support of ν. Remark.
1. The variational reformulation of the Kullback-Leibler divergence is a consequence ofthe convexity of x (cid:55)→ − log( x ) . Indeed, computing its Legendre-Fenchel transform twice yields: − log( x ) = sup y< (cid:26) xy + 1 − log (cid:18) − y (cid:19)(cid:27) = sup g ∈ R { g − xe g } + 1 This approach extends to a broader set of f-divergences, yielding similar variational formulations,see [11] and [12].2. A very similar variational formulation was developed in [18] to estimate the likelihood of twosamples being generated from independent sources.3. Note that the variational formulation represented above is very similar to the Donsker-Varadhan[3] formula: sup g (cid:26)(cid:90) R d g ( x ) dρ ( x ) − log (cid:18)(cid:90) R d e g ( y ) dν ( y ) (cid:19)(cid:27) Indeed, log( x ) ≤ x − yields: sup g (cid:26)(cid:90) R d g ( x ) dρ ( x ) − (cid:90) R d e g ( y ) dν ( y ) (cid:27) + 1 ≤ sup g (cid:26)(cid:90) R d g ( x ) dρ ( x ) − log (cid:18)(cid:90) R d e g ( y ) dν ( y ) (cid:19)(cid:27) and equality is achieved for the same maximizer g = log (cid:16) dρdµ (cid:17) , if ρ (cid:28) ν (otherwise, they areboth infinite). The formula in Proposition 2 can be considered as a linearization of the Donsker-Varadhan formula, easier to implement numerically. Z ∼ ρ and Y ∼ ν with ρ (cid:28) ν , we can equivalently express the formulain Proposition 2 as: D KL ( ρ || ν ) = 1 + max g (cid:110) E [ g ( Z )] − E [ e g ( Y ) ] (cid:111) If instead, we are given independent samples z , ..., z n of Z , and y , .., y m of Y , we can approximatethe above reformulation by its empirical counterpart: D KL ( ρ || ν ) ≈ g n (cid:88) i g ( z i ) − m (cid:88) j e g ( y j ) where the maximization is sought over a suitable class of functions g . Theorem 1 of [11] shows thatif this class of functions1. contains the optimizer g ∗ = log (cid:16) dρdν (cid:17) ,2. satisfies the envelope conditions [11][16a, 16b] (e.g. g is bounded),3. satisfies the entropy conditions [11][17a, 17b] (e.g. Sobolev spaces W k, on a compact space),then we have Hellinger consistency of this estimator, that is (cid:90) (cid:18)(cid:112) exp( g ∗ ) − (cid:113) exp( g n,m ) (cid:19) dν −−−−−−→ n,m → + ∞ g n,m = arg max (cid:110) n (cid:80) i g ( z i ) − m (cid:80) j e g ( y j ) (cid:111) .We deduce from Propositions 1 and 2 the following reformulation of the optimal transport problembetween µ and ν , under a quadratic cost, expressed as a minimax problem: Problem 1 (Minimax reformulation) . min φ max g L [ φ, g ] ≡ E [ g ( ∇ φ ( X ))] − E (cid:104) e g ( Y ) (cid:105) Note that the
Lagrangian L is concave in the maximization variable g , but not necessarily convexin the minimization variable φ .The sample-based version of Problem 1 is given by: Problem 2 (Sample based minimax reformulation) . min φ max g L [ φ, g ] ≈ min φ max g n (cid:88) i g ( ∇ φ ( x i )) − m (cid:88) j e g ( y j ) over suitable function spaces for φ ( x ) and g ( y ), as detailed in Section 3.This is an adversarial setting, in which the player with strategy φ attempts to minimize thediscrepancies between the distributions underlying the sample sets {∇ φ ( x i ) } and { y j } , while theplayer with strategy g attempts to show that the two distributions are in fact different. Thus g would point to those areas where the two distributions differ the most, and φ would correct thosediscrepancies. We will see this competition in action in the examples in section 4.This saddle point optimization problem is reminiscent of the ones encountered in the GenerativeAdversarial Networks (GAN) literature [12]. Broadly speaking, a GAN learns how to generate asample from an unknown distribution. To do so, a two-player game is introduced; a parameterized generator Q aims to produce samples as ‘close’ as possible to the samples in the training set. This isquantified by the use of an f-divergence (e.g. Kullback-Leibler, Jensen-Shannon, or ‘GAN’ divergence),7hich is given a variational formulation in the exact same way as it is done in Proposition 2. This inturns introduces a discriminator , whose role is to prove that the generator has not done the right job.Formulated as such, our optimization problem is quite similar to a GAN. Indeed, the generator Q is a distribution which is usually induced by the pushforward of a generic distribution (e.g. standardGaussian) by a map T . This map, as well as the discriminator, are calibrated using neural networks.This is well within the spirit of the method we use to generate the optimal transport map, as well asthe function g (see Section 2.4).The main differences with the algorithm presented in [12] and ours are:1. Our map T is restricted to the form ∇ φ where φ is convex, in order to solve the quadraticWasserstein problem. To our knowledge, there are no restrictions on the map in the GANproblem,2. We use a variational formulation of the Kullback-Leibler divergence instead of the ‘GAN’ diver-gence,3. Instead of using a batch gradient descent for the optimization algorithm, we proceed to whatwe call ‘implicit gradient descent’, which is described in Section 3.2.4. Although our method of generating the map T and ‘discriminant’ g proceed to a sum or com-position of many non-linear maps, we do not directly use neural networks. In [7], a set of ‘features’ f , .., f K serve as test functions to evaluate the statement ρ = ν for ρ, ν ∈ P ( R d ), when we only have sample points ( z i ) i =1 ,..,n and ( y j ) j =1 ,..,m generated from Z ∼ ρ , Y ∼ ν .As in [7], we will assume that µ, ν are ‘close’. The general case with more distant measures can bereduced to the solution of many local problems, as shown in Algorithm 1 below, also borrowed from[7]. Definition 2.
The samples ( z i ) i =1 ,..,n and ( y j ) j =1 ,..,m generated from random variables Z ∼ ρ , Y ∼ ν are equivalent for the set of features f , .., f K if n n (cid:88) i =1 f k ( z i ) = 1 m m (cid:88) j =1 f k ( y j ) , ∀ k = 1 , .., K The definition above is a relaxation of the equivalence µ = ν ⇔ E [ f ( Z )] = E [ f ( Y )] for all testfunctions f ∈ C b ( R d ). Then solving the transport problem between the samples ( x i ) and ( y j ) isreduced to finding a map T such that ( T ( x i )) i is equivalent to ( y j ), for the features f , .., f K .In [7], T is chosen to be of the type : T ( x ) = ∇ φ ( x ) = x + (cid:88) k α k ∇ φ k ( x )for some pre-determined functions φ , .., φ K and constants α , ..., α K . In fact, the potentials φ k adopted in [7] agree with the features f k , but our proposition below applies to more general choices.It shows that the procedure to solve the sample-based optimal transport problem with pre-determinedfeatures is a particular instance of Problem 2. A specific choice of functional space for g will yieldthis result. Before introducing it, we need a set of compatibility conditions for the choices of possible φ and g . Definition 3.
The features f k , k = 1 , .., K are said to be compatible with the potentials φ k , k = 1 , .., K for the sample ( x i ) i =1 ,..,n , if the matrix C ∈ R K × K defined as C kk (cid:48) = 1 n n (cid:88) i =1 ∇ φ k ( x i ) · ∇ f k (cid:48) ( x i )8 s non-singular. This compatibility assumption essentially guarantees the non-degeneracy of the choice of functions,as it restricts the average displacement to affect the features in an independent fashion. It can besummarized by the requirement that C = E [ J φ J (cid:62) f ] is non-singular, where J φ , J f are the Jacobianmatrices of φ, f . Proposition 3.
Given a compatible set of features f , .., f K and potentials φ , .., φ K for the sample ( x i ) i =1 ,..,n , consider Problem 2 using the functional spaces: g ( z ) = K (cid:88) k =1 β k f k ( z ) , φ ( x ) = | x | K (cid:88) k =1 α k φ k ( x ) for β ∈ R K , α ∈ R K in a small-enough neighborhood of zero.Then the optimizer φ of Problem 2 for two sample sets close to each other solves the sample-basedoptimal transport problem with predetermined features; meaning that ( ∇ φ ( x i )) is equivalent to ( y j ) forthe features f , .., f K . : n n (cid:88) i =1 f k ( ∇ φ ( x i )) = 1 m m (cid:88) j =1 f k ( y j ) , ∀ k = 1 , .., K Proof.
The Lagrangian L as a function of α, β is given by1 n (cid:88) i (cid:34) K (cid:88) k =1 β k f k (cid:32) x i + K (cid:88) l =1 α l ∇ φ l ( x i ) (cid:33)(cid:35) − m (cid:88) j e (cid:80) Kk =1 β k f k ( y j ) Taking the first order conditions at optimality yields: ∇ α L = C ( α ) β, where C ( α ) kk (cid:48) = 1 n (cid:88) i (cid:34) ∇ φ k ( x i ) · ∇ f k (cid:32) x i + K (cid:88) l =1 α l ∇ φ l ( x i ) (cid:33)(cid:35) , Since α is in a neighborhood of zero, the matrix C ( α ) is a small perturbation of the non-singularmatrix C . Since features and potentials are compatible, the matrix C is non-singular, and, thus, C ( α )is non-singular itself. Hence ∇ α L = 0 ⇒ β = 0 . Moreover, the second optimality condition evaluated at β = 0 yields ∀ k : ∂ β k L = 1 n (cid:88) i f k (cid:32) x i + K (cid:88) l =1 α l ∇ φ l ( x i ) (cid:33) − m (cid:88) j f k ( y j )Hence ∇ β L = 0 at β = 0 implies that1 n (cid:88) i f k (cid:32) x i + K (cid:88) l =1 α l ∇ φ l ( x i ) (cid:33) = 1 m (cid:88) j f k ( y j )Notice that the closeness of the two sample sets and the compatibility between the potential andfeatures guarantee that this problem has a solution with a small α (in fact, this can be taken as afeature-dependent characterization of what it means for two sample sets to be close to each other).This result means that the empirical expected values of the f k agree on { T ( x i ) } and { y j } , i.e. thesamples are equivalent for the features f , ..., f K . Hence T = ∇ φ solves the sample-based optimaltransport problem with pre-determined features. 9ote that we are restricting the maps ∇ φ to be ‘small’ perturbations of the identity, by choosing α in a neighborhood of 0. This is because the optimal transport procedure will only be applied tomeasures or samples that are ‘close’ to each other.In this paper, we will allow g to be more general than a simple linear combination of features, thusgreatly expanding the procedure in [7]. This added flexibility yields better adaptability to the mostimportant characteristics of the data. Given the Lagrangian L introduced in Problem 1, the primal objective functional to minimize is,according to Proposition 2: D [ φ ] = max g L [ φ, g ] = D KL ( ∇ φ µ || ν ) − d [ g ] = min φ L [ φ, g ] = (cid:18) min y ∈ R d g ( y ) (cid:19) − E (cid:104) e g ( Y ) (cid:105) (2.6)A desired property of the adversarial game, defined by the formulation in Problem 1, is the absenceof an irreversible advantage or penalty a player gets from playing first. In other words we do not wanta duality gap. This is the content of the following proposition: Proposition 4 (Absence of duality gap) . min φ max g L [ φ, g ] = min φ D [ φ ] = max g d [ g ] = max g min φ L [ φ, g ] Proof.
From Proposition 1, we know thatmin φ D KL ( ∇ φ µ || ν ) = 0with the minimizer reached for the solution of the transport problem.Hence we get in Equation (2.5)min φ max g L [ φ, g ] = min φ D [ φ ] = − g min φ L [ φ, g ] = max g (cid:26) min φ E [ g ( ∇ φ ( X ))] − E (cid:104) e g ( Y ) (cid:105)(cid:27) Note that the inner minimum is reached for the convex function φ ( x ) = y min · x where min y g ( y ) = g ( y min ) ≡ g min .In the case where the minimum of g is not reached, take a minimizing sequence y nmin such that g ( y nmin ) → inf y ∈ R d g ( y ) ≡ g min . Then a minimizing sequence for the inner minimum in φ is given by φ n ( x ) = y nmin · x .In both cases, min φ E [ g ( ∇ φ ( X ))] = g min We are, thus, left with maximizing the dual problemmax g d [ g ] = max g (cid:110) g min − E (cid:104) e g ( Y ) (cid:105)(cid:111) E (cid:2) e g ( Y ) (cid:3) ≥ e g min , we can always choose g to be the constant function g min . We are then leftwith maximizing max g min g min − e g min which is achieved for g ≡ g min = 0. Hence we also have thatmax g d [ g ] = max g min φ L ( φ, g ) = − The optimality conditions for the minimax problem are given by (cid:40) ∇ φ moves mass to where g is smallest g ( y ) = log (cid:16) ∇ φ µ ( y ) ν ( y ) (cid:17) Examining the primal and dual problems in light of these conditions explains the behavior of thecompeting players φ and g : • Given a function g , φ will try to move mass from the areas where g is large (i.e. ∇ φ µ ( y ) ≥ ν ( y ))to those where g is small (i.e. ∇ φ µ ( y ) ≤ ν ( y )). Following this strategy allows this player tominimize the impact of g on the Lagrangian. • Given a function φ , g will adapt to get closer to the function log (cid:16) ∇ φ µ ( y ) ν ( y ) (cid:17) , which is large wheremass is lacking ( ∇ φ µ ( y ) ≥ ν ( y )) and vice-versa. Following this strategy, allows the secondplayer to increase the Lagrangian by focusing on those areas where the push-forward conditionhas not been fully achieved.The game concludes when g becomes constant (necessarily 0) on the support of the distributions.Then φ does not need to move mass anymore, as it then receives no new directive from g . One could attempt to directly use a procedure based on Problem 2 to solve the OT problem forany samples ( x ) i and ( y ) j . Such direct approach, however, would not be universally efficient for thefollowing reasons: • If the distributions underlying ( x ) i and ( y ) j are considerably different, one would require a veryrich family of potentials to build a φ that can perform an accurate transfer. • One would also require a rich functional space from which to draw g in order to properlycharacterize all significant differences in the two data samples. • Depending on the parametrization of φ and g , the Lagrangian can be non-convex in the variablesparametrizing φ , and non-concave in the variables parametrizing g . With distributions that arefar apart, this could make the numerical solution depend on the initialization of those parameters. • The condition that φ is a convex function is typically hard to enforce. For nearby distributions,on the other hand, it is satisfied automatically, as φ ( x ) is close to the convex potential (cid:107) x (cid:107) corresponding to the identity map. 11 lgorithm 1 Theoretical Global Optimal Transport Algorithm (TGOT) procedure
TGOT ( µ, ν ) (cid:46) Step 1: Initialize intermediate nodes N ← number of intermediary steps ρ ← µ, ρ T ← ν for t = 1 , .., N − do ρ t ← N − tN µ + tN ν (cid:46) or any arbitrary measure while not converged do (cid:46) Step 2: Forward step for t = 1 , .., N do Solve the optimal transport problem between ρ t − and ρ t , as defined in Problem 1. Ityields an ‘local’ optimal map ∇ φ t . ∇ φ ← ∇ φ N ◦ ∇ φ N − ◦ · · · ◦ ∇ φ (cid:46) Step 3: Backward step for t = 1 , .., N − do ρ t ← ( N − tN Id + tN ∇ φ ) µ return ∇ φ For these reasons, we will solve multiple local optimal transport problems, instead of one globalone. More precisely, we will apply Algorithm 1, adapted from Algorithms 2 and 7 in [7]. Theorem 2.4in [7] proves the convergence of Algorithm 1 to the solution of the OT problem.In Algorithm 1, the forward step consists of solving multiple, small, optimal transport problems,addressed in Section 3.2. The backward step back-propagates the final sample computed in the forwardpass to all the intermediate samples using McCann’s displacement interpolants.This procedure, reminiscent of the neural networks of machine learning with their “hidden layers”replaced by local optimal transport problems, introduces several advantages: • The global solution will be obtained by composition of the local maps: ∇ φ = ∇ φ N ◦ ∇ φ N − ◦ · · · ◦ ∇ φ (2.7)Hence one can choose a small family of maps to solve each local optimal transport problem, andstill span a rich family of maps for the global displacement.Note that in our two-player game, we would theoretically have at optimality T µ = ν and hencethe optimal g would be equal to log( T µ/ν ) = 0. • If ρ t and ρ t +1 are close, the local OT problem has a solution ∇ φ that is a small perturbationof the identity, i.e. the gradient of a strictly convex potential. Starting from the identity, thenumerical algorithm will explore a small neighborhood around it. If the solution that we seek isin this neighborhood, convexity will be preserved.The global algorithm for finding the optimal map between two distributions known through thesamples ( x i ) and ( y j ) is summarized in Algorithm 2. Algorithm 3 in Section 3 further details theprocedure to solves the sample based local Optimal Transport problem. In order to complete the description of the algorithm proposed, we need to specify the functionalspaces from which g and φ are drawn and the procedure used for solving the minimax problem for ofthe Lagrangian L ( g, φ ). 12 lgorithm 2 Sample Based Global Optimal Transport Algorithm (SBGOT) procedure
SBGOT (( x i ) , ( y j )) (cid:46) Step 1: Initialize intermediate nodes N ← number of intermediary steps z ← x, z N ← y for t = 1 , .., N − do z t,i ← N − tN x i + tN y σ ( i ) (cid:46) for some σ : { , .., n } → { , .., m } (or any arbitrary samples) while not converged do (cid:46) Step 2: Forward step for t = 1 , .., N do z t ← SBLOT ( z t − , z t ) (cid:46) see Algorithm 3 (cid:46) Step 3: Backward step for t = 1 , .., N − do z t ← N − tN x + tN z N return z N Since any two consecutive distributions µ, ν in the procedure are close to each other, the optimal mapis a perturbation of the identity. The potential φ will, thus, be chosen in the form: φ ( x ) = 12 (cid:107) x (cid:107) + ψ ( x ) (3.1)where ψ has a Hessian with a spectral radius less than 1. No such centering is required for g ( x ), asat optimality g ( x ) = log (1) = 0 . One basic capability that one should require of the functional spaces for g and φ is that of detectingand correcting global displacements and scaling –not necessarily isotropic – between two distributions.Thus one should have φ ( x ) = 12 x (cid:62) ( I + A ) x + a · x + φ nl ( x )and g ( z ) = 12 z (cid:62) B z + b · z + b + g nl ( z ) , where A , B are symmetric matrices in R d × d , a , b are vectors in R d , b ∈ R is a scalar, and φ nl and g nl stand for additional non-linear features discussed below. The quadratic polynomial in φ allowsfor global translations and dilations. Correspondingly, the quadratic polynomial in g allows for thedetection of any mismatch in the mean and co-variance of the two distributions. One can easilycheck that, with these basic functions available, the procedure yields the exact solution to the optimaltransport problem between arbitrary Gaussians.If these are the only features available, then there is no advantage in dividing the global probleminto local ones, as the composition of linear maps is also linear, thereby providing no additionalrichness to the single step scenario. The natural element to add is an adaptive feature that couldperform –and detect the need of– local mass displacements. In one dimension, a natural choice isprovided by one or more Gaussians of the form φ knl = α k exp (cid:18) − [ v k ( x − ¯ x k )] (cid:19) , g knl = β k exp (cid:18) − [ s k ( z − ¯ z k )] (cid:19) , where the index k labels the Gaussian feature when more than one is used. The Gaussians in φ allowfor local stretching/compression around m with scale | v | − and amplitude α , while each Gaussianin g detects local discrepancies between the two distributions, as opposed to the global scale and13ositioning provided by its quadratic component. The parameters v , ¯ x , s and ¯ z appear nonlinearly in φ and g , moving us away from the linear feature spaces of [7] and into the realm of adaptability, asthe parameters automatically select the location and scale of the changes required by the data.There are at least four alternative ways to bring these Gaussian features to higher dimensions:1. Adopt general Gaussians of the form φ nl = α exp (cid:18) − (cid:107) V ( x − ¯ x ) (cid:107) (cid:19) , with ¯ x a vector and V a matrix (it is more convenient to write the Gaussian in terms of a generalmatrix V in this way, rather than in terms of the inverse covariance matrix C − = V T V , as wewould need to require this to be positive definite);2. adopt isotropic Gaussians φ nl = α exp (cid:18) − v (cid:107) x − ¯ x (cid:107) (cid:19) , with v a scalar,3. adopt one-dimensional Gaussians along arbitrary directions φ nl = α exp (cid:18) − (cid:107) v · ( x − ¯ x ) (cid:107) (cid:19) , with v a vector, and4. adopt a Gaussian with diagonal covariance φ nl = α exp (cid:18) − (cid:107) D ( x − ¯ x ) (cid:107) (cid:19) , with D a diagonal matrix,and similarly for g nl in all four cases. The first choice has the advantage of generality but may beprone to overfitting in high dimensions, unless it is severely penalized. The second approximates ageneral function φ by the composition of isotropic bumps, an appropriate image is that of hammeringa sheet of metal into any desired shape. Yet, it would resolve poorly local, one-dimensional changes.The third choice excels at these but will fare poorly for more isotropic local changes. Finally, thefourth choice is attached to the coordinate axes, which would make sense only if these correspond tovariables that are assumed to change independently.A natural question is how many Gaussians to include in the functional space proposed. We haveused two in the examples below, but one Gaussian would have sufficed: in the adversarial multistepmethod proposed, it is enough that the player with strategy g ( y ) has a “lens” (the Gaussian) toidentify the area where the two distributions least agree, and the player with strategy φ ( x ) has thecapability to perform local moves to correct this misfit. Since the center and width of the Gaussianare free parameters, both assertions hold. With a single Gaussian feature, both players can focusonly on one local misfit at a time. However, the algorithm has multiple steps, so effectively the totalnumber of features available is the product of the features per step times the number of steps. We will use vectors α ∈ R a , β ∈ R b to parametrize φ ( x ) = φ α ( x ) and g ( y ) = g β ( y ). We are seeking tosolve the minimax problem in α ∈ R a , β ∈ R b for the Lagrangian: L [ α, β ] = 1 n n (cid:88) i =1 g β ( ∇ φ α ( x i )) − m m (cid:88) j =1 e g β ( y j ) + P ( α, β )14here P is a penalization function that will be described in Section 3.3.In practice, one could use any available minimax solver to find a critical point of the above La-grangian. Yet, to our knowledge, there is no available efficient method suitable for a non-convex/non-concave landscape.A naive algorithm would simultaneously implement gradient descent in α and gradient ascent in β , with updates given at each step s by: α s +1 = α s − η ∇ α L [ α s , β s ] β s +1 = β s + η ∇ β L [ α s , β s ] , with a step size η that may change at each iteration. From a game-theory perspective, this corre-sponds to two myopic players that plan their next move based only on their current position, withoutanticipating what the other player might do.Instead, more insightful players will choose their next move based on the future position of theiropponents. This yields a second order algorithm, that we will refer to as implicit gradient descent,with updates given by: α s +1 = α s − η ∇ α L [ α s +1 , β s +1 ] β s +1 = β s + η ∇ β L [ α s +1 , β s +1 ] . A simple Taylor expansion gives: ∇ α L [ α s +1 , β s +1 ] ≈ ∇ α L s + ∇ αα L s · ( α s +1 − α s ) + ∇ αβ L s · ( β s +1 − β s ) ∇ β L [ α s +1 , β s +1 ] ≈ ∇ β L s + ∇ αβ L s · ( α s +1 − α s ) + ∇ ββ L s · ( β s +1 − β s )Defining the twisted gradient G s and twisted Hessian H s by G s = (cid:18) ∇ α L s −∇ β L s (cid:19) , H s = (cid:18) ∇ αα L s ∇ αβ L s −∇ αβ L s −∇ ββ L s (cid:19) and γ s = (cid:18) α s β s (cid:19) , one obtains the second-order updating scheme: γ s +1 = γ s − η ( I + ηH s ) − G s (3.2)Notice that as η →
0, the scheme is equivalent to a classical gradient descent. On the other hand, as η → + ∞ , the scheme converges to Newton iterations.At each iteration, we are allowed to update η in order to accelerate convergence. Ongoing research[4] addresses the correct rules to update η , as well as the convergence of the algorithm to a criticalpoint of the Lagrangian. This minimax solver is robust in two senses: it guarantees both convergenceto a local minimax point and constant improvement. The latter has to do with the subtlety of minimaxproblems, as opposed to regular minimization where enforcing a decrease of the objective function isenough. In each step of our implicit procedure to min x max y L ( x, y ), if L [ x s +1 , y s +1 ] is either biggerthan L [ x s , y s +1 ] or smaller than L [ x s +1 , y s ], we reject the step and adopt a smaller learning rate.Because of this, the solution will always improve over the starting identity map. If computing thetwisted Hessian H becomes too costly, one can resort to Hessian approximation techniques such asBFGS or its variations [22, 13].To conclude, the algorithm for finding the optimal match between two consecutive distributions,which we denote sample based local optimal transport (SBLOT), is summarized in Algorithm 3.15 lgorithm 3 Sample Based Local Optimal Transport Algorithm (SBLOT) procedure
SBLOT (( x i ) , ( y j ))Initialize γ Compute the twisted gradients and Hessians
G, H for n = 1 , .., M axIter doif || G || < tolerance then break γ ← γ − η ( I + ηH ) − G Recompute the twisted gradients and Hessians
G, H at γ Update η return ∇ φ γ [1: a ] ( x ) Transforming Problem 1 into Problem 2 amounts to replacing the theoretical measures with theirempirical estimates; ρ ≈ ˆ ρ = 1 n n (cid:88) i =1 δ ∇ φ ( x i ) , ν ≈ ˆ ν = 1 m m (cid:88) j =1 δ y j Even if ρ (cid:28) ν , this will not hold for their estimates. Allowing maximum freedom for the function g will result in an infinite Kullback-Leibler divergence. For instance, if one allows functions g withsupport including some ∇ φ ( x i ) but none of the y j , the Lagrangian will grow unboundedly, since theexponential term that regularly inhibits this growth is now constant. One way to avoid this problemis to use the relative entropy not between T ( X ) and Y but between T ( X ) and (1 − (cid:15) ) Y + (cid:15)T ( X ), asthen the law of T ( X ) is always absolutely continuous w.r.t. the law of (1 − (cid:15) ) Y + (cid:15)T ( X ), eliminatingthe possibility of blowup in g , and the minimum is still reached when T ( X ) = Y . Another generalsimple way to avoid this kind of scenario is through the addition to the Lagrangian of terms thatpenalize overfitting. For our particular choice of functional spaces, it is only the coefficients in theargument of the exponentials that require penalization, as those are the only ones than involve spatialscales. In particular, for a component of g or φ of the form ae − ( b · ( x − c )) , we add penalization terms proportional to e ( (cid:15) (cid:107) b (cid:107) ) , with (cid:15) as defined above, to avoid resolving scales smaller than (cid:15) , to1( D (cid:107) b (cid:107) ) , where D measures the diameter of the support of the data, to avoid having Gaussians so broad thatthey are indistinguishable from the quadratic components of the functional space, to (cid:13)(cid:13)(cid:13) cD (cid:13)(cid:13)(cid:13) , to avoid centering the Gaussian away from the data, and, when more that one Gaussian is used, to (cid:15) (cid:107) c i − c j (cid:107) , i, j ) of Gaussians, to avoid possible degeneracies in the functional space when twoGaussians become nearly indistinguishable.All these terms are added and multiplied by a tunable parameter λ . Yet one more consideration isrequired for the penalization of the parameters of the potential φ : since in the Lagrangian, φ appearsonly as an argument of g , for a fixed λ , the penalization terms and the core Lagrangian can easilybecome unbalanced. In particular, at the exact solution, g is zero, so only the penalization termswill remain. To correct for such imbalance, we multiply the corresponding penalization terms by theaverage value of (cid:107)∇ g (cid:107) over all current ∇ φ ( x i ). This section illustrates the algorithm through some simple examples. First we use a one-dimensionalexample –simplest for visualization– and a direct solver between initial and final distributions to dis-play the way in which the function g adapts, creating features that point to those areas where transportin still deficient, thus guiding φ to correct them. The two distributions in the first example are rela-tively close, so that they can be matched without involving intermediate distributions. A second setof one-dimensional examples follows, involving more significant changes and hence requiring the useof interpolated distributions. Then we perform some two-dimensional examples, involving Gaussians,Gaussian mixtures and a distribution uniform within an annulus. Finally, we use an example builtso that we know the exact answer, to perform an empirical analysis of convergence. All the examplespresented are intended for illustration and use synthetic data; applications to real data, particularlyto change detection, will be presented in field-specific articles currently under development. φ and g This section shows, through a simple experiment, the competitive behavior exhibited by the twoplayers φ and g in the local algorithm (Algorithm 3). To this end, we create data where the initialand final distribution are not very far from each other, so that the local algorithm can be used as astand alone routine. More specifically, we map one single Gaussian distribution to a Gaussian mixture,where the two components of the mixture overlap significantly, so that they do not differ too markedlyfrom the source.Iteration 0 After 8 iterations After 17 iterationsFigure 1: Plot at three different iteration times of Algorithm 3. Histograms of the source samples andtheir transforms are in red, and of the target samples in blue. The black curve corresponds to g ( x ),vertically rescaled for visualization. The green curve represents the displacement T ( x ) − x .Figure 1 shows steps in the solution to the corresponding sample based OT problem, with thesource samples ( x ) i from a Gaussian –and their transforms– in red and the samples ( y ) j from amixture of two Gaussians in blue. Point samples are represented through histograms. The figure onthe left represents the initial configuration, the one in the middle the configuration after 10 iterationsof Algorithm 3, and the one on the right the final configuration.17n top of the histograms, we display the function g ( x ) in black, scaled vertically to be in the interval[ −
1; 1] for easier comparison with the data, and the displacement ∇ φ ( x ) − x in green, representingthe map that sends the initial sample (in red, in the left figure) to the current sample (in red, in themiddle or right figure).The initial displacement, being 0, was not represented at initialization, but we initialize the function g ( z ) at the purely quadratic function:12 z T (cid:16) ˆΣ − y − ˆΣ − x (cid:17) z + (cid:16) ˆΣ − x ˆ x − ˆΣ − y ˆ y (cid:17) T z + 12 (cid:16) ˆ y T ˆΣ − y ˆ y − ˆ x T ˆΣ − x ˆ x (cid:17) (4.1)where ˆ x, ˆ y are the empirical means of the samples ( x ) i , ( y ) j , and ˆΣ x , ˆΣ y their empirical covariance ma-trices. Equation 4.1 represents the optimal g for two Gaussian measures. More generally, starting withthis expression as the initial guess for g instructs φ to shift the samples as well as to stretch/compressthem, in order to match the first and second moments of the two distributions.The left image of Figure 1 shows how g highlights the lack of variance in ( x ) i ; its maximum is at0, and it has smaller values at the edges. This forces φ to adapt accordingly, by applying a linear mapto stretch ( x ) i . When the variance of the ( ∇ φ ( x )) i exceeds the variance of the ( y ) j , the shape of g isinverted.In the middle image of Figure 1, we can see that ∇ φ corrected the mismatches highlighted by g and even started to slightly separate the mass in the middle. However, there is still too much red massaround 0 and too little red mass around the two peaks of the blue Gaussian mixture. This is welldetected by g , which has a local maximum within the area of red mass excess and two local minimawithin the area of red mass default. In the right image of Figure 1, we observe that ∇ φ adaptedaccordingly and starts yielding satisfactory results. At this point, g is very close to 0 ( || g || ∞ ∼ − ),although this is not apparent in the figure due to the normalization we applied for plotting. Figures 2 and 3 represent inputs and outputs of Algorithm 2, where ( x ) i is sampled from a Gaussianand ( y ) j from a mixture of two and three Gaussians respectively.Starting configuration Final configurationFigure 2: Algorithm 2 pushing forward a Gaussian to a mixture of two Gaussians, in 1D. The sourcesamples and their transforms are depicted through histograms in red, and the target samples in blue.These results were obtained by generating ∼
200 samples for the source and target measures, andusing the functional spaces defined in Section 3.1 in the local algorithm (Algorithm 3), with a generalquadratic form for both φ and g , plus one adaptive Gaussian for φ and two for g . A total of N = 10and N = 20 intermediary measures were adopted for the first and second example, respectively. Asone can see, even though each local map can only perform one local deformation, the composition ofmany creates all the complexity required to move one single Gaussian to a mixture of two or three.18tarting configuration Final configurationFigure 3: Same as figure 2 but with a mixture of three Gaussians as target. Switching to two dimensions, Figure 4 represents the results of mapping a Gaussian distribution to auniform distribution within an annulus.An isotropic Gaussian was used for φ nl and two for g nl in the functional space of Algorithm 3, and N = 30 intermediary distributions were used in Algorithm 2. Figure 5 represents the displacementinterpolants at t = k/ k = 1 , . . . ,
5, obtained from running Algorithm 2 on the example in Figure4. In addition to mass spreading from the isotropic Gaussian, the linear and quadratic part of φ translated and stretched the red sample accordingly.Starting configuration Final configurationFigure 4: Algorithm 2 from a displaced Gaussian to an annulus, in 2DSimilarly, Figure 6 represents the initial and final configurations obtained from running Algorithm2 to transport a two-dimensional Gaussian distribution to a mixture of two Gaussians. A diagonalcovariance was used in the non-linearity φ nl for the functional space in Algorithm 3, and N = 30intermediary steps were used in Algorithm 2. This type of non-linearity is well adapted to separatesamples along the horizontal and vertical axes.Figure 7 represents the displacement interpolants at t = k/ k = 1 , . . . ,
5, obtained fromrunning Algorithm 2 on the example in Figure 6.
In this subsection, we empirically analyze the convergence of the algorithm in a situation where thegenerating distributions, as well as the optimal map, are known: ( x i ) i =1 , ··· ,n are i.i.d. samples ofa standard Gaussian distribution, ( y j ) j =1 , · ,m are obtained through y i = φ (cid:48) ( x i ) for φ ( x ) = | x | (cid:15) ( (cid:15) = 1 / φ is convex, φ (cid:48) is the optimal map for thequadratic Wasserstein problem. 19n a first set of experiments, we keep the number of samples constant at n = m = 500, and we varythe number of intermediary steps K in the global algorithm, raging through K = 1 , , , ,
10. In asecond set of experiments, we keep the number of intermediary steps in the global algorithm constantat K = 10, and vary the number of sample points, using n = m = 25 , , , , ∇ φ exp by (2.7), and compare it to the optimal ∇ φ ∗ defined by: ∇ φ ∗ ( x ) = (1 + (cid:15) ) x | x | (cid:15) − . Figure 5: Interpolants given by Algorithm 2 from a Gaussian to an annulus, in 2D. The top left figure(red) corresponds to the original sample. Time flows from left to right, and from top to bottom.Subsequently represented are the interpolants at time t = k/ k = 1 , . . . , t = k/ k = 1 , . . . , L norm (cid:90) |∇ φ exp ( x ) − ∇ φ ∗ ( x ) | µ ( x ) dx ≈ (cid:80) i |∇ φ exp ( x i ) − ∇ φ ∗ ( x i ) | ,2. The L ∞ norm between ∇ φ exp and ∇ φ ∗ ,For illustrative purposes, we show in Figure 8 the differences between ∇ φ exp and ∇ φ ∗ for varioussets of parameters. Tables 1 and 2 summarize the results. K = 1 2 3 5 10 E [ |∇ φ ∗ ( X ) − ∇ φ exp ( X ) | ] 0.74 0.55 8.3 · − · − · − ||∇ φ ∗ − ∇ φ exp || L ∞ · − · − · − Table 1: Convergence as a function of the number K of intermediary steps21 = m = 25 50 100 200 500 E [ |∇ φ ∗ ( X ) − ∇ φ exp ( X ) | ] 1.4 0.35 7.1 · − · − · − ||∇ φ ∗ − ∇ φ exp || L ∞ · − Table 2: Convergence as a function of the number of samples n In practice, setting a number of samples less than 15 in this example leads to poor convergencedue to the extreme sparsity of data. K = 1 K = 3 K = 5 K = 10Figure 8: Comparison between ∇ φ ∗ (blue) and ∇ φ exp (orange) for different values of intermediarysteps K .Figure 8 compares the optimal map ∇ φ ∗ with the computed map ∇ φ exp . Note that the one stepalgorithm does not provide a monotone solution, i.e. it is not the gradient of a convex function:the source and target distributions are not close enough to guaranty that. This is corrected throughthe introduction of intermediate steps, which brings the source and target distributions for each stepcloser to each other via displacement interpolation. For the example under consideration, the optimalsolution is convex for any value of K bigger than 4. Notice also that, for K = 10 and n = 500, thesolution approximates the exact one very accurately in the bulk of the distribution, as captured bythe density-weighted L norm of their difference. On the other hand, the L ∞ norm is dominated bythe behavior at the tails, where little data is present to guide the algorithm. We have developed an adaptive methodology for the sample-based optimal transport problem underthe standard quadratic cost function. The main advantage of the new procedure is that it does notrequire any external input on the form of the distributions that one seeks to match, or any expertknowledge on the type, location and size of the features in which the source and target distributionmay differ.Even though the map ∇ φ and test function g used at each step are parametric, by using thecomposition of many simple maps and having at one’s disposal a “lens” within g that can focus onany individual local mismatch at each step, the resulting procedure can be thought as effectively freeof parameters, except for the number of intermediate distributions to use, a stopping criterion, anda couple of constants associated with the penalization of the nonlinear features. Thus, it has thepotential to form the basis for a universal tool that can be transferred painlessly across fields.Two main ingredients allow for the procedure to capture arbitrary variability without makinguse of a huge dictionary of candidate features (in its current version, it uses only three: a linearfeature for global displacements, a quadratic feature for global scalings, and a Gaussian feature forlocalized displacements). One ingredient, borrowed from prior work in [7], is the factorization of thepotentially quite complex global map into a sequence of much simpler local maps between nearbydistributions. The optimality of the composed map is guaranteed through the use of displacementinterpolation. The second ingredient is the formulation of the local problem as a two-player gamewhere the first player seeks to push forward one distribution into the other, while the second player22evelops features that show where the push-forward condition fails. The variational characterizationof the relative entropy between distributions that gives rise to this game-theory formulation has theadditional advantage of being sample-friendly, as it involves the two distributions only through theexpected values of functions, which can be naturally replaced by empirical means. Because the mapbetween any two consecutive distributions is close to the identity, local optimality is guaranteed byrequiring this map to be the gradient of a potential.Topics for future research include the extension of the algorithm to transportation costs differentfrom the squared distance and, for the purpose of more efficient computability, the optimization ofthe minimax solver and the parallelization of the computation of the local maps. Most of all, webelieve, the use of the new methodology in real applications will shed light on the issues that requirefurther work, which may include the development of features and penalizations suitable for efficientlycapturing sharp edges or removed objects. Acknowledgments
The authors would like to thank Yongxin Chen for connecting our variational formulation of theKullback-Leibler divergence with the Donsker-Varadhan formula. This work has been partially sup-ported by a grant from the Morse-Sloan Foundation. The work of Tabak was partially supportedby NSF grant DMS-1715753 and ONR grant N00014-15-1-2355. The work of Essid was partiallysupported by NSF grant DMS-1311833.
References [1] Chen, Y., Georgiou, T.T., Pavon, M.: On the relation between optimal transport and schr¨odingerbridges: A stochastic control viewpoint. Journal of Optimization Theory and Applications (2), 671–691 (2016)[2] Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. In: Advances inneural information processing systems, pp. 2292–2300 (2013)[3] Donsker, M.D., Varadhan, S.S.: Asymptotic evaluation of certain markov process expectationsfor large time, i-iv. Communications on Pure and Applied Mathematics (1), 1–47 (1975)[4] Essid, M., Trigila, G., Tabak, E.G.: A minimax algorithm. In Preparation [5] Frogner, C., Zhang, C., Mobahi, H., Araya-Polo, M., Poggio, T.A.: Learning with a wassersteinloss. CoRR abs/1506.05439 (2015). URL http://arxiv.org/abs/1506.05439 [6] Kantorovich, L.V.: On the translocation of masses. Compt. Rend. Akad. Sei , 199–201 (1942)[7] Kuang, M., Tabak, E.G.: Sample-based optimal transport and barycenter problems. Comm.Pure Appl. Math., submitted (2017)[8] L´eonard, C.: From the schr¨odinger problem to the monge–kantorovich problem. Journal ofFunctional Analysis (4), 1879–1920 (2012)[9] McCann, R.J.: A convexity principle for interacting gases. advances in mathematics (1),153–179 (1997)[10] Monge, G.: M´emoire sur la th´eorie des d´eblais et des remblais. De l’Imprimerie Royale (1781)[11] Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals and the likelihoodratio by convex risk minimization. IEEE Transactions on Information Theory (11), 5847–5861(2010) 2312] Nowozin, S., Cseke, B., Tomioka, R.: f-gan: Training generative neural samplers using variationaldivergence minimization. In: Advances in Neural Information Processing Systems, pp. 271–279(2016)[13] Pavon, M.: A variational derivation of a class of bfgs-like methods. arXiv preprintarXiv:1712.00680 (2017)[14] Peyre, G., Cuturi, M.: Computational optimal transport. arXiv:1803.00567 (2018)[15] Rockafellar, R.T.: Convex analysis. Princeton Mathematical Series. Princeton University Press,Princeton, N. J. (1970)[16] Sheather, S.J., Jones, M.C.: A reliable data-based bandwidth selection method for kernel densityestimation. Journal of the Royal Statistical Society. Series B (Methodological) pp. 683–690 (1991)[17] Silverman, B.W.: Density estimation for statistics and data analysis. Routledge (2018)[18] Suzuki, T., Sugiyama, M., Sese, J., Kanamori, T.: Approximating mutual information by maxi-mum likelihood density ratio estimation. In: New challenges for feature selection in data miningand knowledge discovery, pp. 5–20 (2008)[19] Tabak, E., Trigila, G.: Conditional expectation estimation through attributable components.Information and Inference (2017). DOI https://doi:10.1093/imaiai/drn000[20] Tabak, E., Trigila, G.: Explanation of variability and removal of confounding factors from datathrough optimal transport. Communications on Pure and Applied Mathematics (1), 163–199(2018)[21] Villani, C.: Topics in optimal transportation. 58. American Mathematical Society (2003)[22] Wright, S., Nocedal, J.: Numerical optimization. Springer Science35