Entropic Optimal Transport: Geometry and Large Deviations
aa r X i v : . [ m a t h . O C ] F e b Entropic Optimal Transport:Geometry and Large Deviations ∗Espen Bernton † Promit Ghosal ‡ Marcel Nutz § February 9, 2021
Abstract
We study the convergence of entropically regularized optimal trans-port to optimal transport. The main result is concerned with the con-vergence of the associated optimizers and takes the form of a largedeviations principle quantifying the local exponential convergence rateas the regularization parameter vanishes. The exact rate function isdetermined in a general setting and linked to the Kantorovich poten-tial of optimal transport. Our arguments are based on the geometryof the optimizers and inspired by the use of c -cyclical monotonicity inclassical transport theory. The results can also be phrased in terms ofSchrödinger bridges. Keywords
Optimal Transport; Entropic Penalization; Schrödinger Bridge; LargeDeviations; Cyclical Invariance
AMS 2010 Subject Classification
Over the last three decades, optimal transport theory has flourished due toits connections with geometry, analysis, probability theory, and other fieldsin mathematics; see for instance [47, 48, 52]. Following computational ad-vances which have enabled high-dimensional applications, a renewed interestcomes from applied fields such as machine learning, image processing andstatistics. Popularized in this area by Cuturi [18], entropic regularization is ∗ The authors are grateful to Julio Backhoff-Veraguas, Guillaume Carlier, WilfriedGangbo and Jon Niles-Weed for insightful discussions that greatly contributed to thisresearch. † Department of Statistics, Columbia University, [email protected]. ‡ Department of Mathematics, Massachusetts Institute of Technology, [email protected]. § Departments of Statistics and Mathematics, Columbia University,[email protected]. Research supported by an Alfred P. Sloan Fellowship andNSF Grant DMS-1812661. entropic optimal transport problem provides an approximate optimal trans-port when solved for small regularization parameter ε > while admittingmuch more efficient algorithms than the unregularized problem, in additionto having other desirable properties. We defer the discussion of related liter-ature to Section 1.1 below and proceed with a synopsis of the present study.Given a continuous cost function c : X × Y → R + on Polish probabilityspaces ( X , µ ) and ( Y , ν ) , we consider the entropic optimal transport problem inf π ∈ Π( µ,ν ) Z X × Y c dπ + εH ( π | µ ⊗ ν ) (1.1)where Π( µ, ν ) is the set of couplings and H ( ·| µ ⊗ ν ) denotes relative en-tropy (or Kullback–Leibler divergence) with respect to the product of themarginals. The constant ε > acts as a regularization parameter; ε = 0 recovers the (unregularized) optimal transport problem. Under mild condi-tions detailed in Sections 2 and 3, respectively, the entropic optimal transportproblem admits a unique solution π ε ∈ Π( µ, ν ) and π ε converges weakly toa solution π ∗ of the unregularized problem. Our main interest is to quantifythe speed of this convergence π ε → π ∗ .For finite-dimensional linear programs—including optimal transport prob-lems with marginals supported by finite sets—the solution of the entropicregularization is known to converge exponentially fast to a solution of theoriginal problem (in total variation, say). In transport problems with contin-uous marginals, the situation is quite different even in the most regular exam-ples. For Gaussian marginals on R and quadratic costs c ( x, y ) = | x − y | / ,direct computation shows that π ε is Gaussian and π ∗ is given by a lineartransport (Monge) map T . One finds that the transport cost converges onlylinearly, R c dπ ε − R c dπ ∗ = ε + o ( ε ) . The culprit for this slowdown is easilyspotted by inspecting the closed-form solution: the leading term in the costdifference stems from the mass π ε places at a distance of approximately √ ε to the support Γ of π ∗ (that is, the graph of T ). See Section 1.1 for fur-ther discussion on the asymptotics of transport costs and value functions,which have been the main focus of the extant literature on the convergenceas ε → .In the present study, we adopt a different, more local perspective, fromwhich the Gaussian example is actually encouraging: the density of π ε decays exponentially away from Γ . Indeed, it is proportional to e − α | y − T ( x ) | /ε , where α > is the quotient of the marginal variances.The main result of this paper is a comparable statement in a remarkablygeneral setting; it takes the form of a large deviations principle. We define afunction I ( x, y ) through the following optimization. In addition to the givenpoint ( x, y ) =: ( x , y ) , choose finitely many points ( x , y ) , . . . , ( x k , y k ) from2he support Γ of the limiting optimal transport π ∗ , as well as a permutation σ ∈ Σ( k ) . Then, consider the difference k X i =1 c ( x i , y i ) − k X i =1 c ( x i , y σ ( i ) ) (1.2)between the pointwise transport costs from x i to y i with the costs for thepermuted destinations y σ ( i ) . The optimization is to maximize this differ-ence, and we define I ( x, y ) as the supremum value of (1.2) over all choicesof points and permutations. For ( x, y ) ∈ Γ , the optimality of π ∗ impliesthat I ( x, y ) = 0 , because Γ is c -cyclically monotone. But outside Γ , wemay typically expect that I ( x, y ) > . Part (a) of our theorem below showsthat I is a lower bound for the rate function in the general Polish setting.The matching upper bound (b) necessitates a condition on the optimal trans-port problem that is being approximated—but still holds for the majorityof continuous or semi-discrete transport problems of interest. We mainlydiscuss the uniqueness of Kantorovich potentials (Assumption 4.4) as a suf-ficient condition; it also gives rise to an insightful representation of I asthe difference between the cost c ( x, y ) and the solution of the dual optimaltransport problem. An alternative condition imposing regularity of the op-timal transport (Assumption 4.9) is also considered. The main result readsas follows. Theorem 1.1.
Let
Γ = spt π ∗ where π ∗ = lim ε → π ε is the limiting optimaltransport and define I : X × Y → [0 , ∞ ] by (4.3) .(a) For any compact set C ⊂ X × Y , lim sup ε → ε log π ε ( C ) ≤ − inf ( x,y ) ∈ C I ( x, y ) . (b) Let Assumption 4.4 or Assumption 4.9 hold, and consider the sets X =proj X Γ and Y = proj Y Γ of full marginal measure. For any open set U ⊂ X × Y , lim inf ε → ε log π ε ( U ) ≥ − inf ( x,y ) ∈ U I ( x, y ) . The theorem shows in particular that the rate depends (only) on thegeometry of π ∗ , which does not seem to be clear a priori. We mention thatour result can also be stated in terms of (static) Schrödinger bridges. In thiscontext, it is a large deviations principle for the small-noise (or small-time)limit; cf. Section 1.1.For finitely supported marginals, the density of π ε converges exponen-tially for any cost function; that is, the rate function is strictly positive3utside Γ . We shall see that the analogue may fail in the continuous case.Rather, positivity depends on the geometry of the cost. The twist condition(injectivity of ∇ x c ( x, · ) ) plays an important role, like in many results onoptimal transport. We include affirmative positivity results in particular forquadratic costs, which is the most important case for applications. Whilenot pursued in the present paper, our results should also be useful to derivedetailed quantitative bounds on the rate in more specific settings. We mayalso hope to gain insights into how the rate depends on the dimension.Geometry is a cornerstone in the now-classical theory of optimal trans-port, where optimality is captured geometrically by the c -cyclical monotonic-ity of a transport’s support. Defined by comparing costs at finitely manypoints, it yields a powerful tool to derive fundamental results such as stabil-ity of optimal transports under weak limits or existence of dual potentials.We are not aware of a comparable technique in the literature on entropicoptimal transport (or on Schrödinger bridges). In this paper, we exploit acyclical invariance property satisfied by the density of π ε . The invariance it-self can be understood as a reformulation of a classical characterization for π ε through the solution of the dual problem, the Schrödinger potentials. Thenovelty here lies in exploiting the geometric aspect and working on the primalside, following the spirit of c -cyclical monotonicity. Like in classical optimaltransport, the arguments are remarkably simple and general once the correctnotions are in place. Our technique is a departure from the control-theoreticmethods in the related literature. Case in point, the geometric proof thata weak limit π = lim ε → π ε is an optimal transport (cf. Proposition 3.2), isnearly trivial compared to the Gamma-convergence technique, even in thegeneral Polish context.We also emphasize another benefit which may illustrate that cyclicalinvariance is in fact more than just a reformulation of control theory orconvex analysis: the geometry singles out a unique coupling π ε even if thevalue function (1.1) is infinite and hence the usual notion of solution as aminimizer is not meaningful. This is crucial for instance if costs are quadraticbut one of the marginal distributions does not have a finite second moment.Our arguments for the large deviations result apply in that setting withoutany added difficulty, paralleling the geometric insights in classical optimaltransport. (On the other hand, the existence of π ε in the case of infinitevalue functions is not immediate. It will be reported separately, togetherwith a stability theorem, using the same geometric standpoint.) Indeed,we expect the technique to be useful in several other aspects of entropicoptimal transport and Schrödinger bridges, and thus the technique may beas important a contribution as the main theorem.The present paper is organized as follows. After reviewing motivationsfor our research and related literature in the remainder of this Introduction,4ection 2 details the basic definitions and introduces cyclical invariance. InSection 3, this notion is used to prove that cluster points of π ε as ε → have c -cyclically monotone support, hence are optimal transports. The main re-sult on large deviations is obtained in Section 4: part (a) of Theorem 1.1is stated as Corollary 4.3 whereas (b) is split into Corollaries 4.7 and 4.12,each covering one of the two alternative assumptions. Section 5 gives ex-amples of settings where the rate function I is strictly positive outside thesupport Γ , with a focus on quadratic costs. Appendix A contains facts aboutSchrödinger bridges and a derivation of the cyclical invariance property. InAppendix B, we detail two general settings where Assumption 4.4 on theuniqueness of Kantorovich potentials is satisfied. Finally, Appendix C showshow to translate the results on the positivity of I in Section 5 from quadraticcosts to more general cost functions by means of c -convex analysis. In the literature on finite-dimensional linear programs and their entropic reg-ularization, the early work [14] contains a very detailed study of primal anddual convergence, expansion of the value function, and characterizations ofthe rates. Their setting includes discrete optimal transport problems withmarginals supported by finitely many points, and in that case the pointwiseresults in [14] certainly include the large deviations result for ε → . On theother hand, our main theorem is most relevant when at least one marginalsupport is connected, hence is complementary to the discrete case. More re-cently, [53] proved an exponential convergence bound for finite-dimensionallinear programs. While the bound is not sharp in a pointwise sense, theresult is non-asymptotic; i.e., holds for all ε below a known threshold. More-over, the constants are known in terms of the data, which provided valuableintuition for our construction of the rate function I . One may also observehow the constants in [53] blow up as the cardinality of the support increases.In the last decade, optimal transport has found myriad applicationsin machine learning, statistics, image processing, language processing, andother areas. The literature in the computational area has expanded veryquickly and our account is highly incomplete; see [45] for a recent mono-graph with extensive references. Exact computation of an optimal transportbetween marginals with n atoms costs O ( n log n ) , prohibitive for modernapplications with large data sets. The recent success of applied optimaltransport is enabled by the advent of fast approximate solvers, and entropicregularization is among the most influential schemes for high-dimensionalproblems. Popularized by Cuturi [18] in this domain, it allows for the ap-plication of Sinkhorn’s algorithm (also called iterative proportional fitting,and also due to Fortet, Knopp and others) where each iteration is a matrix-5ector multiplication costing O ( n ) . Importantly for modern applications, itis highly parallelizable on GPUs; a number of further advantages are high-lighted in [7]. The convergence of this algorithm was rigorously discussedin [30, 50], among others. More recently, it was shown that δ -accurate ap-proximations of the transport cost can be obtained in ˜ O ( n /δ ) operationsvia entropic regularization; cf. [8, 36] and the references therein. In additionto computation accuracy, a second error in practice stems from samplingthe marginals. For entropic optimal transport (with ε > fixed), the rateof convergence of the empirical cost towards its population limit does notdepend on the dimension, in contrast to the curse of dimensionality sufferedby its unregularized counterpart [28, 40]. In the present study, we focus ona third source of error—the discrepancy between the entropic optimizer π ε and the optimal transport π ∗ —and adopt a local point of view.Continuing with a different branch of related literature, recall that en-tropic optimal transport can also be phrased as the (static) Schrödingerbridge problem. Informally stated, consider a system of diffusing parti-cles from time t to t in thermal equilibrium, and a given joint “reference”law R for its configuration at those times. If marginals ( µ, ν ) differing fromthe ones of R are observed, what is the most likely evolution (joint lawof µ, ν ) of the system conditional on R ? Schrödinger’s answer amounts to π ∗ = arg min Π( µ,ν ) H ( ·| R ) ; see [25, 33] for extensive surveys including histor-ical accounts. (This is the static formulation. Given the origins in physics,it is natural that much of the literature focuses on the dynamic Schrödingerbridge problem, which asks for the dynamic evolution of the particle systemover time t ∈ [ t , t ] . The static problem is recovered by projecting to themarginals.)The minimization of H ( ·| R ) over Π( µ, ν ) coincides with the entropicoptimal transport problem (1.1) if we introduce the cost function c := − ε log( α − dR/d ( µ ⊗ ν )) , where the parameter ε > is arbitrary and α is a normalizing constant (we tacitly assume that R ∼ µ ⊗ ν ). Conversely,taking (1.1) as the starting point, defining R ( ε ) by dR ( ε ) /d ( µ ⊗ ν ) = αe − c/ε yields the associated Schrödinger bridge problem. Assuming for simplicitythat { c = 0 } is the graph of a function f : X → Y , Theorem 1.1 is thena large deviations principle as the reference measure R ( ε ) degenerates to adeterministic coupling (meaning that a particle with given origin x travels tothe predetermined destination f ( x ) ). This is also called the small-noise orsmall-time limit. While not pursued here, it seems plausible that a similarprinciple could be established for more general sequences R ( ε ) . From thepoint of view of Schrödinger bridges, another interesting follow-up question Schrödinger’s ideas about the “most likely evolution” are usually presented as a largedeviations result in the modern literature. That result is very different from the one justdiscussed.
6s whether a comparable large deviations result can be stated for the dynamicproblem on path space.Mikami [41, 42] first highlighted the connection between Schrödingerequations and optimal transport in the small-noise limit. Léonard stud-ied Schrödinger bridges in a series of works starting with [31, 32]; see [34]for further references. In [33], he established convergence of the value func-tion to an optimal transport problem in the sense of Gamma-convergencefor a general formulation of the problem. More recently, [15, 44] studythe limit in specific settings and determine higher-order terms in the ex-pansion of the Schrödinger (or entropic) value function around the optimaltransport cost. These works complement earlier results of [1, 23, 24] show-ing that the large deviation rate function for the empirical distribution ofindependent Brownian particles with drift is asymptotically equivalent tothe Jordan–Kinderlehrer–Otto functional arising in the Wasserstein gradientflow. We mention that [15] also considers the large-time limit (correspondingto ε → ∞ ); cf. [13] for recent developments. The setup in [44] is closest toours in that the entropic penalty and the limit ε → are formulated in thesame way, whereas the literature on Schrödinger bridges often formulates thezero-noise limit through a vanishing Laplacian. We also mention [11] wherea very accessible proof of the Gamma-convergence is presented for quadraticcosts.While the focus of the aforementioned works is on value functions andglobal quantities, the present study focuses on the local geometry and conver-gence. The value functions are not used at all, and so it is quite natural thatthe results hold even when costs are infinite. We are not aware of a largedeviations principle similar to ours in the extant literature. One concreteexample where these aspects are of interest, are the multidimensional ranksand quantiles that have been introduced in statistics to extend the usualscalar notions and familiar nonparametric tests; see [12, 20, 21, 29]. HereBrenier’s map is fundamental, but like in the scalar case, moment conditionsare not natural. McCann’s geometric extension [39] of Brenier’s map (seealso [52, pp. 249–258]) can be used to provide a definition irrespectively of thefiniteness of the value function. Unlike their scalar counterparts, the ranksdefined through optimal transport are computationally expensive. Entropicoptimal transport resolves that issue and provides an approximate Brenier’smap. Leveraging this idea, a notion of “differentiable ranks” based on en-tropic optimal transport was recently proposed in [19]. We expect that ourresults can be used to study the local deviations of these differentiable ranksfrom the unregularized ones.Related to our technique in a broader sense, there have been recent workssuccessfully using ideas of c -cyclical monotonicity outside the setting of clas-sical optimal transport. Examples include martingale optimal transport [5]7nd optimal Skorokhod embeddings [4, 6]. Finally, we mention the intriguing“optimal entropy-transport problem” studied in [35]. Here, the usual optimaltransport problem is relaxed in that the marginal constraints are replacedby an entropic penalty relative to a given pair of measures. While similarin name, this problem is quite different from ours, where the marginal con-straints are strictly enforced and the entropy of the joint distribution is usedas penalty. Let ( X , µ ) and ( Y , ν ) be Polish probability spaces endowed with their Borel σ -fields and let c : X × Y → R + be a measurable (cost) function. Theassociated optimal transport problem is inf π ∈ Π( µ,ν ) Z X × Y c dπ (2.1)where Π( µ, ν ) is the set of all couplings; that is, probability measures π on X × Y with marginals µ = (proj X ) π and ν = (proj Y ) π . Given aconstant ε > , the entropic optimal transport problem is inf π ∈ Π( µ,ν ) Z X × Y c dπ + εH ( π | P ) , P := µ ⊗ ν, (2.2)where H denotes the relative entropy or Kullback–Leibler divergence, H ( π | P ) := (R log( dπdP ) dπ, π ≪ P, ∞ , π P. As detailed in Proposition A.1 of Appendix A, this problem admits a uniqueminimizer π ε whenever the value (2.2) is finite; moreover, π ε is equivalentto P as soon asthere exists π ∈ Π( µ, ν ) with π ∼ P and R c dπ + H ( π | P ) < ∞ . (2.3) Definition 2.1.
A coupling π ∈ Π( µ, ν ) is called ( c, ε ) -cyclically invariant if π ∼ P and its density admits a version dπdP : X × Y → (0 , ∞ ) such that k Y i =1 dπdP ( x i , y i ) = exp (cid:18) − ε (cid:20) k X i =1 c ( x i , y i ) − k X i =1 c ( x i , y i +1 ) (cid:21)(cid:19) k Y i =1 dπdP ( x i , y i +1 ) (2.4)for all k ∈ N and ( x i , y i ) ki =1 ⊂ X × Y , where y k +1 := y .8e omit the qualifier ( c, ε ) when there is no ambiguity. Cyclical invari-ance can be phrased more succinctly using the auxiliary reference measure R = R ( ε ) defined by the Gibbs kernel dRdP = αe − c/ε , (2.5)where α = ( R e − c/ε dP ) − is the normalizing constant. As R ∼ P , we canstate (2.4) as k Y i =1 dπdR ( x i , y i ) = k Y i =1 dπdR ( x i , y i +1 ) . (2.6)This condition, in turn, is related to a multiplicative decomposition of thedensity dπ/dR ; cf. Appendix A. For our analysis of the limit ε → , theless elegant definition (2.4) will be the more useful one, as it makes explicitthe role of ε and links directly to the c -cyclical monotonicity condition ofoptimal transport. Proposition 2.2. (a) There is at most one ( c, ε ) -cyclically invariant cou-pling π ∈ Π( µ, ν ) .(b) Let (2.3) hold. Then π ∈ Π( µ, ν ) is ( c, ε ) -cyclically invariant if andonly if it minimizes (2.2) . Moreover, there exists a unique such coupling. The proof is detailed in Appendix A. Under Condition (2.3), Proposi-tion 2.2 shows the equivalence between minimality and cyclical invariance.The notion of minimality is meaningful only when the value function (2.2)is finite: otherwise all couplings have infinite cost. By contrast, we show inforthcoming work that the notion of cyclical invariance remains meaningfulin this context of infinite costs: existence and uniqueness hold under mildregularity conditions; e.g., X , Y are Euclidean spaces and c is continuous.In the remainder of this paper, we simply assume that a ( c, ε ) -cyclicallyinvariant coupling π ε ∈ Π( µ, ν ) exists for every ε > , rather than imposingCondition (2.3) as in much of the literature. One reason is that this conditionprecludes some applications of interest to us. In any event, the argumentsin this paper do not simplify if (2.3) is assumed. ε → Denote by π ε the unique ( c, ε ) -cyclically invariant coupling. In this sectionwe show that cluster points of π ε as ε → are c -cyclically monotone. Theestimates leading to that conclusion are obtained by simply integrating thecyclical invariance condition. 9 emma 3.1. Let k ≥ and ≤ δ ≤ δ ′ ≤ ∞ . Define A k ( δ, δ ′ ) := ( ( x i , y i ) ki =1 ∈ ( X × Y ) k : δ ≤ k X i =1 c ( x i , y i ) − k X i =1 c ( x i , y i +1 ) ≤ δ ′ ) and let A ⊂ A k ( δ, δ ′ ) be Borel. Then π kε := Q ki =1 π ε ( dx i , dy i ) satisfies π kε ( A ) ≤ e − δ/ε for all ε > . (3.1) Suppose in addition that ¯ A := (cid:8) ( x i , y i +1 ) ki =1 : ( x i , y i ) ki =1 ∈ A (cid:9) satisfies lim inf ε → ε log π kε ( ¯ A ) = 0 . Then lim inf ε → ε log π kε ( A ) ≥ − δ ′ . (3.2) Proof.
Set Z = dπ ε /dP . Using (2.4), we have for P k -a.e. ( x i , y i ) ki =1 ∈ A that Y Z ( x i , y i ) = exp (cid:8) − ε − (cid:2)P c ( x i , y i ) − P c ( x i , y i +1 ) (cid:3)(cid:9) Y Z ( x i , y i +1 ) ≤ e − δ/ε Y Z ( x i , y i +1 ) . Integrating over A with respect to P k = Q P ( dx i , dy i ) = Q P ( dx i , dy i +1 ) yields π kε ( A ) ≤ e − δ/ε π kε ( ¯ A ) ≤ e − δ/ε , which is (3.1). Analogously, π kε ( A ) ≥ e − δ ′ /ε π kε ( ¯ A ) and hence ε log π kε ( A ) ≥ − δ ′ + ε log π kε ( ¯ A ) , so that (3.2) follows under the stated condition on ¯ A .In all that follows, probability measures are considered with weak con-vergence; i.e., the topology induced by bounded continuous functions. Werecall that Π( µ, ν ) is weakly compact; cf. [52, p. 45]. As a consequence,any sequence of couplings admits at least one cluster point, and any clus-ter point is a coupling. A set Γ ⊂ X × Y is called c -cyclically monotone if P ki =1 c ( x i , y i ) ≤ P ki =1 c ( x i , y i +1 ) for all k ≥ and ( x i , y i ) ∈ Γ , ≤ i ≤ k . Proposition 3.2.
Let c be continuous and let π be a cluster point of ( π ε ) as ε → . Then spt π is c -cyclically monotone, hence π is an optimal transportas soon as the optimal transport problem (2.1) is finite. If (2.1) admits aunique c -cyclically monotone coupling π ∗ ∈ Π( µ, ν ) , then π ε → π ∗ as ε → .Proof. Let ε n → and π ε n → π . Suppose for contradiction that thereare ( x i , y i ) ∈ spt π , ≤ i ≤ k with P i c ( x i , y i ) > P i c ( x i , y i +1 ) . By con-tinuity there exist δ > and open neighborhoods U i ∋ ( x i , y i ) such that10 i c (˜ x i , ˜ y i ) ≥ δ + P i c (˜ x i , ˜ y i +1 ) for all (˜ x i , ˜ y i ) ∈ U i . Moreover, π ( U i ) > and hence lim inf n π ε n ( U i ) > . On the other hand, U × · · · × U k ⊂ A k ( δ, ∞ ) implies π kε n ( U × · · · × U k ) → by Lemma 3.1, a contradiction. This showsthat spt π is c -cyclically monotone. It is well known that cyclical monotonic-ity and optimality are equivalent when (2.1) is finite; cf. [52, Theorem 5.10,p. 57]. As Π( µ, ν ) is compact, π ε must have cluster points as ε → , so thatuniqueness implies convergence. Remark 3.3.
For the particular case of quadratic cost on R d and marginalssatisfying certain integrability conditions, the conclusion of Proposition 3.2is obtained in [11] by (arguably much more involved) Gamma-convergencearguments. That line of argument focuses on the properties of the valuefunction, hence cannot be applied when the value function is infinite. Arelated but slightly different convergence result, also obtained by Gamma-convergence, is stated in [33, Theorem 2.4] and includes lower semicontinuouscost functions. On the other hand, the convergence in Proposition 3.2 mayfail if continuity is relaxed to lower semicontinuity: one example, discussedin more detail in [43, Remark 4.3], is c ( x, y ) = { x = y } and µ = ν = Unif[0 , .Uniqueness of c -cyclically monotone transports is known for many exam-ples of continuous or semi-discrete optimal transport problems—arguably formost of the important examples except distance costs—and then Proposi-tion 3.2 shows the convergence of π ε as ε → . See, e.g., [52, Theorem 5.30,p. 84]. When the transport problem admits multiple solutions, it is not ob-vious whether π ε converges. If there exists an optimal transport π with H ( π | P ) < ∞ , one can show that π ε converges to the unique optimal trans-port π ∗ with minimal relative entropy H ( ·| P ) ; cf. [43, Theorem 5.1]. This in-cludes the discrete case with finitely supported marginals as analyzed in [14],but also the semi-discrete case (where one marginal is continuous) under mi-nor integrability conditions. Convergence is also known for the scalar Mongeproblem where c ( x, y ) = | x − y | on X = Y = R and the marginals are abso-lutely continuous; here a relatively explicit analysis is possible [22]. It hasbeen conjectured that convergence holds in a general setting. Throughout this section, the cost function c is assumed to be continuous.For simplicity of exposition, we shall also assume that π ε → π ∗ as ε → , (4.1)for some (necessarily c -cyclically monotone) transport π ∗ ∈ Π( µ, ν ) . How-ever, if it is merely known that π ε n → π ∗ along a specific sequence ε n → ,11hen all of our results hold along that sequence, regardless of whether ( π ε ) hasother cluster points. In fact, the arguments in this paper are complementary to the question of convergence discussed in the preceding paragraph: given the convergence of a sequence, we describe the large deviations. In this subsection we introduce the function I and show that it provides alower bound for the large deviations rate. With the definitions in place, thearguments are straightforward and apply in great generality. We write B r ( z ) for the open ball of radius r around z , in any metric space. Lemma 4.1.
Let ( x, y ) ∈ X × Y . Suppose there exist ( x i , y i ) ≤ i ≤ k ⊂ spt π ∗ with k ≥ such that δ := k X i =1 c ( x i , y i ) − k X i =1 c ( x i , y i +1 ) > , where ( x , y ) := ( x, y ) . Given δ < δ , there exist α, r, ε > such that π ε ( B r ( x, y )) ≤ αe − δ/ε for ε ≤ ε . Proof.
Once again, continuity of c implies that for r > small enough, P c (˜ x i , ˜ y i ) − P c (˜ x i , ˜ y i +1 ) ≥ δ for all (˜ x i , ˜ y i ) ∈ B i := B r ( x i , y i ) , and then B × · · · × B k ⊂ A k ( δ, ∞ ) in Lemma 3.1 yields π ε ( B ) · · · π ε ( B k ) ≤ e − δ/ε . (4.2)For i ≥ we have lim inf π ε ( B i ) ≥ π ∗ ( B i ) due to the weak convergence π ε → π ∗ , and β i := π ∗ ( B i ) > as ( x i , y i ) ∈ spt π ∗ . Let β = min i ≥ β i .Then π ε ( B i ) ≥ β/ for i ≥ and ε small, and thus (4.2) yields π ε ( B ) ≤ ( β/ − k e − δ/ε .Denote by Σ( k ) the set of permutations of { , . . . , k } . Next, we state thedefinition of I ( x, y ) ; it is designed to capture the rate δ in Lemma 4.1 andoptimize it over the choice of ( x i , y i ) ≤ i ≤ k . Lemma 4.2.
Given a c -cyclically monotone set ∅ 6 = Γ ⊆ X × Y , define I ( x, y ) := sup k ≥ sup ( x i ,y i ) ki =2 ⊂ Γ sup σ ∈ Σ( k ) k X i =1 c ( x i , y i ) − k X i =1 c ( x i , y σ ( i ) ) (4.3) where ( x , y ) := ( x, y ) . Then I : X × Y → [0 , ∞ ] is lower semicontinuousand I = 0 on Γ . We have I ( x, y ) ≥ sup k ≥ sup ( x i ,y i ) ki =2 ⊂ Γ k X i =1 c ( x i , y i ) − k X i =1 c ( x i , y i +1 ) , (4.4) and equality holds as soon as x ∈ X := proj X Γ or y ∈ Y := proj Y Γ . roof. We have I ≥ as σ = Id is a possible choice in (4.3). For ( x, y ) ∈ Γ ,the difference of sums in (4.3) is nonpositive by cyclical monotonicity. Thesemicontinuity follows from the continuity of c .Let I ′ ( x, y ) be the right-hand side of (4.4). As the pairs ( x i , y i ) ki =2 can berelabeled arbitrarily, this is the same as (4.3) except that the last supremumin (4.4) is taken over σ ∈ Σ( k ) \{ Id } . If I ( x, y ) > , the identity permutationis not optimal for the relevant pairs ( x i , y i ) ki =2 and equality must hold in (4.4).Thus, if equality fails, then I ( x, y ) = 0 whereas I ′ ( x, y ) < . Let x ∈ X ,then we can choose k = 2 and ( x , y ) ∈ Γ with x = x , which yields P i =1 c ( x i , y i ) − P i =1 c ( x i , y i +1 ) = 0 and hence I ′ ( x, y ) ≥ . The argumentfor y ∈ Y is symmetric.The reader may ignore the difference between (4.3) and (4.4); it is merelya notational nuisance. We have the following result for the c -cyclically mono-tone set Γ := spt π ∗ , also stated as Theorem 1.1 (a) in the Introduction. Corollary 4.3.
For any compact set C ⊂ X × Y , lim sup ε → ε log π ε ( C ) ≤ − inf ( x,y ) ∈ C I ( x, y ) . Proof.
Fix η > and ( x, y ) ∈ C . By the definition of I ( x, y ) there are k ≥ and ( x i , y i ) ki =2 ⊂ Γ such that k X i =1 c ( x i , y i ) − k X i =1 c ( x i , y i +1 ) > I η ( x, y ) − η/ , where ( x , y ) := ( x, y ) and I η ( x, y ) := I ( x, y ) ∧ η − . (The truncation isneeded only if I ( x, y ) = ∞ .) Lemma 4.1 thus yields a ball B r ( x, y ) with lim sup ε log π ε ( B r ( x, y )) ≤ − I η ( x, y ) + η. (4.5)This holds for every ( x, y ) ∈ C , and as C is covered by finitely many suchballs, we deduce that lim sup ε log π ε ( C ) ≤ − inf ( x,y ) ∈ C I η ( x, y ) + η. Recalling that η > was arbitrary, the claim follows. Our next aim is to show that I is also an upper bound for the large devi-ations rate, thus matching the lower bound in Corollary 4.3. This will beaccomplished in two slightly different settings and approaches. The dual ap-proach expresses I as the gap (4.6) between the cost c and the solution of thedual optimal transport problem, whereas the primal directly uses the defini-tion (4.3) of I and imposes regularity conditions. The results correspond toTheorem 1.1 (b) in the Introduction.13 .2.1 Upper Bound via Kantorovich Potential We start with the dual approach, first recalling some standard notions ofoptimal transport—we have tried to consistently use the notation of [52].A proper function ψ : X → ( −∞ , ∞ ] is called c -convex if there exists some ζ : Y → [ −∞ , ∞ ] such that ψ ( x ) = sup y ∈ Y [ ζ ( y ) − c ( x, y )] for all x ∈ X . Its c -conjugate is defined by ψ c ( y ) := inf x ∈ X [ ψ ( x ) + c ( x, y )] for y ∈ Y , and its c -subdifferential is ∂ c ψ = { ( x, y ) ∈ X × Y : ψ c ( y ) − ψ ( x ) = c ( x, y ) } . Given a c -cyclically monotone set Γ , a c -convex function ψ is called a Kan-torovich potential if Γ ⊂ ∂ c ψ ; that is, if ψ c ( y ) − ψ ( x ) = c ( x, y ) on Γ . Thisimplies in particular that ψ, ψ c are finite on X := proj X Γ , Y := proj Y Γ . In the context of optimal transport, spt π ⊂ ∂ c ψ for some optimal π ∈ Π( µ, ν ) implies that ∂ c ψ contains the support of any optimal transport. Indeed, ∂ c ψ is a maximal c -monotone set for inclusion. In what follows, the cyclicallymonotone set of interest is Γ = spt π ∗ , where π ∗ is the limiting optimaltransport (4.1). Assumption 4.4.
Uniqueness of Kantorovich potentials holds on X ; thatis, for any c -convex functions ψ , ψ on X with Γ ⊂ ∂ c ψ i , it holds that ψ − ψ is constant on X .This is often considered a fairly weak assumption, at least for differen-tiable cost functions, and we detail sufficient conditions in Proposition B.2of Appendix B. However, we emphasize that connectedness of at least onemarginal support is crucial (cf. Example 4.8 below).As announced, Assumption 4.4 allows us to express I through the Kan-torovich potential; see (4.6). For our present purpose, the key consequenceis (4.7). It is worth noting that (4.6) also allows us to translate a large bodyof known results about c -convex functions, such as regularity results, intostatements about I . Finally, the gap (4.6) also plays a role in the regularitytheory of optimal transport maps (especially in [37]), thus relating to thesecond approach in Section 4.2.2 below. Proposition 4.5.
Let Assumption 4.4 hold. Then I ( x, y ) = c ( x, y ) − ψ c ( y ) + ψ ( x ) , ( x, y ) ∈ X × Y (4.6) for any Kantorovich potential ψ . In particular, I < ∞ on X × Y . If ( x, y ) , ( x ′ , y ′ ) ∈ X × Y are such that ( x ′ , y ) , ( x, y ′ ) ∈ Γ , then I ( x, y ) + I ( x ′ , y ′ ) = c ( x, y ) + c ( x ′ , y ′ ) − c ( x, y ′ ) − c ( x ′ , y ) . (4.7)14 roof. We first elaborate on Assumption 4.4. A particular family of Kan-torovich potentials, sometimes called Rockafellar antiderivatives of Γ , is de-fined as follows (cf. [52, Equation (5.17), p. 65]): fix ( x , y ) ∈ Γ and set ψ ( x ,y ) ( x ) := sup k ≥ sup ( x i ,y i ) ki =1 ∈ Γ k X i =0 [ c ( x i , y i ) − c ( x i +1 , y i )] , where x k +1 := x. (4.8)It then holds that ψ ( x ,y ) ( x ) = 0 for x = x . Clearly Assumption 4.4 impliesthat changing the reference point ( x , y ) only changes this potential by aconstant. In particular, Ψ ( x ,y ) ( x, y ) := ψ c ( x ,y ) ( y ) − ψ ( x ,y ) ( x ) , ( x, y ) ∈ X × Y (4.9)does not depend on ( x , y ) ∈ Γ , and we may simply write Ψ := Ψ ( x ,y ) .Indeed, under Assumption 4.4, Ψ is even the same for any potential ψ .We now use this independence to prove the lemma. To avoid notationalconflict, we first rewrite the definition (4.8) as ψ (¯ x, ¯ y ) ( x ) = sup k ≥ sup ( x i ,y i ) ki =2 ∈ Γ c (¯ x, ¯ y ) + k X i =2 [ c ( x i , y i ) − c ( x i +1 , y i )] − c ( x , ¯ y ) , (4.10)where we have avoided the subscript i = 1 . Fix ( x, y ) ∈ X × Y . Writing ( x , y ) := ( x, y ) as in Lemma 4.2, the definition x k +1 := x of (4.8) becomesour usual cyclical convention x k +1 = x . As y ∈ Y , there exists ¯ x ∈ X suchthat (¯ x, y ) ∈ Γ . Using (4.10) with ¯ y := y then yields ψ (¯ x,y ) ( x ) = sup k ≥ sup ( x i ,y i ) ki =2 ∈ Γ c (¯ x, y ) + k X i =2 c ( x i , y i ) − k X i =2 c ( x i +1 , y i ) − c ( x , y )= sup k ≥ sup ( x i ,y i ) ki =2 ∈ Γ c (¯ x, y ) − c ( x, y ) + k X i =1 c ( x i , y i ) − k X i =1 c ( x i +1 , y i )= c (¯ x, y ) − c ( x, y ) + sup k ≥ sup ( x i ,y i ) ki =2 ∈ Γ k X i =1 c ( x i , y i ) − k X i =1 c ( x i , y i +1 )= c (¯ x, y ) − c ( x, y ) + I ( x, y ) , where we have used the last part of Lemma 4.2. In view of ψ (¯ x,y ) (¯ x ) = 0 , thefact that Ψ (¯ x,y ) = c on Γ shows in particular that c (¯ x, y ) = ψ c (¯ x,y ) ( y ) , andhence the preceding display yields I ( x, y ) = c ( x, y ) + ψ (¯ x,y ) ( x ) − ψ c (¯ x,y ) ( y ) = c ( x, y ) − Ψ (¯ x,y ) ( x, y ) .
15y the first part of the proof, Ψ (¯ x,y ) ( · ) = Ψ( · ) does not depend on (¯ x, y ) andthe above is precisely (4.6).To see (4.7), let ( x ′ , y ) , ( x, y ′ ) ∈ Γ . Using that I = 0 on Γ by Lemma 4.2and then (4.6), I ( x, y ) + I ( x ′ , y ′ ) = I ( x, y ) + I ( x ′ , y ′ ) − I ( x, y ′ ) − I ( x ′ , y )= c ( x, y ) + c ( x ′ , y ′ ) − c ( x, y ′ ) − c ( x ′ , y ) − Ψ( x, y ) − Ψ( x ′ , y ′ ) + Ψ( x, y ′ ) + Ψ( x ′ , y ) , where the last line vanishes as Ψ is a sum of marginal functions (4.9). Remark 4.6.
The proof of Proposition 4.5 is based on the condition that Ψ ( x ,y ) of (4.9) does not depend on ( x , y ) ∈ Γ , (4.11)which may seem weaker than Assumption 4.4. However, Assumption 4.4 is infact equivalent to (4.11); the proof is stated below. As a direct consequence,another equivalent condition is that the Rockafellar antiderivative (4.8) beindependent of ( x , y ) . The symmetry of (4.11) shows that it is furtherequivalent to impose the analogue of Assumption 4.4 on Y instead of X . Proof that (4.11) implies Assumption 4.4.
By construction, the Rockafellarantiderivative ψ := ψ ( x ,y ) of (4.8) has the minimality property ψ ≤ ξ on X whenever ξ is a potential with ξ ( x ) = 0 = ψ ( x ) . (See [52, p. 62],or [3] for a more general result and further context.) Consider anotherpoint ( x , y ) ∈ Γ , let ψ := ψ ( x ,y ) and let ξ be any potential. Usingthe minimality twice, ψ ( x ) − ψ ( x ) ≤ ξ ( x ) − ξ ( x ) ≤ ψ ( x ) − ψ ( x ) . Given (4.11), the right-hand side can be expressed as ψ ( x ) − ψ ( x ) = ψ ( x ) − ψ c ( y ) − ψ ( x ) + ψ c ( y )= ψ ( x ) − ψ c ( y ) − ψ ( x ) + ψ c ( y )= ψ ( x ) − ψ ( x ) , which is the left-hand side. It follows that ψ ( x ) − ψ ( x ) = ξ ( x ) − ξ ( x ) forany potential ξ , and as x , x ∈ X were arbitrary, Assumption 4.4 holds.We can now show that I is an upper bound for the rate function. Corollary 4.7.
Let Assumption 4.4 hold. For any open set U ⊂ X × Y , lim inf ε → ε log π ε ( U ) ≥ − inf ( x,y ) ∈ U I ( x, y ) . roof. It suffices to show that given ( x, y ) ∈ U and η > , there exists r > such that for all r < r , lim sup ε → − ε log π ε ( B r ( x, y )) ≤ I ( x, y ) + η. Let η > , pick any ( x ′ , y ′ ) ∈ X × Y such that ( x ′ , y ) , ( x, y ′ ) ∈ Γ , and set α := c ( x, y ) + c ( x ′ , y ′ ) − c ( x, y ′ ) − c ( x ′ , y ) . We have I ( x, y ) < ∞ and I ( x ′ , y ′ ) < ∞ by Proposition 4.5. For r > smallenough we may use Lemma 3.1 with δ ′ := α + η/ and B r ( x, y ) × B r ( x ′ , y ′ ) ⊂ A (0 , δ ′ ) to obtain lim sup − ε (cid:2) log π ε ( B r ( x, y )) + log π ε ( B r ( x ′ , y ′ ) (cid:3) = lim sup − ε log π ε (cid:0) B r ( x, y ) × B r ( x ′ , y ′ ) (cid:1) ≤ α + η/ . (4.12)On the other hand, for r small enough, Lemma 4.1 yields as in (4.5) that lim inf − ε log π ε ( B r ( x ′ , y ′ )) ≥ I ( x ′ , y ′ ) − η/ . (4.13)Using (4.12), then (4.7) and finally (4.13), lim sup − ε log π ε ( B r ( x, y )) + lim inf − ε log π ε ( B r ( x ′ , y ′ )) ≤ lim sup − ε (cid:2) log π ε ( B r ( x, y )) + log π ε ( B r ( x ′ , y ′ ) (cid:3) ≤ α + η/ I ( x, y ) + I ( x ′ , y ′ ) + η/ ≤ I ( x, y ) + lim inf − ε log π ε ( B r ( x ′ , y ′ )) + η and the claim follows.The following simple example shows that if both marginals supports aredisconnected (and Assumption 4.4 is violated), I may fail to be an upperbound for the rate function. Example 4.8 (Disconnected Supports) . Consider the normalized × as-signment problem: X = Y = { , } and µ = ν = ( δ { } + δ { } ) / . Here Π( µ, ν ) is the convex hull of the two couplings π ∗ = ( δ { (1 , } + δ { (2 , } ) / , π = ( δ { (1 , } + δ { (2 , } ) / . In particular, every π ∈ Π( µ, ν ) is symmetric: π { (1 , } = π { (2 , } . Con-sider a cost function c with c (1 ,
1) = c (2 ,
2) = 0 and c (1 ,
2) + c (2 , > .Then π ∗ is the unique optimal transport and we know that π ε → π ∗ . Let r ( i, j ) := lim ε → − ε log π ε ( { i, j } ) (4.14)17e the exponential rate of convergence. Using Lemma 3.1 with A = { (1 , , (2 , } ⊂ A ( δ, δ ) for δ := c (1 ,
2) + c (2 , > shows r (1 ,
2) + r (2 ,
1) = δ . As π ε must besymmetric, we conclude that the true exponential rate is r (1 ,
2) = r (2 ,
1) = δ/ . (A priori, it may not be obvious that the limit (4.14) exists, but a posteriori,this is justified as every subsequential limit leads to the same value.) On theother hand, the definition (4.3) of I readily yields that I ≡ . In the remainder of the section we present an alternative approach to theupper bound which does not (directly) refer to potentials but instead employsa continuity condition for the limiting optimal transport π ∗ . We call a subsetof a metric space arcwise connected if any two points are connected by acontinuous curve of finite length. Assumption 4.9. (a)
Γ = graph T for a map T : X → Y .(b) X is arcwise connected.(c) The function c ( · , T ( · )) has the following continuity property: given acompact K ⊂ X , we have uniformly over x , x ∈ K that | c ( x , T ( x )) + c ( x , T ( x )) − c ( x , T ( x )) − c ( x , T ( x )) | = o ( d ( x , x )) . (4.15)As an example, consider X = Y = R d with cost c ( x, y ) = k x − y k / and an optimal transport π given by a continuous transport map T on thearcwise connected support spt µ . Then Assumption 4.9 holds with X =spt µ , as (4.15) equals |h x − x , T ( x ) − T ( x ) i| ≤ k x − x kk T ( x ) − T ( x ) k and T is uniformly continuous on compact sets. General sufficient conditionsfor the continuity of T can be found in [16, Theorem 1].Next, we show how to establish the key half of (4.7) under Assump-tion 4.9. Lemma 4.10.
Let Assumption 4.9 hold. If ( x, y ) , ( x ′ , y ′ ) ∈ X × Y are suchthat ( x ′ , y ) , ( x, y ′ ) ∈ Γ , then I ( x, y ) + I ( x ′ , y ′ ) ≥ c ( x, y ) + c ( x ′ , y ′ ) − c ( x, y ′ ) − c ( x ′ , y ) . (4.16)18 = x ′ k y ′ k = y ′ = T ( x ) x = x ′ k − y = y ′ k − = T ( x ) · · ·· · · x k − = x ′ y k − = y ′ = T ( x k − ) x k = x ′ y = y k = T ( x k ) y ′ k = y ′ = T ( x ) y = y ′ k − = T ( x ) · · · y k − = y ′ = T ( x k − ) y = y k = T ( x k ) x = x ′ k x = x ′ k − · · · x k − = x ′ x k = x ′ Figure 1: Schematic representation of the sums (4.17) (left) and (4.18)(right). Each dashed line stands for a term c ( · , · ) . Proof.
Set ( x , y ) := ( x, y ) and ( x ′ , y ′ ) := ( x ′ , y ′ ) . Let k ≥ and considerarbitrary ( x i , y i ) , ( x ′ i , y ′ i ) ∈ Γ for ≤ i ≤ k . The definition of I yields that I ( x, y ) + I ( x ′ , y ′ ) ≥ k X i =1 [ c ( x i , y i ) + c ( x ′ i , y ′ i )] − k X i =1 [ c ( x i , y i +1 ) + c ( x ′ i , y ′ i +1 )] . This holds in particular for the choices x k := x ′ and x ′ k := x , which entailthat y k = T ( x ′ ) = y and y ′ k = T ( x ) = y ′ . Moreover, we have y i = T ( x i ) and y ′ i = T ( x ′ i ) for i ≥ . Separating the first term of the first sum and the lastterm of the second sum, we obtain that I ( x, y ) + I ( x ′ , y ′ ) ≥ c ( x, y ) + c ( x ′ , y ′ ) − c ( x, y ′ ) − c ( x ′ , y )+ k X i =2 [ c ( x i , y i ) + c ( x ′ i , y ′ i )] − k − X i =1 [ c ( x i , y i +1 ) + c ( x ′ i , y ′ i +1 )] . We further choose x ′ i := x k − i +1 for i = 2 , . . . , k − , which implies y ′ i = y k − i +1 for i = 2 , . . . , k − . Then the first sum can be rearranged as k X i =2 c ( x i , y i ) + c ( x ′ i , y ′ i ) = k − X i =1 c ( x i , T ( x i )) + c ( x i +1 , T ( x i +1 )) (4.17)and the second sum can be rearranged as k − X i =1 c ( x i , y i +1 ) + c ( x ′ i , y ′ i +1 ) = k − X i =1 c ( x i , T ( x i +1 )) + c ( x i +1 , T ( x i )); (4.18)(These rearrangements are elementary if tedious; Figure 1 may be helpful tocomplete them.) In summary, we have I ( x, y ) + I ( x ′ , y ′ ) ≥ c ( x, y ) + c ( x ′ , y ′ ) − c ( x, y ′ ) − c ( x ′ , y ) + Ξ x = x and x k = x ′ , Ξ := sup k ≥ sup x ,...,x k − ∈ spt µ Ξ k for Ξ k := k − X i =1 c ( x i , T ( x i )) + c ( x i +1 , T ( x i +1 )) − c ( x i , T ( x i +1 )) − c ( x i +1 , T ( x i )) . It remains to show that given η > , we can achieve Ξ k ≥ − η by asuitable choice of k and x , . . . , x k − . Fix a continuous, rectifiable curve ϕ : [0 , → X with ϕ (0) = x and ϕ (1) = x ′ , and denote its length by C .For each k ≥ there exist t < t < · · · < t k − < t k = 1 such that x i := ϕ ( t i ) satisfy d ( x i , x i +1 ) ≤ C/ ( k − for all ≤ i ≤ k − . ApplyingAssumption 4.9 on the compact set ϕ ([0 , , we have that k − X i =1 | c ( x i , T ( x i )) + c ( x i +1 , T ( x i +1 )) − c ( x i , T ( x i +1 )) − c ( x i +1 , T ( x i )) |≤ ( k − o ( C/ ( k − o (1) (4.19)as k → ∞ . Remark 4.11.
The preceding arguments can be generalized to handle cer-tain discontinuities in T , even though at a discontinuity, (4.15) can only beexpected with o (1) rather than o ( d ( x , x )) . Indeed, the conclusion of (4.19)still holds if for a bounded number of i ’s, the term under the sum is only o (1) .For instance, this can be used to handle the case of semi-discrete transportwith quadratic cost, where ν has finite support and hence the transport mapis necessarily discontinuous. Corollary 4.12.
Let Assumption 4.9 hold. For any open set U ⊂ X × Y , lim inf ε → ε log π ε ( U ) ≥ − inf ( x,y ) ∈ U I ( x, y ) . Proof.
The argument is similar to the proof of Corollary 4.7, using the in-equality (4.16) instead of the equality (4.7). In the course of the argumentone also obtains that (4.16) already implies (4.7). We omit the details.
The aim of this section is to establish that, under certain conditions, I ( x, y ) of (4.3) is strictly positive for ( x, y ) ∈ X × Y outside the support Γ of thelimiting optimal transport π ∗ . In view of Corollary 4.7, this implies thatthat the mass of π ε around ( x, y ) converges exponentially fast.20hen both marginals are supported by finitely many points, it is knownthat exponential convergence holds for any cost function [14, 53]. We shallsee that in the continuum case, such a statement must depend on the ge-ometry of the cost. Throughout this section, we assume that X = R d (while Y is Polish). The cost c is continuous and differentiable in x with ∇ x c con-tinuous, and there exists an optimal transport π ∗ as in (4.1). We recall the twist condition of optimal transport (e.g., [52, p. 234]) which requires that ∇ x c ( x, · ) be injective; it holds in particular for the quadratic cost. Exam-ple 5.7 below shows that I may vanish on a set of large measure µ ⊗ ν whenthe twist condition does not hold.Similarly to the preceding section, we present a primal and a dual ap-proach. The direct approach proceeds as follows. Given ( x, y ) / ∈ Γ , we usethe geometry of c and regularity of the optimal transport to find an auxiliarypair (˜ x, ˜ y ) ∈ Γ such that c ( x, y ) − c (˜ x, y ) + c (˜ x, ˜ y ) − c ( x, ˜ y ) > . Then, thedefinition (4.3) of I (with k = 2 ) shows that I ( x, y ) > . The following isone possible implementation. Lemma 5.1.
Fix ( x, y ′ ) ∈ Γ and y ∈ Y . Suppose that v := ∇ x c ( x, y ) − ∇ x c ( x, y ′ ) = 0 (5.1) and that there exist ( x n , y n ) ∈ Γ such that ( x n , y n ) → ( x, y ′ ) and lim inf cos α n > for α n := ∠ ( v, x − x n ) . (5.2) Then I ( x, y ) > .Proof. Set ∆( x, y ′ , y n ) := ∇ x c ( x, y ′ ) − ∇ x c ( x, y n ) ; then ∆( x, y ′ , y n ) → as d ( y ′ , y n ) → and we have δ n : = c ( x, y ) − c ( x n , y ) + c ( x n , y n ) − c ( x, y n )= h∇ x c ( x, y ) − ∇ x c ( x, y n ) , x − x n i + o ( k x − x n k )= h∇ x c ( x, y ) − ∇ x c ( x, y ′ ) + ∆( x, y ′ , y n ) , x − x n i + o ( k x − x n k )= h v, x − x n i + o ( k x − x n k ) + O ( d ( y ′ , y n )) k x − x n k = cos( α n ) k v kk x − x n k + o ( k x − x n k ) + O ( d ( y ′ , y n )) k x − x n k . As v = 0 and lim inf n cos α n > , it follows that δ n > for n large enough.Fix such an n , then choosing k = 2 and ( x , y ) := ( x n , y n ) in (4.3) showsthat I ( x, y ) ≥ δ n > .Recall the notation X = proj X Γ and Γ = spt π ∗ . If x is interior in X ,we can choose auxiliary points in any direction from x and Lemma 5.1 yieldsa positivity result for I ( x, y ) as follows.21 emma 5.2. Let x ∈ int X and y ∈ Y . Let π ∗ be given by a transport map T which is continuous at x . If ∇ x c ( x, y ) − ∇ x c ( x, T ( x )) = 0 , then I ( x, y ) > .Proof. For n large we can uniquely define a point x n ∈ ∂B /n ( x ) ⊂ X bythe requirement that x − x n be parallel to v (here ∂B denotes the boundary).Then cos α n = 1 in the notation of (5.2) and we conclude using Lemma 5.1with ( x, y ′ ) = ( x, T ( x )) and ( x n , y n ) = ( x n , T ( x n )) .Sufficient conditions for the continuity (and higher regularity) of thetransport map have been studied extensively; see [52, Section 12] for anoverview of now-classical results and, among others, [16] for recent resultsincluding unbounded domains.The situation is more delicate if x is a boundary point of X or a pointof discontinuity of the transport map, as that restricts the viable choices forapproximating sequences. We provide some examples of possible results; forsimplicity of exposition, they are stated for the quadratic cost on X = Y = R d . The extension of such arguments to a general class of cost functions isdiscussed in Appendix C. Lemma 5.3.
Let c ( x, y ) = k x − y k , let X be strictly convex and consider ( x, y ) ∈ ( X × Y ) \ Γ with x ∈ ∂ X . Suppose that π ∗ is given by a transportmap T which is continuous on a neighborhood B r ( x ) ∩ X for some r > .Then I ( x, y ) > .Proof. The main step is to find a point x ′′ ∈ X such that h v, x − x ′′ i > . (5.3)Once that is achieved, we may choose a sequence x n → x in the open segment ( x ′′ , x ) which is contained in int X due to strict convexity. As ( x n , T ( x n )) → ( x, T ( x )) by continuity and α n = ∠ ( v, x − x ′′ ) for all n , we conclude byLemma 5.1 with ( x, y ′ ) := ( x, T ( x )) .To find x ′′ satisfying (5.3), we first fix x ′ ∈ X such that ( x ′ , y ) ∈ Γ . As c is quadratic, we have v = y ′ − y in (5.1) and the cyclical monotonicity of Γ yields h v, x − x ′ i = h y ′ − y, x − x ′ i ≥ . If this inequality is strict, we choose x ′′ := x ′ . Whereas if h v, x − x ′ i = 0 ,we consider the mid-point ¯ x = ( x ′ − x ) / which satisfies ¯ x ∈ int X by strictconvexity as well as h v, x − ¯ x i = 0 . After choosing ρ > small enoughsuch that ∂B ρ (¯ x ) ⊂ X , we can find a point x ′′ ∈ ∂B ρ (¯ x ) ⊂ X such that h v, x − x ′′ i > , completing the proof.22ext, we illustrate the dual approach in a problem with discontinuous optimal transport map. For the remainder of the section, we assume thatthere exists a Kantorovich potential ψ such that I ( x, y ) = c ( x, y ) − ψ c ( y ) + ψ ( x ) , ( x, y ) ∈ X × Y . (5.4)As seen in Proposition 4.5, a sufficient condition is Assumption 4.4 (unique-ness of potentials). If we assume that µ ∼ L d on its support, the quadraticcost and the convexity condition in the below results already guarantee thatAssumption 4.4 holds; cf. Proposition B.2. The relevance of (5.4) is that ityields the representation { I = 0 } ∩ ( X × Y ) = ∂ c ψ ∩ ( X × Y ) , (5.5)so that our question regarding exponential convergence can be phrased as:does Γ fill the entire set ∂ c ψ ∩ ( X × Y ) ?The intersection with X × Y is crucial to avoid a negative answer in manycases with discontinuous transport (see also the proof of Proposition 5.5below). On the other hand, the intersection is justified because the interpre-tation of I as rate of convergence is meaningless outside spt π ε .We first state the following continuation argument similar to Lemma 5.3. Lemma 5.4.
Let c ( x, y ) = k x − y k , let X be strictly convex and consider ( x, y ) ∈ ( X × Y ) \ Γ with x ∈ ∂ X . Suppose that I (˜ x, y ) > for all ˜ x ∈ int X ∩ B r ( x ) , for some r > . Then I ( x, y ) > .Proof. We may state the proof with the equivalent cost c ( x, y ) = −h x, y i / ,so that the notions of c -convex analysis and convex analysis coincide. Sup-pose for contradiction that I ( x, y ) = 0 . Fix x ′ ∈ X such that ( x ′ , y ) ∈ Γ and denote φ := − ψ c for ψ as in (5.4), then both x and x ′ are in the set { I ( · , y ) = 0 } = ∂ c φ ( y ) = ∂φ ( y ) , where ∂φ ( y ) denotes the subdifferential of the convex function φ in the usualsense. The latter set being convex, it must include the whole segment [ x, x ′ ] ,meaning that I (˜ x, y ) = 0 for all ˜ x ∈ [ x, x ′ ] . The interior of the segment isincluded in int X by strict convexity, contradicting the hypothesis. Proposition 5.5 (Semidiscrete Transport) . Let c ( x, y ) = k x − y k on X = Y = R d , let X be strictly convex, let µ ≪ L d and let spt ν be at mostcountable, with no accumulation points. Then { I = 0 } ∩ ( X × Y ) = Γ . roof. Again, we may state the proof with the equivalent cost c ( x, y ) = −h x, y i / . Let ( x, y ) ∈ X × Y . In view of Lemma 5.4, it suffices to treat thecase x ∈ int X . Denote by dom ∇ ψ the set of points where ψ is differentiableand assume that I ( x, y ) = 0 ; that is, y ∈ ∂ c ψ ( x ) = ∂ψ ( x ) . The (ordinary)subdifferential ∂ψ ( x ) equals {∇ ψ ( x ) } if x ∈ dom ∇ ψ , whereas in general, itcan be described (cf. [49, Theorem 25.6, p. 246]) as the closed convex hull of S ( x ) = n lim n →∞ ∇ ψ ( x n ) : x n → x, x n ∈ dom ∇ ψ, lim n →∞ ∇ ψ ( x n ) exists o . (5.6) Case 1: x ∈ dom ∇ ψ . As Γ ⊂ ∂ψ and ∂ψ ( x ) is a singleton, it followsthat ( x, y ) = ( x, ∇ ψ ( x )) ∈ Γ . Case 2: y ∈ S ( x ) . Let x n → x be as in (5.6). Recalling that x ∈ int X ,we have x n ∈ X for n large. Thus ( x n , ∇ ψ ( x n )) ∈ Γ by Case 1 and closednessentails that the limit ( x, y ) pertains to Γ as well. Case 3: y ∈ ∂ψ ( x ) \ S ( x ) . We shall show that this case does not occur.As a first step, we argue that ∂ψ ( x ) = conv S ( x ) (5.7)in the present context (without taking closure). As x ∈ int X ⊂ int { ψ < ∞} ,the subdifferential ∂ψ ( x ) is bounded [49, Theorem 23.4, p. 217]. Let U bea bounded neighborhood of ∂ψ ( x ) . The discreteness assumption on spt ν entails that U ∩ Y is a finite set (and that Y = spt ν ). Let x n → x be asin (5.6). For x n close to x we have ∇ ψ ( x n ) ∈ U , but also ∇ ψ ( x n ) ∈ Y byCase 1. As a result, the set S ( x ) of limits is finite. In particular, its convexhull is already closed, and (5.7) follows.Now let y ∈ ∂ψ ( x ) \ S ( x ) . By (5.7), y is a nontrivial convex combination y = P ki =1 θ i y i for some distinct y i ∈ S ( x ) and θ i ∈ (0 , with P θ i = 1 . Let φ := − ψ c (which is the Legendre–Fenchel transform of ψ in this context)and x ′ ∈ ∂φ ( y ) . Then cyclical monotonicity of ∂φ implies h x ′ − x, y − y i i ≥ for all i and as k X i =1 θ i h x ′ − x, y − y i i = h x ′ − x, i = 0 , (5.8)it follows that h x ′ − x, y − y i i = 0 for all i . That is, we have ∂φ ( y ) − { x } ⊥ y − y i for all ≤ i ≤ k, which implies in particular dim ∂φ ( y ) < d . On the other hand, ν ( { y } ) > by the discreteness of Y . Thus µ ( ∂φ ( y )) = ν ( { y } ) > , contradicting that µ ≪ L d does not charge lower dimensional sets. This shows that Case 3 doesnot occur and completes the proof. 24he preceding arguments can be extended to a class of cost functionssatisfying a Ma-Trudinger-Wang condition. This is detailed in Appendix C. Proposition 5.6.
After replacing convexity by c -convexity, Lemma 5.4 andProposition 5.5 extend to cost functions c satisfying Assumption C.1. We conclude with a simple example illustrating the relevance of the twistcondition. Here, ∇ x c ( x, y ) vanishes below the diagonal, so that the conditionfails, and indeed the convergence π ε → π ∗ is sub-exponential in that region. Example 5.7 (No Twist) . Consider X = Y = R with identical marginals µ = ν having support [0 , and the cost function c ( x, y ) = ( ( y − x ) , y ≥ x, , y < x. As ∇ x c ( x, y ) = 0 for all y < x , this cost does not satisfy the twist condition.Clearly there is a unique optimal transport π ∗ ∈ Π( µ, ν ) , given by the Mongemap T ( x ) = x . Its support is Γ = { ( x, x ) : 0 ≤ x ≤ } and one cancheck by direct calculation based on (4.3) that I = c on X × Y = [0 , .Assumption 4.9 is readily verified, hence Corollary 4.12 shows that I is indeedthe rate function in this context. We can obtain the same conclusion fromCorollary 4.7, at least if we also suppose that µ is equivalent to the Lebesguemeasure on [0 , : then, Proposition B.2 shows that Assumption 4.4 holds.Or as a third option, we may verify directly that I satisfies (4.7), and thenconclude as in the proof of Corollary 4.7. In any event, we see that I = 0 on { y < x } , indicating sub-exponential decay of the weight of π ε . A Cyclical Invariance and Factorization
In this section we detail some classical facts about static Schrödinger bridgesas well as the proof of Proposition 2.2. Let ( X , X , µ ) and ( Y , Y , ν ) be prob-ability spaces; as before, we denote by Π( µ, ν ) the set of couplings. Proposition A.1.
Let R be a probability measure on ( X × Y , X ⊗ Y ) andsuppose that there exists π ∈ Π( µ, ν ) with H ( π | R ) < ∞ . (A.1) Then there is a unique minimizer π ∗ ∈ Π( µ, ν ) for inf π ∈ Π( µ,ν ) H ( π | R ) . If wecan choose π ∼ R in (A.1) , then also π ∗ ∼ R .Assume in addition that R ∼ µ ⊗ ν . Then there exist measurable functions Z : X × Y → (0 , ∞ ) , f : X → (0 , ∞ ) , g : Y → (0 , ∞ ) such that Z ( x, y ) = f ( x ) g ( y ) , ( x, y ) ∈ X × Y (A.2)25 nd Z is a version of the Radon–Nikodym density dπ ∗ /dR . Conversely, if π ∈ Π( µ, ν ) and a version of its density has the form (A.2) on a set offull π -measure, where f and g are arbitrary [ −∞ , ∞ ] -valued functions, then π = π ∗ .The uniqueness result also holds without Assumption (A.1) , if stated asfollows. Let π, π ′ ∈ Π( µ, ν ) and π, π ′ , R ∼ µ ⊗ ν . If versions of dπ/dR and dπ ′ /dR both admit factorizations as above, then π = π ′ .Proof. Assuming (A.1), the existence, uniqueness and equivalence claimson π ∗ are shown in [17]; see its Theorem 2.1 and the remark following Theo-rem 2.2. The factorization of the density and its measurability are delicate ingeneral (see [9, 10, 26, 51], among others) but less so under our condition that R ∼ µ ⊗ ν . Indeed, using that R ∼ µ ⊗ ν , the proof of [26, Proposition 3.19]shows that a version of dπ ∗ /dR admits a factorization into measurable func-tions f, g on a set A of full π ∗ -measure, dπ ∗ dR ( x, y ) = f ( x ) g ( y ) for ( x, y ) ∈ A ,and moreover that the set A can be chosen to be a measurable rectangle, A = X × Y , where necessarily µ ( X ) = ν ( Y ) = 1 . On X c ∪ { f / ∈ (0 , } we may redefine f := 1 , and similarly for g . Then Z ( x, y ) := f ( x ) g ( y ) isa (0 , ∞ ) -valued version of dπ ∗ /dR whose factorization holds on the wholespace X × Y , as required.Conversely, suppose that the density Z = dπ/dR of π ∈ Π( µ, ν ) ad-mits a factorization f ( x ) g ( y ) on a set of full π -measure. Writing f ( · ) = Z ( · , y ) /g ( y ) for a suitable y , it readily follows that f is measurable, andsimilarly for g . Such a measurable factorization uniquely characterizes π ∗ within Π( µ, ν ) ; see [26, Corollary 3.15].For the final generalization on the uniqueness claim, let π, π ′ be as statedand note that a version of the density dπ/dπ ′ then admits a factorization.We consider π ′ as an auxiliary reference measure, instead of R . Then theanalogue of (A.1) holds as π ′ is itself a coupling and clearly π ′ is the uniqueminimizer of H ( ·| π ′ ) . We can now apply the above results.The above proof used [26] as a reference, which in turn is based on thestability of sums of functions [9, 51]. See [34] for a survey. An insightfulapproach with a direct construction of the factorization was very recentlyproposed in [2]; it yields similar results for the entropic function h ( x ) = x log( x ) considered here but also allows for a generalization to nonconvexpenalties h . In addition, it portrays what we called cyclical invariance asthe cyclical monotonicity of an optimal transport problem arising from thelinearization of the static Schrödinger bridge problem. Proof of Proposition 2.2.
Recall the definition (2.5) of R and note that R ∼ P = µ ⊗ ν . The entropic optimal transport problem (2.2) can we rewrittenas inf π ∈ Π( µ,ν ) εH ( π | R ) , putting it in the realm of Proposition A.1. Similarly,26he second part of (2.3) is equivalent to (A.1). Let Z be as in (A.2), then (2.6)follows, and hence also (2.4).Conversely, if π ∈ Π( µ, ν ) is cyclically invariant, then (2.6) holds for itsdensity Z . Fix an arbitrary x ∈ X and note that f ( x ) := Z ( x, y ) /Z ( x , y ) isindependent of y due to (2.6) with k = 2 . Setting g ( y ) = Z ( x , y ) /f ( x ) thenyields the (measurable) factorization Z ( x, y ) = f ( x ) g ( y ) , and we conclude byProposition A.1. Alternately, the existence of a factorization can be deducedfrom (2.6) by the general result of [9, Theorem 3.3]. Remark A.2.
The above proof shows that if the cyclical invariance condi-tion (2.4) holds for k = 2 , then it already holds for arbitrary k ≥ . B Uniqueness of Potentials
Definition B.1.
Let Γ ⊂ X × Y and Λ ⊂ X . We say that uniqueness ofpotentials holds on Λ if for any c -convex functions ψ , ψ on X with Γ ⊂ ∂ c ψ i ,it holds that ψ − ψ is constant on Λ .We detail two classes of optimal transport problems where uniquenessof potentials holds. Connectedness of at least one marginal support isessential—uniqueness fails even for the simplest discrete problem, µ = ν =( δ { } + δ { } ) / with cost c ( { i } , { j } ) = i = j . Proposition B.2.
Let X = R d and µ ∼ L d on spt µ , where L d ( ∂ spt µ ) = 0 and int spt µ is connected. Let Γ = spt π where π ∈ Π( µ, ν ) is an optimaltransport for the continuous cost function c .(a) Lipschitz cost:
Suppose that c ( · , y ) is differentiable for all y , and locally Lipschitz uniformly in y .Then uniqueness of potentials holds on spt µ , and in particular on proj X Γ .(b) Convex, superlinear cost:
Let Y = R d and c ( x, y ) = h ( y − x ) , where(i) h : R d → R is convex and differentiable,(ii) h has superlinear growth: h ( x ) / k x k → ∞ whenever k x k → ∞ ,(iii) given r < ∞ and θ ∈ (0 , π ) , and for p ∈ R d sufficiently far from theorigin, there is a cone of the form { x ∈ R d : k x − p k k z k cos( θ/ ≤h z, x − p i ≤ r k z k} for some z ∈ R d \ { } on which h assumes itsmaximum at p .Then uniqueness of potentials holds on proj X Γ . emark B.3. (a) If c ∈ C ( R d × R d ′ ) and ν is compactly supported, wecan always change c outside a neighborhood of spt µ × spt ν to satisfy thecondition of (a), without affecting the set of optimal transports.(b) The convex cost with superlinear growth is essentially the well-knownsetting of Gangbo and McCann [27]; cf. their hypotheses (H2)–(H4). Thetechnical condition (iii) is implied by (ii) in the radial case h ( x ) = ˜ h ( k x k ) ;in particular, all the conditions are satisfied for c ( x, y ) = k y − x k p with p ∈ (1 , ∞ ) . In contrast to the main result of [27], h is not assumed to bestrictly convex—strictness is required for uniqueness of optimal transports,but not for uniqueness of potentials. For instance, the “parabola with anaffine piece,” given by h ( x ) = ˜ h ( k x k ) with ˜ h ( t ) = t [0 , + (2 t − (1 , +( t − t + 3) [2 , ∞ ) , satisfies all the assumptions in (b). The affine piece willlead to non-uniqueness of optimal transports for a large class of marginalsin the one-dimensional case.(c) Dual uniqueness may fail if c is not differentiable. For c ( x, y ) = | y − x | on R × R , the c -convex functions are exactly the 1-Lipschitz functions. If µ = ν is the Lebesgue measure on [0 , , the identical transport π is optimaland any 1-Lipschitz function ψ satisfies Γ = { ( x, x ) : x ∈ [0 , } ⊂ ∂ c ψ .The proof of the proposition is based on the following standard consid-eration (e.g., [27, Lemma 3.1]). Lemma B.4.
Let Γ ⊂ X × Y and let ψ, φ be R -valued functions such that φ ( y ) − ψ ( x ) ≤ c ( x, y ) on X × Y and φ ( y ) − ψ ( x ) = c ( x, y ) on Γ . If X = R d and ( x, y ) ∈ Γ are such that ψ and c ( · , y ) are differentiable at x , then ∇ ψ ( x ) = −∇ x c ( x, y ) . In particular, if c ( · , y ) is differentiable for all y ∈ Y ,then ∇ ψ ( x ) is uniquely determined for x ∈ proj X Γ ∩ dom ∇ ψ .Proof. Let ( x, y ) ∈ Γ be as stated. Then ψ ( x ) + ∇ ψ ( x ) · h + o ( h ) = ψ ( x + h ) ≥ φ ( y ) − c ( x + h, y )= ψ ( x ) + c ( x, y ) − c ( x + h, y )= ψ ( x ) − ∇ x c ( x, y ) · h + o ( h ) and hence ∇ ψ ( x ) = −∇ x c ( x, y ) as the direction of h is arbitrary. Lemma B.5.
Let
Γ = spt π for some π ∈ Π( µ, ν ) . Then spt µ = proj X Γ . Proof.
Let ( x, y ) ∈ Γ , then µ ( B r ( x )) = π ( B r ( x ) × Y ) > for all r > .This shows proj X Γ ⊂ spt µ . Let x ∈ spt µ . As µ ( B r ( x )) > , there mustbe some x ′ ∈ B r ( x ) with x ′ ∈ proj X Γ , and this holds for all r > . Hence, spt µ ⊂ proj X Γ . 28 roof of Proposition B.2. We denote by dom ψ the set where a function ψ is finite and by dom ∇ ψ the subset where it is differentiable.(a) Let ψ be a c -convex function on X = R d with Γ ⊂ ∂ c ψ . The lo-cal Lipschitz bound of c ( · , y ) implies the same bound for ψ . In particular, ψ is continuous and L d -a.e. differentiable on dom ψ = R d . The couplingproperty guarantees that proj X Γ ⊂ spt µ has full µ -measure, hence also full L d -measure. It follows that Λ := dom ∇ ψ ∩ proj X Γ ⊂ spt µ has full L d -measure, and ∇ ψ is uniquely determined on Λ by Lemma B.4. As ψ islocally Lipschitz and int spt µ is open and connected, this implies that ψ isuniquely determined (up to constant) on int spt µ (see, e.g., [46, Formula 2]).By continuity on R d , it is also determined on the closure, which equals spt µ due to L d ( ∂ spt µ ) = 0 .(b) In this setting, the local Lipschitz property will only hold within int spt µ and ψ need not be continuous (or even finite) up to the boundary.As we require uniqueness at all (rather than almost all) points x , we arguethe boundary case in a second step. Step 1.
We first show that uniqueness of potentials holds on int spt µ . Itis proved in [27, Proposition C.3 and Corollary C.5] that for any c -convexfunction ψ there is a convex set K with int K ⊂ dom ψ ⊂ K and that ψ is locally Lipschitz (hence L d -a.e. differentiable) within int dom ψ . Byconvexity, int K = int K = int dom ψ . If Γ ⊂ ∂ c ψ , then proj X Γ ⊂ dom ψ and hence spt µ = proj X Γ ⊂ dom ψ ⊂ K, showing that int spt µ ⊂ int K = int dom ψ. It follows that ψ is locally Lipschitz and L d -a.e. differentiable on int spt µ .On the other hand, proj X Γ has full µ -measure in int spt µ by the couplingproperty, hence also full L d -measure. Thus Λ := dom ∇ ψ ∩ proj X Γ has full L d -measure within int spt µ and we conclude as in (a). Step 2.
Define X := proj X Γ ∩ int spt µ . Then Γ = Γ for Γ := { ( x, y ) : x ∈ X , y ∈ Γ x } , (B.1)where Γ x denotes the section { y ∈ Y : ( x, y ) ∈ Γ } . Indeed, µ ( X ) = 1 asstated in Step 1, which implies π (Γ ) = 1 and hence Γ ⊂ Γ by the definitionof Γ = spt π . Conversely, Γ ⊂ Γ is clear, and then Γ ⊂ Γ by closedness.Fix ( x, y ) ∈ Γ . By (B.1) we can find ( x n , y n ) ∈ Γ with ( x n , y n ) → ( x, y ) and in particular ψ c ( y ) − ψ ( x ) = c ( x, y ) = lim c ( x n , y n ) = lim [ ψ c ( y n ) − ψ ( x n )] . The c -convex functions − ψ c and ψ are lower semicontinuous thanks to thecontinuity of c , so that ψ c ( y ) ≥ lim sup ψ c ( y n ) and − ψ ( x ) ≥ lim sup − ψ ( x n ) .29ogether, it follows that ψ c ( y ) = lim ψ c ( y n ) and ψ ( x ) = lim ψ ( x n ) . As x n ∈ int spt µ , we know from Step 1 that ψ ( x n ) is uniquely determined, andthen so is ψ ( x ) . C Proof of Proposition 5.6
In this section, we discuss how to extend Proposition 5.5 to a general classof cost functions c satisfying the Ma-Trudinger-Wang condition “(Aw)” in-troduced in [38]; we use Loeper’s equivalent geometric characterization [37]to generalize from the quadratic case. We recall that the dual representa-tion (5.4) of I has been assumed.A number of terms from c -convex analysis are needed. For ease of ref-erence, we (mostly) follow the notation of [37], whose Section 2 also pro-vides an excellent introduction to the notions used below. Consider a C function c ( x, y ) on the product of two domains Ω , Ω ′ ⊂ R d and supposethat c satisfies the twist condition in both variables; i.e., ∇ x c ( x, · ) and ∇ y c ( · , y ) are injective. Given x ∈ Ω , the c - exponential map T x is defined by T x = −∇ x c ( x, · ) − . A c - segment wrt. x is the image of a segment (in theusual sense) under the map T x . The c - segment of y , y ∈ Ω ′ wrt. x is theimage of the segment joining −∇ x c ( x, y ) and −∇ x c ( x, y ) under T x . Theset Ω ′ is c -convex wrt. Ω if the c -segment of y , y wrt. x is contained in Ω ′ forall y , y ∈ Ω ′ and x ∈ Ω , or equivalently, if −∇ x c ( x, Ω ′ ) is convex for x ∈ Ω . Strict c -convexity means that, in addition, the interior of the c -segment is inthe interior of Ω ′ . A proper function ψ : Ω → R ∪{ + ∞} is c -convex if if thereexists ζ : Ω ′ → [ −∞ , ∞ ] such that ψ ( x ) = sup y ∈ Ω ′ [ ζ ( y ) − c ( x, y )] . The c -transform of ψ is defined by ψ c ( y ) := inf x ∈ Ω [ c ( x, y )+ ψ ( x )] for y ∈ Ω ′ and its c -subdifferential at x is the set ∂ c ψ ( x ) = { y ∈ Ω ′ : ψ c ( y ) − ψ ( x ) = c ( x, y ) } .The function ψ is semiconvex if it is the sum of a convex function and afunction of class C , . Its (ordinary) subdifferential ∂ψ ( x ) at x ∈ Ω is ∂ψ ( x ) := { y ∈ R d : ψ ( x ′ ) ≥ ψ ( x ) + h y, x ′ − x i + o ( k x − x ′ k ) , x ′ ∈ Ω } . Clearly ∂ψ ( x ) is convex. Moreover, it coincides with the subdifferential ofconvex analysis if ψ is convex, and it satisfies an analogue of the cyclicalmonotonicity of convex analysis: adding up the defining inequalities shows h y − y ′ , x − x ′ i ≥ o ( k x − x ′ k ) for y ∈ ∂ψ ( x ) , y ′ ∈ ∂ψ ( x ′ ) . (C.1)We shall use analogous notation for functions on Ω ′ instead of Ω (a minorabuse of notation since c is then used with its variables exchanged). Assumption C.1.
Let Ω , Ω ′ be domains in R d with X ⊂ Ω and Y ⊂ Ω ′ ,and let c ∈ C satisfy the twist condition in both variables. Moreover, let30 be strictly c -convex wrt. Y and let Ω ′ be c -convex wrt. Ω . Finally, weassume that any c -convex function ψ on Ω is locally semiconvex and satisfies − ∇ x c ( x, ∂ c ψ ( x )) = ∂ψ ( x ) , (C.2)and that the analogue holds for functions on Ω ′ .The main condition is (C.2). As ∂ψ ( x ) is convex, it implies in particu-lar that ∂ c ψ ( x ) is c -convex. (The converse implications also holds; see [37].Note that our notation differs slightly from [37], where ∂ c ψ ( x ) denotes whatis −∇ x c ( x, ∂ c ψ ( x )) in our notation.) It is shown in [37] how (C.2) canbe deduced from the (Aw) condition when the the domains are bounded,sufficiently c -convex and c ∈ C . Local semiconvexity of c -convex func-tions can be ensured by comparably mild conditions on the data, see for in-stance [37, Proposition 2.2] or [27, Corollary C.5]. Apart from the quadraticcost, another classical example treated in [37] is the reflector-antenna cost c ( x, y ) = − log k x − y k . See also [52] for further background. Proof of Proposition 5.6. Step 1: Generalization of Lemma 5.4.
This exten-sion is straightforward: using the same notation as in the proof of Lemma 5.4,we again have x, x ′ ∈ { I ( · , y ) = 0 } = ∂ c ( − ψ c )( y ) . The latter set is c -convexby Assumption C.1, hence contains the c -segment of x, x ′ wrt. y . The interiorof the segment is contained in int X by strict c -convexity, and it includespoints from the neighborhood were I was assumed to be positive—a contra-diction. Step 2: Generalization of Proposition 5.5.
Let ( x, y ) ∈ X × Y be suchthat I ( x, y ) = 0 . In view of Step 1, it again suffices to treat the case x ∈ int X . Moreover, as the c -convex function ψ is semiconvex by ourassumption, it still holds that ∂ψ ( x ) is the closed convex hull of S ( x ) asdefined in (5.6). The proofs for Case 1 and Case 2 carry over by simply re-placing ∂ψ ( x ) with ∂ c ψ ( x ) and ∇ ψ ( x ) with T x ( ∇ ψ ( x )) . In Case 3, the proofof (5.7) also carries over using semiconvexity. The arguments around (5.8)can be adapted as follows: Let φ := − ψ c and x ′ ∈ ∂φ ( y ) . Then the cyclicalmonotonicity property (C.1) of ∂φ implies h x ′ − x, y − y i i ≥ o ( k x ′ − x k ) forall i . In view of (5.8), it now follows that h x ′ − x, y − y i i = o ( k x ′ − x k ) forall i , but noting that the convex set ∂φ ( y ) contains the segment [ x ′ , x ] , thisalready implies that h x ′ − x, y − y i i = 0 for all i . The remainder of the proofis identical. References [1] S. Adams, N. Dirr, M. A. Peletier, and J. Zimmer. From a large-deviationsprinciple to the Wasserstein gradient flow: a new micro-macro passage.
Comm.Math. Phys. , 307(3):791–815, 2011.
2] J. Backhoff-Veraguas, M. Beiglböck, and G. Conforti. A non-linear mono-tonicity principle and application to the Schrödinger problem.
PreprintarXiv:2101.09975v1 , 2021.[3] S Bartz and S. Reich. Abstract convex optimal antiderivatives.
Ann. Inst. H.Poincaré Anal. Non Linéaire , 29(3):435–454, 2012.[4] M. Beiglböck, A. M. G. Cox, and M. Huesmann. Optimal transport andSkorokhod embedding.
Invent. Math. , 208(2):327–400, 2017.[5] M. Beiglböck and N. Juillet. On a problem of optimal transport under marginalmartingale constraints.
Ann. Probab. , 44(1):42–106, 2016.[6] M. Beiglböck, M. Nutz, and F. Stebegg. Fine properties of the optimal Sko-rokhod embedding problem.
To appear in J. Eur. Math. Soc. (JEMS) .[7] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. IterativeBregman projections for regularized transportation problems.
SIAM J. Sci.Comput. , 37(2):A1111–A1138, 2015.[8] J. Blanchet, A. Jambulapati, C. Kent, and A. Sidford. Towards optimal run-ning times for optimal transport.
Preprint arXiv:1810.07717v1 , 2018.[9] J. M. Borwein and A. S. Lewis. Decomposition of multivariate functions.
Canad. J. Math. , 44(3):463–482, 1992.[10] J. M. Borwein, A. S. Lewis, and R. D. Nussbaum. Entropy minimization,
DAD problems, and doubly stochastic kernels.
J. Funct. Anal. , 123(2):264–307, 1994.[11] G. Carlier, V. Duval, G. Peyré, and B. Schmitzer. Convergence of entropicschemes for optimal transport and gradient flows.
SIAM J. Math. Anal. ,49(2):1385–1418, 2017.[12] V. Chernozhukov, A. Galichon, M. Hallin, and M. Henry. Monge-Kantorovichdepth, quantiles, ranks and signs.
Ann. Statist. , 45(1):223–256, 2017.[13] G. Clerc, G. Conforti, and I. Gentil. Long-time behaviour of entropic interpo-lations.
Preprint arXiv:2007.07594v1 , 2020.[14] R. Cominetti and J. San Martín. Asymptotic analysis of the exponentialpenalty trajectory in linear programming.
Math. Programming , 67(2, Ser.A):169–187, 1994.[15] G. Conforti and L. Tamanini. A formula for the time derivative of the entropiccost and applications.
Preprint arXiv:1912.10555v1 , 2019.[16] D. Cordero-Erausquin and A. Figalli. Regularity of monotone transport mapsbetween unbounded domains.
Discrete Contin. Dyn. Syst. , 39(12):7101–7112,2019.[17] I. Csiszár. I -divergence geometry of probability distributions and minimizationproblems. Ann. Probability , 3:146–158, 1975.[18] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.In
Advances in Neural Information Processing Systems 26 , pages 2292–2300.2013.[19] M. Cuturi, O. Teboul, and J.-P. Vert. Differentiable ranks and sorting usingoptimal transport.
Preprint arXiv:1905.11885v1 , 2019.
20] N. Deb and B. Sen. Multivariate rank-based distribution-free nonparametrictesting using measure transportation.
Preprint arXiv:1909.08733v1 , 2019.[21] E. del Barrio, J. A. Cuesta-Albertos, M. Hallin, and C. Matrán. Distributionand quantile functions, ranks, and signs in R d : a measure transportationapproach. Preprint arXiv:1806.01238v1 , 2018.[22] S. Di Marino and J. Louet. The entropic regularization of the Monge problemon the real line.
SIAM J. Math. Anal. , 50(4):3451–3477, 2018.[23] M. H. Duong, V. Laschos, and M. Renger. Wasserstein gradient flows fromlarge deviations of many-particle limits.
ESAIM Control Optim. Calc. Var. ,19(4):1166–1188, 2013.[24] M. Erbar, J. Maas, and D. R. M. Renger. From large deviations to Wassersteingradient flows in multiple dimensions.
Electron. Commun. Probab. , 20(89),2015.[25] H. Föllmer. Random fields and diffusion processes. In
École d’Été de Prob-abilités de Saint-Flour XV–XVII, 1985–87 , volume 1362 of
Lecture Notes inMath. , pages 101–203. Springer, Berlin, 1988.[26] H. Föllmer and N. Gantert. Entropy minimization and Schrödinger processesin infinite dimensions.
Ann. Probab. , 25(2):901–926, 1997.[27] W. Gangbo and R. J. McCann. The geometry of optimal transportation.
ActaMath. , 177(2):113–161, 1996.[28] A. Genevay, G. Peyré, and M. Cuturi. Learning generative models withSinkhorn divergences. In
Proceedings of the 21st International Conferenceon Artificial Intelligence and Statistics , PMLR, pages 1608–1617, 2018.[29] P. Ghosal and B. Sen. Multivariate ranks and quantiles using opti-mal transportation and applications to goodness-of-fit testing.
PreprintarXiv:1905.05340v1 , 2019.[30] C. T. Ireland and S. Kullback. Contingency tables with given marginals.
Biometrika , 55(1):179–188, 1968.[31] C. Léonard. Minimization of energy functionals applied to some inverse prob-lems.
Appl. Math. Optim. , 44(3):273–297, 2001.[32] C. Léonard. Minimizers of energy functionals.
Acta Math. Hungar. , 93(4):281–325, 2001.[33] C. Léonard. From the Schrödinger problem to the Monge-Kantorovich prob-lem.
J. Funct. Anal. , 262(4):1879–1920, 2012.[34] C. Léonard. A survey of the Schrödinger problem and some of its connectionswith optimal transport.
Discrete Contin. Dyn. Syst. , 34(4):1533–1574, 2014.[35] M. Liero, A. Mielke, and G. Savaré. Optimal entropy-transport problemsand a new Hellinger-Kantorovich distance between positive measures.
Invent.Math. , 211(3):969–1117, 2018.[36] T. Lin, N. Ho, and M. Jordan. On efficient optimal transport: An analysis ofgreedy and accelerated mirror descent algorithms. In
Proceedings of the 36thInternational Conference on Machine Learning , volume 97 of
PMLR , pages3982–3991, 2019.
37] G. Loeper. On the regularity of solutions of optimal transportation problems.
Acta Math. , 202(2):241–283, 2009.[38] X.-N. Ma, N. S. Trudinger, and X.-J. Wang. Regularity of potential functionsof the optimal transportation problem.
Arch. Ration. Mech. Anal. , 177(2):151–183, 2005.[39] R. J. McCann. Existence and uniqueness of monotone measure-preservingmaps.
Duke Math. J. , 80(2):309–323, 1995.[40] G. Mena and J. Niles-Weed. Statistical bounds for entropic optimal transport:sample complexity and the central limit theorem. In
Advances in NeuralInformation Processing Systems 32 , pages 4541–4551. 2019.[41] T. Mikami. Optimal control for absolutely continuous stochastic processes andthe mass transportation problem.
Electron. Comm. Probab. , 7:199–213, 2002.[42] T. Mikami. Monge’s problem with a quadratic cost by the zero-noise limit of h -path processes. Probab. Theory Related Fields , 129(2):245–260, 2004.[43] M. Nutz.
Lectures on Entropic Optimal Transport . Lecture notes, ColumbiaUniversity, 2020.[44] S. Pal. On the difference between entropic cost and the optimal transportcost.
Preprint arXiv:1905.12206v1 , 2019.[45] G. Peyré and M. Cuturi. Computational optimal transport: With applicationsto data science.
Foundations and Trends in Machine Learning , 11(5-6):355–607, 2019.[46] L. Qi. The maximal normal operator space and integration of subdifferentialsof nonconvex functions.
Nonlinear Anal. , 13(9):1003–1011, 1989.[47] S. T. Rachev and L. Rüschendorf.
Mass transportation problems. Vol. I . Prob-ability and its Applications (New York). Springer-Verlag, New York, 1998.Theory.[48] S. T. Rachev and L. Rüschendorf.
Mass transportation problems. Vol. II . Prob-ability and its Applications (New York). Springer-Verlag, New York, 1998.Applications.[49] R. T. Rockafellar.
Convex Analysis . Princeton University Press, Princeton,NJ, 1970.[50] L. Rüschendorf. Convergence of the iterative proportional fitting procedure.
Ann. Statist. , 23(4):1160–1174, 1995.[51] L. Rüschendorf and W. Thomsen. Closedness of sum spaces and the gener-alized Schrödinger problem.
Teor. Veroyatnost. i Primenen. , 42(3):576–590,1997.[52] C. Villani.
Optimal transport, old and new , volume 338 of
Grundlehren derMathematischen Wissenschaften . Springer-Verlag, Berlin, 2009.[53] J. Weed. An explicit analysis of the entropic penalty in linear programming.volume 75 of
Proceedings of Machine Learning Research , pages 1841–1855,2018., pages 1841–1855,2018.