A Consistent Extension of Discrete Optimal Transport Maps for Machine Learning Applications
aa r X i v : . [ m a t h . S T ] F e b A C
ONSISTENT E XTENSION OF D ISC R ETE O PTIMAL T R ANSPORT M APSFOR M AC HINE L EAR NING A PPLICATIONS
Lucas De Lara (1) ∗ and Alberto González-Sanz (2) ∗ and Jean-Michel Loubes (3) ∗ IMT, Université de Toulouse III France (1) [email protected] (2) alberto.gonzalez [email protected] (3) [email protected] A BSTRACT
Optimal transport maps define a one-to-one correspondence between probability distribu-tions, and as such have grown popular for machine learning applications. However, thesemaps are generally defined on empirical observations and cannot be generalized to newsamples while preserving asymptotic properties. We extend a novel method to learn a con-sistent estimator of a continuous optimal transport map from two empirical distributions.The consequences of this work are two-fold: first, it enables to extend the transport plan tonew observations without computing again the discrete optimal transport map; second, itprovides statistical guarantees to machine learning applications of optimal transport. We il-lustrate the strength of this approach by deriving a consistent framework for transport-basedcounterfactual explanations in fairness.
Keywords:
Optimal Transport, Counterfactuals, Explanability, Fairness.
Over the last past years, Optimal Transport (OT) methods have grown popular for machine learning ap-plications. Signal analysis [Kolouri et al., 2017], domain adaptation [Courty et al., 2017], transfer learning[Gayraud et al., 2017] or fairness in machine learning [Jiang et al., 2020, Gordaliza et al., 2019] for instancehave proposed new methods that make use of optimal transport maps. Given two distributions µ and ν satis-fying some assumptions, such a map T has the property of pushing forward a measure to another in the sensethat if a random variable X follows the distribution µ , then its image T ( X ) follows the distribution ν . Thismap comes as a tool to transform the distribution of observations.However, since only empirical distributions are observed, the continuous optimal transport is transformed intoan empirical problem. Optimal transport for empirical distributions has been widely studied from both a the-oretical and a computational point of view. We refer for instance to Peyré et al. [2019] and references therein.The obtained empirical maps between observations suffer some important drawbacks when implementingmachine learning methods relying on OT. As they are one-to-one correspondences between the points usedto compute the optimal transport, they are only defined on these observations, preventing their use for newinputs.To cope with this issue, either the map must be recomputed for each new data set or one must use a continuousapproximation extending the empirical map to observations out of the support of the empirical distribution.Previous research on the latter topic includes GAN approximations of the OT map [Black et al., 2020] and ∗ Research partially supported by the AI Interdisciplinary Institute ANITI, which is funded by the French “Investingfor the Future – PIA3” program under the Grant agreement ANR-19-PI3A-0004. C ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
Monte-Carlo approximations of the dual parameters [Chiappa and Pacchiano, 2021]. However, these methodsdon’t provide consistent estimators, in the sense that the obtained transport plans are not asymptotically closeto the continuous OT map as the sample size increases.In this paper, we propose to fill the gap between continuous and empirical transport by considering a statisti-cally consistent interpolation of the OT map for discrete measures. On the basis of the interpolation providedin del Barrio et al. [2020a], we generalize their results and prove that it is possible to learn from empiricalobservations a suitable OT map for machine learning methods. We then utilize this interpolation to derive thefirst consistent framework for empirically-based counterfactual explanations to audit the fairness of binaryclassifiers, extending the work in Black et al. [2020].
Let µ and µ be two unknown probability measures on R d whose respective supports are denoted by X and X . In this section, we address the problem of learning the optimal transport map between µ and µ fromdata points. Let k·k denote the euclidean norm associated with the scalar product h· , ·i . The optimal transport map between µ and µ with respect to the squared euclidean cost is defined as the solution to the following Monge problem : min T : T ♯ µ = µ Z R d k x − T ( x ) k dµ ( x ) , (1)where T ♯ µ = µ denotes that T pushes forward µ to µ , namely µ ( B ) := µ ( T − ( B )) for any measur-able set B ⊂ R d . Suppose that µ is absolutely continuous with respect to the Lebesgue measure ℓ d in R d ,and that both µ and µ have finite second order moments. Theorem 2.12 in Villani [2003] states that thereexists an unique solution to (1) T : X → R d called the Brenier map . This map coincides µ -almost surelywith the gradient of a convex function, and in consequence has a cyclically monotone graph. Recall that set S ⊂ R d × R d is cyclically monotone if any finite set { ( x k , y k ) } Nk =1 ⊂ S satisfies N − X k =1 h y k , x k +1 − x k i + h y N , x − x N i ≤ . Such a set is contained in the graph of the subdifferential of a convex function, see [Rockafellar, 1970]. Thesubdifferential at a point x ∈ R d of a convex function ψ is defined as the set ∂ψ ( x ) := { y ∈ R d |∀ z ∈ R d , ψ ( z ) − ψ ( x ) ≥ h y, z − x i} . We say that a multivalued map F : R d → R d is cyclically monotone if its graph is.In a practical setting, we only have access to samples from µ and µ , and consequently we can’t solve (1).However, we can compute a discrete optimal transport map between the empirical measures. Consider two n -samples { x , . . . , x n } and { x , . . . , x n } respectively drawn from µ and µ . They define the empiricalmeasures µ n := 1 n n X k =1 δ x k and µ n := 1 n n X k =1 δ x k . The discrete
Monge problem between µ n and µ n is min T n ∈T n n n X k =1 || x k − T n ( x k ) || , (2)where T n denotes the set of bijections from { x i } ni =1 to { x i } ni =1 . Problem (2) defines an unique solution T n referred as the discrete optimal transport map between the two samples. This solution is such that (cid:8) ( x k , T n ( x k )) (cid:9) nk =1 is cyclically monotone.In this paper, we focus on the problem of estimating the optimal transport map T solving (1). As mentionedin the introduction, the solution T n to (2) is not a suitable estimator because it has finite input and outputspaces, whereas T maps the whole domains. As a consequence, the empirical map cannot generalize to newobservations. This limitation triggered the need for regularized approaches: a topic we explore next.2 C ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
The heuristic approximation of a continuous OT map proposed in Black et al. [2020] handles new observa-tions and has a satisfying empirical behaviour, but is not guaranteed to converge to the true OT map as thesample size increases. The problem of constructing an approximation able to generalize to new observationswhile being statistically consistent crucially raises the question of which properties of continuous optimaltransport must be preserved by the empirical estimator.Recall that in one dimension the continuous optimal transport map between two probability measures isa non-decreasing function T such that T ♯ µ = µ . Then, natural extensions and regularization are madeby preserving that property. In several dimensions, the cyclically monotone property substitutes the non-decreasing one. For the purpose of generalizing the notion of distribution function to higher dimensions,del Barrio et al. [2020a] designed such an extension of T n that converges to T as the sample size increases.We briefly present the construction hereafter, and refer to del Barrio et al. [2020a] for further details.The idea is to extend the discrete map T n : { x i } ni =1 → { x i } ni =1 to a continuous one T n : R d → R d byregularizing a piece-wise constant approximation of T n . The first step consists in solving (2) and permutingthe observations so that for every i ∈ { , . . . , n } , T n ( x i ) = x i . Once the samples are aligned, we look forthe parameters ε and ψ ∈ R n defined as the solutions to the linear program max ψ ∈ R n ,ε ∈ R ε s.t. h x i , x i − x j i ≥ ψ i − ψ j + 2 ε , i = j. (3)Recall that (cid:8) ( x i , x i ) (cid:9) ni =1 is cyclically monotone, and consequently is contained in the graph of the subdif-ferential of some convex function. Since this is a finite set, there exist several convex functions satisfying thisproperty. For any of them denoted by ϕ n , its convex conjugate ϕ ∗ n := sup z ∈ R d {h z, ·i − ϕ n ( z ) } is such that ϕ ∗ n ( x i ) − ϕ ∗ n ( x j ) ≤ h x i , x i − x j i . The idea behind (3) is to find the most regular candidate convex function ϕ n by maximizing the strict convex-ity of ϕ ∗ n . Proposition 3.1 in del Barrio et al. [2020a] implies that (3) is feasible. In practice, we solve (3) byapplying Karp’s algorithm [Karp, 1978] on its dual formulation: min z i,j : i = j X i,j : i = j z i,j h x i , x i i s.t. X j : j = i ( z i,j − z j,i ) = 0 , X i,j : i = j z i,j = 1 , z i,j ≥ , i, j = 1 , . . . , n. (4)Next, define the following convex function ˜ ϕ n ( x ) := max ≤ i ≤ n (cid:8) h x, x i i − ψ i (cid:9) . (5)Note that ∇ ˜ ϕ n , wherever it is well-defined, is a piece-wise constant interpolation of T n . To obtain a regularinterpolation defined everywhere and preserving the cyclical monotonicity we consider the Moreau-Yosidaregularization of ˜ ϕ n given by ϕ n ( x ) := inf z ∈ R d (cid:8) ˜ ϕ n ( z ) + 12 ε || z − x || (cid:9) . Such a regularization is differentiable everywhere. Then, the mapping from R d to R d defined as T n := ∇ ϕ n satisfies the following properties:1. T n is continuous, 3 C ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS T n is cyclically monotone,3. for all i ∈ { , . . . , n } , T n ( x i ) = x i = T n ( x i ) ,4. for all x ∈ R d , T n ( x ) belongs to the convex hull of { x , . . . , x n } .A more explicit expression of T n can be derived using the gradient formula of Moreau-Yosida regularizations.For g : R d → R ∪ { + ∞} a proper convex lower-semicontinuous function, the proximal operator of g isdefined on R d by prox g ( x ) := arg min z ∈ R d (cid:8) g ( z ) + 12 || z − x || (cid:9) . (6)Note that it is well-defined since the minimized function is strictly convex. Then, according to Theorem 2.26in Rockafellar and Wets [2009] we have T n ( x ) = 1 ε (cid:0) x − prox ε ˜ ϕ n ( x ) (cid:1) . (7)The interpolation of each new input x is numerically computed by solving the optimization problemprox ε ˜ ϕ n ( x ) . As a consequence, generalizing with T n is not computationally free as we must computeproximal operators. Let’s benchmark this approach against classical discrete OT.Suppose for instance that after constructing T n on { x i } ni =1 and { x i } ni =1 we must generalize the OT map ona new sample { ¯ x i } mi =1 ∼ µ such that m ≤ n . Without additional observations from µ , we are limitedto: computing for each ¯ x i its closest counterpart in { x j } nj =1 —which would deviate from optimal transport;computing the OT map between { ¯ x i } mi =1 and an m -subsample of { x j } nj =1 —which would be greedy. Withan additional sample { ¯ x i } mi =1 from µ , we could upgrade T n to a T n + m by recomputing the empirical OTmap between the ( n + m ) -samples. However, this would cost O (cid:0) ( n + m ) (cid:1) in computer time, require newobservations, and not be a natural extension of T n . On the other hand, building the interpolation T n withKarp’s algorithm has a running-time complexity of O ( n ) : the same order as for T n . Then, to generalizethe transport to { ¯ x i } mi =1 with T n , we must solve m optimization problems, one for each prox ε ˜ ϕ n (¯ x i ) . Asthis amount to minimizing a function which is Lipschitz with constant max ≤ i ≤ n (cid:13)(cid:13) x i (cid:13)(cid:13) + ε − and stronglyconvex with constant ε − , an ǫ -optimal solution can be obtained in O ( ǫ − ) steps with a subgradient descent[Bubeck, 2017]. Since evaluating ∂ ˜ ϕ n at each step of the descent costs n operations, computing the transportinterpolation with precision ǫ of an m -sample has a computational complexity of order O ( mnǫ − ) . Note alsothat this methods is hyper-parameter free, and as such is more convenient than prior regularized approaches.In addition, the obtained map is a statistically relevant estimator: we show hereafter that the theoreticalinterpolation (7) converges to the continuous OT map under mild assumptions. We provide an extension of Proposition 3.3 in del Barrio et al. [2020a]. While the original result ensuresthe convergence of the interpolation T n to T in the case where µ is the spherical uniform law over the d -dimensional unit open ball, we prove that the consistency holds in more general settings. Theorem 1.
Let ˚ X and ˚ X be the respective interiors of X and X , and T the optimal transport mapbetween µ and µ . The following hold:1. Assume that X is convex such that µ has positive density on its interior. Then, for µ -almost every x , T n ( x ) a.s. −−−−→ n →∞ T ( x ) .
2. Additionally assume that T is continuous on ˚ X , and that X is compact. Then, for any compact set C of R d , ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS sup x ∈ C || T n ( x ) − T ( x ) || a.s. −−−−→ n →∞ . In particular, provided that X is compact, the convergence is uniform on the support.3. Further assume that X is a strictly convex set, then sup x ∈ R d || T n ( x ) − T ( x ) || a.s. −−−−→ n →∞ . The proof falls naturally into three parts, each one dedicated to the different points of Theorem 1. The firstpoint is a consequence of Theorem 2.8 in Del Barrio et al. [2019] and Theorem 25.7 in Rockafellar [1970],which entail that the convergence of { ϕ n } n ∈ N to ϕ extends to their gradients. The proofs of the second andthird points follow the guidelines of the one in del Barrio et al. [2020a]. The idea is to replace the unit ballby a compact set, and then a stricly convex set. We refer to the appendix for a complete description of thisproof, as well as for all the other theoretical claims introduced in this paper. Remark 1.
We briefly discuss the assumptions of Theorem 1. Thanks to a recent work [González-Sanz et al.,2021], the convexity of X can be relaxed to having a connected support with negligible boundary.Note that the second and third points of this theorem require a continuous optimal transport map T to ensurethe uniform convergence of the estimator. Caffarelli’s theory [Caffarelli, 1990, 1991, 1992, Figalli, 2017]provides sufficient conditions for this to hold. Suppose that X and X are compact convex, and that µ and µ respectively admit f and f as density functions. If there exist Λ ≥ λ > such that for all x ∈ X , y ∈ X λ ≤ f ( x ) , f − ( x ) , f ( y ) , f − ( y ) ≤ Λ , then T is continuous. For the non compact cases some results can be found in Figalli and Kim [2010],del Barrio et al. [2020b], Cordero-Erausquin and Figalli [2019]. In this section, we focus on the problem of repairing and auditing the bias of a trained binary classifier. Let (Ω , A , P ) be a probability space. The random vector X : Ω → R d represents the observed features , whilethe random variable S : Ω → { , } encodes the observed sensitive or protected attribute which divides thepopulation into a supposedly disadvantaged class S = 0 and a default class S = 1 . The random variable S issupposed to be non-degenerated. The two measures µ and µ are respectively defined as L ( X | S = 0) and L ( X | S = 1) . The predictor is defined as ˆ Y := h ( X, S ) , where h : R d × { , } → { , } is deterministic.We consider a setting in which ˆ Y = 1 and ˆ Y = 0 respectively represent a favorable and a disadvantageous outcome. The standard way to deal with Fairness in Machine Learning is to measure it by introducing fairness mea-sures. Among them, the disparate impact (DI) has received particular attention to determine whether a binarydecision does not discriminate a minority corresponding to S = 0 , see for instance in Zafar et al. [2017]. Thiscorresponds to the notion of statistical parity introduced in Dwork et al. [2012]. For a classifier h with valuesin { , } , set DI ( h, X, S ) as min( P ( h ( X, S ) = 1 | S = 0) , P ( h ( X, S ) = 1 | S = 1))max( P ( h ( X, S ) = 1 | S = 1) , P ( h ( X, S ) = 1 | S = 0)) . This criterion is close to 1 when statistical parity is ensured while the smaller the disparate, the more discrim-ination for the minority group. Obtaining fair predictors can be achieved by several means, one consistingin pre-processing the data by modifying the distribution of the inputs. Originally inspired by Feldman et al.[2015], this method proved in Gordaliza et al. [2019] consists in removing from the data the dependency withrespect to the sensitive variable. This can be achieved by constructing two optimal transport maps, T and T ,satisfying T ♯ µ = µ B and T ♯ µ = µ B , where µ B is the Wasserstein barycenter of µ and µ . The algo-rithm is then trained on the dataset of the modified observations following the distribution of the barycenter,which guarantees that h (cid:0) T S ( X ) , S (cid:1) satisfies the statistical parity fairness criterion.5 C ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
Using the estimator we propose in this work enables to compute for any new observation ( x, s ) a prediction h ( T n,s ( x ) , s ) with theoretical guarantees. Note that the same framework applies when considering post-processing of outcome of estimators (or scores of classifiers) which are then pushed towards a fair representer.We refer for instance to Le Gouic et al. [2020] for the regression case or to Chiappa et al. [2020] for theclassification case. A sharper approach to fairness is to explain the discriminatory behaviour of the predictor. Black et al. [2020]started laying out the foundations of auditing binary decision-rules with transport-based mappings. Their
Flip Test is an auditing technique for uncovering discrimination against protected groups in black-box clas-sifiers. It is based on two types of objects: the
Flip Sets , which list instances whose output changes hadthey belonged to the other group , and the
Transparency Reports , which rank the features that are associatedwith a disparate treatment across groups. Crucially, building these objects requires assessing counterfactu-als , statements on potential outcomes had a certain event occurred [Lewis, 1973]. The machine learningcommunity mostly focused on two divergent frameworks for computing counterfactuals: the nearest coun-terfactual instances principle, which models transformations as minimal translations [Wachter et al., 2017],and Pearl’s causal reasoning, which designs alternative states of things through surgeries on a causal model[Pearl et al., 2016]. While the former implicitly assumes that the covariates are independent, hence fails toprovide faithful explanations, the latter requires a fully specified causal model, which is a very strong assump-tion in practice. To address these shortcomings, Black et al. [2020] proposed substituting causal reasoningby matching the two groups with a one-to-one mapping T : R d → R d , for instance an optimal transportmap. However, because the GAN approximation they use for the OT map does not come with convergenceguarantees, their framework for explanability fails to be statistically consistent. We fix this issue next. Moreprecisely, after presenting this framework, we show that natural estimators of an optimal transport map, suchas the interpolation introduced in Section 2.2, lead to consistent explanations as the sample size increases. In contrast to Black et al. [2020], we present the framework from a non-empirical viewpoint. The followingdefinitions depend on the choice of the binary classifier h and the mapping T . Definition 1.
For a given binary classifier h , and a measurable function T : R d → R d , we define • the FlipSet as the set of individuals whose T -counterparts are treated unequally F ( h, T ) = { x ∈ R d | h ( x, = h ( T ( x ) , } , • the positive FlipSet as the set of individuals whose T -counterparts are disadvantaged F + ( h, T ) = { x ∈ R d | h ( x, > h ( T ( x ) , } , • the negative FlipSet as the set of individuals whose T -counterparts are advantaged F − ( h, T ) = { x ∈ R d | h ( x, < h ( T ( x ) , } . When there is no ambiguity, we may omit the dependence on T and h in the notation. The Flip Set characterizes a set of counterfactual explanations w.r.t. an intervention T . Such explanationsare meant to reveal a possible bias towards S . The partition into a positive and a negative Flip Set sharpensthe analysis by controlling whether S is an advantageous attribute or not in the decision making process.As S = 0 represents the minority, one can think of the negative partition as the occurrences of negativediscrimination , and the positive partition as the occurrences of positive discrimination . Black et al. [2020]noted that the relative sizes of the empirical positive and negative Flip Sets quantified the lack of statisticalparity. Following their proof, we give a generalization of their result to the continuous case: Proposition 1.
Let h be a binary classifier. If T : X → X satisfies T ♯ µ = µ , then P ( h ( X, S ) = 1 | S = 0) − P ( h ( X, S ) = 1 | S = 1)= P ( X ∈ F + | S = 0) − P ( X ∈ F − | S = 0) . ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
However, the interest of such sets lies in their explanatory power rather than being proxies for determiningfairness scores. By analyzing the mean behaviour of I − T for points in a Flip Set, one can shed light onthe features that mattered the most in the decision making process. A Transparency Report indicates whichcoordinates change the most, in intensity and in frequency, when applying T to a Flip Set. In what follows, forany x = ( x , . . . , x d ) T ∈ R d we define sign ( x ) := ( sign ( x ) , . . . , sign ( x d )) T the sign function on vectors. Definition 2.
Let ⋆ be in {− , + } , h be a binary classifier and T : X → R d be measurable map. Assumethat µ and T ♯ µ have finite first-order moments. The Transparency Report is defined by the mean differencevector ∆ ⋆ diff ( h, T ) = E µ [ X − T ( X ) | X ∈ F ⋆ ( h, T )]= 1 µ ( F ⋆ ( h, T )) Z F ⋆ ( h,T ) (cid:0) x − T ( x ) (cid:1) dµ ( x ) , and the mean sign vector ∆ ⋆ sign ( h, T ) = E µ (cid:2) sign ( X − T ( X )) (cid:12)(cid:12) X ∈ F ⋆ ( h, T )]= 1 µ ( F ⋆ ( h, T )) Z F ⋆ ( h,T ) sign ( x − T ( x )) dµ ( x ) . The first vector indicates how much the points moved; the second shows whether the direction of the trans-portation was consistent. We upgrade the notion of Transparency Report by introducing new objects extend-ing the Flip Test framework.
Definition 3.
Let ⋆ be in {− , + } , and T : X → R d be measurable. Assume that µ and T ♯ µ have finitefirst order moments. The difference Reference Vector is defined as ∆ refdiff ( T ) := E µ [ X − T ( X )] = Z (cid:0) x − T ( x ) (cid:1) dµ ( x ) . and the sign Reference Vector as ∆ refsign ( T ) := E µ [ sign (cid:0) X − T ( X ) (cid:1) ]= Z sign (cid:0) x − T ( x ) (cid:1) dµ ( x ) . The auditing procedure can be summarized as follows: (1) compute the Flip Sets and evaluate the lackof statistical parity by comparing their respective sizes; (2) if the Flip Sets are unbalanced, compute theTransparency Report and the Reference Vectors; (3) identify possible sources of bias by looking at the largestcomponents of ∆ ⋆ diff ( h, T ) − ∆ refdiff ( T ) and ∆ ⋆ sign ( h, T ) − ∆ refsign ( T ) . While the original approach would havedirectly analyzed the largest components of the Transparency Report, the aforementioned procedure scalesthe uncovered variations with a reference. This benchmark is essential. It contrasts the disparity betweenpaired instances with different outcomes to the disparity between the protected groups; thereby, pointing outthe actual treatment effect of the decision rule. We give an example to illustrate how the Reference Vectorsact as a sanity check for settings where the Transparency Report fails to give explanations. Example 1.
Let g be the standard gaussian measure on R , and define µ := ( − , − T + g and µ :=(2 , T + g , so that δ := E ( µ − µ ) = − (4 , T . Set T as the Brenier map between µ and µ , and supposethat the decision rule is h ( x , x , s ) := { x > } . In this scenario, T is the uniform translation I − δ , and wehave F − ( h, T ) = { ( x , x ) T ∈ R | − < x < } ,F + ( h, T ) = ∅ . Clearly, the predictor h is unfair towards µ , since the negative FlipSet outsizes the positive one. In this case,the vector ∆ − diff ( h, T ) is simply equal to ∆ − diff ( h, T ) = 1 µ ( F − ( h, T )) Z F − ( h,T ) δdµ ( x )= δ = ( − , − T . ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
A misleading analysis would state that, because | − | > | − | , the Transparency Report has uncovereda potential bias towards the first coordinate. This would be inaccurate, since the classifier only takes intoaccount the second variable. This issue comes from the fact that in this homogeneous case, the TransparencyReport only reflects how the two conditional distributions differ, and does not give any insight on the decisionrule. Our benchmark approach detects such shortcomings by systematically comparing the TransparencyReport to the Reference Vectors. In this setting we have ∆ refdiff ( T ) = δ , thus ∆ − diff ( h, T ) − ∆ refdiff ( T ) = 0 , whichmeans that the FlipTest does not give insight on the decision making process. To sum-up, we argue that it is the deviation of ∆ − diff ( h, T ) from ∆ refdiff ( T ) , and not ∆ − diff ( h, T ) alone, thatbrings to light the possible bias of the decision rule. Note that T ♯ µ = µ entails ∆ refdiff ( T ) = E [ X | S =0] − E [ X | S = 1] , which does not depend on T . Still, we define the Reference Vector with an arbitrary T because in practice we operate with an estimator that only approximates the push-forward condition. The first step for implementing the Flip Test technique is computing an estimator T n ,n of the chosen match-ing function T . In theory, the matching is not limited to an optimal transport map, but must define anintuitively justifiable notion of counterpart. Definition 4.
Let T : R d → R d satisfy T ♯ µ = µ , and T n ,n be an estimator of T built on a n -samplefrom µ and a n -sample from µ . T n ,n is said to be T -admissible if1. T n ,n : X → X is continuous on X ,2. T n ,n ( x ) a.s. −−−−−−−→ n ,n → + ∞ T ( x ) for µ -almost every x . According to Theorem 1, the smooth interpolation T n is an admissible estimator of the optimal transport map T under mild assumptions.The second step consists in building empirical versions of the Flip Sets and Transparency Reports for h and T n ,n using m data points from µ . The consistency problem at hand becomes two-fold: w.r.t. to m the sizeof the sample, and w.r.t. to the convergence of the estimator T n ,n . Proving this consistency is crucial, as T n ,n satisfies the push-forward condition at the limit only.Consider a m -sample { x i } mi =1 drawn from µ . We define the empirical counterparts of respectively thenegative Flip Set, the positive Flip Set, the mean difference vector, the mean sign vector, and the ReferenceVectors for arbitrary h and T . For any ⋆ ∈ {− , + } , they are given by F ⋆m ( h, T ) := { x i } mi =1 ∩ F ⋆ ( h, T ) , ∆ ⋆ diff ,m ( h, T ) := P mi =1 F ⋆ ( h,T ) ( x i ) (cid:0) x i − T ( x i ) (cid:1) | F ⋆m ( h, T ) | , ∆ ⋆ sign ,m ( h, T ) := P mi =1 F ⋆ ( h,T ) ( x i ) sign (cid:0) x i − T ( x i ) (cid:1) | F ⋆m ( h, T ) | , ∆ refdiff ,m ( T ) := 1 m m X i =1 (cid:0) x i − T ( x i ) (cid:1) , ∆ refsign ,m ( T ) := 1 m m X i =1 sign (cid:0) x i − T ( x i ) (cid:1) . Note that the first four equalities correspond to the original definitions from Black et al. [2020]. The stronglaw of large numbers implies the convergence almost surely of each of these estimators, as precised in thefollowing proposition.
Proposition 2.
Let ⋆ ∈ {− , + } , h be a binary classifier, and T a measurable function. The followingconvergences hold ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS | F ⋆m ( h, T ) | m µ − a.s. −−−−−→ m → + ∞ µ ( F ⋆ ( h, T )) , ∆ ⋆ diff ,m ( h, T ) µ − a.s. −−−−−→ m → + ∞ ∆ ⋆ diff ( h, T ) , ∆ ⋆ sign ,m ( h, T ) µ − a.s. −−−−−→ m → + ∞ ∆ ⋆ sign ( h, T ) , ∆ refdiff ,m ( T ) µ − a.s. −−−−−→ m → + ∞ ∆ refdiff ( T ) , ∆ refsign ,m ( T ) µ − a.s. −−−−−→ m → + ∞ ∆ refsign ( T ) . In particular, theses convergences hold for an admissible estimator T n ,n . To address the further convergencew.r.t. n and n , we first introduce a new definition. Definition 5.
A binary classifier ˜ h : R d → { , } is separating with respect to a measure ν on R d if1. H := ˜ h − ( { } ) and H := ˜ h − ( { } ) are closed or open,2. ν (cid:0) H ∩ H (cid:1) = 0 . We argue that except in pathological cases that are not relevant in practice, machine learning always dealswith such classifiers. For example, thresholded versions of continuous functions, which account for mostof the machine learning classifiers (e.g. SVM, neural networks. . . ), are separating with respect to Lebesguecontinuous measures. As for a very theoretical example of non-separating classifier, one could propose theindicator of the rational numbers, which is not separating with respect to the Lebesgue measure. Workingwith classifiers h such that h ( · , is separating w.r.t. to µ fixes the regularity issues one might encounterwhen taking the limit in h ( T n ,n ( · ) , . More precisely, it ensures that the set of discontinuity points of h is µ -negligible. As T ♯ µ = µ and since T n ,n → T µ -almost everywhere, the following continuousmapping result holds: Proposition 3.
Let ˜ h : R d → { , } be a separating classifier w.r.t. µ , and T n ,n a T -admissible estimator.Then, for µ -almost every x ˜ h ( T n ,n ( x )) a.s. −−−−−−−→ n ,n → + ∞ ˜ h ( T ( x )) . Next, we make a technical assumption for the convergence of the Transparency Report. Let { e , . . . , e d } bethe canonical basis of R d , and define for every k ∈ { , . . . , d } the set Λ k ( T ) := { x ∈ R d | h x − T ( x ) , e k i =0 } . Assumption 1.
For every k ∈ { , . . . , d } , µ (cid:0) Λ k ( T ) (cid:1) = 0 . Any Lebesgue continuous measure satisfies Assumption 1. This is crucial for the convergence of the meansign vector, as it ensures that the points of discontinuity of x sign (cid:0) x − T ( x ) (cid:1) are negligible. We now turnto our main consistency result. Theorem 2.
Let ⋆ ∈ {− , + } , h be a binary classifier such that h ( · , is separating w.r.t. µ , and T n ,n a T -admissible estimator. The following convergences hold µ ( F ⋆ ( h, T n ,n )) a.s. −−−−−−−→ n ,n → + ∞ µ ( F ⋆ ( h, T )) , ∆ ⋆ diff ( h, T n ,n ) a.s. −−−−−−−→ n ,n → + ∞ ∆ ⋆ diff ( h, T ) , ∆ refdiff ( T n ,n ) a.s. −−−−−−−→ n ,n → + ∞ ∆ refdiff ( T ) . If Assumption 1 holds, then additionally ∆ ⋆ sign ( h, T n ,n ) a.s. −−−−−−−→ n ,n → + ∞ ∆ ⋆ sign ( h, T ) , ∆ refsign ( T n ,n ) a.s. −−−−−−−→ n ,n → + ∞ ∆ refsign ( T ) . ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS As h is binary, the probability of the negative Flip Set can be written as µ ( F − ( h, T n ,n )) = R [1 − h ( x, h ( T n ,n ( x ) , dµ ( x ) . Note that the integrated function [1 − h ( · , h ( T n ,n ( · ) , isdominated by the constant . Then, it follows from Proposition 3 that this sequence of functions converges µ -almost everywhere to [1 − h ( · , h ( T ( · ) , when n , n → + ∞ . By the dominated convergencetheorem, we conclude that µ ( F − ( h, T n ,n )) −−−−−−−→ n ,n → + ∞ µ ( F − ( h, T )) . The same argument holds for thepositive Flip Sets. The proofs of the other convergences follow the same reasoning, using Proposition 3 andAssumption 1 to apply the dominated convergence theorem.As aforementioned, the assumptions of Theorem 2 are not significantly restrictive in practice. Thus, the FlipTest framework is tailored for implementations. We addressed the problem of constructing a statistically approximation of the continuous optimal transportmap. We argued that this has strong consequences for machine learning applications based on OT, as it renderspossible to generalize discrete optimal transport on new observations while preserving its key properties. Weillustrated that using the proposed extension ensures the statistical consistency of OT-based frameworks, andas such derived the first consistency analysis for observation-based counterfactual explanations.
References
Emily Black, Samuel Yeom, and Matt Fredrikson. Fliptest: Fairness testing via optimal transport. In
Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency , FAT* ’20, page111–121, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369367.doi: 10.1145/3351095.3372845. URL https://doi.org/10.1145/3351095.3372845 .Sébastien Bubeck. Convex optimization: Algorithms and complexity.
Foundations and Trends in MachineLearning , 8, 2017.Luis A Caffarelli. A localization property of viscosity solutions to the monge-ampere equation and their strictconvexity.
Annals of mathematics , 131(1):129–134, 1990.Luis A Caffarelli. Some regularity properties of solutions of monge ampere equation.
Communications onpure and applied mathematics , 44(8-9):965–969, 1991.Luis A Caffarelli. The regularity of mappings with a convex potential.
Journal of the American MathematicalSociety , 5(1):99–104, 1992.Silvia Chiappa and Aldo Pacchiano. Fairness with continuous optimal transport, 2021.Silvia Chiappa, Ray Jiang, Tom Stepleton, Aldo Pacchiano, Heinrich Jiang, and John Aslanides. A generalapproach to fairness with optimal transport. In
AAAI , pages 3633–3640, 2020.Dario Cordero-Erausquin and Alessio Figalli. Regularity of monotone transport maps between unboundeddomains, 2019.N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation.
IEEETransactions on Pattern Analysis and Machine Intelligence , 39(9):1853–1865, 2017. doi: 10.1109/TPAMI.2016.2615921.Eustasio Del Barrio, Jean-Michel Loubes, et al. Central limit theorems for empirical transportation cost ingeneral dimension.
The Annals of Probability , 47(2):926–951, 2019.Eustasio del Barrio, Juan A. Cuesta-Albertos, Marc Hallin, and Carlos Matrán. Center-outward distributionfunctions, quantiles, ranks, and signs in R d , 2020a.Eustasio del Barrio, Alberto González-Sanz, and Marc Hallin. A note on the regularity of optimal-transport-based center-outward distribution and quantile functions. Journal of Multivariate Analysis , 180:104671,2020b. 10 C
ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through aware-ness. In
Proceedings of the 3rd innovations in theoretical computer science conference , pages 214–226,2012.S. A Feldman, M.and Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and remov-ing disparate impact. In
Proceedings of the 21th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , pages 259–268. ACM, 2015.Alessio Figalli.
The Monge–Ampère equation and its applications . 2017.Alessio Figalli and Young-Heon Kim. Partial regularity of brenier solutions of the monge-ampere equation.
Discrete Contin. Dyn. Syst , 28(2):559–565, 2010.Nathalie TH Gayraud, Alain Rakotomamonjy, and Maureen Clerc. Optimal transport applied to transferlearning for p300 detection. In
BCI 2017-7th Graz Brain-Computer Interface Conference , page 6, 2017.Alberto González-Sanz, Eustasio del Barrio, and Jean-Michel Loubes. Central limit theorems for generaltransportation costs, 2021.Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice, and Jean-Michel Loubes. Obtaining fairness us-ing optimal transport theory. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedingsof the 36th International Conference on Machine Learning , volume 97 of
Proceedings of MachineLearning Research , pages 2357–2365, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/gordaliza19a.html .Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wasserstein fair classification.In
Uncertainty in Artificial Intelligence , pages 862–872. PMLR, 2020.Richard M Karp. A characterization of the minimum cycle mean in a digraph.
Discrete mathematics , 23(3):309–311, 1978.Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal mass transport:Signal processing and machine-learning applications.
IEEE signal processing magazine , 34(4):43–59,2017.T Le Gouic, J Loubes, and P Rigollet. Projection to fairness in statistical learning. arXiv preprintarXiv:2005.11720 , 2020.David Lewis. Causation.
Journal of Philosophy , 70(17):556–567, 1973.Judea Pearl, Madelyn Glymour, and Nicholas P Jewell.
Causal inference in statistics: A primer . John Wiley& Sons, 2016.Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science.
Foundations and Trends® in Machine Learning , 11(5-6):355–607, 2019.R Tyrrell Rockafellar.
Convex analysis . Number 28. Princeton university press, 1970.R Tyrrell Rockafellar and Roger J-B Wets.
Variational analysis , volume 317. Springer Science & BusinessMedia, 2009.Cédric Villani.
Topics in optimal transportation . Number 58. American Mathematical Soc., 2003.Cédric Villani.
Optimal Transport: Old and New . Number 338 in Grundlehren der mathematischen Wis-senschaften. Springer, Berlin, 2008. ISBN 978-3-540-71049-3. OCLC: ocn244421231.Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the blackbox: Automated decisions and the gdpr.
Harv. JL & Tech. , 31:841, 2017.M B Zafar, I Valera, M Gomez Rodriguez, and K P Gummadi. Fairness beyond disparate treatment &disparate impact: Learning classification without disparate mistreatment. In
Proceedings of the 26th Inter-national Conference on World Wide Web , pages 1171–1180. International World Wide Web ConferencesSteering Committee, 2017. 11 C
ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
A Proofs of Section 2
A.1 Intermediary result
We first introduce a proposition adapted from del Barrio et al. [2020a] to suit our setting. In what follows, wedenote by N C ( x ) := { y ∈ R d | ∀ x ′ ∈ C, h y, x ′ − x i ≤ } the normal cone at x of the convex set C . Proposition 4.
Suppose that X is a compact convex set. Let x n = λ n u n ∈ R d where < λ n → + ∞ and u n ∈ ∂ X → u as n → + ∞ . Note that by compactness of the boundary, necessarily u ∈ ∂ X . If ( T ( x n )) n ∈ N has a limit v (taking a subsequence if necessary), then v ∈ ∂ X , and u ∈ N X ( v ) = { } . Proof.
Taking subsequences if necessary, we can assume that T ( x n ) → v for some v ∈ X . The mono-tonicity of T implies that for any x ∈ R d , h x n − x, T ( x n ) − T ( x ) i ≥ . In particular, for any w ∈ T ( R d ) , h x n − T − ( w ) , T ( x n ) − w i ≥ . This can be written as h u n − λ n T − ( w ) , T ( x n ) − w i ≥ . Taking thelimit leads to h u, v − w i ≥ . Define H := { w ∈ R d | h u, w − v i ≤ } which is a closed half-space. As T pushes µ towards µ , T ( R d ) contains a dense subset of X . Since H is closed, this implies that X ⊂ H and v ∈ X ∩ H . Consequently, H is a supporting hyperplane of X and v ∈ ∂ X . Now, write the inclusion X ⊂ H as ∀ w ∈ X , h u, w − v i ≤ . Denote by N X ( x ) the normal cone of X at an arbitrary point x ∈ R d .Conclude by noting that the above inequality reads u ∈ N X ( v ) . The cone does not narrow down to { } as v does not belong to ˚ X . A.2 Proof of Theorem 1
We now turn to the proof of the main theorem.
Proof.
Recall that µ and µ are probability measures on R d with respective supports X and X . Wedenote their interiors by ˚ X and ˚ X , and their boundaries by ∂ X and ∂ X . We assume the measures to beabsolutely continuous with respect to the Lebesgue measure. Recall that there exists an unique map T suchthat T ♯ µ = µ and T = ∇ ϕ µ -almost everywhere for some convex function ϕ called a potential . Wedenote by dom ( ∇ ϕ ) the set of differentiable points of ϕ which satisfies µ ( dom ( ∇ ϕ )) = 1 , according toTheorem 25.5 in Rockafellar [1970].Conversely, there also exists a convex function ψ such S , the Brenier’s map from µ to µ , can be writtenas S := ∇ ψ µ -almost everywhere. In addition, S can be related to T through the potential functions.Concretely, ψ coincides with the convex conjugate ϕ ∗ ( y ) = sup x ∈ R d (cid:8) h x, y i − ϕ ( x ) (cid:9) of ϕ . We can thenfix this function for u ∈ R d \ ˚ X using the lower semi-continuous extension on the support. This defines aspecific ϕ (hence a specific solution T ) as ϕ ( x ) := sup u ∈ R d (cid:8) h x, u i − ϕ ∗ ( u ) (cid:9) = sup u ∈X (cid:8) h x, u i − ϕ ∗ ( u ) (cid:9) . (8)Let { x i } ni =1 and { x i } ni =1 be n -samples drawn from respectively µ and µ , defining empirical measures µ n and µ n . Without loss of generality, assume that the samples are ordered such that T n : x i x i is the uniquesolution to the corresponding discrete Monge problem. Consider the interpolation T n . We pay attention tothe properties it satisfies:1. T n = ∇ ϕ n where ϕ n is continuously differentiable,2. T n is cyclically monotone,3. for all i ∈ { , . . . , n } , T n ( x i ) = x i = T n ( x i ) ,4. for all x ∈ R d , T n ( x ) ∈ conv (cid:0) { x , . . . , x n } (cid:1) .Following the decomposition of Theorem 1 the proof will be divided into three steps.12 C ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
Step 1: Point-wise convergence.
Assume that the support X is a convex set. Recall that T n = ∇ ϕ n everywhere and T = ∇ ϕ µ -almost everywhere. We prove the point-wise convergence of { T n } n ∈ N to T in two steps: first, we show the point-wise convergence of { ϕ n } n ∈ N to ϕ ; second, we do the same for {∇ ϕ n } n ∈ N to ∇ ϕ .Theorem 5.19 in Villani [2008] implies that γ n = ( I × T n ) ♯ µ n w −−−−−→ n → + ∞ γ = ( I × T ) ♯ µ , where w denotes the weak convergence of probability measures. It follows from Theorem 2.8 inDel Barrio et al. [2019] that for all x ∈ ˚ X the limit lim n → + ∞ ϕ n ( x ) = ϕ ( x ) holds after centering.Theorem 25.5 in Rockafellar [1970] states that for all x ∈ dom ( ∇ ϕ ) there exist an open convex subset C such that x ∈ C ⊂ dom ( ∇ ϕ ) . Take an arbitrary x ∈ ˚ X ∩ dom ( ∇ ϕ ) and consider such a subset C containing x . Since ϕ is finite and differentiable in C , we can apply Theorem 25.7 in Rockafellar [1970] to concludethat ∇ ϕ ( x ) = lim n → + ∞ ∇ ϕ n ( x ) . To sum-up, the desired equality holds in ˚ X ∩ dom ( ∇ ϕ ) , in consequence µ -almost surely (recall that the border of a convex set is Lebesgue negligible). Step 2: Uniform convergence on the compact sets.
Further assume that ∇ ϕ is continuous on ˚ X , and thatthe support X is a compact set. Set K = sup x ∈X || x || . This implies that for any x ∈ R d , || T n ( x ) || ≤ max ≤ i ≤ n || x i || ≤ K . Then ||∇ ϕ n ( x ) || ≤ K for all n ∈ N and x ∈ R d . In consequence the sequence { ϕ n } n ∈ N is equicountinous with the topology of convergence on the compact sets. Arzela-Ascoli’s theoremapplied on the compact sets of R d implies that the sequence is s relatively compact in the topology induced bythe uniform norm on the compact sets. Let ρ be any cumulative point of { ϕ n } n ∈ N . Then, there exists a sub-sequence of { ϕ n } n ∈ N converging to ρ . Abusing notation, we keep denoting the sub-sequence by { ϕ n } n ∈ N .The previous step implies that ϕ = ρ and ∇ ϕ = ∇ ρ on X . Next, we show that this equality holds on R d \ X .The continuity of the transport map implies that ˚ X ⊂ dom ( ∇ ϕ ) . Hence, by convexity, for every z ∈ R d and u = ∇ ϕ ( x ) = ∇ ρ ( x ) ∈ ∇ ϕ ( ˚ X ) , ρ ( z ) ≥ ρ ( x ) + h u, z − x i = h u, x i − ϕ ∗ ( u ) , (9)where the equality comes from the equality case of the Fenchel-Young theorem. As µ ( ˚ X ) = 1 , the push-forward condition ∇ ϕ ♯ µ = µ implies that µ ( ∇ ϕ ( ˚ X )) = 1 and consequently ∇ ϕ ( ˚ X ) is dense in X . Itfollows that ρ ( z ) ≥ sup u ∈∇ ϕ ( ˚ X ) (cid:8) h u, z i − ϕ ∗ ( u ) (cid:9) = sup u ∈X (cid:8) h x, u i − ϕ ∗ ( u ) (cid:9) = ϕ ( z ) for every z ∈ R d . (10)To get the upper bound, set z ∈ R d and u n = ∇ ϕ n ( z ) = T n ( z ) . Since T n ( z ) ∈ conv (cid:0) { x , . . . , x n } (cid:1) , then u n ∈ X . Fenchel-Young equality once again implies that h x, u n i = ϕ n ( x ) + ϕ ∗ n ( u n ) . This gives that ϕ n ( x ) ≤ sup u ∈ ˚ X (cid:8) h u, x i − ϕ ∗ n ( u ) (cid:9) = ˜ ϕ n ( x ) , where ˜ ϕ n is the Legendre transform of ˜ ϕ ∗ n : u ( ϕ ∗ n ( u ) if u ∈ ˚ X , + ∞ otherwise.Since ∇ ϕ ∗ is the Brenier map from µ to µ , then Theorem 2.8 in Del Barrio et al. [2019] implies that lim n → + ∞ ϕ ∗ n ( u ) = ϕ ∗ ( u ) = lim n → + ∞ ˜ ϕ ∗ n ( u ) for every u ∈ ˚ X . Outside ˚ X we have ˜ ϕ ∗ n ( u ) = + ∞ = ϕ ∗ ( u ) by definition. Hence, the sequence { ˜ ϕ ∗ n } n ∈ N converges point-wise to ϕ ∗ over R d . According toTheorem 7.17 in together with Theorem 11.34 in Rockafellar and Wets [2009] the same convergence holdsfor their conjugates. This means that for any x ∈ R d we have lim n → + ∞ ˜ ϕ n ( x ) = ϕ ( x ) . This leads to ρ ( x ) ≤ ϕ ( x ) for every x ∈ R d , hence ρ = ϕ . We conclude, using Theorem 25.7 in Rockafellar [1970], that T n = ∇ ϕ n converges uniformly to T = ∇ ϕ over compact sets of R d , in particular over X if it is compact.13 C ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
Step 3: Uniform convergence on R d . Further assume that the support X is a strictly convex set. To provethe result it suffices to show that, for every w ∈ R d , sup x ∈ R d |h T n ( x ) − T ( x ) , w i| −−−−−→ n → + ∞ . Let’s assume that on the contrary, there exist ε > , w = 0 , and { x n } n ∈ N ⊂ R d such that |h T n ( x n ) − T ( x n ) , w i| > ε (11)for all n . Necessarily, the sequence { x n } n ∈ N is unbounded. If not, we could extract a convergent subsequenceso that, by using the point-wise convergence and the continuity of the transport functions, the left-term of (11)would tend to zero. Taking subsequences if necessary, we can assume that x n = λ n u n where lim n → + ∞ u n = u where u n , u ∈ ∂ X and < λ n → + ∞ . By compactness of X and Proposition 4, T n ( x n ) → z ∈ X and T ( x n ) → y ∈ ∂ X . Let τ > so that by monotonicity h T n ( x n ) − T n ( τ u n ) , ( λ n − τ ) u n i ≥ . For n large enough so that λ n > τ we have h T n ( x n ) − T ( τ u n ) , u n i + h T ( τ u n ) − T n ( τ u n ) , u n i ≥ . The second term tends to zero, leading to h z − T ( τ u ) , u i ≥ . As this holds for any τ > , we can take τ n = λ n → + ∞ to get h z − y, u i ≥ . (12)According to Proposition 4, u ∈ N X ( y ) with u = 0 . In particular, as z ∈ X , we have that h u, z − y i ≤ ,which implies that h u, z − y i = 0 . This means that u ⊥ z − y and u ∈ N X ( y ) . Hence, z − y belongs to thetangent plane of ∂ X at y while z ∈ X . Besides, X is strictly convex, implying that z = y . This contradictsat the limit with (11). B Proof of Section 3
Proof of Proposition 1.
Proof.
Note that F − ( h, T ) = { h ( x,
0) = 0 and h ( T ( x ) ,
1) = 1 } = { h ( T ( x ) ,
1) = 1 } − { h ( x,
0) = 1 and h ( T ( x ) ,
1) = 1 } . Similarly F + ( h, T ) = { h ( x,
0) = 1 and h ( T ( x ) ,
1) = 0 } = { h ( x,
0) = 1 } − { h ( x,
0) = 1 and h ( T ( x ) ,
1) = 1 } . Taking the measures we get µ ( F − ) − µ ( F + ) = µ ( { x ∈ R d | h ( T ( x ) ,
1) = 1 } ) − µ ( { x ∈ R d | h ( x,
0) = 1 } ) . Using the fact that T ♯ µ = µ we have 14 C ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS µ ( { x ∈ R d | h ( T ( x ) ,
1) = 1 } ) = µ ( { x ∈ R d | h ( x,
1) = 1 } ) . This leads to µ ( F − ) − µ ( F + ) = µ ( { x ∈ R d | h ( x,
1) = 1 } ) − µ ( { x ∈ R d | h ( x,
0) = 1 } ) . Which concludes the proof.Proof of Proposition 2.
Proof.
Let ⋆ ∈ {− , + } . The empirical probability of the Flip Set is | F ⋆m ( h, T ) | m = 1 m m X i =1 F ⋆ ( h,T ) ( x i ) . By the strong law of large numbers, m m X i =1 F ⋆ ( h,T ) ( x i ) µ − a.s. −−−−−→ m → + ∞ E µ [ F ⋆ ( h,T ) ( X )] = µ ( F ⋆ ( h, T )) . This concludes the first part of the proof. We now turn to the Transparency Report, and show the convergenceof the mean difference vector, as the proof is equivalent for the mean sign vector. The empirical estimatorcan be written as ∆ ⋆ diff ,m ( h, T ) = m | F ⋆m ( h, T ) | × m − m X i =1 F ⋆ ( h,T ) ( x i ) (cid:0) x i − T ( x i ) (cid:1) . Then, by the strong law of large numbers we have m | F ⋆m ( h, T ) | × m − m X i =1 F ⋆ ( h,T ) ( x i ) (cid:0) x i − T ( x i ) (cid:1) µ − a.s. −−−−−→ m → + ∞ µ ( F ⋆ ( h, T )) Z F ⋆ ( h,T ) (cid:0) x − T ( x ) (cid:1) dµ ( x ) , where by definition µ ( F ⋆ ( h, T )) Z F ⋆ ( h,T ) (cid:0) x − T ( x ) (cid:1) dµ ( x ) = ∆ ⋆ diff ( h, T ) . The proof for the Reference Vectors is identical, even simpler as h is not involved.Proof of Proposition 3. Proof.
Throughout this proof, we work with a given realization T n ,n := T ( ω ) n ,n of the random estimatorfor an unimportant arbitrary ω ∈ Ω . Without loss of generality, consider that H is open and H is closed.Recall that by T -admissibility, the sequence T n ,n ( x ) converges for µ -almost every x . We aim at showingthat for µ -almost every x ˜ h ( T n ,n ( x )) −−−−−−−→ n ,n → + ∞ ˜ h ( T ( x )) . For any x ∈ X there are only two different cases. 15 C ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS
Case 1:
For any n and n large enough, T n ,n ( x ) ∈ H . Then at the limit, T ( x ) ∈ H , meaning that theexpected convergence holds for this x . Case 2:
For any n and n large enough, T n ,n ( x ) ∈ H . Then at the limit, either T ( x ) ∈ H or T ( x ) ∈ H . If T ( x ) ∈ H , the expected convergence holds for this x . If T ( x ) ∈ H , necessarily T ( x ) ∈ H ∩ H .As µ ( H ∩ H ) = 0 and µ = µ ◦ T − , this only occurs for x in a µ -negligible set.Any other cases would contradict with the convergence of T n ,n ( x ) . Consequently, the expected conver-gence holds µ -almost everywhere.Proof of Theorem 2. Proof.
In this proof as well we work with a given realization T n ,n := T ( ω ) n ,n of the random estimator for anunimportant arbitrary ω ∈ Ω . Let’s show the result for ⋆ = − as the proof is equivalent for ⋆ = + . Because h is binary, the probability of the Flip Set can be written as µ ( F − ( h, T n ,n )) = E µ [ { F − ( h,T n ,n ) } ] = Z [1 − h ( x, h ( T n ,n ( x ) , dµ ( x ) . Note that the integrated function [1 − h ( · , h ( T n ,n ( · ) , is dominated by the constant . Then, it followsfrom Proposition 3 that this sequence of functions converges µ -almost everywhere to [1 − h ( · , h ( T ( · ) , when n , n → + ∞ . By the dominated convergence theorem, we conclude that µ ( F − ( h, T n ,n )) −−−−−−−→ n ,n → + ∞ µ ( F − ( h, T )) . We now turn to the mean difference vector, ∆ − diff ( h, T n ,n ) = 1 µ ( F − ( h, T n ,n )) Z F − ( h,T n ,n ) (cid:0) x − T n ,n ( x ) (cid:1) dµ ( x ) . We already proved that the left fraction converges to µ ( F − ( h, T )) − . To deal with the integral, we exploitonce again the fact that h is binary to write Z F − ( h,T n ,n ) (cid:0) x − T n ,n ( x ) (cid:1) dµ ( x ) = Z [1 − h ( x, h ( T n ,n ( x ) , (cid:0) x − T n ,n ( x ) (cid:1) dµ ( x ) . Note that the sequence of functions x x − T n ,n ( x ) converges µ -almost everywhere to x x − T ( x ) .This is where 3 comes into play to ensure the convergence µ -everywhere of the integrated function. Thisenables to apply the dominated convergence theorem to conclude that ∆ − diff ( h, T n ,n ) −−−−−−−→ n ,n → + ∞ ∆ − diff ( h, T ) . We finally address the case of the mean sign vector: ∆ − sign ( h, T n ,n ) = Z [1 − h ( x, h ( T n ,n ( x ) , sign (cid:0) x − T n ,n ( x ) (cid:1) dµ ( x ) . The approach is the same as for the mean difference vector. The only crucial distinction to handle is theconvergence of the sequence x sign (cid:0) x − T n ,n ( x ) (cid:1) to x sign (cid:0) x − T ( x ) (cid:1) , which is not trivial as the signfunction is discontinuous wherever a coordinate of its argument equals zero. We follow a similar reasoningas for the proof of Proposition 3 to show the convergence µ -almost everywhere. The only pathological casehappens when x − T ( x ) ends up on a canonical axis, that is to say when x ∈ Λ k ( T ) . If Assumption 1 holds,this occurs only for x in a µ -negligible set. Consequently, for µ -almost every x
16 C
ONSISTENT E XTENSION OF D ISCRETE O PTIMAL T RANSPORT M APS sign (cid:0) x − T n ,n ( x ) (cid:1) −−−−−−−→ n ,n → + ∞ sign (cid:0) x − T ( x ) (cid:1) . To conclude, we apply Proposition 3 along with the dominated convergence theorem to obtain ∆ − sign ( h, T n ,n ) −−−−−−−→ n ,n → + ∞ ∆ − sign ( h, T ) . The proof for the Reference Vectors is identical, even simpler as hh