[PDF] Achieving robustness in classification using optimal transport with hinge regularization

Abstract

Adversarial examples have pointed out Deep Neural Networks vulnerability to small local noise. It has been shown that constraining their Lipschitz constant should enhance robustness, but make them harder to learn with classical loss functions. We propose a new framework for binary classification, based on optimal transport, which integrates this Lipschitz constraint as a theoretical requirement. We propose to learn 1-Lipschitz networks using a new loss that is an hinge regularized version of the Kantorovich-Rubinstein dual formulation for the Wasserstein distance estimation. This loss function has a direct interpretation in terms of adversarial robustness together with certifiable robustness bound. We also prove that this hinge regularized version is still the dual formulation of an optimal transportation problem, and has a solution. We also establish several geometrical properties of this optimal solution, and extend the approach to multi-class problems. Experiments show that the proposed approach provides the expected guarantees in terms of robustness without any significant accuracy drop. The adversarial examples, on the proposed models, visibly and meaningfully change the input providing an explanation for the classification.

Full PDF

AAchieving robustness in classiﬁcation using optimaltransport with hinge regularization

Mathieu Serrurier

IRITUniversit´e Paul Sabatier Toulouse

Franck Mamalet

IRT Saint-Exupery

Alberto Gonz´alez-Sanz

IMTUniversit´e Paul Sabatier Toulouse

Thibaut Boissin

IRT Saint-Exupery

Jean-Michel Loubes

IMTUniversit´e Paul Sabatier Toulouse

Eustasio del Barrio

Dpto. de Estad´ıstica e Investigaci´on OperativaUniversidad de Valladolid

Abstract

We propose a new framework for robust binary classiﬁcation, with Deep Neu-ral Networks, based on a hinge regularization of the Kantorovich-Rubinstein dualformulation for the estimation of the Wasserstein distance. The robustness of theapproach is guaranteed by the strict Lipschitz constraint on functions required bythe optimization problem and direct interpretation of the loss in terms of adversar-ial robustness. We prove that this classiﬁcation formulation has a solution, and isstill the dual formulation of an optimal transportation problem. We also establishthe geometrical properties of this optimal solution. We summarize state-of-the-artmethods to enforce Lipschitz constraints on neural networks and we propose newones for convolutional networks (associated with an open source library for thispurpose). The experiments show that the approach provides the expected guaran-tees in terms of robustness without any signiﬁcant accuracy drop. The results alsosuggest that adversarial attacks on the proposed models visibly and meaningfullychange the input, and can thus serve as an explanation for the classiﬁcation.

The important progress in deep learning has led to a massive interest for these approaches in industry.However, when machine learning is applied for critical tasks such has in the transportation or themedical domain, empirical and theoretical guarantees are required. Some recent papers [7] proposetheoretical guarantees for particular neural networks, but this remains an open problem for DeepNeural Networks. Empirically, weakness of deep models with respect to adversarial attack wasﬁrst shown in [27], and is an active research topic. [27] pointed out that sensitivity to adversarialattack is closely related to the high Lipschitz constant of an unconstrained deep classiﬁer. Defensemethods relying on Lipschitz constraints have also been proposed [6; 16; 21; 22]. Moreover it hasbeen proven that limiting the Lipschitz constant improves generalisation [26] and the interpretabilityof the model [29], i.e. higher sparsity and interpretability. Counterfactual explanation in machinelearning considers what we have to change in a situation to change the prediction [32]. For modelsbased on high-level language such as logic, this modiﬁcation is interpretable as itself and provides anexplanation of the prediction. It turns out that, for neural networks, the deﬁnition of counterfactualcorresponds exactly to adversarial attacks, but these are usually indistinguishable from noise.

Preprint. Under review. a r X i v : . [ c s . L G ] J un -Lipschitz networks have also been used in Wasserstein GAN [2] to measure the distance betweentwo distributions as a discriminator, in analogy with the initial GAN algorithm [12]. The Wassersteindistance is approximated using a loss based on the Kantorovich-Rubinstein dual formulation and ak-Lipschitz network constrained by weight clipping. An interesting property of this approach, butnot harnessed in WGAN, is the link with optimal transportation that provides a nice interpretationof the loss function in terms of robustness. However, we will show that a vanilla classiﬁer based onthe Kantorovich-Rubinstein problem is suboptimal, even on toy datasets.In this paper, we propose a binary classiﬁer based on a regularized version of the dual formulationof the Wasserstein distance, combining the Kantorovich-Rubinstein loss function with a hinge loss.With this new optimization problem, we ensure to have an accurate classiﬁer with a loss that hasa direct interpretation in terms of robustness due to the 1-Lipschitz constraint. Moreover, we showthat it is still the dual of an optimal transportation problem. We prove that the optimal solution ofthe problem exists and makes no error when the classes are well separated. Our approach sharessome similarities with Parseval networks [6], in the way we constraint the Lipschitz constant of thenetworks by spectral normalization and orthonormalization of the weights. However, the new lossfunction we propose takes a better advantage of the Lipschitz constant limitation. The geometricproperties of the optimal solution of our problem encourage us to consider more advanced regular-ization techniques proposed in [1] such as B¨jorck normalization and a gradient preserving activationfunction. Experimentation shows that the proposed approach matches the state-of-the-art in termsof accuracy and demonstrates higher robustness to adversarial attacks. We also emphasizes that theclassiﬁer output has a direct interpretation in terms of robustness. Last, we show that adversarialexamples for our classiﬁer perform input changes that are interpretable , and thus are close to thenotion of counterfactuals.The paper, and the contributions, are structured as follows. In Section 2, we recall the deﬁnition ofWasserstein distance and the dual optimization problem associated. We present the interesting prop-erties of a classiﬁer based on this approach and we illustrate that it leads to a suboptimal classiﬁer.Section 3 describes the proposed binary classiﬁer, based on a regularized version of the Kantorovich-Rubinstein loss with a hinge loss. We show that the primal of this classiﬁcation problem is a newoptimal transport problem and we demonstrate different properties of our approach. Section 4 is de-voted to the way of constraining the classiﬁer to be 1-Lipschitz. We recall the different approachesto performs Lipschitz regularization, and also propose a new way to consider the regularizationfor convolutional and pooling layers. Section 5 presents the results of experiments on MNIST andCelebA datasets, measuring and comparing the results of different approaches in terms of accuracyand robustness. It also illustrates that the 1-Lipschitz constraint is satisﬁed. Last, we demonstratethat with our approach, building an adversarial example requires explicitly changing the example toan in-between two-classes image. We also show that we can easily build a counterfactual example,based the gradient of our network on the considered point, that can be used as an explanation for aclassiﬁcation. Proofs, computations details and additional experiments are reported in the appendix. In this paper we only consider the Wasserstein-1 distance, also called Earth-mover, and noted W for W . The -Wasserstein distance between two probability distributions µ and ν in Ω , and its dualformulation by Kantorovich-Rubinstein duality [30], is deﬁned as the solution of: W ( µ, ν ) = inf π ∈ Π( µ,ν ) E x,z ∼ π (cid:107) x − z (cid:107) (1a) = sup f ∈ Lip (Ω) E x ∼ µ [ f ( x )] − E x ∼ ν [ f ( x )] (1b)where Π( µ, ν ) is the set of all probability measures on Ω × Ω with marginals µ and ν and Lip (Ω) denotes the space of 1-Lipschitz functions over Ω . Although, the inﬁmum in Eq. (1a) is not tractablein general, the dual problem can be estimated through the optimization of a regularized neural net-work. This approach has been introduced in WGAN [2] where Lip (Ω) is approximated by theset of neural networks with bounded weights (better approaches to achieve it will be discussed inSection 4). 2 a) Distribution of ˆ f c conditionally to the classes (b) Level map of ˆ f c Figure 1: Wasserstein classiﬁcation (Eq. (2)) on the two moons.We consider a binary classiﬁcation problem on feature vector space X ⊂ Ω and and labels Y = {− , } . We name P + = P ( X | Y = 1) and P − = P ( X | Y = − , the conditionaldistributions with respect to Y. We note p = P ( Y = 1) and − p = P ( Y = − the apriori classdistribution. The classiﬁcation problem is balanced when p = .In WGAN, [2] proposed to use the learned neural network (denoted ˆ f in the following), by maxi-mizing the Eq. (1b), as a discriminator between fake and real images, in analogy with GAN [12]. Tobuild a classiﬁer based on ˆ f , one can simply note that if f ∗ is an optimal solution of Eq. (1b), then f ∗ + C, C ∈ R , is also optimal. Centering the function f ∗ (resp. ˆ f ), Eq. (2), enables classiﬁcationaccording to the sign of f ∗ c ( x ) (resp. ˆ f c for the empirical solution). f ∗ c ( x ) = f ∗ ( x ) − (cid:18) E z ∼ P + [ f ∗ ( z )] + E z ∼ P − [ f ∗ ( z )] (cid:19) . (2)Such a classiﬁer would exhibit good properties in terms of robustness for two main reasons: First, ithas been shown in [30] that the function f ∗ is directly related to the cost of transportation betweentwo points linked by the transportation plan (when π ∗ ( x = y ) = 0 ) as follows: P x , z ∼ π ∗ ( f ∗ ( x ) − f ∗ ( z ) = || x − z || ) = 1 . (3)Second, it was shown in [13; 1], that this optimal solution also induces a property stronger than1-Lipschitz: ||∇ f ∗ || = 1 almost surely on the support of π ∗ . (4)However, applying this vanilla classiﬁer (Eq. (2)) to a toy dataset such as the two-moons problem,leads to a poor accuracy. Indeed, Figures 1a and 1b present respectively the distribution of the valuesof ˆ f c ( x ) conditionally to the classes and the level map of ˆ f c . We can observe that, even if the classesare easily separable, the distributions of the values of ˆ f c conditionally to the class overlap. Thus, the0-level threshold on ˆ f c does not correspond to the optimal separator (even if it is better than random).Intuitively, ˆ f c maximizes the difference of the expectancy of the image of the two distributions butdo not try to minimize their overlap (Fig. 1a). In order to improve the classiﬁcation abilities of the classiﬁer based on Wasserstein distance, wepropose a Kantorovich-Rubinstein optimization problem regularized by an hinge loss : sup f ∈ Lip (Ω) −L hKRλ ( f ) = sup f ∈ Lip (Ω) − (cid:18) E x ∼ P − [ f ( x )] − E x ∼ P + [ f ( x )] + λ E x (1 − Y f ( x )) + (cid:19) (5)3 a) Distribution of ˆ f conditionally to the classes. (b) Level map of ˆ f Figure 2: Hinge regularized Kantorovich-Rubinstein (hinge-KR) classiﬁcation on the two moonsproblemwhere (1 − y f ( x )) + stands for the hinge loss max (0 , − y f ( x )) and λ > = 0 . We name L hKRλ thehinge-KR loss. The goal is then to minimize this loss with an 1-Lipschitz neural network.When λ = 0 , this corresponds to the Kantorovich-Rubinstein dual optimization problem. Intuitively,the 1-Lipschitz function f ∗ optimal with respect to Eq. (5) is the one that both separates the exampleswith a margin and spreads as much as possible the image of the distributions.In the following, we introduce Theorems that prove the existence of such an optimal function f ∗ and important properties of this function. Demonstrations of these theorems are in Appendix B. Theorem 1 (Solution existence) . For each λ > there exists at least a solution f ∗ to the problem f ∗ := f ∗ λ ∈ arg min f ∈ Lip (Ω) L hKRλ ( f ) . Moreover, let ψ be an optimal transport potential for the transport problem from P + to P − , f ∗ satisﬁes that || f ∗ || ∞ ≤ M := 1 + diam (Ω) + L ( ψ )inf( p, − p ) . (6)The next theorem establishes that the Kantorovich-Rubinstein optimization problem with hinge reg-ularization is still a transportation problem with relaxed constraints on the joint measure (which isno longer a joint probability measure). Theorem 2 (Duality) . Set P + , P − ∈ P (Ω) and λ > , then the following equality holds sup f ∈ Lip (Ω) −L hKRλ ( f ) = inf π ∈ Π pλ ( P + ,P − ) (cid:90) Ω × Ω | x − z | dπ + π x (Ω) + π z (Ω) − (7) Where Π pλ ( P + , P − ) is the set consisting of positive measures π ∈ M + (Ω × Ω) which are absolutelycontinuous with respect to the joint measure dP + × dP − and dπ x dP + ∈ [ p, p (1 + λ )] , dπ z dP − ∈ [1 − p, (1 − p )(1 + λ )] . We note ˆ f the solution obtained by minimizing L hKRλ on a set of labeled examples and f ∗ thesolution of Eq. (5). We don’t assume that the solution found is optimal (i.e. ˆ f (cid:54) = f ∗ ) but weassume that ˆ f is 1-Lipschitz. Given a function f , a classiﬁer based on sign ( f ) and an example x ,an adversarial example is deﬁned as follows: adv ( f, x ) = argmin z ∈ Ω | sign ( f ( z ))= − sign ( f ( x )) (cid:107) x − z (cid:107) . (8)According to the 1-Lipschitz property of ˆ f we have | ˆ f ( x ) | ≤ | ˆ f ( x ) − ˆ f ( adv ( ˆ f , x )) | ≤(cid:107) x − adv ( ˆ f , x ) (cid:107) . (9)4o | ˆ f ( x ) | is a lower bound of the distance of x to the separating boundary deﬁned by ˆ f and thusa lower bound to the robustness to l adversarial attacks. By minimizing E ((1 − y f ( x )) + ) , wemaximize the accuracy of the classiﬁer and by maximizing the discrepancy of the image of P + and P − with respect to f we maximize the robustness with respect to adversarial attack. The propositionbelow establishes that the gradient of the optimal function with respect to Eq. (5) has norm 1 almostsurely, as for the unregularized case (Eq. (4)). Proposition 1.

Let π be the optimal measure of the dual version (7) of the hinge regularized optimaltransport problem. Suppose that it is absolutely continuous with respect to Lebesgue measure. Thenthere exists at least a solution f ∗ of (7) such that ||∇ f ∗ || = 1 almost surely. Furthermore, empirical results suggest that given x , the image tr f ∗ ( x ) of x with respect to the trans-portation plan and adv ( x ) are in the same direction with respect to x and the direction is −∇ x f ∗ ( x ) .Combining this direction with the Eq. (9), we will show empirically (sect. 5) that adv ( x ) ≈ x − c x ∗ f ∗ ( x ) ∗ ∇ x f ∗ ( x ) and tr f ∗ ( x ) ≈ x − c (cid:48) x ∗ f ∗ ( x ) ∗ ∇ x f ∗ ( x ) with ≤ c x ≤ c (cid:48) x ∈ R . This provides a strong link between the adversarial attack and the optimaltransport for the proposed classiﬁer.The next proposition shows that, if the classes are well separated, maximizing the hinge-KR lossleads to a perfect classiﬁer. Proposition 2 (Separable classes) . Set P + , P − ∈ P (Ω) such that P ( Y = +1) = P ( Y = − and λ ≥ and suppose that there exists (cid:15) > such that | x − z | > (cid:15) dP + × dP − almost surely (10) Then for each f λ ∈ arg sup f ∈ Lip /(cid:15) (Ω) (cid:90) Ω f ( dP + − dP − ) − λ (cid:18)(cid:90) Ω (1 − f ) + dP + + (cid:90) Ω (1 + f ) + dP − (cid:19) , (11) it is satisﬁed that L ( f λ ) = 0 . Furthermore if (cid:15) ≥ then f λ is an optimal transport potential from P + to P − for the cost | x − z | . We show in Fig. 2, on the two moons problem, that in contrast to the vanilla classiﬁer based onWasserstein (Eq. (2)), the proposed approach enables non overlapping distributions of ˆ f condition-ally to the classes (Fig. 2a). In the same way, the 0-level cut of ˆ f (Fig. 2b) is a nearly optimalclassiﬁcation boundary. Moreover, the level cut of ˆ f , on the support of the distributions, is close tothe distance to this classiﬁcation boundary. In order to build a deep learning classiﬁer based on the hinge-KR optimization problem (Eq. (5)), wehave to constrain the Lipschitz constant of the neural network to be equal to 1. Even if the controlof the Lipschitz constant of a neural network is a key step to guarantee some robustness [6], it isknown that evaluating it exactly is a np-hard problem [31]. The simplest strategy is to constrainteach layer of the network to be 1-Lipschitz, to ensure that the Lipschitz constant of composition ofthe functions will be less than or equal to one. Most common activation functions such as ReLU orsigmoid are 1-Lipschitz. In the case of a dense layer, constraints can be applied to its weights. Thesimplest strategy is to constraint the weights of each layer. Given a dense layer with weights W , Itis commonly admitted that: L ( W ) = || W || ≤ || W || F ≤ max ij ( | W ij | ) ∗ √ nm (12)where || W || is the spectral norm, and || W || F is the Frobenius norm. The initial version of WGAN[2] consisted of clipping the weights of the networks. However, this is a very crude way toupper-bound Lipschitz constant (last term in Eq. (12)). Normalizing by the Frobenius norm hasalso been proposed in [25]. In this paper, we use spectral normalization as proposed in [19], since5he spectral norm is equals to the Lipschitz constant. At the inference step, we normalize theweights of each layer by dividing the weights by the spectral norm of the matrix. The spectralnorm is computed by iteratively evaluating the largest singular value with the power iteration algo-rithm [11]. This is done during the forward step and taken into account for the gradient computation.In the case of 2D-convolutional layers, normalizing by the spectral norm of the convolution kernels isnot enough and a supplementary multiplicative constant Λ is required (the regularization is then doneby dividing W by Λ || W || ). Given a convolutional layer with a kernel size equals to k = 2 ∗ ¯ k +1 , thecoefﬁcient Λ can be estimated, as in [6], as the square root of the maximum number of duplicationsof the input matrix values: since each input can be used in at most k positions, choosing Λ = k guarantees the convolutional layers to be 1-Lipschitz. However, due to the effect of the zero-padding,the constant Λ is overestimated and the real Lipschitz constant is lower than 1, especially when thesize of the image is small. When the convolutional network is very deep, this heavily decreases theLipschitz constant of the neural network. To mitigate this effect, we propose, for zero padding, atighter estimation of Λ , computing the average duplication factor of non zero padded values in thefeature map: Λ = (cid:114) ( k.w − ¯ k. (¯ k + 1)) . ( k.h − ¯ k. (¯ k + 1)) h.w . (13)Even if this constant doesn’t provide a strict upper bound of the Lipschitz constant (for instance,when the higher values are located in the center of the picture), it behaves very well empirically(see Figure 3b for instance). Convolution with stride, pooling layers, detailed explanations anddemonstrations are discussed in Appendix C.3.As shown in Section 3.2, the optimal function f ∗ with respect to Eq. (5), veriﬁes ||∇ f ∗ || = 1 almostsurely. In [13], the authors propose to add a regularization term with respect to the average gradientnorm with respect to inputs in the loss function. However, the estimation of this value is difﬁcultand a regularization term doesn’t guarantee the property. In this paper, we apply the approachdescribed in [1], based on the use of speciﬁc activation functions and a process of normalizationof the weights. Two norm preserving activation functions are proposed: i) GroupSort2 : orderthe vector by pairs, ii)

FullSort : order the full vector. These functions are vector-wise rather thanelement-wise. We also use the P-norm pooling [4],with P = 2 which is a norm-preserving averagepooling. Concerning linear functions, a weight matrix W is norm preserving if and only if all thesingular values of W are equals to . In [1], the authors propose to use the Bjrck orthonormalizationalgorithm [3]. This algorithm is fully differential and, as for spectral normalization, is applied duringthe forward inference, and taken into account for back-propagation (see Appendix C.4 for details).We also developed a full tensorﬂow [8] implementation in an opensource library, called DEEL.LIP , that enables training of k-Lipschitz neural networks, and exporting them as standard layers forinference. In the experiment, we compare three different approaches: i) classical log-entropy classiﬁer(MLP/CNN), ii) 1-Lipschitz log-entropy classiﬁer (1LIP-MLP/1LIP-CNN), in the spirit of Parse-val networks, iii) hinge-KR classiﬁer (hKR-MLP/hKR-CNN). In order to have a fair comparison,all the classiﬁers share the same dense or convolutional architecture. For the 1-Lipschitz log-entropy,we perform spectral normalization within the experiments with 3 power iteration steps and we useReLU activation (gradient preserving activations and pooling make no sense in this case and Bjrckorthonormalization doesn’t improve the results). For the hinge-KR classiﬁers, we apply Bjrck or-thonormalization (15 steps with p=1) after the spectral normalization (this improves the convergenceof the Bjrck algorithm). We use fullsort activation for dense layers and GroupSort2 for the convolu-tional ones. The full description of the architecture, the optimization process, and the inﬂuence ofeach parameter are described in Appendix D.1.For adversarial robustness estimation, since the hinge-KR primal problem is linked to L distance,we focus on L based adversarial attacks using the DeepFool framework [20]. For each type ofNeural Network, 500 attacks are carried out on test sets (using the foolbox library [23]), storing https://github.com/deel-ai/deel-lip to be published soon M LP bin cross. 99.6 4.47 LIP − M LP bin cross. 99.5 6.29 hKR − M LP L hKRλ ( λ = 10 ) 99.0 7.19 hKR − M LP L hKRλ ( λ = 50 ) 99.2 Celeb-AMoustache

CN N bin cross. 92.6 0.45 LIP − CN N bin cross. 92.4 0.27 hKR − CN N L hKRλ ( λ = 20 ) 90.9 Table 1: Accuracy and robustness to adversarial attack (average noise L norm) comparisonfor each image the output value, and the L norm of noise required to fool the network (up to a50/50 score). The output value is either the last dense layer output before the sigmo¨ıd activationfor MLP/CNN and 1LIP-MLP/1LIP-CNN networks (commonly called logit layer), or the last layeroutput for the hKR-MLP/hKR-CNN classiﬁer. We consider two binary classiﬁcation problems. Theﬁrst one is the separation of the digits 0 and 8 on the MNIST dataset (balanced classes, 10596training, and 1954 test samples). We choose these particular pair of digits because they are thehardest to separate. In the second problem, we consider the separation between male people withor without mustaches on the CelebA dataset [18] (unbalanced classes, 11779 training ( withmustache), 11036 test samples ( ) ).Table 1 compares the classiﬁcation accuracy and the adversarial robustness results for the three typesof classiﬁers. The drop in accuracy for the proposed solution, compared to classical networks, is lessthan one point even for a complex task such as celeb-A moustache classiﬁcation. The last columnof the table compares the average L norm of adversarial noise for the three types of classiﬁer. Onthe MNIST dataset, the improvement of robustness with 1-Lipschitz layers is signiﬁcant with a clearadvantage for our approach. When considering the CelebA problem, using 1-Lipschitz layers withlog-entropy is not enough (the logit values tend to be small). The proposed hKR approach leads to aclassiﬁer that is up to 10 times more robust than its competitors with an acceptable loss of accuracy.In Fig. 3, we compare, on the MNIST 0-8 dataset, the L norm of adversarial samples with respectto the output of the hKR neural networks (and the logit values for the 1LIP network). Figures 3aand 3b show that with our proposed learning rule, the fooling noise for a given input is linearly linkedto the output (slope of one for MLP, and around 1.1 for CNN), which conﬁrms that the Lipschitzconstants of our networks are very close to (but lower than) 1. We also observe that, for a given inputsample, the network prediction represents a very tight lower bound of the adversarial robustness ofthe network for this sample. In contrast, for the 1LIP-MLP (Fig. 3c), the 1-Lipschitz constraint isalso guaranteed, but the output of the network can be far lower than the adversarial noise norm.Whileon some samples at the tail end of the distribution, the L norm of found noise can reach 30, theaverage adversarial robustness is lower than for the proposed solution (green vertical lines). Besidethe level of noise, Figures 4 empirically show that, in contrast to the other approaches, the noiserequired to change the output class for the hKR classiﬁer is highly structured, and interpretable.Thus, on the MNIST 0-8 dataset, in order to fool the proposed learned network with DeepFool , a0-image, for instance, is explicitly transformed into a mix between an 8 and a 0. Even for a morecomplex task, such as a mustache classiﬁer on the Celeb-A dataset, Fig. 5 shows that fooling animage of a male without (resp. with) a mustache requires to addition (resp. removal) of dark pixelsaround the nasolabial fold. We also build counterfactual examples (line d) by applying the schemeproposed in Section 3.2 (i.e. counter ( ˆ f , x ) ≈ x − c x ∗ ˆ f ( x ) ∗ ∇ x ˆ f ( x ) ). For each sample, wechoose c x to obtain a visually satisfactory counterfactual (and not at the classiﬁcation boundary asfor adversarial examples). Although the differential images are not as precise as the ones obtainedwith adversarial approach, they clearly focus on the meaningful part of the images and provide, inour opinion, convincing counterfactual explanations. Remark that this approach is closely related tosaliency map. However, the results obtained with our network are far more precise than the ones onclassical networks. 7 a) hKR-MLP (b) hKR-CNN (c) 1LIP-CNN Figure 3: Comparison of L norm of fooling noise (Y-axis) with the output value (X-axis): greendots noise norm for samples (green line average value), red line: minimal possible noise (1-lipschitz), blue histogram: output (resp. logit for 1LIP-CNN) distribution (blue line average value) (a) MLP with sigmo¨ıd and binary cross entropy(b) MLP Lipschitz with sigmo¨ıd and binary cross entropy(c) MLP Lipschitz with the proposed hinge-KR loss Figure 4: Deepfool adversarial examples on MNIST 0-8 dataset: source image, fooled image, dif-ferential image (yellow (resp. cyan) increase(resp. decrease) grey level)

This paper presents a novel binary classiﬁcation framework and the associated deep learningprocess. Besides the interpretation of the classiﬁcation task as a regularized optimal transportproblem, we demonstrate that this new formalization has some valuable properties about errorbounds and robustness regarding adversarial attacks. We also propose a systematic approach toensure the 1-Lipschitz constraint of a neural network. This includes a state-of-the-art regularizationalgorithm and more precise constant evaluation for convolutional and pooling layers. Even ifthis regularization process can increase the computation time during learning, it doesn’t impactinference. We developed an open source python library based on tensorﬂow for 1-Lipshitz layersand gradient preserving activation and pooling functions. This makes the approach very easy toimplement and to use.The experiment emphasizes the theoretical results and conﬁrms that the classiﬁer has good andpredictable robustness to adversarial attack with a acceptable cost on accuracy. We also show thatour classiﬁer forces adversarial attacks to explicitly modify the input. Moreover, we show thatwe can easily build counterfactuals explanations. This suggest that with our novel classiﬁcationproblem, the adversarial attack is linked to the optimal transportation.In conclusion, we believe that this classiﬁcation framework based on optimal transport is of greatinterest for critical problems since it provides both empirical and theoretical guarantees. Futureworks will focus on the multiclass counterpart of the approach and on its applicability to large anddeep networks. 8 a) Fooling CelebA images classical network(b) Fooling images 1-lipschitz network (binary crossentropy)(c) Fooling images with the proposed hinge-KR classiﬁer(d) Counterfactuals for hinge-KR classiﬁer using gradient with resp. c x = 2 , , , Figure 5: (a-c) Deepfool and adversarial examples on Celeb-A dataset: two left (resp. right) triplet-images without (resp. with) mustache. Triplet-images consist of source image, fooled image, anddifferential image of V channel in HSV colorspace (yellow (resp. cyan) increase (resp. decrease) Vchannel level) with a common color scale for all settings. (d) counterfactuals: the last images of thetriplets are ∇ x ˆ f ( x ) with the same color representation than differential images. Acknowledgments and Disclosure of Funding

This project received funding from the French Investing for the Future PIA3 program within theArticial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledgethe support of the DEEL project . References [1] Cem Anil, James Lucas, and Roger Grosse. Sorting out Lipschitz function approximation. InKamalika Chaudhuri and Ruslan Salakhutdinov, editors,

Proceedings of the 36th InternationalConference on Machine Learning , volume 97 of

Proceedings of Machine Learning Research ,pages 291–301, Long Beach, California, USA, 09–15 Jun 2019. PMLR.[2] Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein generative adversarial net-works. In Doina Precup and Yee Whye Teh, editors,

Proceedings of the 34th InternationalConference on Machine Learning , volume 70 of

Proceedings of Machine Learning Research ,pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.[3] ˚Ake Bj¨orck and C. Bowie. An Iterative Algorithm for Computing the Best Estimate of anOrthogonal Matrix.

SIAM Journal on Numerical Analysis , 8(2):358–364, June 1971.[4] Y. Lan Boureau, Jean Ponce, and Yann Lecun. A theoretical analysis of feature pooling invisual recognition. In

ICML 2010 - Proceedings, 27th International Conference on MachineLearning , ICML 2010 - Proceedings, 27th International Conference on Machine Learning,pages 111–118, September 2010. 27th International Conference on Machine Learning, ICML2010 ; Conference date: 21-06-2010 Through 25-06-2010.[5] Haim Brezis.

Function Analysis, Sobolev Spaces and Partial Differential Equations . Springer,01 2010.

96] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier.Parseval Networks: Improving Robustness to Adversarial Examples. arXiv:1704.08847 [cs,stat] , April 2017. arXiv: 1704.08847.[7] M´elanie Ducoffe, S´ebastien Gerchinovitz, and Jayant Sen Gupta. A high-probability safetyguarantee for shifted neural network surrogates. In

SafeAI , 02 2020.[8] Mart´ın Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems,2015. Software available from tensorﬂow.org.[9] R´emi Flamary and Nicolas Courty. Pot python optimal transport library, 2017.[10] Wilfrid Gangbo and Robert McCann. The geometry of optimal transportation.

Acta Mathe-matica , 177:113–161, 09 1996.[11] Gene H. Golub and Henk A. van der Vorst. Eigenvalue computation in the 20th century.

Jour-nal of Computational and Applied Mathematics , 123(1):35 – 65, 2000. Numerical Analysis2000. Vol. III: Linear Algebra.[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advances in NeuralInformation Processing Systems 27 , pages 2672–2680. Curran Associates, Inc., 2014.[13] Ishaan Gulrajani, Faruk Ahmed, Mart´ın Arjovsky, Vincent Dumoulin, and Aaron C. Courville.Improved training of wasserstein gans.

CoRR , abs/1704.00028, 2017.[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for ImageRecognition. arXiv:1512.03385 [cs] , December 2015. arXiv: 1512.03385.[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Sur-passing human-level performance on imagenet classiﬁcation. , Dec 2015.[16] Matthias Hein and Maksym Andriushchenko. Formal Guarantees on the Robustness of a Clas-siﬁer against Adversarial Manipulation. arXiv:1705.08475 [cs, stat] , May 2017.[17] Yann Lecun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning ap-plied to document recognition. In

Proceedings of the IEEE , pages 2278–2324, 1998.[18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes inthe wild. In

Proceedings of International Conference on Computer Vision (ICCV) , December2015.[19] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normaliza-tion for generative adversarial networks.

ArXiv , abs/1802.05957, 2018.[20] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool: a simpleand accurate method to fool deep neural networks. arXiv:1511.04599 [cs] , November 2015.arXiv: 1511.04599.[21] Hajime Ono, Tsubasa Takahashi, and Kazuya Kakizaki. Lightweight Lipschitz Margin Train-ing for Certiﬁed Defense against Adversarial Examples. arXiv:1811.08080 [cs, stat] , Novem-ber 2018. arXiv: 1811.08080.[22] Haifeng Qian and Mark N. Wegman. L2-nonexpansive neural networks. In

InternationalConference on Learning Representations , 2019.[23] Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox: A Python toolbox to bench-mark the robustness of machine learning models. arXiv:1707.04131 [cs, stat] , March 2018.arXiv: 1707.04131.[24] R . Tyrrell Rockafellar and Roger Wets.

Variational Analysis , volume 317. Springer, 01 2004.1025] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization toaccelerate training of deep neural networks.

CoRR , abs/1602.07868, 2016.[26] Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel R. D. Rodrigues. Robust large margindeep neural networks.

IEEE Transactions on Signal Processing , 65(16):42654280, Aug 2017.[27] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-fellow, and Rob Fergus. Intriguing properties of neural networks, 2013.[28] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-fellow, and Rob Fergus. Intriguing properties of neural networks. arXiv:1312.6199 [cs] ,December 2013.[29] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and AleksanderMadry. Robustness may be at odds with accuracy, 2018.[30] C´edric Villani.

Optimal Transport: Old and New . Grundlehren der mathematischen Wis-senschaften. Springer Berlin Heidelberg, 2008.[31] Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis andefﬁcient estimation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,and R. Garnett, editors,

Advances in Neural Information Processing Systems 31 , pages 3835–3844. Curran Associates, Inc., 2018.[32] Sandra Wachter, Brent D. Mittelstadt, and Chris Russell. Counterfactual explanations withoutopening the black box: Automated decisions and the GDPR.

CoRR , abs/1711.00399, 2017.11

Optimal transportation : discrete case

A.1 Optimal transport

When considering a limited number of samples for the two distribution, the computation of theWasserstein distance can be solved through linear programming algorithms. In the balanced case,we have X = { x , . . . , x n } where { x , . . . , x n } are sampled from P + and { x n +1 , . . . , x n } are sampled from P − . We note U = { u , . . . , u n } the labels with u , . . . , u n = 1 and u n +1 , . . . , u n = − and C the n × n matrix cost function with C i,j = || x i − x n + j || . The primalproblem of the optimal transport is to ﬁnd a transportation plan Π (a n × n matrix) such as: min Π (cid:88) i,j ∈ n × n Π i,j ∗ C i,j (14)subject to Π i,j ≥ , (15) (cid:88) i Π i,j = 1 n , (cid:88) j Π i,j = 1 n . (16)The constraints enforce Π to be a discrete joint probability distribution with the appropriatemarginals as in the continuous case. The dual formulation for the discrete optimal transport problemis: max F F.U T (17)subject to ∀ i, j ∈ n × n, F i − F n + j ≤ C i,j (18)where F is a n vector that is a discrete version of the function f of Equation 1b. The constraint on F is the discrete counterpart of the 1-Lipschitz constraint. A.2 Hinge regularized Optimal transport

Similarly to the classical case, the discrete counterpart of the regularized Wasserstein distance isalso a transportation problem which has the following formulation: min Π (cid:88) i,j ∈ n × n [Π i,j ∗ C i,j ] −  − (cid:88) i,j ∈ n × n [Π i,j ]  subject to Π i,j ≥ , n ≤ (cid:88) i Π i,j ≤ λn , n ≤ (cid:88) j Π i,j ≤ λn . Roughly speaking, it allows to give more weight to the transportation of the closest pairs by ad-mitting to deviate from the marginals with a tolerance that depends on λ . Since the closest pairsin the two distributions are the most difﬁcult to classify, it illustrates why this formulation is moreadequate for classiﬁcation problem. The dual formulation of this transportation problem is a discretecounterpart of Equation 5 : max F n (cid:88) k =0 [ F i ∗ u i − λ (0 , − F i ∗ u i ) + ] subject to ∀ i, j ∈ n × n, F i − F n + j ≤ C i,j . We observe that the constraint in the dual problem are not affected by the new formulation and stillcorresponds to a the 1-Lipschitz constraint in the continuous case.12

Theorem proofs

B.1 Proof Theorem 1

We denote as f ∗ := f ∗ λ ∈ arg min f ∈ Lip (Ω) L hKRλ ( f ) and ˆ f n := ˆ f n,λ ∈ arg min f ∈ Lip (Ω) ˆ L hKRλ,n ( f ) . (19)If we assume that (6) is not true, then there exists x ∈ Ω such that f ∗ ( x ) > diam (Ω) + R ( ψ )inf( p, − p ) or f ∗ ( x ) < − − diam (Ω) − L ( ψ )inf( p, − p ) . We suppose without loss of generality that the ﬁrst inequalityholds. If z ∈ Ω then the 1-Lipschitz condition in f ∗ implies that f ∗ ( z ) > L ( ψ )1 − p . Hence (1 − f ∗ ) + = 0 and L ( f ∗ ) ≤ sup g ∈ Lip (Ω) L ( g ) − λL ( f ∗ )= sup g ∈ Lip (Ω) E X | Y =1 ( g ( X )) − E X | Y = − ( g ( X )) − E { λ (1 − Y f ∗ ( X )) + } = L ( ψ ) − λ { pE X | Y =1 (1 − f ∗ ( X )) + + (1 − p ) E X | Y = − (1 + f ∗ ( X )) + }≤ L ( ψ ) − λ { (1 − p ) E X | Y = − (1 + f ∗ ( X )) }≤ L ( ψ ) − λ { (1 − p ) E X | Y = − (2 + L ( ψ )1 − p ) } = L ( ψ ) − λ (1 − p ) − λ { E X | Y = − ( L ( ψ )) } = L ( ψ ) − λ (1 − p ) − λL ( ψ ) Then f ∗ can not be an optimal solution of the problem (19). Then there exists some constant M large enough, such that f ∗ belongs to Lip M (Ω) := { f ∈ Lip (Ω) : || f || ∞ ≤ M } and not toLip (Ω) . Since the functional L hKRλ is convex and Lip M (Ω) is compact in C (Ω) , we are able tomake use of Ascoli-Arzela Theorem and conclude that there exists at least one function minimizingthe expected loss. Furthermore the set of those functions is compact and convex. B.2 Proof Theorem 2Deﬁnition B.1.

Let µ, ν two positive measures in R d . The Kullback-Leibler divergence from µ to ν is deﬁned as KL ( µ | ν ) = (cid:26)(cid:82) log( dµdν ) dµ − (cid:82) dµ + (cid:82) dν if µ << ν ∞ otherwise (20) Theorem 3.

Let φ , φ : Ω → ¯ R be lower semicontinuous convex functions and µ, ν ∈ P (Ω) beprobability measures. Then for all (cid:15) > the following equality holds inf π ∈ Π + ( µ,ν ) (cid:90) φ ( − dπ x dµ ( x )) dµ ( x ) + (cid:90) φ ( − dπ z dν ( z )) dν ( z ) + (cid:15)KL ( π | e − c ( x , z ) (cid:15) ( dµ × dν ))= sup f,g ∈ L (Ω) − (cid:90) Ω φ ∗ ( f ( x )) dµ ( x ) − (cid:90) Ω φ ∗ ( g ( z )) dν ( z ) − (cid:15) (cid:90) (cid:16) e f ( x )+ g ( z ) − c ( x , z ) (cid:15) − e − c ( x , z ) (cid:15) (cid:17) dµdν. (21) Furthermore if (cid:15) = 0 then inf π ∈ Π + ( µ,ν ) (cid:90) Ω × Ω c ( x , z ) dπ ( x , z ) + (cid:90) φ ( − dπ x dµ ( x )) dµ ( x ) + (cid:90) φ ( − dπ z dν ( z )) dν ( z )= sup ( f,g ) ∈ Φ( µ,ν ) − (cid:90) Ω φ ∗ ( f ( x )) dµ ( x ) − (cid:90) Ω φ ∗ ( g ( z )) dν ( z ) . (22) Where Π + ( µ, ν ) is the set of positive measures π ∈ M + (Ω × Ω) which are absolutely continuouswith respect to the joint measure dµ × dν , and Φ( µ, ν ) consists of the pairs of functions ( f, g ) ∈ L (Ω) × L (Ω) such that c ( x , z ) − f ( x ) − g ( z ) ≥ dµ × dν − a.s. . First we recall the FenchelRockafellar Duality result, we use a weaker version given in Theorem1.12 in [5] 13 roposition 3.

Let E be a Banach space and Υ , Ψ : E → R ∪ {∞} be two convex functions,assume that there exist z ∈ dom (Ψ) ∩ dom (Υ) such that Ψ is continous in z . Then strong dualityholds inf a ∈ E { Υ( a ) + Ψ( a ) } = sup b ∈ E ∗ {− Υ ∗ ( − b ) − Ψ ∗ ( b ) } (23)We identify the different elements of our problem with such of previous Proposition.• E is the space of continuous functions in Ω × Ω . Note that the set is bounded, hence E ∗ ,by Riesz theorem, is the set of regular measures in Ω × Ω .• If (cid:15) = 0 :Ψ ( u ) = (cid:26) if u ( x , z ) ≥ − c ( x , z ) ∞ otherwise (24) Υ ( u ) = (cid:26)(cid:82) φ ∗ ( − f ( x )) dµ ( x ) + (cid:82) φ ∗ ( − g ( z )) dν ( z ) if u ( x , z ) = f ( x ) + g ( z ) ∞ otherwise (25)If (cid:15) > : Ψ (cid:15) ( u ) = (cid:15) (cid:90) (cid:16) e u ( x , z ) − c ( x , z ) (cid:15) − e − c ( x , z ) (cid:15) (cid:17) dµ ( x ) dν ( z ) (26) Υ (cid:15) ( u ) = Υ ( u ) (27)Note that Υ (cid:15) ( u ) = Υ ( u ) could be non well deﬁned, to avoid this situation we ﬁx x ∈ Ω andconsider u ( x , z ) = ( u ( x , z ) − u ( z , z ) /

2) + u ( z , z ) − u ( z , z ) / . Now we compute the dualoperators Ψ ∗ (cid:15) ( − π ) = sup u ∈ E (cid:26) − (cid:15) (cid:90) (cid:16) e u ( x , z ) − c ( x , z ) (cid:15) − e − c ( x , z ) (cid:15) (cid:17) dµ ( x ) dν ( z ) − (cid:90) u ( x , z ) dπ ( x , z ) (cid:27) = sup u ∈ E (cid:26) − (cid:15) (cid:90) (cid:16) e u ( x , z ) − c ( x , z ) (cid:15) − e − c ( x , z ) (cid:15) (cid:17) dµ ( x ) dν ( z ) + (cid:90) u ( x , z ) dπ ( x , z ) (cid:27) Now if π were not absolutely continuous respect the joint measure e − c ( x , z ) (cid:15) dµ × dν then we wouldhave a continuous function u ( x , z ) = 0 dµ × dν almost surely and such that (cid:82) u ( x , z ) dπ ( x , z ) (cid:54) = 0 . Ifwe take the function λu ( z , z ) and λ tends to ±∞ we deduce that the supremum is ∞ . Then supposethat dπ = m ( x , z ) e − c ( x , z ) (cid:15) ( dµ × dν ) . Ψ ∗ (cid:15) ( − π ) = (cid:40) sup u ∈ E (cid:110) (cid:15) (cid:82) (cid:16) − e u ( x , z ) (cid:15) + 1 + u ( x , z ) (cid:15) m ( x , z ) (cid:17) e − c ( x , z ) (cid:15) dµ ( x ) dν ( z ) (cid:111) if dπ = me − c ( x , z ) (cid:15) ( dµ × dν ) . ∞ otherwise. = (cid:15)KL ( π | e − c ( x , z ) (cid:15) ( dµ × dν )) With some similar calculation, we compute for (cid:15) = 0 : Ψ ∗ ( − π ) = (cid:26)(cid:82) c ( x , z ) dπ ( x , z ) if π is a positive measure. ∞ otherwise.Finally for Υ ∗ (cid:15) = Υ ∗ Υ ∗ (cid:15) ( π ) = sup u ∈ E, u ( x , z )= f ( x )+ g ( z ) (cid:26)(cid:90) f ( x ) + g ( z ) dπ ( x , z ) − (cid:90) φ ∗ ( − f ( x )) dµ ( x ) − (cid:90) φ ∗ ( − g ( z )) dν ( z ) (cid:27) = sup u ∈ E, u ( x , z )= f ( x )+ g ( z ) (cid:26)(cid:90) f ( x ) dπ x ( x ) − (cid:90) φ ∗ ( − f ( x )) dµ ( x ) + (cid:90) g ( z ) dπ z ( z ) − (cid:90) φ ∗ ( − g ( z )) dν ( z ) (cid:27) = sup f ∈ C (Ω) (cid:26)(cid:90) f ( x ) dπ x ( x ) − (cid:90) φ ∗ ( − f ( x )) dµ ( x ) (cid:27) + sup g ∈ C (Ω) (cid:26)(cid:90) g ( z ) dπ z ( z ) − (cid:90) φ ∗ ( − g ( z )) dν ( z ) (cid:27) = ( I ) + ( I ) .

14e ﬁrst consider ( I ) . The same reasoning will hold for ( I ) . If π x were not absolutely continuousrespect µ then reasoning as before we obtain ∞ . Then dπ x = dπ x dµ dµ and ( I ) = sup f ∈ C (Ω) (cid:26)(cid:90) (cid:18) − f ( x ) dπ x dµ − φ ∗ ( f ( x ))) (cid:19) dµ ( x ) (cid:27) = (cid:90) (cid:18) sup m (cid:26) − dπ x dµ m − φ ∗ ( m )) (cid:27)(cid:19) dµ ( x ) = (cid:90) φ ( − dπ x dµ ) dµ ( x )( I ) = (cid:90) φ ( − dπ x dν ) dµ ( z ) Note that the inversion of the supremum and the integral is guaranteed here since ( x , m ) (cid:55)→− m dπ x dµ ( x ) + φ ∗ ( m ) is lower semi-continuous and convex in m and measurable in ( x , m ) . Thenit is a normal integrand, and we can apply Theorem 14.60 in [24].Then computing both in Equation (23) we end with the following result inf u ( x , z )= f ( x )+ g ( z ) ≥− c ( x , z ) (cid:26)(cid:90) φ ∗ ( − f ( x )) dµ ( x ) + (cid:90) φ ∗ ( − g ( z )) dν ( z ) (cid:27) = inf f ( x )+ g ( z ) ≤ c ( x , z ) (cid:26)(cid:90) φ ∗ ( f ( x )) dµ ( x ) + (cid:90) φ ∗ ( g ( z )) dν ( z ) (cid:27) = − sup f ( x )+ g ( z ) ≤ c ( x , z ) (cid:26) − (cid:90) φ ∗ ( f ( x )) dµ ( x ) − (cid:90) φ ∗ ( g ( z )) dν ( z ) (cid:27) sup π ∈M + (Ω × Ω) (cid:26) − (cid:90) c ( x , z ) dπ ( x , z ) − (cid:90) φ ( − dπ x dν ) dµ ( z ) − (cid:90) φ ( − dπ x dµ ) dµ ( x ) (cid:27) = − inf π ∈M + (Ω × Ω) (cid:26) (cid:15) (cid:90) c ( x , z ) dπ ( x , z ) + (cid:90) φ ( − dπ x dν ) dµ ( z ) + (cid:90) φ ( − dπ x dµ ) dµ ( x ) (cid:27) . inf u ( x , z )= f ( x )+ g ( z ) (cid:26) (cid:15) (cid:90) (cid:16) e − f ( x ) − g ( z ) − c ( x , z ) (cid:15) − e − c ( x , z ) (cid:15) (cid:17) dµ ( x ) dν ( z ) + (cid:90) φ ∗ ( − f ( x )) dµ ( x ) + (cid:90) φ ∗ ( − g ( z )) dν ( z ) (cid:27) = inf f,g (cid:26) (cid:15) (cid:90) (cid:16) e f ( x )+ g ( z ) − c ( x , z ) (cid:15) − e − c ( x , z ) (cid:15) (cid:17) dµ ( x ) dν ( z ) + (cid:90) φ ∗ ( f ( x )) dµ ( x ) + (cid:90) φ ∗ ( g ( z )) dν ( z ) (cid:27) = − sup f,g (cid:26) − (cid:15) (cid:90) (cid:16) e f ( x )+ g ( z ) − c ( x , z ) (cid:15) − e − c ( x , z ) (cid:15) (cid:17) dµ ( x ) dν ( z ) − (cid:90) φ ∗ ( f ( x )) dµ ( x ) − (cid:90) φ ∗ ( g ( z )) dν ( z ) (cid:27) sup π ∈M + (Ω × Ω) (cid:26) − (cid:15)KL ( π | e − c ( x , z ) (cid:15) ( dµ × dν )) − (cid:90) φ ( − dπ x dν ) dµ ( z ) − (cid:90) φ ( − dπ x dµ ) dµ ( x ) (cid:27) = − inf π ∈M + (Ω × Ω) (cid:26) (cid:15)KL ( π | e − c ( x , z ) (cid:15) ( dµ × dν )) + (cid:90) φ ( − dπ x dν ) dµ ( z ) + (cid:90) φ ( − dπ x dµ ) dµ ( x ) (cid:27) Proof of Theorem 2 With the same notation of Theorem 3, it is enough to consider, µ = P + ν = P − and ψ ( s ) = (cid:40) p − s if s ∈ [ p, p + λp ] ∞ else. (28) ψ ( s ) = (cid:40) − p − s if s ∈ [1 − p, − p + λ (1 − p )] ∞ else. (29)15hen for each f ∈ L ( dµ ) , g ∈ L ( dν ) − ψ ∗ ( f ( x )) = − sup s {− ψ ( s ) + f ( x ) s } = inf s { ψ ( − s ) − f ( x ) s } = inf s { ψ ( s ) + f ( x ) s } = (cid:40) f ( x ) if ≤ f ( x ) f ( x ) − pλ (1 − f ( x )) else. = f ( x ) − pλ (1 − f ( x )) + − ψ ∗ ( g ( z )) = f ( z ) − (1 − p ) λ (1 − f ( z )) + . Note that when λ ≥ the functions r (cid:55)→ h ( r ) := r − pλ (1 − r ) + and h ( r ) := r − (1 − p ) λ (1 − r ) + are nondecreasing. Now if we denote as J the right hand side of (21) then J = sup ( f,g ) ∈ Φ( µ,ν ) (cid:90) Ω h ( f ( x )) dµ ( x ) + (cid:90) Ω h ( g ( z )) dµ ( z ) . We denote as f d the d − conjugate of f deﬁned as the function f d ( r ) := inf s ∈ Ω {| r − s | − f ( s ) } , see for instance in [10] for a suitable deﬁnition. It is clear that f dd ≥ f , and the equality holds if f is a d − concave function since it is said that f is d − concave if it is the d -conjugate of anotherfunction. Hence using the nondecreasing condition of h we get to J = sup f dd ,f d (cid:90) Ω h ( f dd ( x )) dµ ( x ) + (cid:90) Ω h ( f d ( z )) dν ( z ) . On the other side f d ( r ) = inf s ∈ Ω {| r − s | − f ( s ) } is a limit of a sequence of − Lipschitz functionsin Ω , hence it belongs to Lip (Ω) . Using the -Lipschitz property and taking r = s in the inﬁmumleads to − f d ( r ) ≤ inf s ∈ Ω {| r − s | − f d ( s ) } ≤ − f d ( r ) . This means that f dd = − f d ( r ) , hence we have that J = sup ( − f d ,f d ) (cid:90) Ω h ( f dd ( x )) dµ ( x ) + (cid:90) Ω h ( f d ( z )) dµ ( z ) . ≤ sup f ∈ Lip (Ω) (cid:90) Ω h ( f dd ( x )) dµ ( x ) + (cid:90) Ω h ( − f ( z )) dν ( z ) ≤ J where the last inequality comes from the fact that if f ∈ Lip (Ω) then ( f, − f ) ∈ Φ( µ, ν ) . B.3 Proof Proposition 1

As a direct consequence of Theorem 2 we derive the next equality inf π ∈ Π λ ( P + ,P − ) (cid:90) Ω × Ω (cid:18) (cid:15) | x − z | − (cid:19) dπ + 2= sup f ∈ Lip /(cid:15) (Ω) (cid:90) Ω f ( dP + − dP − ) − λ (cid:18)(cid:90) Ω (1 − f ) + dP + + (cid:90) Ω (1 + f ) + dP − (cid:19) . (30)We denote as I the left hand side of (30) and Π( P + , P − ) the set of measures with marginals P + , P − .Now using the hypothesis (10) we derive the next inequality I = inf π ∈ Π( P + ,P − ) (cid:90) Ω × Ω (cid:18) (cid:15) | x − z | − (cid:19) dπ + 2 = 1 (cid:15) W ( P + , P − ) . Since Lip /(cid:15) W ( P + , P − ) = sup f ∈ Lip /(cid:15) (Ω) (cid:82) Ω f ( dP + − dP − ) , we denote as ψ (cid:15) ∈ Lip /(cid:15) (Ω) thefunction where the supremum is achieved. Hence we derive the following inequality (cid:15) W ( P + , P − ) = (cid:90) Ω f λ ( dP + − dP − ) − λ (cid:18)(cid:90) Ω (1 − f λ ) + dP + + (cid:90) Ω (1 + f λ ) + dP − (cid:19) ≤ (cid:90) Ω ψ (cid:15) ( dP + − dP − ) − λ (cid:18)(cid:90) Ω (1 − f λ ) + dP + + (cid:90) Ω (1 + f λ ) + dP − (cid:19) = 1 (cid:15) W ( P + , P − ) − λ (cid:18)(cid:90) Ω (1 − f λ ) + dP + + (cid:90) Ω (1 + f λ ) + dP − (cid:19) . (cid:82) Ω (1 − f λ ) + dP + + (cid:82) Ω (1 + f λ ) + dP − = 0 and the ﬁrst assert of the proof is completed. Thesecond assertion is a straightforward consequence of the previous one. B.4 Proof Proposition 2

Even though the proof of Proposition 1 can be done following the frame of the proof of Proposition1 in [13], we have provided here an easier proof in order to make this document self-content. Theproof of this Proposition requires some properties on the transport plan.

Deﬁnition B.2.

A set Γ ⊂ R d × R d is said to be d -cyclically monotone if for all n ∈ N and { ( x k , y k ) } nk =1 ⊂ Γ it is satisﬁed n (cid:88) k =1 c ( x k , y k ) ≤ n (cid:88) k =1 c ( x k +1 , y k ) , assuming that n + 1 = 1 . (31) It is said that a measure is d − cyclically monotone if its support is d − cyclically monotone. In particular the optimal transference plan in Kantorovich problem for the cost d is d − cyclicallymonotone, see Theorem 2.3 [10]. The same characterization holds for the optimal measures of (21),this claim is proved in the following Lemma. Lemma 4.

The optimal measure π of (21) is d − cyclically monotone for d ( x , z ) = || x − z || . If π were not d − cyclically monotone, in [30] it is built another measure ˜ π , with the same marginalsas π , such that the value of (cid:82) | x − z | dπ ( x , z ) > (cid:82) | x − z | ˜ π ( x , z ) . Computing this we deduce inf π ∈ Π pλ ( µ,ν ) (cid:90) Ω × Ω | x − z | dπ + π x (Ω) + π z (Ω) − > inf ˜ π ∈ ˜ π pλ ( µ,ν ) (cid:90) Ω × Ω | x − z | d ˜ π + ˜ π x (Ω) + ˜ π z (Ω) − . Hence π cannot be optimal.We replicate this construction on order to build this proof as self content as possible.If P + and P − are discrete probabilities. Then P + = (cid:80) nk =1 u k δ x k and P − = (cid:80) nj =1 v j δ z j then theoptimal measure has the form: n n (cid:88) k,j =1 π k,j δ x k , z j (32)If it is not d -cyclically monotone then there exist N ∈ N and { ( x k i , z k i ) } Ni =1 ⊂ supp ( π ) such that: N (cid:88) i =1 || x k i − z k i +1 || < N (cid:88) i =1 || x k i − z k i || , assuming that k N +1 = k . Let a := inf i =1 ,...,N { π k i ,k i } > . And let’s deﬁne ˜ π as ˜ π := π + 1 n n (cid:88) i =1 (cid:16) δ x ki , z ki +1 − δ x ki , z ki (cid:17) . Then ˜ π ( A × Ω) = π ( A × Ω) + 1 n n (cid:88) i =1 (cid:16) δ x ki ( A ) − δ x ki ( A ) (cid:17) = π ( A × Ω) . And the same holds with (Ω × B ) and the other marginal, and also it satisﬁed that n n (cid:88) k,j =1 | x k − z j | ˜ π k,j < n n (cid:88) k,j =1 || x k − z j || π k,j . Hence ˜ π is the searched measure in the discrete case. Π pλ ( S , T ) is sequentially compact respect the weak convergence denoted * of measures if both S , T are also. Because of the compactness of Ω × Ω , we only have to check that the set is boundedin total variation. But this is straightforward because for each π ∈ Π pλ ( P + , P − ) it is satisﬁed | π | (Ω × Ω) ≤ ( p + pλ )( p + pλ ) . 17f P + and P − are general probabilities. Let X +1 , . . . , X + n and Z +1 , . . . , Z + n be sequences of indepen-dent random variables with law P + and P − . And let P + n , P n − be the associated empirical measures.Buy using the strong law of large numbers we deduce that P + n → P + and P n − → P − with probabil-ity one.Now let π n be the corresponding optimal measure for P + n , P n − , then there exist a measure π suchthat π n (cid:42) ∗ π . It means that for each continuous and bounded function f in Ω × Ω we get (cid:90) f dπ n −→ (cid:90) f dπ. Since the norm ( x , z ) (cid:55)→ || x − z || is continuous and bounded, once again because Ω is compact, wederive that (cid:90) || x − z || dπ n + π x n (Ω) + π z n (Ω) − −→ (cid:90) || x − z || dπ + π x (Ω) + π z (Ω) − Finally it is known that if a sequence of measures is d -cyclically monotone and converges weak* toanother measure, then it is also d -cyclically monotone. This concludes the proof.The proof of Proposition 1 is achieved as follows. The assumption of d -cyclically monotone involvesthat in particular g ( x ) − g ( z ) = || x − z || π -a.s. for some function g . Then for the balanced case (cid:90) ( g − dπ x − (cid:90) ( g + 1) dπ z + 2= sup f ∈ Lip (Ω) (cid:90) Ω f ( dP + − dP − ) − λ (cid:18)(cid:90) Ω (1 − f ) + dP + + (cid:90) Ω (1 + f ) + dP − (cid:19) . Then we split ( g −

1) = ( g − g − ≥ + ( g − g − < and (cid:90) ( g − dπ x + 1= (1 + λ ) (cid:90) ( g − g − ≥ dP + + (cid:90) ( g − g − < dP + = (cid:90) ( g − − λ (1 − g ) + dP + . Doing the same with P − , we deduce that this g is optimal and g ( x ) − g ( z ) = || x − z || π -a.s. for theoptimal measure π . As a consequence of such observations, following exactly the same argumentsof the proof of Proposition 1 in [13], note that the key is g ( x ) − g ( z ) = || x − z || π -a.s. which comesfrom what follows.Let f ∗ be the optimal of Lemma 4, x be a differentiable point of f ∗ . By assumption, the densityproperty implies that π ( x = z ) = 0 , and then with probability one, there exist z such that f ∗ ( x ) − f ∗ ( z ) = || x − z || and both points are different x (cid:54) = z . For each t ∈ [0 , let x t = (1 − t ) x + t z andthe path σ : [0 , → R deﬁned as σ ( t ) := f ∗ ( x t ) − f ∗ ( x ) . The proof is split in two steps; Step 1 ( σ ( t ) = || x t − z || = t || x − z || )First of all we realize that for each s, t ∈ [0 , | σ ( t ) − σ ( s ) | = | f ∗ ( x ) − f ∗ ( x s ) | ≤ || x t − x s || ≤ | t − s ||| x − z || . Actually if we consider t ∈ [0 , then σ (1) − σ (0) ≤ σ (1) − σ ( t ) + σ ( t ) − σ (0) ≤ (1 − t ) || x − z || + σ ( t ) − σ (0) ≤ (1 − t ) || x − z || + t || x − z || = || x − z || = σ (1) − σ (0) And the inequalities become equalities and because σ (0) = 0 we conclude σ ( t ) = t || x − z || . Step 2 (There exists some unitary vector v such that | ( ∂f ∗ /∂ v )( x ) | = 1 )The candidate is v = z − x || x − z || , and lets compute the partial derivative ∂f ∗ ∂ v ( x ) = lim h → f ( x + h v ) − f ( x ) h = lim h → σ ( h || x − z || ) h = 1 . Then for each differentiable point x of f ∗ there exists an unitary vector v such that | ∂f ∗ /∂ v ( x ) | = 1 .Then by creating an orthonormal base such that v belongs to it we can deduce that ||∇ f ∗ ( x ) || = 1 .And this event occurs with almost surely because of Rademacher Theorem.18 Lipshitz constant for convolutional networks

C.1 Enforcing 1-Lipschitz dense layer

A neural network is a composition of linear and non-linear function. Let’s study ﬁrst a multilayerperceptron is deﬁned as follows : f ( x ) = φ k ( W k . ( φ k − ( W k − . . . φ ( W .x ))) . We name L ( f ) the Lipschitz constant of a function f . As a composition of functions, the Lipschitzconstant of a multilayer perceptron is upper bounded by the product of the individual Lipschitzconstants: L ( f ) ≤ L ( φ k ) ∗ L ( W k ) ∗ L ( φ k − ) ∗ L ( W k − ) ∗ . . . ∗ L ( φ ) ∗ L ( W .x ) . The most common activation functions such as ReLU or sigmoid are 1-Lipschitz. Thus, we canensure that a perceptron is 1-Lipschitz by ensuring each dense layer W k is 1-Lipschitz. Given alinear function represented by an n × m matrix W , it is commonly admitted that: L ( W ) = || W || ≤ || W || F ≤ max ij ( | W ij | ) ∗ √ nm (33)where || W || is the spectral norm, and || W || F is the Frobenius norm. The initial version of WGAN[2] clip the weights of the networks. However, this is a very crude way to upper-bound the 1-Lipschitz (see equation 33). Normalizing by the Frobenius norm have also been proposed in [25].In this paper, we use spectral normalization as proposed in [19]. At the inference step, we normalizethe weights of each layer by dividing the weight by the spectral norm of the matrix: W s = W || W || . Even if this method is more computationally expensive than Frobenius normalization, it gives a ﬁnerupper bound of the 1-Lipschitz constraint of the layer. The spectral norm is computed by iterativelyevaluating the largest singular value with the power iteration algorithm [11]. This is done during theforward step and taken into account for the gradient computation.

C.2 Enforcing 1-Lipschitz convolutional layer

In this section we will show that enforcing convolution kernels to 1-lispchitz is not enough forensuring the 1-lipschitz property of convolutional layers, and will propose two normalization factors.Notations: We consider a Convolutional layer with an input feature map X of size ( c, w, h ) , and L output channels obtained with kernels W = { W l } l ∈ [0 ,L [ of odd size ( c, k, k ) , i.e. k = 2 ∗ ¯ k + 1 .Considering the classical same conﬁguration which output size is ( L, w, h ) , we use the followingmatrix notations of the convolution Y = W ∗ X :• (cid:101) X the zero padded matrix of X of size ( c, w + k − , h + k − • ¯ W the vectorized matrix of weights of size ( L, c.k ) • ¯ X a matrix of size ( c.k , w.h ) , a duplication of the input (cid:101) X , where each column j corre-spond to the c.k inputs in (cid:101) X used for computing a given output j • ¯ Y = ¯ W . ¯ X the vectorized output of size ( L, w.h ) Given two outputs X and X , we can compute an upper bound of convolutional layer lipschitzconstant (Eq. 34). || Y − Y || = || ¯ Y − ¯ Y || ≤ || ¯ W || . || ¯ X − ¯ X || ≤ Λ . || W || . || X − X || (34)The coefﬁcient Λ can be estimated, as in [6], by the maximum number of duplication of the inputmatrix (cid:101) X in ¯ X : each input can be used at most in k positions. But since within ¯ X , part of the19alues come from the zero padded zones in (cid:101) X , and have no inﬂuence on || Y − Y || , we propose atighter estimation of Λ , computing the average duplication factor of non zero padded value in ¯ X .For a 1D convolution (see Fig. 6), the number of zero values in the ¯ k ﬁrst columns of ¯ X (sym-metrically on the ¯ k last columns) is (¯ k, ¯ k − , ..., . So the number of zero padded values is k.w − ∗ (cid:80) ¯ kt =1 t = k.w − ¯ k (¯ k + 1) .Figure 6: Zero padded elements in a 1D convolution with k = 7(¯ k = 3) We propose to use Eq. 35 as a tighter normalization factor . Λ = (cid:114) ( k.w − ¯ k. (¯ k + 1)) . ( k.h − ¯ k. (¯ k + 1)) h.w (35) C.3 Convolution layers with zero padding and stride

Convolution layers are sometimes used with stride (as in Resnet layers [14]) to reduce the compu-tation cost of these layers . Given a stride ( s, s ) , the output layer size of the layer will be ( wo, ho ) such as w = s.wo + rw and h = s.ho + rh . We also introduce α = (cid:100) ks (cid:101) the maximum number ofoverlapping stride positions. As in previous section, we can build a matrix ¯ X of size ( c.k , wo.ho ) ,as a duplication of (cid:101) X . The maximum duplication factor of an element of (cid:101) X in ¯ X is Λ = α .As in section C.2, we can compute a tighter factor using the average duplication factor of input in X , by computing the number of non-zero-padded values used in ¯ X . We introduce ¯ α, ¯ β such as ¯ k = ¯ α.s + ¯ β .For a 1D convolution (see Fig. 7), the number of zero values in the ﬁrst columns of ¯ X is (¯ k, ¯ k − s, ..., ¯ β ) . So the number of zero padded values on the left side is zl = (cid:80) ¯ αt =0 (¯ k − t.s ) = (¯ α + 1)¯ k − s. ¯ α (¯ α +1)2 = (¯ α +1)(¯ αs +2 ¯ β )2 .On the right side (last columns), we introduce γ w = argmax { γ = w − − i.s, such as i > =0 and γ ≤ ¯ k } i.e. γ w = w − − s. (cid:100) w − − ¯ ks (cid:101) . γ w represents the ﬁrst half-kernel to include the lastelement of the line. We also introduce α w , β w such as γ w = α w .s + β w . The number of zero valuesin the last columns is (¯ k − γ w , ¯ k − γ w + s, ..., ¯ k − γ w + α w .s ) , i.e. zr w = (cid:80) α w t =0 (¯ k − γ w + t.s ) =( α w + 1)(¯ k − γ w + s.α w ) .For the matrix ¯ Y the average duplication factor for a value of the input X is ( k.wo − zl − zr w ) . ( k.ho − zl − zr h ) h.w We propose to use Eq. 36 as a tighter normalization factor . Λ = (cid:114) ( k.wo − zl − zr w ) . ( k.ho − zl − zr h ) h.w (36) this factor Eq. 35 is not a strict upper bound of the lipschitz constant, since particular matrix with highvalue on the center and low values on borders won’t satisfy the inequality (34) main drawback with stride is that each point in the input feature map has not the same number of occur-rences As in previous section, this factor is not an upper bound of the lipschitz constant in case of stride s = 1 , we have ¯ α = ¯ k , ¯ β = 0 , γ w = α w = ¯ k and β w = 0 . So we can retrieve zl + zr w = ¯ k. (¯ k +1)2 + ¯ k. (¯ k +1)2 = ¯ k. (¯ k + 1) k = 7 (¯ k = 3) , and s = 2 Layer type Parameters Upper lipconstant Thighter Lip estimationDense || W || Convolution wo stride kernel size ( k, k ) k = 2¯ k + 1 k. || W || (cid:113) ( k.w − ¯ k. (¯ k +1)) . ( k.h − ¯ k. (¯ k +1)) h.w . || W || Convolution with stride kernel size ( k, k ) stride ( s, s ) (cid:100) ks (cid:101) . || W || (cid:113) ( k.wo − zl − zr w ) . ( k.ho − zl − zr h ) h.w . || W || MaxPoolig AveragePooling averaging size po stride s (cid:100) pos (cid:101) . po Table 2: Main

C.3.1 Pooling layers

By deﬁnition, the max pooling layer is 1-lipschitz, since || max ( X ) − max ( X ) || ≤ || X − X || [28].Considering average pooling layer with a averaging size of po , and a stride of s . Since a mean isequivalent to a convolution with the matrix po po × po . The average pooling layer is equivalent toa convolution with stride (sec C.3). Introducing α = (cid:100) pos (cid:101) , which is in the common case where s = po . So an upper bound of lipschitz constant for the average pooling layer is Λ . || W || = αpo C.4 Gradient norm preserving and general architecture

As proven Sections 3.2 and , the optimal function f ∗ with respect to Equation 5, veriﬁes ||∇ f ∗ || = 1 almost surely. In [13], the authors propose to add a regularization terms with respect to the averagegradient norm with respect to inputs in the loss function. However, the estimation of this valueis difﬁcult and a regularization term doesn’t guarantee the property. In this paper, we apply theapproach described in [1], based on the use of speciﬁc activation functions and a normalizationprocess of the weights. Three norm preserving activation functions are proposed:• MaxMin : order the vector by pairs.•

GroupSort : order the vector by group of a ﬁxed size.•

FullSort : order the vector.These function are vector-wise rather than element-wise. We also propose the activation

Const-PReLU , a PReLU [15] activation function complemented by a constraint such that | α | ≤ ( α thelearnt slope). This last function is norm preserving only when | α | = 1 (linear, or absolute valuefunction), but being computed element wise, it is then more efﬁcient for convolutional layers out-puts.Given a vector v of size k the P-norm pooling is deﬁned in [4] as follows : P ool P − norm ( v ) = (cid:32) k k (cid:88) i =1 v pi P (cid:33) P W is norm preserving if and only if all the singularvalues of W are equals to . In [1], the authors propose to use the Bjrk Orthonormalization algorithm[3]. The Bjrk algorithm compute the closest orthonormal matrix by repeating the following operation: W k +1 = W k ( I + p (cid:88) i =1 ( − p (cid:18) − p (cid:19) Q pk ) (37)where Q k = 1 − W Tk W k and W = W . This algorithm is fully differential, and as for spectral nor-malization, it is applied during the forward inference, and taken into account for back-propagation. D Experiments : additional results

D.1 Networks architecture

We apply these normalization both to dense and convolutional layers. All the weights are initializedwith Bjrck algorithm with 15 steps. We consider both ReLU and norm preserving activation func-tions. For convolutional layers, we also apply the lipschitz coefﬁcients correction (Sect. C.2) andwe restrict the norm preserving activation functions to GroupSort-2, and ConstPReLU for efﬁciencysake .For Two-Moons dataset, the network architecture a MLP with 256-128-64-1 layer sizes. For MNIST0-8 subset, we use respectively a MLP with 128-64-32-1 layer sizes, or a Convolutional NeuralNetwork, using 3x3 convolution kernels, with C32-C32-P-C64-C64-C64-P-D128-D1 (the same ar-chitecture is used with sigmo¨ıd output. For Celeb-A male mustache dataset, VGG architecture isused with 3x3 convolution kernels: C16-C16-P-C32-C32-C32-P-C128-C128-C128-P-C256-C256-C256-P-D128-D64-D32-D1. For all experiments Average Pooling have shown better results thanmaxPooling. Other experiments were driven with ResNet networks, but leading to the same kind ofresults.Lipschitz network were learnt with DEEL.LIP library. After the training step, the weights arereplaced by their normalized version D.2 effect of λ The proposed regularized optimal transport for classiﬁcation is applied on the two digits (0-8)MNIST subset [17], and the CelebA[18] male (with-whithout) mustache subset. Learning is donewith the

DEEL.LIP library, using the Hinge-KR loss (Eq. 5), varying the λ parameter, using Adamoptimizer on 50 epochs, repeated 10 times, and collecting on the test dataset both the Wassersteindistance estimation, and the accuracy.Fig. 8 shows the inﬂuence of the hyper-parameter λ introduced in eq. 5. As expected, the regular-ization tends to enhance the classiﬁcation performance, but induce a slight drop on the Wassersteindistance estimation. D.3 Approximating Wasserstein distance

We ﬁrst consider the computation of the Wasserstein distance between two distributions using sev-eral kind of

Lip neural networks architectures. The architectures are inspired by [1], but the objec-tive is slightly different, since we are not in the scope of WGAN where the generated distributionaims to decrease this distance. In our scope, the distributions are ﬁxed, and the Wasserstein distancecan be computed empirically (A.1) as a reference.We will compare several kind of Lip neural networks, Multi Layer Perceptron and ConvolutionalNeural Network, with spectral normalization, and Bj¨orck orthonormalization[3; 1], and activationfunction ReLU, constrained PReLU (sec. C.4), MaxMin, Groupsort2, FullSort [1]. EmpiricalWasserstein distance is estimated using the POT library[9], applied on two digits (0-8) MNISTsubset[17]. others regularization layers such as BatchNorm and Dropout are useless and can modify the lipschitz factor https://github.com/deel-ai/deel-lip to be published soon https://github.com/deel-ai/deel-lip to be published soon λ hyperparameter on regurarized OT classiﬁer (left: wasserstein distanceestimation; right:classiﬁcation accuracy) with a 1-lipschitz MLP on MNIST 0-8 datasetLearning is done using the DEEL.LIP library with the wasserstein loss (Kantorovich-Rubinsteinformulation), and Adam optimizer (initial learning rate . ) on 50 epochs. Each experiment onMNIST dataset are repeated ten times.Results presented in table 2 show that the best approximation is obtained using MLP architecture,with Bj¨orck orthonormalization and FullSort activation. It has to be noticed that even for a simpledataset as M N IST − , none conﬁguration is able to achieve the empirical Wasserstein distancevalue. Several possible explanations are possible: it can be due to the representativeness of thechosen Lip Neural Network (architecture and activation) among the

Lip functions, but it couldalso be the consequence of an optimization problem. For the former, we use Lip layers whichgives only an upper bound for the full network lipschitz constant, but has shown in Figure3b, theoverall lipschitz factor is close to one. Besides, many experiments (not presented in this paper)have be done on the size and deepness of the Neural networks with roughly the same results. Forthe latter, since the the orthonormalized (Bj¨orck) networks are a subclass of spectral normalizationbased ones, but results are worse with spectral normalization, optimization issues could be foreseen.Experiments done with many type of optimizers (Adam, SGD, RMSprop,...) lead to the same results,and surprisingly the variance of Wassertein estimations is very low (table 2).Besides, activation functions choice has also a big inﬂuence on the Wasserstein estimation, sincechoosing ReLU for a MLP leads to a drop of in the estimation.For Convolutional Neural Net-work, best results for are obtained with Bj¨orck orthonormalization and constrained PReLU acti-vation, but are worse than MLP. This may be due to the non-invariance to shift and scale of theWasserstein distance. https://github.com/deel-ai/deel-lip to be published soon noise = 0 . ) [empirical estim.] N.A. N.A. 13.14 M LP Bj¨orck (15 it) FullSort . M LP Bj¨orck (15 it) GroupSort2 13.11

M LP Bj¨orck (15 it) ReLU 12.87

M LP Spectral norm. FullSort 12.88MNIST 0-8 [empirical estim.] N.A. N.A. 19.04

M LP Bj¨orck (15 it) FullSort . ± . M LP Bj¨orck (15 it) GroupSort2 . ± . M LP Bj¨orck (15 it) PReLU . ± . M LP Bj¨orck (15 it) ReLU . ± . M LP Spectral norm. FullSort . ± . CN N

Bj¨orck (15 it) FullSort . ± . CN N

Bj¨orck (15 it) GroupSort2 . ± . CN N

Bj¨orck (15 it) PReLU . ± . CN N

Bj¨orck (15 it) ReLU . ± . CN N

Spectral norm. FullSort . ± . Table 3: Wasserstein estimation for various dataset, and various architecture (

M LP : 256,128,64,1; M LP : 128-64-32-1; CNN : C32-C32-P-C64-C64-C64-P-128-1) (a) Variation of activation function (b) Variation of Bjorck iteration number per batch (0:spectral normalization) using FullSort activation Figure 9: Estimation of Wasserstein distance with a Lipschitz MLP24 a) Variation of activation function (b) Variation of Bjorck iteration number per batch (0:spectral normalization) using FullSort activationa) Variation of activation function (b) Variation of Bjorck iteration number per batch (0:spectral normalization) using FullSort activation