[PDF] Wasserstein Discriminant Analysis

Abstract

Wasserstein Discriminant Analysis (WDA) is a new supervised method that can improve classification of high-dimensional data by computing a suitable linear map onto a lower dimensional subspace. Following the blueprint of classical Linear Discriminant Analysis (LDA), WDA selects the projection matrix that maximizes the ratio of two quantities: the dispersion of projected points coming from different classes, divided by the dispersion of projected points coming from the same class. To quantify dispersion, WDA uses regularized Wasserstein distances, rather than cross-variance measures which have been usually considered, notably in LDA. Thanks to the the underlying principles of optimal transport, WDA is able to capture both global (at distribution scale) and local (at samples scale) interactions between classes. Regularized Wasserstein distances can be computed using the Sinkhorn matrix scaling algorithm; We show that the optimization of WDA can be tackled using automatic differentiation of Sinkhorn iterations. Numerical experiments show promising results both in terms of prediction and visualization on toy examples and real life datasets such as MNIST and on deep features obtained from a subset of the Caltech dataset.

Full PDF

MMachine Learning Journal manuscript No. (will be inserted by the editor)

Wasserstein Discriminant Analysis

R´emi Flamary · Marco Cuturi · NicolasCourty · Alain Rakotomamonjy

Received: date / Accepted: date

Abstract

Wasserstein Discriminant Analysis (WDA) is a new supervised lin-ear dimensionality reduction algorithm. Following the blueprint of classicalFisher Discriminant Analysis, WDA selects the projection matrix that max-imizes the ratio of the dispersion of projected points pertaining to diﬀerentclasses and the dispersion of projected points belonging to a same class. Toquantify dispersion, WDA uses regularized Wasserstein distances. Thanks tothe underlying principles of optimal transport, WDA is able to capture bothglobal (at distribution scale) and local (at samples’ scale) interactions betweenclasses. In addition, we show that WDA leverages a mechanism that inducesneighborhood preservation. Regularized Wasserstein distances can be com-puted using the Sinkhorn matrix scaling algorithm; the optimization problemof WDA can be tackled using automatic diﬀerentiation of Sinkhorn’s ﬁxed-

This work was supported in part by grants from the ANR OATMIL ANR-17-CE23-0012,Normandie Region, Feder, CNRS PEPS DESSTOPT, Chaire d’excellence de l’IDEX ParisSaclay.The authors also want to thank Alexandre Saint-Dizier, Charles Bouveyron and Julie Delonfor fruitful discussion and for pointing out the class weights in Fisher Discriminant.R. FlamaryLagrange, Observatoire de la Cˆote d’AzurUniversit´e Cˆote d’AzurE-mail: remi.ﬂ[email protected]. CuturiCREST, ENSAECampus Paris-Saclay 5, avenue Henry Le Chatelier 91120 Palaiseau, FranceN. CourtyLaboratoire IRISACampus de Tohannic56000 Vannes, FranceA. RakotomamonjyLITIS EA4108,Universit´e Rouen Normandie a r X i v : . [ s t a t . M L ] M a y R´emi Flamary et al. point iterations. Numerical experiments show promising results both in termsof prediction and visualization on toy examples and real datasets such asMNIST and on deep features obtained from a subset of the Caltech dataset.

Keywords

Linear Discriminant Analysis · Optimal Transport · WassersteinDistance

Feature learning is a crucial component in many applications of machine learn-ing. New feature extraction methods or data representations are often respon-sible for breakthroughs in performance, as illustrated by the kernel trick in sup-port vector machines (Sch¨olkopf and Smola, 2002) and their feature learningcounterpart in multiple kernel learning (Bach et al, 2004), and more recentlyby deep architectures (Bengio, 2009).Among all the feature extraction approaches, one major family of dimen-sionality reduction methods (Van Der Maaten et al, 2009; Burges, 2010) con-sists in estimating a linear subspace of the data. Although very simple, linearsubspaces have many advantages. They are easy to interpret, and can be in-verted, at least in a lest-squares way. This latter property has been used forinstance in PCA denoising (Zhang et al, 2010). Linear projection is also akey component in random projection methods (Fern and Brodley, 2003) orcompressed sensing and is often used as a ﬁrst pre-processing step, such as thelinear part in a neural network layer. Finally, linear projections only imply ma-trix products and stream therefore particularly well on any type of hardware(CPU, GPU, DSP).Linear dimensionality reduction techniques come in all ﬂavors. Some ofthem, such as PCA, are inherently unsupervised; some can consider labeleddata and fall in the supervised category. We consider in this paper linear and supervised techniques. Within that category, two families of methods standout: Given a dataset of pairs of vectors and labels { ( x i , y i ) } i , with x i ∈ R d ,the goal of Fisher Discriminant Analysis (FDA) and variants is to learn alinear map P : R d → R p , p (cid:28) d , such that the embeddings of these points Px i can be easily discriminated using linear classiﬁers. Mahalanobis metriclearning (MML) follows the same approach, except that the quality of theembedding P is judged by the ability of a k -nearest neighbor algorithm (nota linear classiﬁer) to obtain good classiﬁcation accuracy. FDA and MML, in both Global and Local Flavors.

FDA attempts to maximizew.r.t. P the sum of all distances || Px i − Px j (cid:48) || between pairs of samples fromdiﬀerent classes c, c (cid:48) while minimizing the sum of all distances || Px i − Px j || between pairs of samples within the same class c (Friedman et al, 2001, § § asserstein Discriminant Analysis 3 I n t e r - C l a ss FDA I n t r a - C l a ss LFDA WDA λ = 1 WDA λ = 10 LegendClass 1Class 2WeightVariance estimation

Fig. 1

Weights used for inter/intra class variances for FDA, Local FDA and WDA fordiﬀerent regularizations λ . Only weights for two samples from class 1 are shown. The colorof the link darkens as the weight grows. FDA computes a global variance with uniform weighton all pairwise distances, whereas LFDA focuses only on samples that lie close to each other.WDA relies on an optimal transport matrix T that matches all points in one class to allother points in another class (most links are not visible because they are colored in whiteas related weights are too small). WDA has both a global (due to matching constraints)and local (due to transportation cost minimization) outlook on the problem, with a tradeoﬀcontrolled by the regularization strength λ . version of FDA was proposed by Sugiyama (2007), which boils down to dis-carding the computation for all pairs of points that are not neighbors. On theother hand, the ﬁrst techniques that were proposed to learn metrics (Xinget al, 2003) used a global criterion, namely a sum on all pairs of points. Lateron, variations that focused instead exclusively on local interactions, such asLMNN (Weinberger and Saul, 2009), were shown to be far more eﬃcient inpractice. Supervised dimensionality approaches stemming from FDA or MMLconsider thus either global or local interactions between points, namely, eitherall diﬀerences || Px i − Px j || have an equal footing in the criterion they optimize,or, on the contrary, || Px i − Px j || is only considered for points such that x i isclose to x j . WDA: Global and Local.

We introduce in this work a novel approach that in-corporates both a global and local perspective. WDA can achieve this blendthrough the mathematics of optimal transport (see for instance the recentbook of Peyr´e and Cuturi (2018) for an introduction and exposition of someof the computational we will use in this paper). Optimal transport providesa powerful toolbox to compute distances between two empirical probabilitydistributions. Optimal transport does so by considering all probabilistic cou-plings between these two measures, to select one, denoted T , that is optimalfor a given criterion. This coupling now describes interactions at both a globaland local scale, as reﬂected by the transportation weight T ij that quantiﬁeshow important the distance || Px i − Px j || should be to obtain a good projectionmatrix P . Indeed, such weights are decided by (i) making sure that all pointsin one class are matched to all points in the other class (global constraint, de-rived through marginal constraints over the coupling); (ii) making sure thatpoints in one class are matched only to few similar points of the other class(local constraint, thanks to the optimality of the coupling, that is a functionof local costs). Our method has the added ﬂexibility that it can interpolate,through a regularization parameter, between an exclusively global viewpoint R´emi Flamary et al. (identical, in that case, to FDA), to a more local viewpoint with a globalmatching constraint (diﬀerent, in that sense, to that of purely local tools suchas LMNN or Local-FDA). In mathematical terms, we adopt the ratio formu-lation of FDA to maximize the ratio of the regularized Wasserstein distancesbetween inter class populations and between the intra-class population withitself, when these points are considered in their projected space :max P ∈ ∆ (cid:80) c,c (cid:48) >c W λ ( PX c , PX c (cid:48) ) (cid:80) c W λ ( PX c , PX c ) (1)where ∆ = { P = [ p , . . . , p p ] | p i ∈ R d , (cid:107) p i (cid:107) = 1 and p (cid:62) i p j = 0 for i (cid:54) = j } is the Stiefel manifold (Absil et al, 2009), the set of orthogonal d × p matrices; PX c is the matrix of projected samples from class c . W λ is theregularized Wasserstein distance proposed by Cuturi (2013), which can beexpressed as W λ ( X , Z ) = (cid:80) i,j T (cid:63)i,j (cid:107) x i − z j (cid:107) , T (cid:63)i,j being the coordinates ofthe entropic-regularized Optimal Transport (OT) matrix T (cid:63) (see § λ controls the local information involved in thedistance computation. Further analyses and intuitions on the role on thewithin-class distances in the optimization problem are given in the Discussionsection.When λ is small, we will show that WDA boils down to FDA. When λ islarge, WDA tries to split apart distributions of classes by maximizing theiroptimal transport distance. In that process, for a given example x i in oneclass, only few components T i,j will be activated so that x i will be paired withfew examples. Figure 1 illustrates how pairing weights T i,j are deﬁned whencomparing Wasserstein discriminant analysis (WDA, with diﬀerent regular-ization strengths) with FDA (purely global), and Local-FDA (purely local)(Sugiyama, 2007). Another strong feature brought by regularized Wassersteindistances is that relations between samples (as given by the optimal trans-port matrix T ) are estimated in the projected space. This is an importantdiﬀerence compared to all previous local approaches which estimate local re-lations in the original space and make the hypothesis that these relations areunchanged after projection. Paper outline.

Section 2 provides background on regularized Wasserstein dis-tances. The WDA criterion and its practical optimization is presented in Sec-tion 3. Section 4 by discusses properties of WDA and related works. Numericalexperiments are provided in Section 5. Section 6 concludes the paper and in-troduces perspectives.

Wasserstein distances, also known as earth mover distances, deﬁne a geometryover the space of probability measures using principles from optimal transport asserstein Discriminant Analysis 5 theory (Villani, 2008). Recent computational advances (Cuturi, 2013; Ben-amou et al, 2015) have made them scalable to dimensions relevant to machinelearning applications.

Notations and Deﬁnitions.

Let µ = n (cid:80) i δ x i , ν = m (cid:80) i δ z i be two empiri-cal measures with locations in R d stored in matrices X = [ x , · · · , x n ] and Z = [ z , · · · , z m ]. The pairwise squared Euclidean distance matrix betweensamples in µ and ν is deﬁned as M X , Z := [ || x i − z j || ] ij ∈ R n × m . Let U nm be the polytope of n × m nonnegative matrices such that their row and col-umn marginals are equal to n /n and m /m respectively. Writing n for the n -dimensional vector of ones, we have: U nm := { T ∈ R n × m + : T1 m = n /n, T T n = m /m } . Regularized Wassersein distance.

Let (cid:104) A , B (cid:105) := tr( A T B ) be the Frobeniusdot-product of matrices. For λ ≥

0, the regularized Wasserstein distance weadopt in this paper between µ and ν is (and with a slight abuse of notation): W λ ( µ, ν ) := W λ ( X , Z ) := (cid:104) T λ , M X , Z (cid:105) , (2)where T λ is the solution of an entropy-smoothed optimal transport problem, T λ := argmin T ∈ U nm λ (cid:104) T , M X , Z (cid:105) − Ω ( T ) , (3)where Ω ( T ) is the entropy of T seen as a discrete joint probability distribution,namely Ω ( T ) := − (cid:80) ij t ij log( t ij ) . Note that problem (3) can be solved veryeﬃciently using Sinkhorn’s ﬁxed-point iterations (Cuturi, 2013). The solutionof the optimization problem can be expressed as: T = diag( u ) K diag( v ) = u1 Tm ◦ K ◦ n v T , (4)where ◦ stands for elementwise multiplication and K is the matrix whoseelements are K i,j = e − λM i,j . The Sinkhorn iterations consist in updatingleft/right scaling vectors u k and v k of the matrix K = e − λ M . These updatestake the following form for iteration k : v k = m /m K T u k − , u k = n /n Kv k (5)with an initialization which will be ﬁxed to u = n . Because it only involvesmatrix products, the Sinkhorn algorithm can be streamed eﬃciently on parallelarchitectures such as GPGPUs. In this section we discuss optimization problem (1) and propose an eﬃcientapproach to compute the gradient of its objective.

R´emi Flamary et al. C classes: samples of class c are stored in matrices X c ; the number of samplesfrom class c is n c . Using the deﬁnition (2) of regularized Wasserstein distance,we can write the Wasserstein Discriminant Analysis optimization problem asmax P ∈ ∆ (cid:40) J ( P , T ( P )) = (cid:80) c,c (cid:48) >c (cid:104) P T P , C c,c (cid:48) (cid:105) (cid:80) c (cid:104) P T P , C c,c (cid:105) (cid:41) (6)s.t. C c,c (cid:48) = (cid:88) i,j T c,c (cid:48) i,j ( x ci − x c (cid:48) j )( x ci − x c (cid:48) j ) T , ∀ c, c (cid:48) and T c,c (cid:48) = argmin T ∈ U ncnc (cid:48) λ (cid:104) T , M PX c , PX c (cid:48) (cid:105) − Ω ( T ) , which can be reformulated as the following bilevel problemmax P ∈ ∆ J ( P , T ( P )) (7)s.t. T ( P ) = argmin T ∈ U ncnc (cid:48) E ( T , P ) (8)where T = { T c,c (cid:48) } c,c (cid:48) contains all the transport matrices between classes andthe inner problem function E is deﬁned as E ( T , P ) = (cid:88) c,c> = c (cid:48) λ (cid:104) T c,c (cid:48) , M PX c , PX c (cid:48) (cid:105) − Ω ( T c,c (cid:48) ) . (9)The objective function J can be expressed as J ( P , T ( P )) = (cid:104) P T P , C b (cid:105)(cid:104) P T P , C w (cid:105) where C b = (cid:80) c,c (cid:48) >c C c,c (cid:48) and C w = (cid:80) c C c,c are the between and withincross-covariance matrices that depend on T ( P ). Optimization problem (7)-(8)is a bilevel optimization problem, which can be solved using gradient descent(Colson et al, 2007). Indeed, J is diﬀerentiable with respect to P . This comesfrom the fact that optimization problems in Equation (8) are all strictly con-vex, making solutions of the problems unique, hence T ( P ) is smooth anddiﬀerentiable (Bonnans and Shapiro, 1998).Thus, one can compute the gradient of J directly w.r.t. P using the chainrule as follows ∇ P J ( P , T ( P )) = ∂J ( P , T ) ∂ P + (cid:88) c,c (cid:48) ≥ c ∂J ( P , T ) ∂ T c,c (cid:48) ∂ T c,c (cid:48) ∂ P (10) The ﬁrst term in gradient (10) suppose that T is constant and can be computed(Eq. 94-95 (Petersen et al, 2008)) as ∂J ( P , T ) ∂ P = P (cid:18) σ w C b − σ b σ w C w (cid:19) (11) asserstein Discriminant Analysis 7 with σ w = (cid:104) P T P , C w (cid:105) and σ b = (cid:104) P T P , C b (cid:105) . In order to compute the secondterm in (10), we will separate the cases when c = c (cid:48) and c (cid:54) = c (cid:48) as it correspondsto their position in the fraction of Equation (6). Their partial derivative isobtained directly from the scalar product and is a weighted vectorization ofthe transport cost matrix ∂J ( P , T ) ∂ T c,c (cid:48) (cid:54) = c = vec (cid:16) σ w M PX c , PX c (cid:48) (cid:17) and ∂J ( P , T ) ∂ T c,c = − vec (cid:16) σ b σ w M PX c , PX c (cid:17) . (12)We will see in the remaining that the main diﬃculty stands in computingthe Jacobian ∂ T c,c (cid:48) /∂ P since the optimal transport matrix is not available asa closed form. We solve this problem using instead an automatic diﬀerentiationapproach wrapped around the Sinkhorn ﬁxed point iteration algorithm.3.2 Automatic Diﬀerentiation.A possible way to compute the Jacobian ∂ T c,c (cid:48) /∂ P is to use the implicit func-tion theorem as in hyperparameter estimation in ML (Bengio, 2000; Chapelleet al, 2002). We detail that approach in the appendix but it requires invertinga very large matrix, and does not scale in practice. It also assumes that theexact optimal transport T λ is obtained at each iteration, which is clearly anapproximation since we only have the computational budget for a ﬁnite, andusually small, number of Sinkhorn iterations.Following the gist of Bonneel et al (2016), which do not diﬀerentiate Sinkhorniterations but a more complex ﬁxed point iteration designed to compute Wasser-stein barycenters, we propose in this section to diﬀerentiate the transporta-tion matrices obtained after running exactly L Sinkhorn iterations, with apredeﬁned L . Writing T k ( P ), for the solution obtained after k iterations as afunction of P for a given c, c (cid:48) pair, T k ( P ) = diag( u k ) e − λ M diag( v k )where M is the distance matrix induced by P . T L ( P ) can then be directlydiﬀerentiated: ∂ T k ∂ P = ∂ [ u k Tm ] ∂ P ◦ e − λ M ◦ n v kT (13)+ u k Tm ◦ ∂e − λ M ∂ P ◦ n v kT + u k Tm ◦ e − λ M ◦ ∂ [ n v kT ] ∂ P Note that the recursion occurs as u k depends on v k whose is also related to u k − . The Jacobians that we need can then be obtained from Equation (5).For instance, the gradient of one component of u k Tm at the j -th line is ∂ u kj ∂ P = − /n [ Kv k ] j (cid:16) (cid:88) i ∂ K j,i ∂ P v ki + (cid:88) i K j,i ∂ v ki ∂ P (cid:17) , (14) R´emi Flamary et al. while for v k , we have ∂ v kj ∂ P = − /m [ K T u k − ] j (cid:16) (cid:88) i ∂ K i,j ∂ P u k − i + (cid:88) i K i,j ∂ u k − i ∂ P (cid:17) , (15)and ﬁnally ∂ K i,j ∂ P = − K i,j P ( x i − x (cid:48) j )( x i − x (cid:48) j ) T . The Jacobian ∂ T k ∂ P can be thus obtained by keeping track of all the Jacobiansat each iteration and then by successively applying those equations. This ap-proach is far cheaper than the implicit function theorem approach. Indeed, inthis case, the computation of ∂ T ∂ P is dominated by the complexity of computing ∂ K ∂ P whose costs for one iteration is O ( pn d ) for n = m . The complexity isthen linear in L and quadratic in n .3.3 AlgorithmIn the above subsections, we have reformulated the WDA optimization prob-lem so as to make it tractable. We have derived closed-form expressions ofsome elements of the gradient as well as an automatic diﬀerentiation strategyfor computing gradients of the transport plans T c,c (cid:48) with respects to P .Now that all these partial derivatives are computed, we can compute thegradient G k = ∇ P J ( P k , T ( P k )) at iteration k and apply classical manifoldoptimization tools such as projected gradient of Schmidt (2008) or a trustregion algorithm as implemented in Manopt/Pymanopt (Boumal et al, 2014;Koep and Weichwald, 2016). The latter toolbox includes tools to optimizeover the Stiefel manifold, notably automatic conversions from Euclidean toRiemannian gradients. Algorithm 1 provide the steps of a projected gradientapproach for solving WDA. We noted in there that at each iteration, we needto compute all the transport plans T c,c (cid:48) , which are needed for computing C b and C w . Automatic diﬀerentiation in the last steps of Algorithm 2 takesadvantage of these transport plan computations for calculating and storingpartial derivatives needed for Equation 13.From a computational complexity point of view, for each projected gradientiteration, we have the following complexity, considering that all classes arecomposed of n samples. For one iteration of the Sinkhorn -Knopp algorithmgiven in Algorithm 2, u k and v k are of complexity O ( n ) while { ∂ v kj ∂ P } j and { ∂ u kj ∂ P } n are both O ( n dp ). In this algorithm, complexity is dominated by theone of ∂ K i,j ∂ P which costs is O ( n d p ), although it is computed only once. InAlgorithm 1, the costs of C b and C w are O ( n d ) and Equation (11) and (12)yields respectively a complexity of O ( pd ) and O ( n d p ). Note that the cost ofcomputing ∂ T k ∂ P is dominated by those of { ∂ v kj ∂ P } j and { ∂ u kj ∂ P } n . Finally, the costcomputing the sum in G k = ∇ P J ( P k , T ( P k ) achieves a global complexity of O ( C n dp ). In conclusion, our algorithm is quadratic in both the number ofsamples in the classes and in the original dimension of the problem. asserstein Discriminant Analysis 9 Algorithm 1

Projected gradient algorithm for WDA

Require: Π ∆ : projection on the Stiefel manifold1: Initialize k = 0, P repeat

3: compute all the T c,c (cid:48) as given in Equation (8) by means of Algorithm 24: compute C b and C w

5: compute Equation (11) for P k

6: compute Equation (12) for P k

7: compute ∂ T k ∂ P using automatic diﬀerentiation based on Equations (13), (14) and (15)8: compute gradient G k = ∇ P J ( P k , T ( P k ) using all above elements9: compute descent direction D k = Π ∆ ( P k − G ) − P k

10: linesearch on the step-size α k P k +1 ← Π ∆ ( P k + α k D k )12: k ← k + 113: until convergence Algorithm 2

Sinkhorn-Knopp algorithm with automatic diﬀerentiation

Require: K = e − λ M PX c, PX c (cid:48) , L the number of iterations1: Initialize k = 0, u = n , ∂ u j ∂ P = for all j

2: compute ∂ K i,j ∂ P { store these gradients for computing ∂ T k ∂ P } for k = 1 to L do

4: compute v k and u k as given in Equation (5)5: compute ∂ v kj ∂ P for all j { store these gradients for computing ∂ T k ∂ P }

6: compute ∂ u kj ∂ P for all j { store these gradients for computing ∂ T k ∂ P } end foroutput u k , v and all the gradients T ij tend to be larger for nearbypoints with an exponential decrease with respect to distance between Px i and Px j (cid:48) .4.2 Regularized Wasserstein Distance and Fisher criterion.Fisher criterion for measuring separability stands on the ratio of inter-class andintra-class variability of samples. However, this intra-class variability can be challenging to evaluate when information regarding probability distributionscome only through empirical examples. Indeed, the classical ( λ = ∞ ) Wasser-stein distance of a discrete distribution with itself is 0, as with any othermetrics for empirical distributions. Recent result by Mueller and Jaakkola(2015) also suggests that even splitting examples from one given class andcomputing Wasserstein distance between resulting empirical distributions willresult in arbitrary small distance with high probability. This is why entropy-regularized Wasserstein distance plays a key role in our algorithm, as to thebest of our knowledge, no other metrics on empirical distributions wouldlead to relevant intra-class measures. Indeed, W λ ( PX , PX ) = (cid:104) P (cid:62) P , C (cid:105) with C = (cid:80) i,j T i,j ( x i − x j )( x i − x j ) T . Hence, since λ < ∞ ensures that massof a given sample is split among its neighbours by the transport map T , W λ ( PX , PX ) is thus non-zero and interestingly, it depends on a weighted co-variance matrix C which, because it depends on T , will put more emphasison couples of neighbour examples.More formally, we can show that minimizing W λ ( PX , PX ) with respect to P induces a neighbourhood preserving map P . This means that if an example i is closer to an example j than an example k in the original space, this relationshould be preserved in the projected space. This implies that (cid:107) Px i − Px j (cid:107) should be smaller than (cid:107) Px i − Px k (cid:107) . Then, this neighbourhood preservationcan be enforced if K i,j > K i,k , which is equivalent to M i,j < M i,k , implies T i,j > T i,k . Hence, since W λ ( PX , PX ) = (cid:80) i,j T i,j (cid:107) Px i − Px j (cid:107) , the inequal-ity T i,j > T i,k means that examples that are close in the input space areencouraged to be close in the projected space. We show next that there existssituation in which this condition is guaranteed. Proposition 1

Given T the solution of the entropy-smoothed optimal trans-port problem, as deﬁned in Eq. (3), between the empirical distribution PX onitself, ∀ i, j, k : ∃ α ≥ , K i,j > αK i,k ⇒ T i,j > T i,k Proof

Because, we are interested in transporting PX on PX , the resultingmatrix K is symmetric non-negative. Then, according to Lemma 1 (see ap-pendix), the solution matrix T is also symmetric. Owing to this symmetryand the properties of the Sinkhorn-Knopp algorithm, it can be shown (Knight,2008) that there exists a non-negative vector v such that T i,j = K i,j v i v j . Inaddition, because this vector is the limit of a convergent sequence obtained bythe Sinkhorn-Knopp algorithm, it is bounded. Let us denoted as A , a constantsuch that ∀ i, v i ≥ A . Furthermore, in Lemma 2 (see appendix), we show that ∀ i, ≥ v i . Now, it is easy to prove that A > K i,k K i,j = ⇒ v j K i,k K i,j = ⇒ K i,j v j − K i,k > ∀ j, v j ≤ T i,j = v i K i,j v j , we thus have K i,j v j − K i,k v k > ⇒ T i,j > T i,k . in Proposition 1 we take α = A since A < asserstein Discriminant Analysis 11 Note that this proposition provides us with a guarantee on a ratio K i,k K i,j between examples that induces preservation of neighbourhood. However, theconstant A we exhibit here is probably loose and thus, a larger ratio may stillpreserve locality.4.3 Connection to Fisher Discriminant Analysis.We show next that WDA encompasses Fisher Discriminant analysis in thelimit case where λ approaches 0. In this case, we can see from Eq. (3) that thematrix T does not depends on the data. The solution T for each Wassersteindistance is the matrix that maximizes entropy, namely the uniform probabilitydistribution T = nm n,m . The cross-covariance matrices become thus C c,c (cid:48) = 1 n c n c (cid:48) (cid:88) i,j ( x ci − x c (cid:48) j )( x ci − x c (cid:48) j ) T and the matrices C w and C b correspond then to intra- and inter-class covari-ances as used in FDA. Since these matrices do not depend on P , the opti-mization problem (1) boils down to the usual Rayleigh quotient which can besolved using a generalized eigendecomposition of C − w C b as in FDA. Note thatWDA is equivalent to FDA when the classes are balanced (in the unbalancedcase one needs to weight the covariances/Wasserstein distances in (1) with theclass ratios). Again, we stress out that beyond this limit case and when λ > T promotes cross-covariance matricesthat are estimated from local relations as illustrated in Figure 1.Following this connection, we want to stress again the role played by thewithin-class Wasserstein distances in WDA. At ﬁrst, from a theoretical pointof view, optimizing the ratio instead of just maximizing the between-classdistance allows us to encompass well-known method such as FDA. Secondly, aswe have shown in the previous subsection, minimizing the within-class distanceprovides interesting features such as neighbourhood preservation under mildcondition.Another intuitive beneﬁt of minimizing the within-class distance is thefollowing. Suppose we have several projection maps that lead to the sameoptimal transport matrix T . Since W λ ( PX , PX ) = (cid:80) i,j T i,j (cid:107) Px i − Px j (cid:107) for any P , minimizing W λ ( PX , PX ) with respect to P means preferring theprojection map that yields to the smaller weighted (according to T ) pairwisedistance of samples in the projected space. Since for an example i , { T i,j } j aremainly non-zero among neighbours of i , minimizing the within-class distancefavours projection maps that tend to tighly cluster points in the same class.4.4 Relation to other information-theoretic discriminant analysis.Several information-theoretic criteria have been considered for discriminantanalysis and dimensionality reduction. Compared to Fisher’s criteria, these ones have the advantage of going beyond a simple sketching of the data pdfbased on second-order statistics. Two recent approaches are based on the ideaof maximizing distance of probability distributions of data in the projectionsubspaces. They just diﬀer in the choice of the metrics of pdf (one being a L distance (Emigh et al, 2015) and the second one being a Wasserstein dis-tance (Mueller and Jaakkola, 2015)). While our approach also seeks at ﬁndingprojection that maximizes pdf distance, it has also the unique feature of ﬁnd-ing projections that preserves neighbourhood. Other recent approaches haveaddressed the problem of supervised dimensionality reduction algorithms stillfrom an information theoretic learning perspective but without directly max-imizing distance of pdf in the projected subpaces. We discuss two methods towhich we have compared with in the experimental analysis. The approach ofSuzuki and Sugiyama (2013), denoted as LSDR, seeks at ﬁnding a low-ranksubspace of inputs that contains suﬃcient information for predicting outputvalues. In their works, the authors deﬁne the notion of suﬃciency throughconditional independence of the outputs and the inputs given the projectedinputs and evaluate this measure through squared-loss mutual information.One major drawback of their approach is that they need to estimate a den-sity ratio introducing thus an extra layer of complexity and an error-pronetask. Similar idea has been investigated by Tangkaratt et al (2015) as theyused quadratic mutual information for evaluating statistical dependence be-tween projected inputs and outputs (the method has been named LSQMI).While they avoid the estimation of density ratio, they still need to estimatederivatives of quadratic mutual information. Like our approach, the methodof Giraldo and Principe (2013) avoids density estimation for performing su-pervised metric learning. Indeed, the key aspect of their work is to show thatthe Gram matrix of some data samples can be related to some informationtheoretic quantities such as conditional entropy without the need of estimatingpdfs. Based on this ﬁnding, they introduced a metric learning approach, coinedCEML, by minimizing conditional entropy between labels and projected sam-ples. While their approach is appealing, we believe that a direct criterion suchas Fisher’s is more relevant and robust for classiﬁcation purposes, as provedin our experiments.4.5 Wasserstein distances and machine learning.Wasserstein distances are mainly derived from the theory of optimal trans-port (Villani, 2008), and provide a useful way to compare probability measures.Its practical deployment in machine learning problems has been alleviatedthanks to regularized versions of the original problem (Cuturi, 2013; Benamouet al, 2015). The geometry of the space of probability measures endowed withthe Wasserstein metric allows to consider various objects of interest such asmeans or barycenters (Cuturi and Doucet, 2014; Benamou et al, 2015), andhas led to generalization of PCA in the space of probability measures (Seguyand Cuturi, 2015). It has been considered in the problem of semi-supervised asserstein Discriminant Analysis 13 Fig. 2

Illustration of subspace learning methods on a nonlinearly separable 3-class toyexample of dimension d = 10 with 2 discriminant features (shown on the upper left) and8 Gaussian noise features. Projections onto p = 2 of the test data are reported for severalsubspace estimation methods. learning (Solomon et al, 2014), domain adaptation (Courty et al, 2016), or def-inition of loss functions (Frogner et al, 2015). More recently, it has also beenconsidered in a subspace identiﬁcation problem for analyzing the diﬀerencesbetween distributions (Mueller and Jaakkola, 2015), but contrary to our ap-proach, they only consider projections to univariate distributions, and as suchdo not permit to ﬁnd subspaces with dimension >

1. More recent works haveproposed to use Wasserstein for measuring similarity between documents inHuang et al (2016) and propose to learn a metric that encodes class informa-tion between samples. Note that in our work we use Wasserstein between theempirical distributions and not the training samples yielding a very diﬀerentapproach.

In this section we illustrate how WDA works on several learning problems.First, we evaluate our approach on a simple simulated dataset with a 2-dimensional discriminative subspace. Then, we benchmark WDA on MNISTand Caltech datasets with some pre-deﬁned hyperparameter settings for meth-ods having some. Unless speciﬁed and justiﬁed, for LFDA and LMNN, we haveset the number of neighours to 5. For CEML, Gaussian kernel width σ hasbeen ﬁxed to √ p , which is the value used by Giraldo and Principe (2013) acrossall their experiments. For WDA, we have chosen λ = 0 .

01 except for the toyproblem. The ﬁnal experiment compares performance of WDA and competi-tors on some UCI dataset problems, in which relevant parameters have beenvalidated.Note that in the spirit of reproducible research the code will be made avail-able to the community and the Python implementation of WDA is available

KNN error comparison K P r e d i c t i o n e rr o r Orig.PCAFDALFDALMNNLDSRCEMLWDA

KNN error comparison p P r e d i c t i o n e rr o r Orig.PCAFDALFDALMNNLDSRCEMLWDA

Fig. 3

Prediction error on the simulated dataset (left) with projection dimension ﬁxed to p = 2 and error for varying K in the KNN classiﬁer. (right) evolution of performance withdiﬀerent projection dimension p and best K in the KNN classiﬁer. as part of the POT for Python Optimal Transport Toolbox (Flamary andCourty, 2017) on Github .5.1 Practical implementation.In order to make the method less sensitive to the dimension and scaling of thedata, we propose to use a pre-computed adaptive regularization parameterfor each Wasserstein distances in (1). Denote as λ c,c (cid:48) such parameter yieldingthus to a distance W λ c,c (cid:48) . In practice, we initialize P with the PCA projection,and deﬁne λ c,c (cid:48) as λ c,c (cid:48) = λ ( n c n c (cid:48) (cid:80) i,j (cid:107) P x ci − P x c (cid:48) j (cid:107) ) − between class c and c (cid:48) . These values are computed a priori and ﬁxed in the remaining iterations.They have the advantage to promote a similar regularization strength betweeninter and intra-class distances.We have compared our WDA algorithms to some classical dimensionalityreduction algorithms like PCA and FDA, to some locality preserving meth-ods such as LFDA and LMNN and to some recent mutual information-basedsupervised dimensionality and metric learning algorithms such as LSDR andCEML mentioned above. For the last three methods, we have used the author’simplementations. We have also considered LSQMI as a competitor but did notreport its performances as they were always worse than those of LSDR.5.2 Simulated dataset.This dataset has been designed for evaluating the ability of a subspace methodto uncover a discrimative linear subspace when the classes are non-linearlyseparable. It is a 3-class problem in dimension d = 10 with two discriminativedimensions, the remaining 8 containing Gaussian noise. In the 2 discriminant Code : https://github.com/rflamary/POT/blob/master/ot/dr.py asserstein Discriminant Analysis 15 −2 Sensitivity to λλ P r e d i c t i o n e rr o r Orig.PCAFDALFDALMNNLDSRCEMLWDA −2 Sensitivity to FP iter. λ P r e d i c t i o n e rr o r L=1L=2L=5L=10L=20

Fig. 4

Comparison of WDA performances on the simulated dataset (left) as a function of λ . (right) as a function of λ and with diﬀerent number of ﬁxed point iterations. features, each class is composed of two modes as illustrated in the upper leftpart of Figure 2.Figure 2 also illustrates the projection of test samples in two-dimensionalsubspaces obtained from the diﬀerent approaches. We can see that for thisdataset WDA, LDSR and CEML lead to a good discriminant subspace. Thisillustrate the importance of estimating relations between samples in the pro-jected space as opposed to the original space as done in LMNN and LFDA.Quantitative results are illustrated in Figure 3 (left) where we reported pre-diction error for a K-Nearest-Neighbors classiﬁer (KNN) for n = 100 trainingexamples and n t = 5000 test examples. In this simulation, all prediction errorsare averaged over 20 data generations and the neighbors parameters of LMNNand LFDA have been selected empirically to maximize performances (respec-tively 5 for LMNN and 1 for LFDA). We can see in the left part of the ﬁgurethat WDA, LDSR and CEML and to a lesser extent LMNN can estimate therelevant subspace, when the optimal dimension value is given to them,that isrobust to the choice of K . Note that slightly better performances are achievedby LSDR and CEML. In the right plot of Figure 3, we show the performancesof all algorithms when varying the dimension of the projected space. We notethat WDA, LMNN, LSDR and LFDA achieve their best performances for p = 2 and that prediction errors rapidly increase as p is misspeciﬁed. Instead,CEML performs very well for p ≥

2. Being sensitive to the correct projectedspace dimensionality can be considered as an asset, as typically this dimen-sion is to be optimized ( e.g by cross-validation), making it easier to spot thebest dimension reduction. At the contrary, CEML is robust to projected spacedimensionality mis-speciﬁcation at the expense of under-estimating the bestreduction of dimension.In the left plot for Figure 4 , we illustrate the sensitivity of WDA w.r.t. theregularization parameter λ . WDA returns equivalently good performance onalmost a full order of magnitude of λ . This suggests that a coarse validation canbe performed in practice. The right panel of Figure 4 shows the performanceof the WDA for diﬀerent number of inner Sinkhorn iterations L . We can see KNN error on MNIST for p=10 K P r e d i c t i o n e rr o r Orig.PCAFDALFDALMNNLSDRCEMLWDA

KNN error on MNIST for p=20 K P r e d i c t i o n e rr o r Orig.PCAFDALFDALMNNLSDRCEMLWDA

Fig. 5

Averaged prediction error on MNIST with projection dimension (left) p = 10. (right) p = 20. In these plots, LSQMID has been omitted due to poor performances. KNN error on Caltech along p p P r e d i c t i o n e rr o r Orig.PCAFDALFDALMNNLSDRCEMLWDA

Fig. 6

Averaged prediction error on the Caltech dataset along the projection dimension.In these plots, LSQMID has been omitted due to poor performances. that even if this parameter leads to diﬀerent performances for large values of λ , it is still possible ﬁnd some λ that yield near best performance even forsmall value of L .5.3 MNIST dataset.Our objective with this experiment is to measure how robust our approach iswith only few training samples despite high-dimensionality of the problem. Tothis end, we draw n = 1000 samples for training and report the KNN predictionerror as a function of k for the diﬀerent subspace methods when projectingonto p = 10 and p = 20 dimensions (resp. left and right plots of Figure 5). Thereported scores are averages of 20 realizations of the same experiment. We alsolimit the analysis to L = 10 as the number of Sinkhorn ﬁxed point iterationsand λ = 0 .

01. For both p , WDA ﬁnds a better subspace than the originalspace which suggests that most of the discriminant information available in thetraining dataset has been correctly extracted. Conversely, the other approaches asserstein Discriminant Analysis 17 Fig. 7

2D tSNE of the MNIST samples projected on p = 10 for diﬀerent approaches. (ﬁrstand third lines) training set ( second and fourth lines) test set. struggle to ﬁnd a relevant subspace in this conﬁguration. In addition to betterprediction performance, we want to emphasize that in this conﬁguration, WDAleads to a dramatic compression of the data from 784 to 10 or 20 features whilepreserving most of the discriminative information.To gain a better understanding of the corresponding embedding, we havefurther projected the data from the 10-dimensional space to a 2-dimensionalone using t-SNE Van der Maaten and Hinton (2008). In order to make theembeddings comparable, we have used the same initializations of t-SNE forall methods. The resulting 2D projections on the test samples are shown inFigure 7. We can clearly see the overﬁtting behaviour of FDA, LFDA, LMNNand LDSR that separate accurately the training samples but fail to separatethe test samples. Instead, WDA is able to disentangle classes in the trainingset while preserving generalization abilities. λ of WDA was empirically setto 10 − . The K in KNN was set to 3 which is a common standard settingfor this classiﬁer. The reported results reported in Figure 6 are averaged over10 realizations of the same experiment. When p ≥

5, WDA already ﬁnds asubspace which gathers relevant discriminative information from the originalspace. In this experiment, LMNN yields to a better subspace for small p valueswhile WDA is the best performing method for p ≥

6. Those results highlightthe potential interest for using WDA as linear dimensionality reduction layersin neural-nets architecture.5.5 Running-timeFor the above experiments on MNIST and Caltech, we have also evaluatedthe running times of the compared algorithms. The LFDA, LMNN, LDSRand CEML codes are the Matlab code that have been released by the authors.Our WDA code is Python-based and relies on the POT toolbox . All thesecodes have been runned on a 16-core Intel Xeon E5-2630 CPU, operating at2.4 GHz with GNU/Linux and 144 Gb of RAM.Running times needed for computing learned subspaces are reported inTable 1. We ﬁrst remark that LSDR is not scalable. For instance, ot needsseveral tenths of hour for computing the projection from 4096 to 14 dimensionson Caltech. More generally, we can note that our WDA algorithm scales welland is cheaper to compute than LMNN and is far less expensive than CEMLon our machine. We believe our WDA algorithm better leverages multi-coremachines owing the large amount of matrix-vector multiplications needed forcomputing Sinkorhn iterations.5.6 UCI datasets.We have also compared the performances of the dimensionality reduction algo-rithms on some UCI benchmark datasets Lichman (2013). The experimental asserstein Discriminant Analysis 19

Datasets PCA FDA LFDA LMNN LSDR CEML WDAMnist (10) 0.39(0.1) 0.69(0.2) 0.55(0.4) 20.55(14.2) 29813(5048) 87.02(8.7) 6.28(0.3)Mnist (20) 0.38(0.0) 0.58(0.0) 0.54(0.2) 18.27(17.0) 60147(11176) 90.22(8.8) 6.15(0.1)Caltech (14) 0.53(0.3) 21.38(6.1) 11.43(2.0) 39.56(6.3) 140776(53036) 14.59(7.6) 5.29(0.1)

Table 1

Averaged running time in seconds of the diﬀerent algorithms for computing thelearned subspaces.Datasets Orig. PCA FDA LFDA LMNN LSDR LSQMI CEML WDAwines 24.33 26.57 37.87 29.21 32.81 32.81 46.29 ionosphere 26.14 26.90 29.63 27.64 30.80 31.08 36.42 22.87 isolet 17.50 17.60 15.12 13.96 caltechpca 23.39 13.93 12.03 18.19 11.55 36.08 100.00 13.65

Aver. Rank 5.4 5.5 5.2 5.2 3.4 5.7 8.9 3.5 2.2

Table 2

Average test errors over 20 trials on UCI datasets. In bold, the lower test erroraccross algorithms. Underlined averaged test errors that are statistically non-signiﬁcantlydiﬀerent according to a signrank test with p-val = 0.05. Result of LSQMI on caltech hasnot been reported due to lack of convergence after few days of computation. setting is similar to the one proposed by the authors of LSQMI Tangkaratt et al(2015). For these UCI datasets, we have appended the original input featureswith some noise features of dimensionality 100. We have split the examples50% −

50% in a training and test set. Hyper-parameters such as the numberof neighbours for the KNN and and the dimensionality of the projection hasbeen cross-validated on the training set and choosed respectively among thevalues [1 : 2 : 19] (in Matlab notation) and [5 , , , , . .

4. There is onlyone dataset ( vehicles ) for which WDA performs signiﬁcantly worse than topmethods. Interestingly LSDR and LSQMI seem to be less robust than LMNNand FDA, against which they have not been compared in the original paper(Tangkaratt et al, 2015).

This work presents the Wasserstein Discriminant Analysis, a new and originallinear discriminant subspace estimation method. Based on the framework ofregularized Wasserstein distances, which measure a global similarity between empirical distributions, WDA operates by separating distributions of diﬀerentclasses in the subspace, while maintaining a coherent structure at a class level.To this extent, the use of regularization in the Wasserstein formulation allowsto eﬀectively bridge a gap between a global coherency and the local structureof the class manifold. This comes at a cost of a diﬃcult optimization of a bi-level program, for which we proposed an eﬃcient method based on automaticdiﬀerentiation of the Sinkhorn algorithm. Numerical experiments show thatthe method performs well on a variety of features, including those obtainedwith a deep neural architecture. Future work will consider stochastic versionsof the same approach in order to enhance further the ability of the method tohandle large volume of high-dimensional data.

References

Absil PA, Mahony R, Sepulchre R (2009) Optimization algorithms on matrixmanifolds. Princeton University PressBach FR, Lanckriet GR, Jordan MI (2004) Multiple kernel learning, conicduality, and the smo algorithm. In: Proceedings of the twenty-ﬁrst interna-tional conference on Machine learning, ACM, p 6Benamou JD, Carlier G, Cuturi M, Nenna L, Peyr´e G (2015) Iterative breg-man projections for regularized transportation problems. SIAM Journal onScientiﬁc Computing 37(2):A1111–A1138Bengio Y (2000) Gradient-based optimization of hyperparameters. Neuralcomputation 12(8):1889–1900Bengio Y (2009) Learning deep architectures for ai. Foundations and trends R (cid:13) in Machine Learning 2(1):1–127Bonnans JF, Shapiro A (1998) Optimization problems with perturbations: Aguided tour. SIAM review 40(2):228–264Bonneel N, Peyr´e G, Cuturi M (2016) Wasserstein barycentric coordinates:Histogram regression using optimal transport. ACM Transactions on Graph-ics 35(4)Boumal N, Mishra B, Absil PA, Sepulchre R (2014) Manopt, a matlab toolboxfor optimization on manifolds. The Journal of Machine Learning Research15(1):1455–1459Burges CJ (2010) Dimension reduction: A guided tour. Now PublishersChapelle O, Vapnik V, Bousquet O, Mukherjee S (2002) Choosing multipleparameters for support vector machines. Machine learning 46(1-3):131–159Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization.Annals of operations research 153(1):235–256Courty N, Flamary R, Tuia D, Rakotomamonjy A (2016) Optimal transportfor domain adaptation. Pattern Analysis and Machine Intelligence, IEEETransactions onCuturi M (2013) Sinkhorn distances: Lightspeed computation of optimal trans-port. In: NIPS, pp 2292–2300 asserstein Discriminant Analysis 21 Cuturi M, Doucet A (2014) Fast computation of wasserstein barycenters. In:ICMLDonahue J, Jia Y, Vinyals O, Hoﬀman J, Zhang N, Tzeng E, Darrell T (2014)DeCAF: a deep convolutional activation feature for generic visual recog-nition. In: Proceedings of The 31st International Conference on MachineLearning, pp 647–655Emigh M, Kriminger E, Prˆıncipe JC (2015) Linear discriminant analysis withan information divergence criterion. In: 2015 International Joint Conferenceon Neural Networks (IJCNN), IEEE, pp 1–6Fern XZ, Brodley CE (2003) Random projection for high dimensional dataclustering: A cluster ensemble approach. In: ICML, vol 3, pp 186–193Flamary R, Courty N (2017) Pot python optimal transport libraryFriedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning.Springer series in statistics Springer, BerlinFrogner C, Zhang C, Mobahi H, Araya M, Poggio T (2015) Learning with awasserstein loss. In: NIPS, pp 2044–2052Giraldo LGS, Principe JC (2013) Information theoretic learning with inﬁnitelydivisible kernels. In: Proceedings of the ﬁrst International Conference onRepresentation Learning (ICLR), pp 1–8Griﬃn G, Holub A, Perona P (2007) Caltech-256 Object Category Dataset.Tech. Rep. CNS-TR-2007-001, California Institute of TechnologyHuang G, Guo C, Kusner MJ, Sun Y, Sha F, Weinberger KQ (2016) Super-vised word mover’s distance. In: Advances in Neural Information ProcessingSystems, pp 4862–4870Knight PA (2008) The sinkhorn–knopp algorithm: convergence and applica-tions. SIAM Journal on Matrix Analysis and Applications 30(1):261–275Koep N, Weichwald S (2016) Pymanopt: A python toolbox for optimizationon manifolds using automatic diﬀerentiation. Journal of Machine LearningResearch 17:1–5Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/ml

Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. Journal ofMachine Learning Research 9(2579-2605):85Mueller J, Jaakkola T (2015) Principal diﬀerences analysis: Interpretable char-acterization of diﬀerences between distributions. In: NIPS, pp 1693–1701Petersen KB, Pedersen MS, et al (2008) The matrix cookbook. Technical Uni-versity of Denmark 7:15Peyr´e G, Cuturi M (2018) Computational Optimal Transport. To be pub-lished in Foundations and Trends in Computer Science, URL https://optimaltransport.github.io

Schmidt M (2008) Minconf-projection methods for optimization with simpleconstraints in matlabSch¨olkopf B, Smola AJ (2002) Learning with kernels: Support vector machines,regularization, optimization, and beyond. MIT pressSeguy V, Cuturi M (2015) Principal geodesic analysis for probability measuresunder the optimal transport metric. In: NIPS, pp 3294–3302

Solomon J, Rustamov R, Leonidas G, Butscher A (2014) Wasserstein propa-gation for semi-supervised learning. In: ICML, pp 306–314Sugiyama M (2007) Dimensionality reduction of multimodal labeled data bylocal ﬁsher discriminant analysis. The Journal of Machine Learning Research8:1027–1061Suzuki T, Sugiyama M (2013) Suﬃcient dimension reduction via squared-lossmutual information estimation. Neural computation 25(3):725–758Tangkaratt V, Sasaki H, Sugiyama M (2015) Direct estimation of the derivativeof quadratic mutual information with application in supervised dimensionreduction. arXiv preprint arXiv:150801019Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduc-tion: a comparative review. Journal of Machine Learning Research 10:66–71Villani C (2008) Optimal transport: old and new, vol 338. Springer Science &Business MediaWeinberger KQ, Saul LK (2009) Distance metric learning for large marginnearest neighbor classiﬁcation. The Journal of Machine Learning Research10:207–244Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning withapplication to clustering with side-information. Advances in neural informa-tion processing systems 15:505–512Zhang L, Dong W, Zhang D, Shi G (2010) Two-stage image denoising byprincipal component analysis with local pixel grouping. Pattern Recognition43(4):1531–1549

A Illustration of the transport T c,c (cid:48)

In this Section, we provide intuition on how the transport T c,c (cid:48) between class c and c (cid:48) behaves in 2D toy problem. Remind that this matrix plays an essential role on how thecovariance matrix C is estimated in Equation (6).In this example, illustrated in Figure 8, two bi-modal Gaussian distributions are sampledto produce two distributions representing two classes. We illustrate in Figure 9 the transport T , (inter-class) and { T , T , } (intra-class). The corresponding transport matrices areeither displayed in matrix form as inserts, or as connections between the samples. Thoseconnections have a width parametrized by the magnitude of the connection (i.e. a small t i,j value will be displayed as a very thin connection). We note that for visualization purpose,the magnitude of the T elements displayed in matrix form are normalized by the the largestmagnitude in the matrix. The transport maps can be observed in Figure 9 for three diﬀerentvalues of the λ parameter ( λ = 1 , . , . λ , which allows to concentrate the connections on speciﬁc modes of the distributions.When λ is smaller, inter-modes connections start to appear, which allows to consider thedata distributions at a larger scale when computing C . Regarding the inter-class transport T , , one can also observe the speciﬁc relations induced by the optimal transport maps,that do not associate modes together, but rather dispatch one mode of each class onto thetwo modes of the other. B Implicit function gradient computation

In this section, we propose to compute this Jacobian based on the implicit function theorem.asserstein Discriminant Analysis 23

Fig. 8

Illustration of the evolution of the transport for two classes c = 1 and c (cid:48) = 2For clarity’s sake, in this subsection we will not use the c, c (cid:48) indices and T representsan optimal transport matrix between n and m samples projected with P . First, we expressthe function T ( P ) as an implicit function using the optimality conditions of the equationdeﬁning the optimal T in Equation 6. The Lagrangian of this problem can be expressed asCuturi (2013): L = (cid:88) i,j ( t i,j m i,j ( P ) + t i,j log( t i,j ))+ (cid:88) i α i (cid:88) j t i,j − r i  + (cid:88) j β j (cid:32)(cid:88) i t i,j − c j (cid:33) where α and β are the dual variables associated to the sum constraints, m i,j = (cid:107) Px i − Pz j (cid:107) and in our particular case r i = n and c j = m , ∀ i, j . One can deﬁne an implicit fonction g ( P , T , α , β ) : R p × d + n × m + n + m → R n × m + n + m from the above lagrangian by computingits gradient w.r.t. ( T , α , β ) and setting it to zero owing to optimality. The implicit functiontheorem gives us the following relation: ∇ P g = ∂g∂ P + ∂g∂ T ∂ T ∂ P + ∂g∂ α ∂ α ∂ P + ∂g∂ β ∂ β ∂ P = which can be reformulated as  ∂ T ∂ P ∂ α ∂ P ∂ β ∂ P  = − E − ∂g∂ P , with E = (cid:20) ∂g∂ T ∂g∂ α ∂g∂ β (cid:21) (16)when the function is well deﬁned and E is invertible. The derivative ∂ T ∂ P can be deducedfrom the upper part of the term on the left. Note that all the partial derivatives in Equ. (16)are easy to compute. Additionally, E is a ( pd + nm + n + m ) × ( pd + nm + n + m ) matrix whichis very sparse, as shown in the sequel. However, assuming for instance that the number ofpoints in each class m = n is the same, using this technique would amount to solve a large n × n linear system with a worst case complexity of O ( n ).4 R´emi Flamary et al. Fig. 9

Illustration of the evolution of the transport for two classes for three values of the λ parameter (ﬁrst row) λ = 1 (second row) λ = 0 . λ = 0 .

1. The left columnillustrates inter-class relations, while the right column illustrates intra-class relations.asserstein Discriminant Analysis 25We now detail the computation of the gradient using the implicit function theorem.Note that we use the notation of the paper and that we want to compute the Jacobian ∂ T ∂ P .First we compute the implicit function g ( P , T , α , β ) : R p × d + n × m + n + m → R n × m + n + m from the Lagrangian function given in the paper by computing the OT problem optimalityconditions: ∂ L ∂t k,l = λ ( x k − z l ) (cid:62) P (cid:62) P ( x k − z l ) + log( t k,l )+ 1 + α k + β l = 0 ∂ L ∂α i = (cid:88) j t i,j − r i = 0 ∂ L ∂β j = (cid:88) i t i,j − c j = 0 ∀ k, l, i, j . The Jacobian ∂ T ∂ P can be computed using the implicit function by solving thefollowing linear problem:  ∂ T ∂ P ∂ α ∂ P ∂ β ∂ P  = − E − ∂g∂ P , with E = (cid:20) ∂g∂ T ∂g∂ α ∂g∂ β (cid:21) (17)First t = vec( T ) is vectorized as in Matlab with column major format. ∂g∂ T =  diag( t ) I n I n , . . . , I n L m,n L m,n , . . . , L mm,n  (18)where L km,n is R m × n matrix of 0 with all coeﬃcients on line k equal to 1. ∂g∂ α =  I n I n . . . I n n,n m,n  ∂g∂ β =  L m,n (cid:62) L m,n (cid:62) . . . L mm,n (cid:62) n,n m,n  (19)Now we compute the last element ∂g∂ P using a vectorization p = vec( P ) and ∆ i,j = x i − z j First note that ∂ x (cid:62) P (cid:62) Px ∂p m,l = 2 x l (cid:88) i x i p m,i = 2 x l ( P ( m, :) x )which leads to the following Jacobian ∂g∂ P = 2 λ  ∆ (cid:62) , ⊗ ∆ (cid:62) , P (cid:62) ∆ (cid:62) , ⊗ ∆ (cid:62) , P (cid:62) . . .∆ (cid:62) n,m ⊗ ∆ (cid:62) n,m P (cid:62) n + m,dp  (20)where the upper part of the matrix can be seen as a column-only Kroenecker productbetween ∆ and P ∆ .All the elements are now in place for the linear system (17), which can be solved usingany eﬃcient method for sparse linear system.6 R´emi Flamary et al. C Lemmas

Lemma 1

If the matrix M ∈ R n × n is non-negative symmetric then the matrix T deﬁnedas in argmin T ∈ U n,n λ (cid:104) T , M (cid:105) − Ω ( T ) is also symmetric non-negative. Here, Ω is the entropy of the matrix T Proof

As this optimization problem is strictly convex for λ < ∞ , and thus admits an uniquesolution. We show in the sequel that T (cid:62) achieves the same objective value than T and thus T (cid:62) is also a minimizer, which naturally leads to T (cid:62) = T .First note that the constraints are symmetric thus, T (cid:62) is feasible. In addition becausethe entropy only depends on single entries of the matrix hence Ω ( T ) = Ω ( T (cid:62) ). Finally, (cid:104) T (cid:62) , M (cid:105) = (cid:88) i,j M i,j T (cid:62) i,j = (cid:88) i,j M i,j T j,i = (cid:88) i,j M j,i T j,i = (cid:104) T , M (cid:105) which proves that both matrices lead to the same objective values. Lemma 2

Suppose that T is the solution of an entropy-smoothed optimal transport prob-lem, with matrix K being symmetric and such that ∀ i, K i,i = 1 . There exists a vector v such that ∀ i, j, T i,j = K i,j v i v j and ∀ i, v i ≤ .Proof Existence of the v such that T i,j = K i,j v i v j comes from the fact that the optimiza-tion problem can be solved using the Sinkhorn-Knopp algorithm. Through the constraintsof the optimal transport problem, we have ∀ i, j T i,j = K i,j v i v j ≤ n . When, i = j , as K i,i = 1, we have v i ≤ n and thus v i ≤≤