[PDF] A contribution to Optimal Transport on incomparable spaces

Abstract

Optimal Transport is a theory that allows to define geometrical notions of distance between probability distributions and to find correspondences, relationships, between sets of points. Many machine learning applications are derived from this theory, at the frontier between mathematics and optimization. This thesis proposes to study the complex scenario in which the different data belong to incomparable spaces. In particular we address the following questions: how to define and apply Optimal Transport between graphs, between structured data? How can it be adapted when the data are varied and not embedded in the same metric space? This thesis proposes a set of Optimal Transport tools for these different cases. An important part is notably devoted to the study of the Gromov-Wasserstein distance whose properties allow to define interesting transport problems on incomparable spaces. More broadly, we analyze the mathematical properties of the various proposed tools, we establish algorithmic solutions to compute them and we study their applicability in numerous machine learning scenarii which cover, in particular, classification, simplification, partitioning of structured data, as well as heterogeneous domain adaptation.

Full PDF

TT H È S E D E D O C TO R AT D E

L’UNIVERSITÉ BRETAGNE SUD C OMUE U NIVERSITÉ B RETAGNE L OIRE É COLE D OCTORALE N O Mathématiques et Sciences et Technologiesde l’Information et de la Communication

Spécialité :

Informatique

Par

Titouan VAYER

A contribution to Optimal Transport on incomparable spaces

Thèse présentée et soutenue à Vannes, le Le 05 Novembre 2020Unité de recherche : IRISAThèse N o : 573 Rapporteurs avant soutenance :

Gabriel Peyré Directeur de Recherche CNRS, DMA, École Normale Supérieure.Marc Sebban Professeur des universités. Université Jean Monnet, Saint-Etienne.

Composition du Jury :

Président : Filippo Santambrogio Professeur des universités, Université Claude Bernard, Lyon 1.Examinateurs : Julie Delon Professeure des universités, Université Paris Descartes.Pierre Vandergheynst Full Professor, Ecole Polytechnique fédérale de Lausanne.Dir. de thèse : Nicolas Courty Professeur des universités, Université Bretagne Sud.Co-dir. de thèse : Laetitia Chapel Maître de Conférence, Université Bretagne Sud.Romain Tavenard Maître de Conférence, Université de Rennes 2.

Invité(s) :

Rémi Flamary Maître de Conférence, HDR, CMAP, École Polytechnique. a r X i v : . [ s t a t . M L ] N ov cknowledgements Puisqu’il est ainsi permis de plumer plus librement, de mettre rigueur de côté, pour des remerciements :Merci à mes encadrant.e.s: Laetita, Nicolas, Romain; les oﬃciel.le.s. Vous m’avez permis dejouer aux mathématiques et à l’ordinateur avec une rare liberté, je pense. C’était excitant, gratiﬁant,inspirant et je garderai de votre conﬁance un souvenir certain. Merci Rémi, l’oﬃcieux mais tout autantprésent. La seule chose égalant de fréquence tes enthousiasmes sont tes conjectures, souvent justes, jecrois. Je m’excuse par retard pour mes nombreuses exothermies avortées, qui ont parsemé ces troisannées. Au-delà de tout ce qui a permis ce manuscrit, merci aux moments d’ailleurs et de pas loin, auxdiscussions, aux rires. Cela me manquera. Merci aussi à Nicolas Klutchnikoﬀ, tu m’as sauvé la mise dansdes situations plutôt importantes; nos discussions de 100 ans avec Romain devant un tableau à Villejeanresteront. Merci Ievgen, j’espère que l’on pourra continuer à jouer au Transport Optimal en dessous de laLoire. Merci à Obelix, ce fut un plaisir de passer ces trois années à vos côtés.Je remercie également les rapporteurs, Gabriel Peyré et Marc Sebban, ainsi que les membres dujury, Julie Delon, Filippo Santambrogio, Pierre Vandergheynst, chercheur.e.s dont j’admire les travaux.Merci d’avoir pris le temps de vous intéresser aux miens.Thank you Caglayan, we followed each other through space and time, and I am so happy thatyou were on this trip. I am conﬁdent that we will share a şerefe in Turkey one day. Until then I believewe can still have some Yec’hed mat in Rennes.Merci Mathilde, Maëlle, Thibault, Hadrien. Simple, sans vous il n’y aurait pas plus de troislignes mal réglées sur ces feuilles. Vous m’avez rendu la vie légère, ce cocon rennais tourbillonnera sansdoute longtemps là-haut, tissé de joyeux souvenirs.Merci Étienne, qui n’a pas de frère que le sang. Merci les Marie, centres du triangle triangletriangle. Des breton.ne.s se perdent: rennais.es, briochain.e.s, nantais.es (hé oui), vous êtes mille, ce seratoujours un grand plaisir de faire des saltos et de rire comme des baleineaux en votre compagnie. Merci laLilla : j’aime à penser que votre rencontre marque le début de tout le tintouin qui risque de suivre. MerciLudo et Max, sans vous je serais très probablement dans l’Empire.Merci à mes parents, à ma famille, sans doute n’ai-je pas réussi à rendre tout ça vraiment clair,mais avec le temps, qui sait... Vos enthousiasmes me touchent. Enﬁn merci Myriam, être à tescôtés m’apporte énormément, j’y ai sans aucun doute appris plus que dans toutes mes pérégrinationsmathématiques. ésumé (Français)

Introduction “Ainsi, l’on voit dans les Sciences, tantôt des théories brillantes, mais longtemps inutiles, devenir toutà coup le fondement des applications les plus importantes, et tantôt des applications très simples enapparence, faire naître l’idée de théories abstraites dont on n’avait pas encore le besoin, diriger vers lesthéories des travaux des Géomètres, et leur ouvrir une carrière nouvelle.” C’est ainsi que Nicolas DeCondorcet [De Condorcet 1781] au 18ème siècle a introduit les travaux de Gaspard Monge [Monge 1781]qui sont au coeur de la théorie du transport optimal. Comment déplacer des masses d’un endroit à unautre de sorte à minimiser l’eﬀort global de déplacement ? Condorcet avait raison : cette “idée simple” amaturé au ﬁl des ans en une élégante théorie au croisement des mathématiques et de l’optimisation et estaujourd’hui au centre de nombreuses applications en machine learning.D’une manière générale, l’intérêt du transport optimal réside à la fois dans sa capacité à fournir desrelations, des correspondances, entre des ensembles de points et dans le fait qu’il induise une notiongéométrique de distance entre des distributions de probabilité (voir Figure 1). Ces deux propriétésse sont révélées très utiles pour un large éventail de tâches qui sont, pour n’en citer que quelques-unes, le recalage d’images [Haker 2001], la recherche d’image par contenu [Rubner 1998], l’adaptation dedomaine [Courty 2017], le traitement du signal [Kolouri 2017], l’apprentissage non supervisé [Arjovsky 2017,Genevay 2018], l’apprentissage supervisé et semi-supervisé [Frogner 2015, Solomon 2014], le traitementautomatique du langage [Kusner 2015], l’équité en machine learning [Gordaliza 2019], ou encore enbiologie [Schiebinger 2019] ou en astrophysique [Frisch 2002].Malgré ses nombreuses propriétés, le problème de transport optimal reste diﬃcile à résoudre en pratiqueet il est connu pour souﬀrir de problèmes de scalabilité qui empêchent son utilisation sur des donnéesvolumineuses, omniprésentes en machine learning. L’émergence du transport optimal a été grandementfavorisée par de récentes avancées dans le domaine de l’optimisation [Cuturi 2013, Altschuler 2017,Genevay 2016].De plus, dans sa formulation originelle, le transport optimal reste assez limité aux applications oùil existe un “moyen direct” pour comparer les points, appelés samples , provenant des distributions. Ilest donc souvent limité aux cas où les samples font partie du même espace métrique, qui est la plupartdu temps un espace euclidien. Cette limitation empêche notamment l’utilisation du transport optimalpour une variété de tâches dans lesquelles il existe une information de structure supplémentaire sur lesdonnées, qui ne peut généralement pas être décrite par des espaces euclidiens. On peut citer par exemplele cas où les samples sont décrits par des graphes, des arbres ou des séries temporelles. Cette limitationempêche également son utilisation lorsque les samples se trouvent dans des espaces métriques diﬀérents,potentiellement non liés, ou lorsqu’une notion de distance entre les samples ne peut pas être facilementdéﬁnie. Tous ces cas font partie de ce qu’on appellera par la suite le cas de ﬁgure incomparable . Unesolution intéressante se trouve dans l’élégante théorie de la distance de Gromov-Wasserstein [Memoli 2011] x y OT correspondences

40 60 80 100 120 140Mean of the blue distribution0.000.020.040.060.080.100.12 O T d i s t a n c e Figure 1: Le Transport Optimal fournit des outils pour trouver des correspondances, relations, entre des ensemblesde points et donne une notion géométrique de distance entre des distributions de probabilité. (à gauche)

Deuxensembles de points en 2D. Les correspondances trouvées par l’OT sont illustrées en pointillées. (à droite)

Les distances déﬁnies par l’OT entre deux distributions de probabilité bleue et rouge. Plus la moyenne de ladistribution bleue est éloignée de la rouge, plus la distance est grande. qui ne nécessite pas la comparaison des échantillons entre les distributions. Cependant, cette distance estconnue pour être encore plus diﬃcile à résoudre que le transport optimal classique. L’objectif de cettethèse est d’aider à surmonter ces diﬀérents obstacles en:(i) Déﬁnissant de nouveaux problèmes de transport optimal sur des espaces incomparables et enparticulier pour des données structurées.(ii) Réduisant l’écart entre la compréhension théorique du transport optimal classique et celle duproblème de Gromov-Wasserstein.

Données structurées et espaces incomparables en machine learning

Avant d’entrer dans ledétail des contributions de cette thèse il est important de rendre explicite la notion de structure et d’espaces incomparables . Une première approche est de considérer une information structurelle commeétant l’élément d’information qui encode les relations spéciﬁques qui existent entre les composants d’unobjet. Cette déﬁnition peut être mise en relation avec le concept de relationnal reasoning [Battaglia 2018]dans lequel des entités (ou des éléments ayant des attributs tels que l’intensité d’un signal) coexistentavec certaines relations ou propriétés entre eux.Des tels cas de données structurées apparaissent naturellement lorsque la structure est explicite . Parexemple, dans le contexte des graphes, les arêtes sont représentatives de la structure, de sorte que chaqueattribut du graphe (généralement un vecteur de R d ) peut être lié à d’autres par les arêtes entre les noeuds.Ces objets apparaissent notamment pour modéliser des composés chimiques ou des molécules [Kriege 2016],la connectivité du cerveau [Ktena 2017] ou encore les réseaux sociaux [Yanardag 2015]. Cette famillegénérique de données structurées comprend également les arbres [Day 1985] ou encore les séries temporellesdont les valeurs sont corrélées dans le temps, de sorte leur comparaison nécessite de prendre en comptedans la modélisation la structure temporelle, la direction du temps.Les informations de structure des données en machine learning peuvent apparaître de manière plussubtile voire même implicitement . Elles peuvent se matérialiser lorsqu’on construit une prior ou un biais structurel sur la représentation des données. Par exemple, dans le contexte de l’apprentissageprofond, de nombreuses architectures exploitent l’équivariance à une transformation pour améliorer lagénéralisation des modèles. Les réseaux de neurones convolutifs (CNN) sont un excellent exemple de cetteprior structurelle: un CNN est équivariant aux translation, i.e. si nous translatons l’entrée du réseau de neurone, la sortie des convolutions sera également translatée. Ce biais structurel est connu pour révélerune certaine hiérarchie spatiale sur les pixels utile dans de nombreuses applications [Wang 2018,Chen 2018].D’autres travaux ont étudié la conception de couches avec des équivariances à d’autres transformationstelles que les permutations, rotations ou réﬂexions [Kondor 2018, Cohen 2016, Gens 2014].Contrairement aux méthodes “end-to-end” telles que les réseaux de neurones, d’autres approches plus“hand-engineering”, basées sur la segmentation des images, permettent de dévoiler une certaine structureutile sur les images [Bach 2007, Jianbo Shi 2000]. Les structures implicites sont également au cœur denombreux outils de traitement du langage naturel (NLP) utilisés pour trouver de bonnes représentationsvectorielles des mots [Mikolov 2013b, Mikolov 2013a, Pennington 2014], pour la reconnaissance vocale[Hinton 2012] ou plus généralement pour l’apprentissage de séquences [Sutskever 2014]. Dans ces cas, lesbiais structurels passent soit par l’utilisation de variables latentes ou celle de probabilités conditionnelles.Lorsqu’ils sont disponibles, les labels ou les classes induisent également une structure implicite sur lesfeatures des données. Par exemple, en adaptation de domaines, on peut souhaiter que les samples dudomaine source ayant le même label soient appariés de manière cohérente dans la même région de l’espacecible, et ainsi éviter qu’ils ne soient divisés en des emplacements trop éloignés [Courty 2017, Alvarez-Melis 2018b]. La tendance récente dans la communauté de machine learning des réseaux de neuronespour les graphes (GNN) [Wu 2020] est l’un des nombreux exemples soulignant l’importance des donnéesstructurées de nos jours.Alors que les exemples précédents considèrent les données structurées comme des entrées du processusd’apprentissage, la prédominance des données structurées en machine learning se manifeste égalementdans de nombreux travaux où les données structurées sont des sorties . On peut citer par exemple ledomaine de la prédiction structurée dans lequel on veut apprendre à produire des objets structurés telsque des séquences, des arbres ou des assignements [Taskar 2005, Laﬀerty 2001, Collins 2002, Blondel 2020,Mensch 2018, Korba 2018].En bref, la notion de structure en machine learning est omniprésente et apparaît aussi souvent qu’il ya une information supplémentaire sur les objets qui va au-delà de leurs représentations caractéristiques,de leurs features. Comme le montrent de nombreux contextes de machine learning tels que les modèlesgraphiques [Pearl 1986, Pearl 2009], l’apprentissage par renforcement [D v zeroski 2001] ou les modèlesbayésiens non-paramétriques [Hjort 2010], considérer les objets comme une composition complexe d’entitésavec certaines interactions est particulièrement utile, aﬁn d’apprendre à partir de petites quantités dedonnées.La notion précédente de données structurées peut se voir comme étant un cas particulier de donnéesdéﬁnies sur des espaces incomparables. Dans cette situation, chaque sample possède une “caractéristique”qui lui est propre et qui peut ne pas être partagée avec les autres samples. Par exemple, lorsque l’onconsidère un ensemble de données composé de plusieurs graphes, la structure d’un graphe n’est généralementpas partagée avec les autres graphes. Cette notion, volontairement large, englobe également le cas où lesdonnées proviennent de sources hétérogènes. Un exemple particulier de ce problème est l’adaptation dedomaines hétérogènes [Yeh 2014,Zhou 2014,Wang 2011] qui vise à exploiter les connaissances provenant dedomaines sources hétérogènes pour améliorer les performances d’apprentissage dans un domaine cible avec,potentiellement, diﬀérents espaces pour les features entre les domaines source et cible. Le cas des datasetsMNIST/USPS [LeCun 2010, Hull 1994] illustre assez bien cette situation : sur la base de la connaissanced’images de chiﬀre de taille 28 ×

28 ( i.e. des vecteurs de R ) de MNIST, comment construire e.g. unclassiﬁeur qui fonctionne sur des images de chiﬀres 16 ×

16 ( i.e. des vecteurs de R ) de USPS ? Il vasans dire que ce problème se pose souvent dans tous les domaines du machine learning : il est courant queles données soient recueillies à partir de sources diverses et hétérogènes et les méthodes qui s’appuient sur Figure 2: On associe à un dataset ( x i ) i ∈ [[ n ]] une distribution de probabilité le décrivant entièrement. Dans cetexemple x i ∈ R . (à gauche) Formulation Lagragienne (ou nuage de points): P ni =1 a i δ x i est une mesure deprobabilité discrète où ( a i ) i ∈ [[ n ]] est un vecteur de probabilité, i.e. a i ≥ P ni =1 a i = 1 et δ est la mesure dedirac i.e. δ x i ( x ) = 1 si x = x i et 0 sinon. (à droite) Formulation Euleriennne (histogrammes): on associe unemesure de probabilité P Ni =1 a i δ ˆ x i où (ˆ x i ) i ∈ [[ N ]] est une grille régulière qui discrétise l’espace R et ( a i ) i ∈ [[ N ]] est unvecteur de probabilité. cette diversité présentent souvent un grand intérêt. Le transport optimal pour le machine learning

La question centrale qui se pose souvent enmachine learning est la suivante : comment représenter les données et comment les comparer ? Le cadredes distributions de probabilités apporte certaines réponses à cette question en associant une mesure deprobabilité à une collection de samples qui forment un dataset ( x i ) i ∈ [[ n ]] . La représentation Lagragienne du dataset résulte en une mesure de probabilité discrète P ni =1 a i δ x i dans laquelle on associe à chaquepoint x i un dirac δ x i ( x ) = 1 si x = x i sinon 0 ainsi qu’un poids a i ≥ a i ) i ∈ [[ n ]] est un vecteur deprobabilité qui satisfait P ni =1 a i = 1. Lorsqu’aucune information concernant l’importance relative dessamples dans le dataset n’est disponible les poids peuvent être choisis uniformes de sorte que a i = n . Demême une représentation Eulerienne peut être construite via la distribution de probabilité P Ni =1 a i δ ˆ x i dans laquelle (ˆ x i ) i ∈ [[ N ]] est une grille régulière qui discrétise l’espace. Cette formulation est équivalente àconstruire un histogramme sur nos données (voir Figure 2). Ces points de vue sur les données préconisentde trouver un moyen approprié de comparer leur représentation sous forme de distributions de probabilitéet, à ce titre, la question de trouver des outils pour les comparer est au cœur de nombreux algorithmes demachine learning.Bien qu’il existe diverses divergences telles que φ -divergences [Csiszar 1975] ou Maximum MeanDiscrepancies (MMD) [Gretton 2007], la richesse du transport optimal réside dans sa capacité à intégrerla géométrie de l’espace sous-jacent dans sa formulation et à prêter attention aux relations, aux corre-spondances des échantillons au sein de leurs représentations respectives. Pour souligner en bref l’avantagede représenter les données par des histogrammes/distributions de probabilités couplé à l’utilisation du transport optimal, nous pouvons citer par exemple [Rubner 2000] pour la recherche d’images par contenuou [Kusner 2015] pour le traitement du langage naturel. À ce stade, des questions naturelles se posent: cette représentation théorique des données est-elle utilisable lorsque leur nature est intrinsèquementstructurée, ou lorsque les samples se trouvent dans des espaces incomparables ? Dans ce cas, commentpouvons-nous représenter les données sous forme de distributions de probabilités ? Dans quelle mesurecette représentation est-elle valable ? Le transport optimal est-il toujours applicable et, si ce n’est pas lecas, comment comparer ces distributions de probabilités ? L’objectif de cette thèse, entre autres, est defournir des réponses à ces questions. Contributions

Cette thèse couvre la majeure partie des travaux de l’auteur et se concentre sur un seul axe de recherchequi est

Le transport optimal sur des espaces incomparables . Des travaux supplémentaires [Vayer 2020a] surles séries temporelles sur des espaces incomparables, qui ne sont pas basés sur le transport optimal, ne sontpas inclus dans cette thèse. Le lecteur intéressé peut cependant trouver les détails dans la bibliographie.

Chapitre 2

Ce chapitre présente les résultats fondamentaux de la théorie classique du transport optimal et résume/il-lustre ses diﬀérentes formulations ainsi que quelques solveurs numériques connus. La philosophie de cechapitre est de fournir un aperçu de haut niveau du transport optimal, tant en théorie qu’en pratique. Cechapitre se conclut sur la théorie de Gromov-Wasserstein qui est au cœur de la thèse. Un lecteur familieravec les concepts de base du transport optimal peut passer cette partie bien qu’elle contienne des conceptset des notations essentiels qui seront abordés tout au long de la thèse.

Chapitre 3

Ce chapitre est consacré au transport optimal pour les données structurées, et notamment dans le contextedes graphes. Il est basé sur les travaux des articles [Vayer 2019a] et [Vayer 2020b] et fournit des réponsesà la question de la déﬁnition d’un cadre mathématique pour le transport optimal dans le cas de donnéesstructurées. Nous fournissons un cadre général, basé sur la notion de Fused Gromov-Wasserstein quidéﬁnit une distance de transport optimal entre des objets structurés tels que des graphes labelés nondirigés.En résumé, nous considérons les graphes labelés non orientés comme des tuples de la forme G =( V , E , ‘ f , ‘ s ) où ( V , E ) est l’ensemble des noeuds et des arêtes du graphe. ‘ f : V → Ω f est une fonction quiassocie à chaque noeud v i ∈ V un feature a i def = ‘ f ( v i ) dans un espace métrique (Ω f , d ). Nous appelons information de feature l’ensemble de tous les features ( a i ) i du graphe. De même, ‘ s : V → Ω s associe à unnœud v i du graphe un point x i def = ‘ s ( v i ) appartenant à un espace métrique (Ω s , C ) spéciﬁque à chaquegraphe. C : Ω s × Ω s → R est une application symétrique qui vise à mesurer la similarité entre les nœudsdu graphe. Cependant, contrairement à l’espace des features, Ω s est implicite et, en pratique, il suﬃt deconnaître la mesure de similarité C . Avec un léger abus de notation, C est utilisé pour désigner à la fois lamesure de similarité de la structure et la matrice qui encode cette similarité entre les paires de nœuds dugraphe ( C ( i, k ) = C ( x i , x k )) i,k . Selon le contexte, C peut soit encoder les informations de voisinage desnœuds, les informations des arrêtes du graphe, ou plus généralement, il peut modéliser une distance entre } } } Figure 3: (à gauche)

Graphe labelé avec ( a i ) i l’information de feature, ( x i ) i l’information de structure et ( h i ) i unvecteur de poids qui mesure l’importance relative des noeuds. (à droite) Les données structurées sont entièrementdécrites par une mesure de probabilité µ sur l’espace produit des features et de la structure, avec respectivementdes marginales µ X et µ A sur la structure et les features. les nœuds telle que la distance shortest-path. Nous désignons par information de structure l’ensemble detous les points de structure ( x i ) i du graphe.Nous proposons d’enrichir le graphe décrit précédemment avec un vecteur de poids qui a pour butd’encoder l’importance relative des noeuds du graphe. Pour ce faire, si nous supposons que le graphe a n noeuds, nous associons aux noeuds des poids ( h i ) i ∈ Σ n . Par cette procédure, nous obtenons la notion de données structurées comme un tuple S = ( G , h G ) où G est un graphe labelé et h G est une fonction quiassocie un poids à chaque noeud. Cette déﬁnition permet au graphe d’être représenté par une mesure deprobabilité µ = P ni =1 h i δ ( x i ,a i ) sur l’espace produit feature/structure qui décrit l’ensemble de la donnéestructurée (voir Figure 3).Considérons à présent deux données structurées µ = P ni =1 h i δ ( x i ,a i ) et ν = P mi =1 g j δ ( y j ,b j ) , où h ∈ Σ n et g ∈ Σ m sont des histogrammes. Nous notons M AB = ( d ( a i , b j )) i,j la matrice n × m de distance entreles features et C , C , les matrices de structure des graphes.Nous déﬁnissons une nouvelle distance de transport optimal appelée la distance de Fused Gromov-Wasserstein. Elle est donnée pour un paramètre α ∈ [0 ,

1] par :

F GW ( C , C , h , g ) = min π ∈ Π( h , g ) X i,j,k,l (1 − α ) d ( a i , b j ) q + α | C ( i, k ) − C ( j, l ) | q π i,j π k,l (1)Nous prouvons que cette fonction déﬁnit bien une distance entre les graphes labelés (Theorem 3.3.1)et qu’elle permet aussi de donner naissance à une notion de barycentre, de moyenne, de graphes baséesur la moyenne de Fréchet (Section 3.3.2). Nous établissons un algorithme pour calculer ces diﬀérentsobjets (Section 3.3.3) qui nous permet en particulier de résoudre aussi le problème de Gromov-Wassersteinclassique, et nous prouvons qu’elle s’avère utile dans de nombreux scénarii de machine learning sur lesgraphes comme la classiﬁcation, la simpliﬁcation de graphes ou encore le clustering de graphes (Section3.4). Nous concluons cette partie en élargissant la déﬁnition précédente au cas de données structuréescontinues (Section 3.5) où nous montrons que FGW possède des propriétés de distance similaires et déﬁnitde plus une géodésique sur l’espace des données structurées. Chapitre 4

Ce Chapitre vise à combler l’écart entre la théorie classique du transport optimal et la théorie de Gromov-Wasserstein. Il s’ouvre sur le cas particulier des distributions 1D dont l’étude est basée sur les travaux del’article [Vayer 2019b]. Nous établissons la première closed-form de Gromov-Wasserstein dans le cas desmesures discrètes sur la droite réelle et du coût euclidien au carré. Nous prouvons que celle-ci peut êtrecalculée rapidement avec une complexité en O ( n log( n )). En particulier nous montrons qu’un couplageoptimal pour Gromov-Wasserstein est trouvé en considérant soit le couplage diagonal ou le couplageanti-diagonal lorsque les points sont triés. Nous proposons, en utilisant cette closed-form, une nouvelledivergence appelée Sliced Gromov-Wasserstein, à la manière de Sliced Wasserstein. Nous établissons sespropriétés (Theorem 4.1.3) et l’utilisons dans des scénarii tels que la comparaison de meshes 3D et lesréseaux de neurones génératifs.Une deuxième partie plus prospective se concentre sur la théorie de Gromov-Wasserstein pour lesespaces euclidiens en la liant à la théorie classique du transport optimal. Nous abordons notamment laquestion de la régularité des plans de transport de Gromov-Wasserstein. Nous donnons des conditionsnécessaires sous lesquelles nous pouvons prouver que le couplage optimal de Gromov-Wasserstein estsupporté par une fonction déterministe, de la même manière que le couplage optimal pour le cas deWasserstein avec un coût quadratique entre des mesures régulières (théorème de Brenier [Brenier 1991]).Pour cela nous considérons les cas où les mesures de distance ou similarité dans chaque espace c X , c Y sont déﬁnies par les produits scalaires ou par des coûts quadratiques. Nous montrons que résoudre GW équivaut à résoudre conjointement un problème de transport linéaire et un problème d’alignement. Ainsi,la régularité des plans optimaux de GW peut être étudiée au travers de formulations équivalentes, plussimples à analyser. Cela nous permet également de construire des solutions algorithmiques pour GW surdes espaces euclidiens. En résumé :(i) Dans la Section 4.2.2 nous considérons le cas où c X , c Y sont déﬁnis par des produits scalaires danschaque espace. A condition que la mesure de probabilité source soit régulière par rapport à la mesurede Lebesgue, nous donnons une condition suﬃsante pour l’existence d’un plan de transport optimal déterministe , i.e. supporté par une fonction déterministe T . Nous montrons que cette fonction estde la forme ∇ u ◦ P où u est une fonction convexe et P est une application linéaire qui peut êtreconsidérée comme une transformation globale “recalant” les mesures de probabilité dans le mêmeespace (Theorem 4.2.1). Nous utilisons cette formulation pour montrer que la distance GW entreles mesures de probabilité 1D admet une solution closed-form. Plus précisément, nous montrons quele couplage optimal est déterminé par les fonctions de distribution cumulative et anticumulative dela distribution source (Theorem 4.2.4).(ii) Dans la Section 4.2.3 nous considérons c X , c Y déﬁnis comme étant le carré des distances euclidiennesdans chaque espace. Nous montrons que le problème équivaut à une maximisation d’une fonctionconvexe sur Π( µ, ν ). Nous utilisons la dualité Fenchel-Legendre dans l’espace des mesures pour endéduire un problème équivalent à celui de Gromov-Wasserstein (Theorem 4.2.5). Nous l’analysonsplus en profondeur et montrons que la régularité des plans de transport optimaux est plus compliquéeà établir que dans le cas précédent.(iii) Dans la Section 4.2.4 nous utilisons les formulations précédentes pour obtenir des solutions numériqueseﬃcaces pour le problème GW basées sur des BCD. Nous montrons que ces algorithmes se comparentfavorablement par rapport aux solveurs standards tels que le gradient conditionnel ou avec larégularisation entropique. MNIST USPS

USPS samples M N I S T s a m p l e s s matrix between samples USPS colored coded pixels MNIST pixels through v MNIST pixels through entropic v Figure 4: Illustration de COOT entre les datasets MNIST et USPS. (gauche) samples des datasets MNIST etUSPS ; (centre gauche)

Matrice de transport π s entre les samples, triés par classe ; (centre) Image USPS avecles pixels colorés en respectant leur position 2D ; (centre droite) couleurs transportées sur image MNIST enutilisant π v , les pixels noirs correspondent aux pixels MNIST non informatifs (toujours à 0); (droite) couleurstransportées sur image MNIST en utilisant π v avec régularisation entropique. (iv) Nous concluons par la Section 4.2.5 en considérant le problème Gromov-Monge dans les espaceseuclidiens, qui est l’équivalent du problème de Monge du transport linéaire dans le contexte deGromov-Wasserstein. Nous discutons du cas particulier de Gromov-Monge entre les mesuresgaussiennes et nous montrons que ce problème admet une closed-form quand on se limite auxpush-forward linéaires (Theorem 4.2.6). Nous donnons des interprétations géométriques de cerésultat et nous comparons le push-forward optimal avec la celui de la théorie du transport optimalclassique dans le cas des mesures gaussiennes.

Chapitre 5

Ce chapitre présente un nouveau cadre théorique pour comparer des mesures de probabilité sur desespaces incomparables, à savoir le problème de co-transport optimal. Contrairement à l’approche deGromov-Wasserstein, cette approche permet d’optimiser simultanément deux couplages entre les sampleset les features des données. Ce chapitre fournit une analyse théorique approfondie de ce cadre et, du pointde vue des applications, il aborde le problème de l’adaptation des domaines hétérogènes et du co-clustering.Ce chapitre est basé sur l’article [Redko 2020].Plus précisément nous considérons deux datasets quelconques X = [ x , . . . , x n ] T ∈ R n × d et X =[ x , . . . , x n ] T ∈ R n × d , avec en général n = n et d = d . Les lignes sont appelées samples et lescolonnes features . Nous associons aux samples ( x i ) i ∈ [[ n ]] et ( x i ) i ∈ [[ n ]] des poids w = [ w , . . . , w n ] > ∈ Σ n et w = [ w , . . . , w n ] > ∈ Σ n . De la même manière on associe aux features des poids v ∈ Σ d et v ∈ Σ d . Leproblème de Co-transport optimal est déﬁni par: COOT ( X , X , w , w , v , v ) = min π s ∈ Π( w , w ) π v ∈ Π( v , v ) X i,j,k,l L ( X i,k , X j,l ) π si,j π vk,l (2)Pour illustrer cette déﬁnition, nous résolvons le problème d’optimisation entre deux datasets classiques :MNIST et USPS. Ils contiennent des images de diﬀérentes résolution (les images de USPS sont de taille16 ×

16 et MNIST 28 ×

28) qui appartiennent aux mêmes classes (chiﬀres entre 0 et 9). En outre, les chiﬀressont également diﬀéremment centrés comme l’illustrent les exemples de la partie gauche de la Figure4. Cela signiﬁe que sans prétraitement, les images ne se trouvent pas dans le même espace topologiqueet donc ne peuvent pas être comparées directement à l’aide des distances conventionnelles. Les imagesreprésentent les samples tandis que chaque pixel agit comme un feature conduisant à 256 et 784 features pour USPS et MNISTLe résultat de la résolution du problème est reporté sur la Figure 4. Dans la partie centrale gauche,nous fournissons le couplage optimal π s entre les samples, i.e. les diﬀérentes images, triés par classe. Lecouplage π v , à son tour, décrit les relations entre les features, i.e. les pixels, dans les deux domaines. Pourle visualiser, nous codons en couleur les pixels de l’image USPS source et utilisons π v pour transporterles couleurs sur une image MNIST cible de sorte que ses pixels soient déﬁnis comme des combinaisonsconvexes de couleurs de la première avec des coeﬃcients donnés par π v . Les résultats correspondants sontprésentés dans la partie droite de la Figure 4.Nous montrons que cette nouvelle formulation inclut Gromov-Wasserstein comme cas particulier(Proposition 5.5.2), et qu’elle présente l’avantage de travailler directement sur les données brutes sans avoirà calculer, stocker et choisir les mesures de similarité. De plus, COOT fournit deux couplages interprétablesentre les features et les samples. Nous montrons que COOT déﬁnit une notion de distance entre lesdatasets (Proposition 5.3.1) et nous établissons une procédure d’optimisation basée sur la résolution detransports linéaires (Section 5.4). Sur le plan pratique, nous apportons la preuve de l’utilité de COOTpour le machine learning notamment en adaptation de domaines hétérogènes (Section 5.6.1) et pour leco-clustering (Section 5.6.2). ontents Table of contents v1 Introduction 1 v Contents ontents v GM and GW in the discrete case . . GW in the 1d case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SW ∆ and RISW . . . . . . . . . . . . . . . . . . . . . . . . . . (innerGW) and (MaxOT) . . . . . . . . . . . . H ( µ, ν ) for linear push-forward . . . . . . . . . . . . . . . . . . . Bibliography 163i Contents

Linear algebraa , A all vectors in R d and matrices are written in bold. The coordinates willbe written A i,j or A ( i, j ) for matrices and a i for vectors without bold.[[ n ]] the subset { , · · · , n } of N I n , J n the n × n identity matrix and anti-identity matrix k . k , h , i a norm (depends on the context) and an inner product ‘ p Denotes the standard k . k p norm ⊗ K is the Kronecker product of matrices, i.e. for two matrices A ∈ R n × m , B p × q , A ⊗ K B ∈ R np × mq deﬁned by ( A i,j B ) i,j ⊗ the the tensor-matrix multiplication, i.e. for a tensor L = ( L i,j,k,l ) and amatrix A , L ⊗ A is the matrix (cid:16)P k,l L i,j,k,l A k,l (cid:17) i,j .tr , det is the trace operator for matrices i.e. tr( P ) = P i,j P i,j and the determi-nant operator k . k F is the Frobenius norm for matrices i.e. k P k F = p tr( P T P ) h ., . i F is the inner product for matrices i.e. h P , Q i F = tr( Q T P )rank( A ) , ker( A ) The rank and the kernel of a matrix A DS is the set of doubly-stochastic matrices, i.e. DS = { X ∈ R n × n | X1 n = n , X T n = n , X ≥ } Π n the set of permutation matrices, i.e. X = ( x ij ) ( i,j ) ∈ [[ n ]] ∈ Π n if x ij ∈{ , } and P ni =1 x ij = P nj =1 x ij = 1. O ( p ) the subset of R p × p of all orthogonal matrices. V p ( R q ) is the Stiefel manifold, i.e. the set of all orthonormal p -frames in R q orequivalently V p ( R q ) = { B ∈ R q × p | B T B = I p } . S n the set of all permutations of [[ n ]]. (cid:11) Element-wise division operator for two vectors, i.e. u (cid:11) v = ( u i v i ) i Σ n the set of probability vectors in R n + (or histograms with n bins), i.e. Σ n = { a ∈ R n + | P ni =1 a i = 1 } Measure theory P ( X ) the set of probability measures on a space XM ( X ) the set of Borel ﬁnite signed measures on a space XL the Lebesgue measure on R or R n depending on the contextsupp( µ ) the support of µ (see deﬁnition 2.1.4) δ x the dirac measure on x , i.e. δ x ( y ) = 0 if y = x else 1 µ ⊗ ν the product measure of two probability measures µ, ν , i.e. , µ ⊗ ν ( A × B ) = µ ( A ) ν ( B )( id × T ) µ Is the measure µ ⊗ ( T µ )Π( µ, ν ) the set of couplings of two probability measures µ, ν (see deﬁnition 2.1.1) M c ( µ, ν ) The Monge cost between two probability measures µ, ν (see deﬁnitionMP) T c ( µ, ν ) The Kantorovitch cost between two probability measures µ, ν (see deﬁni-tion KP) W p ( µ, ν ) The p -Wasserstein distance between two probability measures µ, νGW p ( µ, ν ) The p -Gromov-Wasserstein distance between two probability measures µ, ν F µ the Fourier transform of the probability measure µ i.e. F µ ( s ) = ´ e − iπ h s,x i d µ ( x ) N ( m , Σ ) multivariate Gaussian (normal) distribution with mean m and covariance Σ ontents vii Functions DT the Jacobian of a function TC ( X ) , C ( X , Y ) the set of continuous functions form X to R ( resp. from X to Y ) C b ( X ) , C b ( X , Y ) the set of continuous and bounded functions form X to R ( resp. from X to Y ) C p ( X ) the set of function of class C p . Lip k ( X ) the set of Lipschitz functions on X with Lipschitz constant kL p ( µ ) The set of p -integrable functions with respect to a measure µ , i.e. f ∈ L p ( µ ) if ´ | f | p d µ < + ∞∇ the gradient operator d X ⊕ d Y The “sum” distance on the cartesian product ( X , d X ) × ( Y , d Y ). Moreprecisley it is deﬁned by d X ⊕ d Y (( x, x ) , ( y, y )) = d X ( x, x ) + d Y ( y, y ). S d − The d -dimensional hypersphere, i.e. S d − = (cid:8) θ ∈ R d | k θ k = 1 (cid:9) Acronyms

OT Is the acronym for Optimal TransportW,SW,GW, SGW Stands respectively for Wasserstein, Sliced Wasserstein, Gromov-Wasserstein and Sliced Gromov-WassersteinQAP,QP,BAP,BP Stands respectively for Quadratic Assignment Problem, Quadratic Pro-gram, Bilinear Assignment Problem, Bilinear Program hapter Introduction

Contents “Thus, we see in science, sometimes brilliant but for a long time useless theories, suddenly becoming thebasis of the most important applications, and sometimes seemingly very simple applications, giving rise tothe idea of abstract theories that are not yet needed, directing towards the theories of the Surveyors’ work,and opening up a new career for them” . That is how Nicolas De Condorcet [De Condorcet 1781] in the18th century introduced the work of Gaspard Monge [Monge 1781] which is at the core of the OptimalTransport (OT) theory. How to move some masses from one location to another so as to minimize theoverall eﬀort? Condorcet was right: this “simple idea” has evolved over the years to become a theoryat the crossroads of mathematics/optimization and is today at the center of many machine learningapplications.Broadly speaking the interest of Optimal Transport lies in both its ability to provide correspondencesbetween sets of points and its ability to induce a geometric notion of distance between probabilitydistributions (see Figure 1.1) Both have proved to be very useful for a wide range of tasks that are, to namea few, image registration [Haker 2001], image retrieval [Rubner 1998], domain adaptation [Courty 2017],signal processing [Kolouri 2017], unsupervised learning [Arjovsky 2017, Genevay 2018], supervised andsemi-supervised learning [Frogner 2015, Solomon 2014], natural language processing [Kusner 2015], fairness[Gordaliza 2019], in biology [Schiebinger 2019] or in astrophysics [Frisch 2002].Despite its many properties, the optimal transport problem remains diﬃcult to solve in practice and isknown to suﬀer from scalability problems that prevent its use on large, ubiquitous data in machine learning.The emergence of optimal transport in the machine learning community has been greatly favoured byrecent achievements on the optimization side [Cuturi 2013, Altschuler 2017, Genevay 2016] which tend tocircumvent the heavy computational complexity of solving OT problems. Original quote: “Ainsi, l’on voit dans les Sciences, tantôt des théories brillantes, mais longtemps inutiles, devenir tout àcoup le fondement des applications les plus importantes, et tantôt des applications très simples en apparence, faire naîtrel’idée de théories abstraites dont on n’avait pas encore le besoin, diriger vers les théories des travaux des Géomètres, et leurouvrir une carrière nouvelle.”

Chapter 1. Introduction x y OT correspondences

40 60 80 100 120 140Mean of the blue distribution0.000.020.040.060.080.100.12 O T d i s t a n c e Figure 1.1: Optimal Transport provides tools for ﬁnding correspondences between set of point and gives a geometricnotion of distance between probability distributions. (left)

Two sets of points in 2D. The correspondences foundby OT is depicted in dashed line. (right)

OT distance between two probability distributions in red and blue. Thefurther away the mean of the red probability distribution is from the blue one the higher is the OT distance.

Moreover optimal transport, in its early formulation, is quite restricted to applications where thereexists a direct way of comparing the samples of the data. Its applicability is thus often limited to the casewhere the samples are part of a common ground metric space, that is most of the time Euclidean. Thislimitation prevents its use for a variety of machine learning tasks where there is an additional structuralinformation on the data which can not usually be described in the Euclidean setting e.g. when the samplesare described by graphs, trees or time series. It also prevents the use of optimal transport when thesamples lie in diﬀerent, seemingly not related, metric spaces or when a meaningful notion of distancebetween the samples can not be easily deﬁned. All of these instances can be framed into the incomparable setting, that is when the samples lie on incomparable spaces . An interesting remedy in this situation canbe found in the theory of the Gromov-Wasserstein distance [Memoli 2011] which does not require thecomparison of the samples across the distributions. However, it is well-known to be arduous to solve andto suﬀer from tedious scalability issues. The purpose of this thesis is to help overcoming these obstaclesby:(i) Deﬁning new Optimal Transport frameworks for incomparable spaces and especially for structureddata.(ii) Reducing the theoretical gap between classical Optimal Transport and the Gromov-Wassersteintheory in Euclidean spaces and in particular derive scalable and applicable formulations.

Before going into the details of this thesis, it is important to make explicit what is behind the notionof structure through the manuscript. As a starter we can see the structural information as the piece ofinformation which encodes the speciﬁc relationships that exist among the components of an object. Thisdeﬁnition can be related with the concept of relational reasoning [Battaglia 2018] where some entities (or elements with attributes such as the intensity of a signal) coexist with some relations or propertiesbetween them.Natural instances of such structured data arise when the structure is explicit . For example, in agraph context, edges are representative of this notion so that each attribute of the graph (typically R d vectors) may be linked to some others through the edges between the nodes. Notable examples are found .2. Structured Data and Incomparable Spaces in Machine Learning 3 in chemical compounds or molecules modeling [Kriege 2016], brain connectivity [Ktena 2017], or socialnetworks [Yanardag 2015]. This generic family of structured data also encompasses trees [Day 1985] ortime series where the signals’ values are correlated through time so that comparing times series requiresone to take the direction of time into account.Structure information of data in machine learning can also be more subtle or even implicit . Forexample it can appear where one build structural priors, or inductive bias [Battaglia 2018], on theobjects representation. For instance in the context of deep learning, many successful architectures exploitthe equivariance to a symmetry transformation to improve generalization. The convolutional neuralnetwork (CNN) is a prime example of this “built-in” inductive bias which satisﬁes translation equivariance, i.e. if we translate the input, the output of the convolutions will also be translated. This inductivebias is known to reveal useful spatial hierarchy, or structure, on the pixels [Wang 2018, Chen 2018]and other works have studied designing layers with equivariances to other transformations such aspermutation, rotation, reﬂection [Kondor 2018, Cohen 2016, Gens 2014]. Unlike end-to-end methods othermore “hand-engineering” approaches based e.g. on images segmentation can leverage some structure inimages that can further be usefully exploited [Bach 2007, Jianbo Shi 2000]. Implicit structure is alsoat the core of many natural language processing (NLP) tools used to ﬁnd good word representations[Mikolov 2013b, Mikolov 2013a, Pennington 2014]. It is also the main ingredient of speech recognition[Hinton 2012] or sequence learning [Sutskever 2014]. In these cases structural assumptions on thesequences of words are made whether by the mean of latent variables or via conditional probabilities.When available, labels or classes also induce an implicit structure on the feature space of the data.For instance in Domain Adaptation one may desire the source samples with the same label to bematched consistently within the same region of the target space preventing them from being split intodisjointed far locations [Courty 2017, Alvarez-Melis 2018b]. The recent trend of graph neural networks(GNN) [Wu 2020] in the machine learning community is one of the many examples emphasizing thatstructured data remains an important and challenging setting nowadays. While previous instances considerstructured data as inputs of the learning process the prevalence of this notion in machine learning alsoarises in another line of works where it is an output. This setting is e.g. considered in the structuredprediction approach where one wants to learn to produce structured object such as sequences, trees orassignments [Taskar 2005, Laﬀerty 2001, Collins 2002, Blondel 2020, Mensch 2018, Korba 2018].In short, the notion of structure in machine learning is omnipresent and appears as often as there is anadditional information about the objects that goes beyond their feature representations. As shown in manycontexts in machine learning such as graphical models [Pearl 1986, Pearl 2009], relational reinforcementlearning [D v zeroski 2001] or Bayesian nonparametrics [Hjort 2010], considering objects as a complexcomposition of entities together with some interactions is crucial in order to learn from small amounts ofdata.The previous notion of structure data can be seen as a special case of data deﬁned on incomparablespaces. Informally in this situation, each sample has its own “characteristic” that may not be sharedwith the other samples. For instance when considering a dataset of multiple graphs (each graph beinga data point) the structure of one graph is usually not shared among the other graphs. This notion,purposely broad, encompasses also the case where data come from heterogeneous sources. As a particularinstance of this problem Heterogeneous Domain Adaptation [Yeh 2014, Zhou 2014, Wang 2011] aimsto exploit knowledge from heterogeneous source domains to improve the learning performance in atarget domain with, potentially, diﬀerent feature spaces between the source and target domains. TheMNIST/USPS [LeCun 2010, Hull 1994] case is a prime example of this situation: based on the knowledgeof 28 ×

28 digit images ( i.e. R vectors) from MNIST how to build e.g. a classiﬁer that works well on Chapter 1. Introduction

Figure 1.2: We associate to a dataset ( x i ) i ∈ [[ n ]] a probability measure describing the dataset. In this example x i ∈ R . (left) Lagragian formulation (points clouds): P ni =1 a i δ x i is a discrete probability measure where ( a i ) i ∈ [[ n ]] is a probability vector, i.e. a i ≥ P ni =1 a i = 1 and δ is the dirac measure δ x i ( x ) = 1 if x = x i else 0. (right) Eulerian formulation (histograms): P Ni =1 a i δ ˆ x i where (ˆ x i ) i ∈ [[ N ]] is a regular grid on R and ( a i ) i ∈ [[ N ]] is aprobability vector. ×

16 digit images ( i.e. R vectors) of USPS? Needless to say this problem often arises in all the ﬁeldsof machine learning: it is common that the data are gathered from heterogeneous sources in practice andmethods that build upon this diversity are often of high interest. The central question that often arises in machine learning is: how to represent data and how to comparethem? The framework of probability distributions provides an answer to this query by associating aprobability measure to a collection of samples that forms a dataset ( x i ) i ∈ [[ n ]] . The Lagragian representationof the dataset results in a discrete probability measure P ni =1 a i δ x i in which one associates to each point x i a dirac δ x i ( x ) = 1 if x = x i otherwise 0 as well as a weight a i ≥ a i ) i ∈ [[ n ]] is a probabilityvector which satisﬁes P ni =1 a i = 1. When no information about the relative importance of the samplesin the dataset is available the weights can be chosen as uniform so that a i = n . Similarly a Eulerian representation can be constructed via the probability distribution P Ni =1 a i δ ˆ x i in which (ˆ x i ) i ∈ [[ N ]] is aregular grid on the space. This formulation produces an histogram of our data (see Figure 2). Thispoint of view on the data advocates ﬁnding an appropriate way of comparing their representation asprobability distributions and, as such, the question of ﬁnding adequate measures of “how far” are twoprobability distributions is at the core of many machine learning algorithms. Although various divergencesexist such as φ -divergences [Csiszar 1975] or Maximum Mean Discrepancies (MMD) [Gretton 2007], therichness of optimal transport lie in its ability to incorporate the geometry of the underlying space inits formulation and to pay attention to the relations, the correspondences of the samples within theirrespective representations. To highlight in short the beneﬁt of representing data through probability .4. Outline of the Thesis 5 distributions coupled with OT we can cite [Rubner 2000] for image retrieval or [Kusner 2015] for naturallanguage processing. At this point natural questions arise: is this framework applicable when the nature ofthe data is inherently structured or when the diﬀerent data points lie in incomparable spaces? In this casehow can we represent data as probability distributions? To what extent this representation is valuable?Is the Optimal transport framework still applicable, and, if not, how do we compare these probabilitydistributions? The purpose of this thesis, inter alia , is to give some answers to these questions. This thesis covers mostly all the author’s work conducted and focus on a single line of research that is

Optimal Transport on Incomparable Spaces . Additional line of works [Vayer 2020a] on time series onincomparable spaces, which is not based on optimal transport, is not included in this thesis but theinterested reader can ﬁnd the details in the bibliography. The rest of the thesis is hinged so that allchapters can be read separately in any order except for Chapter 2 that provides all the mathematicalbackground and tools used in the other chapters.

Chapter 2 sets up the mathematical and numerical background of optimal transport. It presentsthe fundamental results of classical optimal transport theory and summarizes/illustrates its diﬀerentformulations as well as some well-known solvers. The philosophy of this chapter is to provide a high-leveloverview of OT both in theory and practice. This chapter concludes with the Gromov-Wasserstein theorywhich is at the core of thesis. A reader familiar with the basic concepts of optimal transport may skipthis part although it contains crucial concepts and notations that will be discussed throughout the thesis.

Chapter 3 is dedicated to optimal transport for structured data, and especially in the context of graphs.It is based on the works of the articles [Vayer 2019a] and [Vayer 2020b] and gives some answers to thequestion of deﬁning a mathematical framework for optimal transport in the case of structured data. Ageneral framework for this setting is given, based on the Fused Gromov-Wasserstein distance that deﬁnesa OT distance between structured data such as undirected graphs, and applied on real world graph dataapplications.

Chapter 4 aims at bridging the gap between the classical optimal transport theory and the Gromov-Wasserstein theory. The chapter opens with the special case of 1D distributions which results in the SlicedGromov-Wasserstein formulation based on the works in [Vayer 2019b]. A second more prospective partfocuses on the Gromov-Wasserstein theory for Euclidean spaces and connects with the classical optimaltransport theory by questioning the regularity of Gromov-Wasserstein optimal transport plans.

Chapter 5 presents a new framework for comparing probability measures on incomparable spaces,namely the CO-Optimal transportation problem. Contrary to the Gromov-Wasserstein approach thisapproach simultaneously optimizes two transport maps between both samples and features of the data.This chapter provides a thorough theoretical analysis of this framework and, from an application side, thiswork tackles the problem of Heterogeneous Domain Adaptation and co-clustering/data summarization.This chapter is based on the article [Redko 2020]. hapter Generality about optimal transport

Contents

Optimal transport is a long-standing mathematical problem whose theory has matured over the years.A good gateway for this theory can be found in [Santambrogio 2015]. A more mathematical orientedoverview can be found in [Villani 2008] while the most complete document about numerical aspects ofOT can be found in [Peyré 2019]. The objective of this chapter to present in short the main results ofthe “classical” OT theory both mathematically and numerically. We will discuss in the last part of thischapter the theory related to the Gromov-Wasserstein transportation problem for which we refer thereader to [Memoli 2011, Sturm 2012] for its foundation.

The Monge problem

The OT problem has been historically introduced by Gaspard Monge[Monge 1781] and can be described as the following “least eﬀort problem”: given two probabilitydistributions µ and ν how do we transfer all the probability mass of µ onto ν so that the overall eﬀort of transferring this mass is minimized? Originally the idea was to move dirt (déblais) to one place toanother (remblais) in the most eﬃcient way.For properly deﬁning this problem we need to deﬁne the notions of transfer and eﬀort : the former canbe expressed through the notion of push-forward and the latter through the notion of cost . More precisely Chapter 2. Generality about optimal transport and given two Polish spaces X , Y and two probability measures µ ∈ P ( X ) , ν ∈ P ( Y ) a cost is a function c : X × Y → R + ∪ { + ∞} which values c ( x, y ) aim at measuring how far is x ∈ X from y ∈ Y and quantiﬁessomehow the “eﬀort” of moving x forward to y . The push-forward of a probability measure µ through afunction T : X → Y is deﬁned as the probability measure T µ ∈ P ( Y ) which satisﬁes equivalently one ofthe two following conditions:(i) T µ ( A ) = µ ( T − ( A )) = µ ( { x ∈ X | T ( x ) ∈ A } ) for every measurable set A (ii) ´ Y φ ( y )d( T µ )( y ) = ´ X φ ◦ T ( x )d µ ( x ) for every measurable function φ .These conditions simply state that we transform, or push , the probability measure µ thanks to T soas to create another probability measure on Y . When we consider a discrete probability distribution µ = P ni =1 a i δ x i the push-forward measure T µ is simply deﬁned by T µ = P ni =1 a i δ T ( x i ) .As described at the beginning, one wants to move the source distribution µ forward to the target distribution ν . This translates mathematically as ﬁnding a map T which satisﬁes T µ = ν . When weconsider the Euclidean setting and when the probability measures have densities f, g with respect to theLebesgue measure by the change of variable formula the push-forward condition writes: g ( T ( x )) det( DT ( x )) = f ( x ) (2.1)where DT stands for the Jacobian of T . Among all these possible push-forwards, OT aims at ﬁnding themap T which minimizes the total cost of having moved µ forward to ν that is ´ X c ( x, T ( x ))d µ ( x ). Overallthe problem of Monge (MP) can be formulated as the following non-convex optimization problem: M c ( µ, ν ) = inf T µ = ν ˆ X c ( x, T ( x ))d µ ( x ) . (MP)In general ﬁnding such optimal map T of the (MP) problem is quite diﬃcult to solve since the solutionmay not be unique and may not even exists (see Figure 2.1). Even in the regular setting where µ, ν havedensities, equation (2.1) is highly non-linear in T which is one of the major diﬃculty preventing from aneasy analysis of the Monge Problem. As such the Monge problem remained an open question for manyyears and results about the existence and unicity of the optimal Monge map were limited to special casesuntil the works of Brenier [Brenier 1991] which implications will be detailed after. Kantorovitch formulation

Major breakthroughs in the OT theory were made possible thanks toKantorovitch [Kantorovich 1942] who proposes a relaxation of the (MP) problem. The key idea is toconsider a probabilistic mapping instead of a deterministic map T to push the source measure forward tothe target one. In the Kantorovitch formulation it is allowed to split the mass of the probability measuresinto pieces and transport them towards several targets points. This translates mathematically by replacingthe push-forward of a measure by a probabilistic coupling : Deﬁnition 2.1.1 (Couplings) . Let µ ∈ P ( X ) , ν ∈ P ( Y ) . A coupling π of µ and ν is a probabilitydistribution on X × Y such that both marginals of π are respectively µ and ν . More precisely π is part ofthe following set: Π( µ, ν ) = { π ∈ P ( X × Y ) | A ⊂ X , B ⊂ Y measurable π ( A × Y ) = µ ( A ) ; π ( X × B ) = ν ( B ) } (2.2) A Polish space X is a separable completely metrizable topological space. T is often called a map or Monge map in the OT literature .1. Linear Optimal Transport theory 9 Figure 2.1: Push-forward between two discrete probability measures µ and ν in three scenarii . (left) µ issupported on x , x with corresponding weights , and ν on y with weight 1. The only possible push-forward T is T ( x ) = y and T ( x ) = y . (center) In this situation there is no push-forward of µ onto ν because no functioncan satisﬁes T ( x ) = y and T ( x ) = y when y = y . The problem of Monge admits no solution in this case. (right) All points are equidistant form each other that is c ( x i , y j ) = 1 for i, j ∈ [[2]] . In this case the solution ofproblem MP is not unique it may associate x whether with y or y with the same overall cost (same with x ). Remark 2.1.1.

Unlike the set of push-forwards of µ forward to ν the set of couplings of two probabilitymeasures is always non-empty as the product measure µ ⊗ ν is in Π( µ, ν ) . An important illustration of the former deﬁnition is when µ and ν are discrete probability measures.This situation will be omnipresent thorough the manuscript we propose to detail the notations in thefollowing example: Example 2.1.1 (The case of discrete probability measures) . Let µ = P ni =1 a i δ x i , ν = P mi =1 b j δ y j bediscrete probability measures where x i ∈ X , y j ∈ Y and a = ( a i ) i ∈ [[ n ]] ∈ Σ n , b = ( b j ) j ∈ [[ m ]] ∈ Σ m areprobability vectors which belong to the following probability simplex: Σ n def = { a ∈ R n + | n X i =1 a i = 1 } . (2.3) We will use interchangeably the term histogram or probability vector for an element a ∈ Σ n . In this case acoupling π is a matrix of the following set: Π( a , b ) = { π ∈ R n × m + | π m = a ; π T n = b } = { π ∈ R n × m + |∀ ( i, j ) ∈ [[ n ]] × [[ m ]] , m X j =1 π ij = a i ; n X i =1 π ij = b j } (2.4)Using the coupling instead of a deterministic map allows deﬁning the OT problem for a very largeclass of probability measures under very mild assumptions. More precisely let X , Y be Polish spaces and Memo 2.1.1 (Lower semi-continuity) . On a metric space ( X , d ) a function f : X → R ∪ { + ∞} issaid to be lower semi-continuous (l.s.c.) if for every sequence x n → x we have f ( x ) ≤ lim inf f ( x n ) .Such functions have the following properties:• If ( f k ) k is a sequence of l.s.c. functions on X then f = sup k f k is l.s.c.• If f is l.s.c. and bounded from below then there exists a sequence of continuous and boundedfunctions ( f k ) k ∈ N converging increasingly to f. We can also suppose that each f k is k -Lipsichtz. BakeriesCafés

Figure 2.2: The bakery/cafés problem in Manhattan based on example 2.1.2. The bakeries ( resp cafés) are in blue( resp red) and the dots’ size denote the amount of bread disposable ( resp needed). Lines represent the amountof bread transferred d π ( x, y ): the wider the larger. The optimal transport plan is depicted in the Figure andrepresents the cheapest transport plan so that all breads are transferred from bakeries to cafés. µ ∈ P ( X ) , ν ∈ P ( Y ). Given a cost c : X × Y → R + ∪ { + ∞} lower semi-continuous (see Memo 2.1.1), thenKantorovitch problem aims at ﬁnding: T c ( µ, ν ) = inf π ∈ Π( µ,ν ) ˆ X ×Y c ( x, y )d π ( x, y ) . (KP)The resulting cost, potentially inﬁnite without further assumptions, corresponds to the minimal costof moving µ forward to ν by splitting their masses and transporting forward the pieces according tothe transport plan π . The good news about this formulation is that the inﬁmum is always well deﬁnedproviding that the cost is positive and lower semi-continuous (actually it suﬃces that c is bounded frombelow see [Santambrogio 2015, Theorem 1.7]). The problem KP is deﬁned regardless of the nature of theprobability distributions: they can be both discrete or continuous (see Figure 2.3). The problem appearsto be linear in π so that we will denote (KP) as the linear transportation problem .Relying on the Kantorovitch formulation (KP) appears to be very useful in order to ﬁnd a solution ofthe Monge problem (MP). Indeed a push forward T µ = ν induces a coupling π = ( id × T ) µ then itis easy to verify that T c ( µ, ν ) ≤ M c ( µ, ν ). To ﬁnd the converse inequality it suﬃces to ﬁnd an optimalsolution π ∗ of (KP) which is of the form π ∗ = ( id × T ∗ ) µ where T ∗ µ = ν . In this case we wouldhave proven that both problems are equal and that T ∗ is optimal for (MP). In other words if there is anoptimal coupling supported on a deterministic function then both (KP) and (MP) are equivalent. We willsee in next sections cases where we can ensure that necessarily the optimal coupling is of this form. Example 2.1.2.

The famous bakery analogy of Villani’s book [Villani 2003] provides a simple illustrationof the linear OT problem. Suppose that someone is in charge of the distribution of bread from bakeries ( id × T ) µ is the measure µ ⊗ T µ or equivalently d( id × T ) µ ( x, y ) = d µ ( x )d( T µ ( y )) .1. Linear Optimal Transport theory 11 Figure 2.3: Diﬀerent OT problems between (left) two discrete probability distributions, the coupling π is a matrix (center) one continuous and one discrete probability distribution, this is called the semi-discrete OT problem (right) two continuous distributions.

Inspired from [Peyré 2019] to cafés in Manhattan. The bakeries are located at some points y and the cafés at some points x distantfrom each other by c ( x, y ) . At 8 a.m sharp all the bread from the bakeries has to be transferred to thecafés in order for the citizens of Manhattan to have a good day. The company in charge of the distributionwants to route the breads from bakeries to cafés the cheapest way possible. This problem can be recast intoa linear OT problem. Considering two distributions µ ="all available breads in bakeries" and ν ="all thedemands in bread of cafés", the company seeks for a transport plan π such that d π ( x, y ) is the amount ofbreads transferred from bakery y to café x . The best transport plan minimizes the overall cost of movingall the breads from bakeries to cafés which is ´ c ( x, y )d π ( x, y ) (see Figure 2.2) Wasserstein distance

The most notable scenario in many OT applications is when X = Y = Ω where(Ω , d ) is a Polish space, e.g. an Euclidean space. In this case there is a natural way of deﬁning the cost c since the space Ω is already endowed with a notion of distance between the points. In this situation wecan deﬁne the so-called p-Wasserstein distance for p ∈ [1 , + ∞ [ as W p ( µ, ν ) = ( T d p ( µ, ν )) p or precisely: W p ( µ, ν ) = (cid:18) inf π ∈ Π( µ,ν ) ˆ Ω × Ω d p ( x, y )d π ( x, y ) (cid:19) p (2.5)The name is not misleading: this function satisﬁes all the axioms of distance on the space of probabilitydistributions with bounded p -moments as stated in the next theorem (see [Villani 2008, Deﬁnition 6.4]): Theorem 2.1.1 (The Wasserstein distance is a distance) . Let (Ω , d ) be a Polish space, p ∈ [1 , + ∞ [ and: P p (Ω) def = { µ ∈ P (Ω) | ˆ Ω d ( x , x ) p d µ ( x ) < + ∞} (2.6) with x ∈ Ω arbitrary. Let µ, ν ∈ P p (Ω) . Then W p ( µ, ν ) < + ∞ . Moreover:(i) W p ( µ, ν ) = W p ( ν, µ ) (symmetry) (ii) W p ( µ, ν ) = 0 ⇐⇒ µ = ν (identity of indiscernibles)(iii) Let ζ ∈ P p (Ω) then W p ( µ, ν ) ≤ W p ( µ, ζ ) + W p ( ζ, ν ) (triangle inequality) Example 2.1.3.

One direct example of Wasserstein distance is between two diracs supported on x , y . Inthis case W p ( δ x , δ y ) = d ( x, y ) which is quite the intuitive behavior, that is the more the dirac are locatedfar from each others the larger their Wasserstein distance is. The distance property renders W p a powerful tool for comparing probability measures. Anothervaluable feature of the former distance is that it gives a characterization of the weak convergence ofprobability measure. Informally a sequence of probability measures gets as close as possible to a probabilitymeasure µ if the Wasserstein distance tends to zero. The convergence is based on the following deﬁnition: Deﬁnition 2.1.2 (Weak-convergence) . Let ( µ n ) n ∈ N be a sequence of probability measures on X a Polishspace. We say that ( µ n ) n ∈ N converges weakly to µ in X if for all continuous and bounded functions f : X → R : ˆ X f d µ n → ˆ X f d µ (2.7)The Wasserstein distance metrizes the weak convergence of probability measures, in other words( µ n ) n ∈ N converges weakly to µ if and only if W p ( µ n , µ ) → e.g. the Lévy–Prokhorov distance,but the richness of W lies in its ability to incorporate a lot of the geometry of the underlying space throughthe distance d . Consequently Wasserstein spaces ( P p (Ω) , W p ) are very large and many metric spaces canbe embed into Wasserstein spaces with low distortion [Bourgain 1986, Andoni 2015, Frogner 2019]. A fundamental result of the linear OT theory is the cyclical monotonicity property of its optimal transportplans. Basically it illustrates that an optimal transport plan can not be improved locally and moreimportantly that is also suﬃcient for being a global optimal transport plan. Consequently it characterizesthe set of optimal couplings using the notion of c -concave functions which appears to be very useful fordeﬁning the notions of duality and for solving the Monge problem (MP) by relying on the Kantorovitchrelaxation (KP). As such the cyclical monotonicity is maybe the main ingredient of linear OT. Thissection aims at presenting in short both this result and its consequences. Deﬁnition 2.1.3 (Cyclically Monotone Set) . Let c : X × Y → ] − ∞ , + ∞ ] be a real valued function onarbitrary sets X , Y . A set Γ ⊂ X × Y is said to be cyclically monotone if for ( x i , y i ) Ni =1 ∈ Γ N and σ apermutation of [1 , .., N ] N X i =1 c ( x i , y i ) ≤ N X i =1 c ( x i , y σ ( i ) ) (2.8) We will call c -CM such sets. c -CM sets is an important notion in OT theory since it characterizes optimal transport plans for wellbehaved costs. We consider the following deﬁnition: Deﬁnition 2.1.4 (Support) . Let ( X , d ) be a Polish space and µ ∈ P ( X ) . The support of µ is deﬁned asthe smallest closed set F such that µ ( F ) = 1 or equivalently:supp ( µ ) = { x ∈ X |∀ ε > , µ ( B ( x, ε )) > } (2.9) .1. Linear Optimal Transport theory 13 Informally the support of a distribution is where the distribution “lives”, i.e. where it is not zero.In the discrete case π is a matrix and the support is found in the indices ( i, j ) such that π ij >

0. Thefollowing theorem states that the support of an optimal coupling is actually a c -CM set: Theorem 2.1.2 (Theorem 1.38 in [Santambrogio 2015]) . Let c : X × Y → [0 , + ∞ [ continuous and µ ∈ P ( X ) , ν ∈ P ( Y ) with T c ( µ, ν ) < + ∞ . If a coupling π ∈ Π( µ, ν ) is optimal for (KP) then supp ( π ) is a c -CM set. Interestingly enough the c -CM sets are characterized by speciﬁc functions based on the notion of c -transforms. This property gives another way of computing optimal transport plans based on deterministicfunctions. Deﬁnition 2.1.5 ( c -transforms) . Let X , Y be Polish spaces and ψ : X → R be a function. We deﬁne its c - transform as the function ψ c : Y → R : ψ c ( y ) = inf x ∈X c ( x, y ) − ψ ( x ) (2.10) and the ¯ c - transform of a function φ : Y → R as the function φ ¯ c : X → R : φ ¯ c ( x ) = inf y ∈Y c ( x, y ) − φ ( y ) (2.11) Fonctions that can be written as ψ c or φ ¯ c are called respectively c - concave or ¯ c - concave functions. Remark 2.1.2.

The c -transform is a a generalization of the Legendre transform that is well-known inconvex analysis [Rockafellar 1970]. More precisely for function u : R d → R its Legendre transform isdeﬁned as u ∗ ( y ) = sup x ∈ R d h y , x i − u ( x ) (see Memo 2.1.2). The c -transform corresponds to this notion byconsidering c ( x , y ) = h x , y i (up to the change of sign). Another special case deserves attention that iswhen c ( x , y ) = k x − y k . Consider ψ : R d → R ∪ { + ∞} , then the function ψ is c -concave if and only ifthe function u : x → k x k − ψ ( x ) is convex and lower semi-continuous and the Legendre transform of u is the function x → k x k − ψ c ( x ) (see [Santambrogio 2015, Proposition 1.21]). When X = Y and c is symetric both notions are equivalent and in this case we will drop this distinction.One important property about c -transform is that it satisﬁes: ∀ x, y ∈ X × Y , ψ ( x ) + ψ c ( y ) ≤ c ( x, y ) (2.12)The case of equality of (2.12) is attained on special subsets of X × Y that are precisely the c -CM sets asstated in the next theorem: Memo 2.1.2 (Convex analysis) . For any function u : R d → R its convex conjugate or Legendretransform is deﬁned by u ∗ ( y ) = sup x ∈ R d h y , x i − u ( x ) . The subdiﬀenrential of u , denoted as ∂u , is deﬁnedfor x ∈ R d as ∂u ( x ) = { y ∈ R d | ∀ z ∈ R d u ( x ) − u ( z ) ≥ h y , z − x i} which reduces to ∂u ( x ) = {∇ u ( x ) } when u is diﬀerentiable at x . When u is convex diﬀerentiable the convex conjugate has the followingimportant properties (see [Rockafellar 1970, Theorem 23.5]):i u ( x ) + u ∗ ( y ) ≥ h x , y i (Fenchel-Young inequality)ii u ( x ) + u ∗ ( y ) = h x , y i ⇐⇒ y ∈ ∂u ( x ) iii In particular u ( x ) + u ∗ ( ∇ u ( x )) = h x , ∇ u ( x ) i Theorem 2.1.3 (Theorem 1.37 in [Santambrogio 2015]) . If Γ = ∅ is a c -CM set in X × Y and c : X × Y → R , then there exists a c -concave function ψ : X → R ∪ {−∞} such that: Γ ⊂ { ( x, y ) ∈ X × Y : ψ ( x ) + ψ c ( y ) = c ( x, y ) } (2.13)To summarize, the support of optimal couplings are necessarily c -CM sets and these c -CM sets arealso characterized by functions using the notion of c -transform. The fundamental theorem of optimaltransport states that all these results are in fact equivalent: Theorem 2.1.4 (Fundamental theorem of linear OT) . Let X , Y be Polish spaces, µ ∈ P ( X ) , ν ∈ P ( Y ) and c : X × Y → [0 , + ∞ [ lower-semi continuous such that T c ( µ, ν ) < ∞ . Let π ∈ Π( µ, ν ) then thefollowing conditions are equivalent:(i) π is optimal for (KP) (ii) The support of π is c -cyclically monotone(iii) There exists a measurable c -concave function ψ such that ψ ( x ) + ψ c ( y ) = c ( x, y ) π a.e.Proof. We will give a sketch of proof, for completeness the reader can refer to Theorem 5.10 in [Villani 2008].Theorems 2.1.2 and 2.1.3 already proved that ( i ) = ⇒ ( ii ) = ⇒ ( iii ) in the case where c is continuous.To pass from continuity to lower semi-continuity we can consider a sequence ( c k ) k of costs that convergesincreasingly to c and observe that [Santambrogio 2015, Lemma 1.41]:lim k → + ∞ inf π ∈ Π( µ,ν ) ˆ c k ( x, y )d π ( x, y ) = inf π ∈ Π( µ,ν ) ˆ c ( x, y )d π ( x, y )and, using some subtleties, this can prove ( i ) = ⇒ ( ii ) = ⇒ ( iii ) when c is lower semi-continuous. Forthe converse we can easily prove that ( iii ) = ⇒ ( i ). By hypothesis ´ c ( x, y )d π ( x, y ) = ´ ψ ( x )d µ ( x ) + ´ ψ c ( y )d ν ( y ). However for any other coupling π we have ´ c ( x, y )d π ( x, y ) ≥ ´ ψ ( x )d µ ( x ) + ´ ψ c ( y )d ν ( y )by relation (2.12) and so ´ ´ c ( x, y )d π ( x, y ) ≥ ´ c ( x, y )d π ( x, y ) so that π is optimal. Technical detailsare hidden here for proving the measurability and integrability of ψ, ψ c . The main theorem of OT for duality

A ﬁrst implication of the fundamental theorem is related to aduality principle which is a widely used property in linear programming (see Section 2.1.4 for more details).This property can be extended in full generality in the context of OT as stated in the next theorem:

Theorem 2.1.5 (Duality theorem) . Let X , Y be Polish spaces, µ ∈ P ( X ) , ν ∈ P ( Y ) and c : X × Y → [0 , + ∞ ] be lower semi-continuous (l.s.c.) such that T c ( µ, ν ) < + ∞ then strong duality holds . Moreprecisely the dual problem : sup φ,ψ ∈ C b ( X ) × C b ( Y ) ∀ ( x,y ) ∈X ×Y , φ ( x )+ ψ ( y ) ≤ c ( x,y ) ˆ X φ ( x )d µ ( x ) + ˆ Y ψ ( y )d ν ( y ) (DKP) leads to the same optimum as the (KP) problem. Equivalently: T c ( µ, ν ) = sup φ,ψ ∈ Φ c ˆ X φ ( x )d µ ( x ) + ˆ Y ψ ( y )d ν ( y ) (2.14) .1. Linear Optimal Transport theory 15 where Φ c is the set of continuous bounded functions which veriﬁes: ∀ x, y ∈ X × Y , φ ( x ) + ψ ( y ) ≤ c ( x, y ) (2.15) This result holds when Φ c is replaced by Φ c ( µ, ν ) the set of integrable functions which satisﬁes (2.15) .Proof. For completeness we will give here an idea of the proof, the interested reader can refer to Theorem5.10 in [Villani 2008] for more details. If φ, ψ ∈ Φ c ( µ, ν ) and π ∈ Π( µ, ν ) then by hypothesis: ˆ φ ( x )d µ ( x ) + ˆ ψ ( y )d ν ( y ) = ˆ φ ( x ) + ψ ( y )d π ( x, y ) ≤ ˆ c ( x, y )d π ( x, y ) (2.16)Which implies that sup φ,ψ ∈ Φ c ( µ,ν ) ´ X φ ( x )d µ ( x )+ ´ Y ψ ( y )d ν ( y ) ≤ T c ( µ, ν ). To show the converse inequalitywe will use the cyclical monotonicity properties of optimal transport plans. Let π ∗ be an optimal couplingfor the (KP) problem. Using Theorem 2.1.4 we know that there exists a c -concave function ψ such that ψ ( x ) + ψ c ( y ) = c ( x, y ) for all x, y ∈ X × Y . In this way: ˆ c ( x, y )d π ∗ ( x, y ) = ˆ ψ ( x )d µ ( x ) + ˆ ψ c ( y )d ν ( y ) ≤ sup φ,ψ ∈ Φ c ˆ X φ ( x )d µ ( x ) + ˆ Y ψ ( y )d ν ( y ) (2.17)Last inequality stems from the property (2.12) of c -transform since ( ψ, ψ c ) ∈ Φ c . If c is continuous andbounded then so are ψ, ψ c so last inequality is valid. If c is only l.s.c. then we can show that there is asequence ( c k ) k bounded and n -Lipschitz such that c = sup k c k . A limit argument suﬃces to conclude forthis case.The functions φ, ψ are usually called Kantorovitch potentials and play an important role in OT problems.Given two admissible potentials φ, ψ i.e. that satisfy φ ( x ) + ψ ( y ) ≤ c ( x, y ) we can always cook up a pairof “better” potentials using the c -transform. Indeed, due to (2.12), one can check that the pairs ( φ, φ c ),( ψ, ψ ¯ c ) are also admissible potentials and improve the objective function. It turns out that after oneiteration of this procedure we can not improve the potentials anymore. Based on this remark we can alsowrite the duality as the maximization over one single potential which is the semi-dual formulation:sup φ c − concave ˆ X φ ( x )d µ ( x ) + ˆ Y φ c ( y )d ν ( y ) (2.18) Example 2.1.4.

When c = d is a distance on some space Ω then there is a tight connection between c -transform and 1-Lipschitz functions. Indeed suppose that φ is a 1-Lipschitz function, then for x, y ∈ Ω , φ ( y ) ≤ φ ( x ) + d ( x, y ) so that φ ( y ) = inf x ∈ Ω φ ( x ) + d ( x, y ) = ( − φ ) d ( y ) which proves that the d -transformof − φ is φ . The converse is also true so that the semi-dual formulation can be written: sup φ ∈ Lip (Ω) ˆ Ω φ ( x )d µ ( x ) − ˆ Ω φ ( y )d ν ( y ) (2.19) This formulation is very useful in practice in the context of generative modeling [Arjovsky 2017] (seeSection 2.1.5).

The main theorem of OT for the Monge problem

From a theoretical perspective one fundamentalquestion that arises is the regularity of these potentials and, with some assumptions, we can use them tosolve the Monge problem (MP) based on the Kantorovitch relaxation (KP). We have the following result:

Proposition 2.1.1 (Proposition 1.15 in [Santambrogio 2015]) . Let X = Y = Ω ⊂ R d and c ∈ C . In thefollowing φ denotes a Kantorovitch potential. If ( x , y ) ∈ supp ( π ∗ ) then: ∇ φ ( x ) = ∇ x c ( x , y ) (2.20) provided that φ is diﬀerentiable at x . This proposition suggests the following strategy in order to ﬁnd an optimal coupling: (1) Ensure that φ is diﬀerentiable µ a.e . This can be guaranteed when µ is absolutely continuous with respect to theLebesgue measure and when φ is regular enough, such as Lipschitz. (2) Deduce from previous propositionthat π is characterized by a deterministic function that is the map associating y to each x . The ideahere is to “inverse” ∇ c in (2.20) and deduce from ( x , y ) that y is uniquely determined from x . Thiscan be done using some regularity assumptions on c and the spaces X , Y . When conditions (1) and (2) aresatisﬁed we can deduce that the optimal coupling is unique since it was constructed using φ and c only.The step (2) can be veriﬁed e.g. when we have the following condition: Deﬁnition 2.1.6 (Twist condition) . For Ω ⊂ R d we say that c : Ω × Ω → R satisﬁes the Twist conditionwhenever c is diﬀerentiable w.r.t. x at every point, and the map y → ∇ x c ( x , y ) is injective for every x . When working on Euclidean domains and when the cost c ∈ C this condition corresponds to det (cid:16) ∂ c∂ y i ∂ x j (cid:17) = 0. The squared Euclidean cost is an important example of costs which satisﬁes the Twistcondition and leads to the celebrated Brenier theorem [Brenier 1991]: Theorem 2.1.6 (Brenier) . Let

Ω = R p , c ( x , y ) = k x − y k and µ ∈ P ( R p ) absolutely continuouswith respect to the Lebesgue measure and ν ∈ P ( R p ) with ´ k x k d µ ( x ) < + ∞ , ´ k y k d ν ( y ) < + ∞ .The optimal transport plan π ∗ of (KP) is unique and supported on the gradient of a convexfunction. More precisely it can be written as π ∗ = ( id × T ) µ where T = ∇ φ and φ : R d → R ∪ { + ∞} is convex and ﬁnite almost everywhere.Moreover T is the unique solution of (MP) . If T is another optimal solution then T = T µ a.e. This result can be generalized to costs c ( x , y ) = h ( y − x ) with h strictly convex and in this case T can be written as T ( x ) = x − ( ∇ h ) − ∇ φ ( x ) where φ is a c -concave function (see e.g. [Gangbo 1996]).The regularity of potential functions φ, ψ and its consequences for Optimal Transport problems isa long-standing line of research. Other more general hypothesis on the cost function c than the Twistcondition can be built thanks to Ma, Trudinger and Wang who found a key assumption on the cost c that requires fourth-order condition on the cost functions [Ma 2005]. The resulting MTW conditionsturned out to be suﬃcient to prove the regularity of the Kantorovitch potentials. We refer the readerto [Figalli 2010] for a survey on this topic. Two important special cases will be considered in this manuscript, namely the cases where µ and ν are probability distributions on R or when they are Gaussians distributions. Theses cases are well-known in linear OT for having closed-form solutions which are given in the next results (respectively[Santambrogio 2015, Theorem 2.9] and [Peyré 2019, Remark 2.30]). .1. Linear Optimal Transport theory 17 Figure 2.4: OT for 1D probability measures can be computed using simple sorts. (left)

Optimal coupling betweentwo discrete probability measures with uniform weights. When the points are sorted it associates ﬁrst point ofthe source with the ﬁrst point of the target and so on. (right)

Generic case: the optimal coupling associateshorizontally the points w.r.t. the cumulative distributions of the probability measures. In this example x isassociated with y . Theorem 2.1.7 (Closed-form expression on the real-line) . Asssume that

Ω = R , µ, ν ∈ P ( R ) . Let F µ be the cumulative distribution function: ∀ t ∈ R , F µ ( t ) = µ (] − ∞ , t ]) (2.21) and F − µ its pseudo inverse, namely: ∀ x ∈ [0 , , F − µ ( x ) = inf { t ∈ R | F µ ( t ) ≥ x } (2.22) If c ( x, y ) = h ( y − x ) where h is stricly convex then (KP) has a unique solution given by π ∗ =( F − µ × F − ν ) L [0 , where L [0 , is the Lebesgue measure restricted to [0 , .Moreover if µ is atomless π ∗ is supported on T mon ( x ) = F − ν ( F µ ( x )) , i.e. π ∗ = ( id × T mon ) µ .If h is only convex then π ∗ is still optimal but uniqueness can not be guaranteed. This theorem states that it suﬃces to sort the support of the distributions in order to recover the optimalcoupling (see Figure 2.4). In the special case where µ = n P ni =1 δ x i , ν = n P ni =1 δ y i this correspondsto sort x ≤ · · · ≤ x n , y ≤ · · · ≤ y n and to associate x with y , x with y and so on. In the case µ = P ni =1 a i δ x i , ν = P mi =1 b j δ y j the previous theorem states that, after sorting the points, the optimalmapping is obtained by putting as much mass as possible from x to y and to add the remaining massto y . This procedure is repeated until there is no more mass left. This corresponds to a monotonerearrangement π ij , π i j > x i ≤ x i implies that y j ≤ y j . Overall the Wasserstein distance in 1Dcan be solved using simple sorts. This result is the main ingredient of the sliced-Wasserstein distance (seeSection 2.1.5).Another special case arises when the probability measures are Gaussian. This is a well known result inthe literature of OT geometry [Givens 1984, McCann 1997, Takatsu 2011] which is recalled in the followingtheorem: Theorem 2.1.8 (Closed form expression for Gaussians) . Let µ = N ( m µ , Σ µ ) , ν = N ( m ν , Σ ν ) andsuppose that c ( x , y ) = h ( y − x ) with h striclty convex. Figure 2.5: Linear displacement interpolation between two discrete probability measure µ, ν on R using thelinear map T deﬁned in 2.1.8. The ﬁgure shows the displacement interpolant which is a probability measure isdeﬁned by ((1 − t ) id + T ) µ for t ∈ [0 ,

1] [McCann 1997] (see Section 2.1.6 for more details)

Let T : x → m ν + A ( x − m ν ) where: A = Σ − / µ ( Σ / µ Σ ν Σ / µ ) Σ − / µ (2.23) then T is the unique optimal solution of (MP) and π ∗ = ( id × T ) µ is the unique optimal solutionof (KP) .In particular when c ( x , y ) = k x − y k is the Euclidean distance on R d the -Wasserstein distanceis given by: W ( µ, ν ) = k m µ − m ν k + B ( Σ µ , Σ ν ) (2.24) where B ( Σ µ , Σ ν ) = tr (cid:16) Σ µ + Σ ν − Σ / µ Σ ν Σ / µ ) (cid:17) is the Bures metric [Bures 1969]. Interestingly enough the problem of computing OT between Gaussian measures draws connectionswith the general case. Indeed for µ, ν ∈ P ( R d ) and c ( x , y ) = k x − y k the optimal map T deﬁned inTheorem 2.1.8 is actually the optimal Monge map of (MP) when restricted to the class of linear Mongemap [Flamary 2019, Proposition 1]. Figure 2.5 illustrates the behavior of this map for two discreteprobability measures on R .Note that a generalization of the previous result exists for elliptical distributions which are somehowgeneralizations of Gaussian densities. In this case the W admits also a closed-form (see [Muzellec 2018]). In most of machine learning applications we do not have access to the true distributions µ, ν but only tosamples from these distributions. As such a natural question arises: can we infer from this samples goodestimates of OT objects such as couplings or OT distances? One particular question is how well can weestimate the Wasserstein distance by relying only on samples of the distribution? If we consider a probabilitydistribution µ ∈ P ( R d ) and an empirical distribution µ n = n P ni =1 δ x i where x i ∼ µ are iid samples does µ n is a good proxy for µ ? Unfortunately the sample complexity of the estimation of the Wassersteindistance is exponential in the dimension of the ambient space. More precisely E [ W p ( µ n , µ )] = O ( n − d ) sothat the Wasserstein distance suﬀers from the curse of dimensionality [Dudley 1969, Weed 2017]. It wasshown in [Weed 2017] that this result can be reﬁned to O ( n − p ) where p is the intrinsic dimension of the .1. Linear Optimal Transport theory 19 Figure 2.6: An assignment of x , x , x , x to y , y , y , y can be described whether by a permutation σ or usingassignment matrix π σ . In this example both correspond to the permutation σ : 1 → → →

4, 4 → data but, generally, this is a major bottleneck for the use of OT in high-dimensional machine learningproblems. Previous analysis can be extended to the inﬁnite dimensional setting as analysed in [Lei 2020].The problem of estimating the optimal coupling by relying on small batches of µ when it is discrete wasfurther analyzed in [Fatras 2020].To circumvent this limitation some robust projection formulations have been proposed [Niles-Weed 2019,Lin 2020] as well as strategies such as gaussian-smoothing [Goldfeld 2020] or based on wavelet estimator[Weed 2019a]. The entropic regularization presented in the next section in also one of the tool thatfacilitates the estimation of W p for high-dimensional settings. For more details about statistical aspects ofOT we refer the reader to [Weed 2019b]. In this section we consider the problem of computing OT between discrete probability measures µ = P ni =1 a i δ x i , ν = P mi =1 b j δ y j . The problem can be solved in many ways and we aim here at giving a briefsummary these possibilities. We denote by C = ( c ij ) i ∈ [[ n ]] ,j ∈ [[ m ]] the matrix of all pair-to-pair costs betweenthe samples x i , y j , i.e. c ij = c ( x i , y j ) for all i ∈ [[ n ]] , j ∈ [[ m ]]. In the discrete case the underlying problemreads: min π ∈ Π( a , b ) h C , π i F = min π ∈ Π( a , b ) X ij c ij π ij . (2.25)As described previously the problem is linear in π , in this way the discrete case corresponds to a linearprogram (LP) [Dantzig 1997]. Before discussing potential algorithms for solving equation (2.25) we detailone important special case. Assignment problems

Suppose that m = n . In this case we can look for an assignment of the points,that is a one-to-one correspondence between the points x i , y i . This translates mathematically by lookingat the permutation σ ∈ S n of the points or at the permutation matrix π σ =  n , if j = σ ( i )0 , otherwise such thatthe overall cost is minimized (see Figure 2.6). In this situation one aims at solving:min σ ∈ S n n n X i =1 c i,σ ( i ) = min σ ∈ S n h C , π σ i F (2.26)This problem is well known in the literature as the linear assignment problem (see e.g. [Burkard 1999]).It is worth pointing that, in this case, it exactly corresponds to the Monge problem (MP) in the discretecase. Interestingly enough when the weights of the OT problem (2.25) are set as uniform i.e. a = b = n n Algorithm 1

North-West corner rule a , b , i, j = 1 while i < = n , j < = m do π ij = min { a i , b j } // Send as many units as possible form i to j a i = a i − π ij // Adjust the supply b j = b j − π ij // Adjust the demand If a i = 0, i = i + 1, if b j = 0, j = j + 1 end while both problems (2.25) and (2.26) are equivalent. More precisely by combining the fundamental theorem oflinear programming [Bertsimas 1997], which states that the minimum of a linear program is reached at anextremal point of the polyhedron, and Birkhoﬀ’s theorem [Birkhoﬀ 1946], which states that the extremalpoints of Π( n n , n n ) is the set of permutation matrices, we can conclude that the optimal map of (2.25)is reached at π σ which is optimal for (2.26). Algorithmic solutions

To solve the OT problem (2.25) in general one can rely on classical algorithmsfor solving (LP) [Dantzig 1997]. We make here a brief overview of possible numerical solutions and werefer the reader to Section 3 in [Peyré 2019] for more details.As seen in Theorem 2.14 the OT problem can be solved using duality which reads in the discrete case:max α ∈ R n , β ∈ R m ∀ ( i,j ) ∈ [[ n ]] × [[ m ]] ,α i + β j ≤ c ij α T a + β T b (2.27)where α , β denotes the Kantorovitch potentials. Thanks to the fundamental Theorem 2.1.4 an optimalsolution π ∗ of the primal problem is found when α ∗ i + β ∗ j = c ij for π ∗ ij > α ∗ , β ∗ are solutions ofthe dual problem. Using this remark we can solve (2.25) by relying on the Network Simplex algorithmwhich philosophy is to ﬁnd feasible solutions ( α , β ) such that α i + β j = c ij whenever π ij > π and ( α , β ) are complementary w.r.t. C ). The complexity of this algorithm is O ( n log( n ))when m = n . The special case of uniform weights for assignment problems can be solved using the Auctionalgorithm which has a cubic complexity O ( n ). Special cases: Monge property

The case where C has special structure deserves attention. Inparticular when C satisﬁes the following Monge property [Burkard 1996]: ∀ ( i, j ) ∈ [[ n ]] × [[ m ]] , c i,j + c i +1 ,j +1 ≤ c i +1 ,j + c i,j +1 (2.28)which can be tested in O ( mn ) operations. This property has some interesting historical background. It isactually based on the original observation of Monge who states that if quantity must be transported fromlocations x , y to locations x , y then the route from x and the route from y must not intersect: betternot to cross the paths! In this case the simple North-West corner rule (see Algorithm 1) produces anoptimal solution in O ( n + m ). The original quote by Monge is [Monge 1781]: "Lorsque le transport du deblai se fait de manière que la somme desproduits des molécules par l’espace parcouru est un minimum, les routes de deux points quelconques A & B, ne doivent plusse couper entre leurs extrémités, car la somme Ab + Ba, des routes qui se coupent, est toujours plus grande que la sommeAa + Bb, de celles qui ne se coupent pas." .1. Linear Optimal Transport theory 21

Figure 2.7: Eﬀect of the entropic regularization parameter ε on the optimal coupling π ∗ ε between two 1Dprobability distributions. As ε increases the coupling tends to blur and converges to the marginals’ productcoupling. Special cases: 1D probability distributions

As seen in Section 2.1.2 the case of 1D probabilitydistributions can be solved eﬃciently using simple sorts when C is e.g. a squared Euclidean distancematrix. The complexity of computing the Wasserstein distance is O ( n log( n )) when n = m and weightsare uniform and in general it suﬃces to compute the two cumulative distribution functions which is O ( n log( n ) + m log( m )). Special cases: Gaussian distributions

When µ and ν are Gaussian distributions (and with aEuclidean cost) the OT problem is also quite easy to solve. In the discrete case, when relying on samplesfrom µ, ν , and using the empirical version of the means and covariances, ﬁnding the optimal solution hasa O (( n + m ) d + d ) complexity.Although previous special cases exist solving the OT problem in general remains costly. The nextsection presents a regularization scheme that tends to lower this computational complexity and was one ofthe major breakthrough in OT past years. Entropic regularization

The idea of penalizing the entropy of the joint coupling can be traced back to Schrödinger [Schrodinger 1931]and its use for linear OT to Wilson [Wilson 1969], yet it was made popular quite recently in the OTcommunity [Cuturi 2013]. The entropic regularization has multiple virtues in practice: 1) it turns theoptimal transport problem into a strongly-convex minimization problem which solution is unique 2)solving an entropic regularized OT only involves simple iterations of matrix-vectors products which canbe plugged easily into modern diﬀerentiable frameworks 3) it can be accelerated on GPU and can solve inparallel several OT problems 4) it has many desirable properties for high-dimensional problems statisticallyspeaking.The entropy term for a coupling π reads as: H ( π ) = − X ij (log( π ij ) − π ij (2.29)which corresponding entropic regularized OT problem: Algorithm 2

Sinkhorn-Knopp Algorithm for entropic transport a , b , C , ε > , u (0) , v (0) = , K = exp( − C ε ) for i = 1 , . . . , n it do u ( i ) = a (cid:11) K > v ( i − // Update left scaling v ( i ) = b (cid:11) Ku ( i − // Update right scaling end for return π ∗ = Diag( u ) K Diag( v ) T εc ( µ, ν ) = min π ∈ Π( a , b ) h C , π i F − εH ( π ) ( ε -KP)Interestingly enough the optimal cost of (KP) can be obtained as ε → i.e. lim ε → T εc ( µ, ν ) = T c ( µ, ν )[Peyré 2019, Propositon 4.1]. As a side eﬀect, the entropic term tends to blur the optimal coupling sothat more points are associated compared to the sparse optimal solution of the original problem. In otherwords entropy forces the solution to have a spread support. In the limit setting where ε → + ∞ all pointsare coupled together such that lim ε → + ∞ π ∗ ε = ab T where π ∗ ε denotes the optimal coupling of ( ε -KP) (seeFigure 2.7). Note that the entropy regularization can also be deﬁned when the probability measures arenot discrete and in this case reads H ( π ) = ´ X ×Y (log( d π ( x,y )d µ ( x )d ν ( y ) ) − π ( x, y ). Sinkhorn-Knopp and Bregman projections

A simple analytic solution of ( ε -KP) can be foundusing the lagragian duality as expressed in the following proposition [Peyré 2019, Proposition 4.3]: Proposition 2.1.2.

Problem ( ε -KP) has a unique solution of the form π ∗ = Diag( u ) K Diag( v ) with K = e − C ε and u , v ∈ R n + × R n + . As written in [Sinkhorn 1967] there is a unique solution of the form π ∗ = Diag( u ) K Diag( v ) withmarginals a , b providing that K is positive deﬁnite. Moreover it can be recovered based on the Sinkhorn-Knopp Matrix scaling algorithm that relies on matrices multiplications by alternatively updating u and v in order for π ∗ to have the prescribed marginals (see in Algorithm 2). When n = m and by setting τ = n ) ε the Sinkhorn algorithm produces an optimal solution π ∗ such that h C , π ∗ i F ≤ T c ( µ, ν ) + τ after O ( k C k ∞ log( n ) τ − ) iterations [Altschuler 2017]. In particular this implies that a τ -approximatesolution of the original unregularized problem can be computed in O ( n log( n ) τ − ) time.From a practical point of view the Sinkhorn’s algorithm suﬀers from stability issues when ε → K vanishes rapidly which results in divisions by 0 during the algorithms’ iterations. To avoid suchunderﬂows for small value of ε [Schmitzer 2016] suggest a log-sum-exp stabilization trick whose iterationsturn to be mathematically equivalent to the original iterations.This problem is also a special case of a Kullback-Leiber minimization problem where one wants toﬁnd a coupling matrix π ∗ the closest possible to a kernel K in the sense of the Kullback-Leiber geometry.More precisely ( ε -KP) is equal to: min π ∈ Π( a , b ) KL ( π | K ) (2.30)where K = e − C ε and KL ( π , K ) = P ij π ij log( π ij K ij ) − π ij + K ij is the Kullback-Leiber divergencebetween π and K . Reformulating OT problems as a minimization of a Kullback-Leiber divergenceallows the use of the machinery of Bregman projections in order to ﬁnd a solution and to analyse theconvergence [Benamou 2015]. This formulation is particularly interesting for solving multi-marginals .1. Linear Optimal Transport theory 23 OT problems [Nenna 2016], regularized OT barycenter [Benamou 2015, Bigot 2019b] and the Gromov-Wasserstein problem [Peyré 2016] in short.

Sinkhorn divergences

One drawback of entropic regularized OT is that it induces a bias T εc ( µ, µ ) = 0which can be problematic for learning using T εc . In [Genevay 2018] authors propose to correct this bias byconsidering the so-called Sinkhorn divergence: SD c,ε ( µ, ν ) = T εc ( µ, ν ) − T εc ( µ, µ ) − T εc ( ν, ν ) . (2.31)This divergence enjoys many valuable properties. First it deﬁnes a symmetric positive deﬁnite smoothfunction on the space of probability measures that is convex in both µ, ν and that metrizes the weakconvergence of probability measures [Feydy 2019]. Second it interpolates, through ε , between Wassersteindistance and Kernel norms (MMD) allowing ﬁnding a trade-oﬀ between both. Finally it is more suitedfor high-dimensional problems where the estimation of the Wasserstein distance is known to suﬀer fromthe curse of dimensionality (the sample complexity if O ( n − d ) as explained in Section 2.1.3) whereas thesample complexity of SD c,ε is O ( ε − d n − ) [Genevay 2019]. Stochastic Optimal Transport: going large scale.

The regularization of linear OT allows derivingstochastic formulations that are useful in practice to handle large scale datasets. This setting wasconsidered in [Genevay 2016, Seguy 2018] where authors rely on the dual formulation (2.14) or thesemi-dual formulation (2.18) in the regularized case. More precisely for µ, ν ∈ P ( R d ) and ε > resp. semi-dual) boils down to solve the following unconstrained maximization problems:sup φ,ψ ∈ C ( R d ) × C ( R d ) E x ∼ µ, y ∼ ν [ F ε ( φ ( x ) , ψ ( y ))] (s-D)sup ψ ∈ C ( R d ) E x ∼ µ [ H ε ( x , ψ )] (s-SD)where F ε ( φ ( x ) , ψ ( y )) = φ ( x ) + ψ ( y ) − εe φ ( x )+ ψ ( y ) − c ( x , y ) ε and H ε ( x , ψ ) = ´ Y ψ ( y )d ν ( y ) − ε log( ´ e ε ( ψ ( y ) − c ( x , y )) d ν ( y )) − ε when entropic regularization is used. Since the problem is recast inthe form of an unconstrained maximization of an expectation, the idea is to use stochastic gradientstools such as Stochastic Gradient Descent (SGD), or Stochastic Averaged Gradient (SAG) to compute asolution of (s-D),(s-SD). When both µ = P ni =1 a i δ x i , ν = P mj =1 b j δ y j are discrete (SGD) or (SAG) aredirectly applicable to maximize the following ﬁnite sums:max α , β ∈ R n × R m n,m X i =1 ,j =1 F ε ( α i , β j ) a i b j (s-Ddis)max β ∈ R m n X i =1 H ε ( x i , β ) a i (s-SDdis)In [Genevay 2016] authors propose to use (SAG) to compute (s-SDdis) which operates at each iteration bysampling a point x k from µ then to compute the gradient of H ε ( x k , β ) corresponding to that sample whilekeeping in memory a copy of past gradients. This approach costs O ( n ) per iteration due to the computationof the gradient and converges to a solution within O ( k − ) iterations. In contrast in [Seguy 2018] proposeto solve (s-Ddis) by applying an (SGD) on mini-batches of both µ, ν which comes with a O ( p ) cost periteration where p is the mini-batch size and also converges in O ( k − ). In the continuous setting theproblem is inﬁnite dimensional so that it can not be solved using (SGD) anymore. In [Genevay 2019] authors propose to represent the dual variables as kernel expansions while in [Seguy 2018] the dualvariables are parametrized by a neural network. Another line of works rely on the unregularized problemand on the special case of W . In this case the duality reads sup φ E x ∼ µ, y ∼ ν [ φ ( x ) − φ ( y )] where themaximization is done over all 1-Lipschitz function φ . In [Arjovsky 2017] authors tackled this problem inthe context of generative modelling. They parametrized µ, ν using a neural network and used the same(SGD)+mini-batch procedure resulting on a O ( p ) cost per iteration. Their approach however relies on aweight clipping of the (NN) weights in-between gradient updates to enforce the Lipschitz constraint whichlead to optimization diﬃculties [Gulrajani 2017]. Note all approaches comes at the price of biasing theoptimal coupling due to the mini-batch sampling. This eﬀect was further analyzed in [Fatras 2020]. Apart from entropic-regularized OT there are a lot of other methods for approximating OT. One of themrelies on the closed-form expression of OT for probability distributions over the real line resulting onthe so-called

Sliced Wasserstein distance (SW) [Rabin 2011]. Considering µ, ν ∈ P ( R d ) the key idea isto randomly select lines in R d , to project the measures into these lines and to compute the resulting1D-Wasserstein distance which can be done using simple sorts as seen previously. The sliced-Wassersteindistance is the average of all these 1D-Wasserstein distances over all drawn lines. More precisely: Deﬁnition 2.1.7 (Sliced Wasserstein distance) . Let λ d − be the uniform measure on S d − .For θ ∈ S d − we note P θ the projection on θ , i.e. P θ ( x ) = h x , θ i . Let µ, ν ∈ P ( R d ) . The SlicedWasserstein distance between µ and ν is deﬁned as: SW pp ( µ, ν ) = ˆ S d − W pp ( P θ µ, P θ ν ) dλ d − ( θ ) (2.32) where the Wasserstein distance is deﬁned with the standard Euclidean distance on R d . SW enjoys several interesting properties. First SW induces a similar topology than W : it deﬁnesa distance on P p ( R d ) [Bonnotte 2013] that metrizes the weak convergence [Nadjahi 2019] and which isequivalent to the Wasserstein distance for measures with compact supports [Nadjahi 2020, Bonnotte 2013].Second it deﬁnes a positive deﬁnite kernel e − γSW ( µ,ν ) for γ > W which is not Hilbertian and consequently does not deﬁne a positive deﬁnite kernel (see Section 8.3in [Peyré 2019]). In terms of sample complexity SW is known to be dimension independant [Nadjahi 2020]such as O ( n − / ) when p = 2 [Lin 2020, Nadjahi 2020] and better samples complexities can be found byprojecting on subspaces of dimension k > SW isunable to ﬁnd the correspondences between the samples of the distributions as it does not provide anoptimal transport map π which is valuable for certain application such as domain adaptation [Courty 2017].From a practical side estimating SW requires the calculation of an integral over the hypersphere whichcan be done using a simple Monte-Carlo scheme. Hence for discrete probability measures with n atoms theoverall complexity of computing SW is O ( Ln log( n )) where L is the number of projection directions on S d − .The quality of the Monte Carlo estimates is impacted by the number of projections as well as the variance ofthe evaluations of the Wasserstein distance has pointed out empirically in [Kolouri 2019a, Deshpande 2019]and more formally in [Nadjahi 2020]. The low computational complexity of SW makes it very attractive fora number of scenarii such as in deep learning for generative modeling [Deshpande 2018, Deshpande 2019], .1. Linear Optimal Transport theory 25 for barycenter computation [Bonneel 2015] or topological data analysis [Carrière 2017] to name a few.This “projection” idea was further developed and improved in several works which proposed to project on k -dimensional subspaces [Paty 2019], to use non-linear projection [Kolouri 2019a] or to generalize SW forthe unbalanced setting [Bonneel 2019].Many other interesting formulations can be derived from the original OT formulation. Since theyare not considered in this manuscript we just give a brief overview here. In [Ferradans 2014] authorpropose to regularize the linear OT with a quadratic term, resulting on a quadratic regularized OT.For regular grids [Solomon 2015] deﬁne a Wasserstein distance that can be computed eﬃciently in O ( n log( n )) using convolutions. Another line of works consider an unbalanced setting where the sourceprobability measure is partially transferred to the target probability measure resulting on the unbalancedformulation [Chizat 2017]. A case of particular interest is when the target probability measure is discreteand the source continuous, namely the semi-discrete OT. It founds many applications in practice andcan be tackled using Laguerre cells [Lévy 2018]. Finally the multi-marginal OT aims at solving an linearOT problem where there are many target/source probability measures and one optimal coupling fortransporting them all [Nenna 2016]. The Wasserstein distance is also an interesting tool in order to compute a notion of barycenter of probabilitydistributions. In an Euclidean setting the traditional barycenter of points ( x i ) i ∈ [[ m ]] can be computed bysolving inf x ∈ R d P ni =1 λ i k x − x i k where λ i ≥ P i λ i = 1. The barycenter vector is then given by x = P ni =1 λ i x i . This can be generalized to arbitrary metric spaces ( X , d ) using the so-called Fréchet (orKarcher) mean [Karcher 2014]: inf x ∈X n X i =1 λ i d ( x, x i ) p (2.33)for p ∈ N . The problem (2.33) motivates the use of Wasserstein barycenter by considering the metric space( P p (Ω) , W p ). Generally (2.33) is non-convex and diﬃcult to solve for arbitrary metric space, howeverin the case of the Wasserstein distance the situation is somehow easier since it can be formulated asa convex problem for which existence can be proved and eﬃcient numerical solvers exist. For a set ofinput probability measures ( ν i ) i ∈ [[ n ]] ∈ P (Ω) the Wasserstein barycenter reads as the following variationalproblem: inf µ ∈P (Ω) n X i =1 λ i T c ( µ, ν i ) . (2.34)The barycentric formulation ﬁnds many applications in machine learning such in Bayesian inference [Sri-vastava 2015], fairness [Gordaliza 2019], in image processing for texture synthesis and mixing [Rabin 2012]or in neuroimaging [Gramfort 2015] to name a few. As proven in [Agueh 2011] in the context of W for Ω = R d this problem is convex and when one of the input measure has a density the barycenter iswell-deﬁned and unique. Even though there exist special cases (see Section 9.2 in [Peyré 2019]) in practiceﬁnding a solution in the general setting is diﬃcult. In the following we detail one solution for the scenariowhere the input measures are discrete. More formally let ( ν i ) ni =1 be discrete probability measures withweights b i ∈ Σ n i and that are supported on Y i = ( y iq ) q ∈ [[ n i ]] ∈ R n i × d for each i ∈ [[ n ]]. Instead of lookingat all possible discrete probability measures we can search a k atoms probability measure i.e. of the formˆ µ = P kp =1 a p δ x p where X = ( x p ) p ∈ [[ k ]] ∈ R k × d and a ∈ Σ k . Overall the resulting problem is: min a ∈ Σ k , X ∈ R k × d n X i =1 λ i min π ∈ Π( a , b i ) h π , C XY i i F = min a ∈ Σ k , X ∈ R k × d ∀ i ∈ [[ n ]] , π i ∈ Π( a , b i ) n X i =1 λ i h π i , C XY i i F (2.35)where C XY i ∈ R k × n i is the matrix deﬁned by all pair to pair costs between the points of the barycenterand ν i , i.e. C XY i = ( c ( x p , y iq )) p,q ∈ [[ k ]] × [[ n i ]] . In [Cuturi 2014] author propose to solve (2.35) using BlockCoordinate Descent (BCD) that alternates between minimizing w.r.t. a , X and π i while keeping othersﬁxed:(i) The minimization w.r.t. all π i with a , X ﬁxed involves solving n OT problems which can be doneusing algorithms described in Section 2.1.4.(ii) The minimization w.r.t. X with a , π i ﬁxed can be performed in closed-form in the case Ω = R d and c ( x , y ) = k x − y k [Cuturi 2014, Equation 8]: X = Diag (cid:18) a (cid:19) n X i =1 λ i π i Y i ! (2.36)(iii) The minimization w.r.t. the weight a with X , π i ﬁxed relies on the optimal dual variables of all OTsub-problems of step (i) and applies a projected subgradient minimization w.r.t. a as described inAlgortihm 1 in [Cuturi 2014].These three steps are repeated until convergence of X and a . The major bottleneck of this approach is itscomputational complexity which is driven by the calculation of many OT problems. When the support X is ﬁxed and by denoting C XY i = C i the problem reduces to:min a ∈ Σ k ∀ i ∈ [[ n ]] , π i ∈ Π( a , b i ) n X i =1 λ i h π i , C i i (2.37)which is an (LP) with kn + n variables and 2 N n constraints. Note that ﬁrst order methods such assubgradient descent on the dual have been proposed in [Carlier 2015] to solve (2.37) but in general itsscale forbids the use generic solvers even for medium scale problems. These remarks advocate for the useof entropic regularized OT to obtain fast and smooth approximations of the original barycenter problemas given by: min a ∈ Σ k ∀ i ∈ [[ n ]] , π i ∈ Π( a , b i ) n X i =1 λ i h π i , C i i F − εH ( π i ) (2.38)The resulting problem is a smooth convex minimization problem, which can be tackled using gradientdescent [Cuturi 2014] or with descent method on the semi-dual [Cuturi 2018]. Another possibility is torewrite (2.38) as a the following weighted KL minimization problem [Benamou 2015]:min ( π i ) i ∀ i ∈ [[ n ]] , π Ti = b i π = ··· = π n n X i =1 λ i εKL ( π i | K i ) (2.39)where K i = e − C iε . In this formulation the barycenter a is encoded in the row marginals of all thecouplings π i such that a = π = · · · = π n . It is shown in [Benamou 2015] that this problem canalso be solved using a generalized Sinkhorn algorithm which involves iterative projections. As such .2. The Gromov-Wasserstein problem 27 the entropic regularization is quite suited for the barycenter problem and was further analyzed for thegeneral case of continuous probability measures in [Bigot 2019a, Bigot 2019b]. Note that other methodshave been proposed which rely e.g. on the sliced Wasserstein formulation [Bonneel 2015], unbalancedformulation [Chizat 2017] or on convolutions for geometric domains [Solomon 2015]. The case n = 2 : McCann interpolant One special case deserves attention that is when n = 2 andin the case Ω = R d equipped with k . k . This setting corresponds to the so-called McCann interpolant [McCann 1997] where one wants to ﬁnd:inf µ ∈P ( R d ) (1 − t ) W ( µ, ν ) + tW ( µ, ν ) (2.40)with t ∈ [0 ,

1] and ν is regular with respect to the Lebesgue measure. Using Brenier theorem we knowthat there exists a unique push-forward such that T ν = ν . In this case the barycenter is unique andobtained with µ t = ((1 − t ) id + tT ) ν . In practice when the probability measures ν , ν are discrete withrespectively n and m atoms this interpolant can be computed by µ t = P m,ni =1 ,j =1 π ∗ ij δ (1 − t ) x i + t y j where π ∗ is an optimal coupling between ν , ν . Despite its valuable properties the linear OT problem faces the challenging problem of probability measureswhose supports lie in incomparable spaces, that is to say when X , Y are not part of a common groundmetric space. For example when µ ∈ P ( R ) , ν ∈ P ( R ) the deﬁnition of a meaningful cost c : R × R → R + is not straightforward. In particular in this setting we can not deﬁne a distance between x , y ∈ R × R sothat the Wasserstein distance can no longer be deﬁned. Moreover the Wasserstein distance is not invariantto important families of invariants, such translations or rotations or more generally isometries which is animportant ﬂaw of linear OT for certain applications such as shape matching.The Gromov-Wasserstein (GW) framework is an elegant remedy for this situation. It is built upona quadratic Optimal Transport problem, as opposed to a linear one for the linear OT problem, and,informally its optimal value quantiﬁes the metric distortion when transporting points from one spaceto another. This section aims at presenting the GW problem, its fundamental metric properties as wellas numerical solvers. We refer the reader to [Sturm 2012, Memoli 2011, Chowdhury 2019a] for furtherreadings.We consider two polish spaces ( X , d X ) , ( Y , d Y ). Let c X : X ×X → R and c Y : Y ×Y → R be continuousmeasurable functions and µ ∈ P ( X ) , ν ∈ P ( Y ) be probability measures on X , Y . The Gromov-Wasserstein( GW ) problem aims at ﬁnding: GW p ( c X , c Y , µ, ν ) = inf π ∈ Π( µ,ν ) (cid:18) ˆ X ×Y ˆ X ×Y (cid:12)(cid:12) c X ( x, x ) − c Y ( y, y ) (cid:12)(cid:12) p d π ( x, y )d π ( x , y )) (cid:19) p (2.41)for p ∈ N ∗ (see Figure 2.8). GW depends on the choice of similarities c X , c Y between points in X and Y . When it is clear form the context we will simply note GW p ( µ, ν ) instead of GW p ( c X , c Y , µ, ν ). Since X , Y are already endowed with a natural metric one choice would be to consider c X = d X and c Y = d Y .This setting brings in light the notion of metric measure spaces , as triplets of the form ( X , d X , µ ) where( X , d X ) is a complete separable metric space and µ is a Borel probability measure on X . It was studied Figure 2.8: The GW problem considers two probability measures µ ∈ P ( X ) , ν ∈ P ( Y ) over two spaces that donot necessarily share a common metric. It is built upon the similarities c X , c Y within each space and on a measureof the distortion between each pair of points (cid:12)(cid:12) c X ( x, x ) − c Y ( y, y ) (cid:12)(cid:12) . in depth in [Sturm 2012]. Another possibility is to consider triplets ( X , c X , µ ) where c X is a integrablefunction, this notion refers to measure networks and was studied in [Chowdhury 2019a].The GW objective is constructed so that if an optimal coupling π maps x to y and x to y then thecouple ( x, x ) should be “as similar” in X as ( y, y ) in Y . When c X , c Y are distances it implies that x, x are as close in X as y, y in Y . In this work we consider a general setting where c X , c Y are continuous and X , Y are Polish spaces and we will detail the two previous settings.As for the linear OT problem the equation (2.41) always admits a solution. To show that we deﬁne L ( x, x , y, y ) = (cid:12)(cid:12) c X ( x, x ) − c Y ( y, y ) (cid:12)(cid:12) . If Π( µ, ν ) is compact and the functionnal π → ´ ´ L d π d π is l.s.c.for the weak-convergence, Weierstrass theorem (see Memo 2.2.1) proves that the inﬁmum will be attainedat some optimal coupling. The ﬁrst condition is a well-known result in OT theory provided that X , Y arePolish spaces [Santambrogio 2015, Theorem 1.7]. For the lower semi-continuity w.r.t. the weak-convergencewe can show that it suﬃces that L be itself l.s.c. using the following lemma: Lemma 2.2.1.

Let Ω be a Polish space. If f : Ω × Ω → R + ∪ { + ∞} is lower semi-continuous, then thefunctional J : P (Ω) → R ∪ { + ∞} with J ( µ ) = ´ ´ f ( w, w )d µ ( w )d µ ( w ) is l.s.c. for the weak convergenceof measures.Proof. Since f is l.s.c. and bounded from below by 0 we can consider ( f k ) k a sequence of continuous andbounded functions converging increasingly to f (see e.g [Santambrogio 2015]). By the monotone convergencetheorem J k ( µ ) → J ( µ ) def = sup k J k ( µ ) = sup k ´ ´ f k dµdµ . Moreover every J k is continuous for the weakconvergence. Using theorem 2.8 [Billingsley 1999] on the Polish space Ω × Ω we know that if µ n convergesweakly to µ then the product measure µ n ⊗ µ n converges weakly to µ ⊗ µ . In this way lim n →∞ J k ( µ n ) = J k ( µ ) since f k are continuous and bounded. In particular every J k is l.s.c. We can conclude that J isl.s.c. as the supremum of l.s.c. functionals on the metric space of ( P (Ω) , δ ) (see e.g. [Santambrogio 2015]).Here we equipped P (Ω) with a metric δ as e.g. δ ( µ, ν ) = P ∞ k =1 − k | ´ Ω f k dµ − ´ Ω f k dν | (see remark 5.11in [Ambrosio 2005]). Memo 2.2.1 (Weierstrass theorem) . The Weierstrass theorem states that if f : X → R ∪ + ∞ isl.s.c. and X is compact then there exists x ∗ = inf x ∈X f ( x ) (see box 1.1 in [Santambrogio 2015]). .2. The Gromov-Wasserstein problem 29 Overall since L is l.s.c. due to the continuity of c X , c Y we can apply Lemma 2.2.1 with Ω = X × Y andconclude that π → ´ ´ L d π d π is l.s.c. for the weak-convergence and by the means of Weierstrass theoremequation (2.41) always admits a minimizer. Remark 2.2.1.

As a consequence of our formulation the resulting GW cost main be inﬁnite. A simplecondition to remedy this possibility would be ´ ´ (cid:12)(cid:12) c X ( x, x ) − c Y ( y, y ) (cid:12)(cid:12) p d( µ ⊗ µ )( x, x )d( ν ⊗ ν )( y, y ) < ∞ .Since: (cid:12)(cid:12) c X ( x, x ) − c Y ( y, y ) (cid:12)(cid:12) p ≤ ( | c X ( x, x ) | + | c Y ( y, y ) | ) p ∗ ≤ p − ( | c X ( x, x ) | p + | c Y ( y, y ) | p ) (2.42) if c X , c Y are p -integrable functions, i.e. c X ∈ L p ( µ ⊗ µ ) and c Y ∈ L p ( ν ⊗ ν ) then the cost is ﬁnite (weused Hölder’s inequality in (*) see Memo 2.2.2) GW for measures on Euclidean spaces

One special case will be consider in Chapter 4, that iswhen X = R p , Y = R q and µ ∈ P ( R p ) , ν ∈ P ( R q ) with p not necessarily equal to q . We can considerthe GW problem with the standard Euclidean distances for c X , c Y on respectively R p , R q . This settingillustrates the invariance property of the GW problem w.r.t. rotations and translations. More preciselylet O ∈ O ( p ) and x ∈ R p associated with T ( x ) = Ox + x . Then the GW problem is invariant by T that is GW pp ( T µ, ν ) = GW pp ( µ, ν ) (same applies for ν ). To see that we simply used for all x , x c X ( T ( x ) , T ( x )) = k Ox + x − Ox − x k = k O ( x − x ) k = k x − x k since O ∈ O ( p ). This propertywill be generalized for any metric space by considering the notion of isometry and contrasts with theWasserstein distance which is not invariant neither to translation or rotation of the support of oneprobability measure. Example 2.2.1.

As a ﬁrst illustration of the GW problem we consider two discrete probability measures µ ∈ P ( R ) , ν ∈ P ( R ) following respectively a spiral in R , R and composed of a mixture of Gaussiandistributions. c X , c Y are deﬁned by the Euclidean distances between the points. We compute an optimalcoupling of the GW problem using the FW solver presented in Chapter 3 (see Section 2.2.3 for moredetails). The result is depicted in Figure 2.9. Memo 2.2.2 (Hölder’s inequality) . Let ( X , µ ) be a measurable space and ( f, g ) ∈ L p ( µ ) × L q ( µ ) with p, q > verifying p + q = 1 . The Hölder’s inequality states: ˆ | f g | d µ ≤ ( ˆ | f | p d µ ) p ( ˆ | g | q d µ ) q (2.43) As a corollary for q ≥ we have: ∀ x, y ∈ R + , ( x + y ) q ≤ q − ( x q + y q ) . (2.44) Indeed, if q > : ( x + y ) q = (cid:0) ( q − ) q x ( q − ) q + ( q − ) q y ( q − ) q (cid:1) q ≤ (cid:2) ( q − ) q − + ( q − ) q − (cid:3) q − (cid:0) x q q − + y q q − (cid:1) = x q q − + y q q − . Last inequality is a consequence of Hölder’s inequality. The result remains valid for q = 1 . Figure 2.9: GW problem between two discrete probability measures µ ∈ P ( R ) , ν ∈ P ( R ). The optimal coupling π is depicted in dashed lines. It associates points so as to minimize the distortion between all pair-to-pair distances within the support of each measures. One of the main property of the GW problem is that it allows for comparing probability measures whosesupports dwell in diﬀerent, potentially non-related, spaces by deﬁning a notion of equivalence of twoprobability distributions in this case. This is made possible thanks to the concepts of isometry and isomorphism . Deﬁnition 2.2.1 (Isometry) . Let ( X , d X ) and ( Y , d Y ) be two metric spaces. An isometry is a sujectivemap φ : X → Y that preserves the distances: ∀ x, x ∈ X, d Y ( φ ( x ) , φ ( x )) = d X ( x, x ) . (2.45)An isometry is necessarily bijective, since for φ ( x ) = φ ( x ) we have d Y ( φ ( x ) , φ ( x )) = 0 = d X ( x, x )and hence x = x (in the same way φ − is also a isometry). When it exists, X and Y share the same"size" and any statement about X which can be expressed through its distance is transported to Y by theisometry φ . Example 2.2.2.

Let us consider the two following graphs whose discrete metric spaces are obtained asshortest path between the vertices (see corresponding graphs in Figure 2.10).  x x x x  ,  | {z } d X ( x i ,x j ) and  y y y y  ,  | {z } d Y ( y i ,y j ) . .2. The Gromov-Wasserstein problem 31 x x x x y y y y Figure 2.10: Two isometric metric spaces. Distances between the nodes are given by the shortest path, and theweight of each edge is equal to 1. x x y y

214 34 12 12

Figure 2.11: Two isometric but not isomorphic spaces.

These spaces are isometric since the surjective map φ such that φ ( x ) = y , φ ( x ) = y , φ ( x ) = y , φ ( x ) = y veriﬁes equation (2.45) . Another natural and straightforward example is two point clouds rotated from each other. Moreprecisely if we consider ( x i ) i ∈ [[ n ]] , ( y i ) i ∈ [[ n ]] where x i , y i ∈ R p × R p equipped with the Euclidean norm k . k .Suppose that there exists a orthogonal matrix O ∈ O ( p ) such that y i = Ox i for all i ∈ [[ n ]] (with a slightabuse of notations we identify the matrix with its linear application). Then for all ( i, j ) ∈ [[ n ]] we have: k y i − y j k = k Ox i − Ox j k = k O ( x i − x j ) k = k x i − x j k (2.46)since O ∈ O ( p ) so that X = ( x i ) i ∈ [[ n ]] and Y = ( y i ) i ∈ [[ n ]] are isometric.This notion can be enriched in order to take into account the measures, which results in the notion of strong isomorphism : Deﬁnition 2.2.2 (Strong isomorphism) . Let ( X , d X ) , ( Y , d Y ) be Polish spaces and µ ∈ P ( X ) , ν ∈ P ( Y ) .We say that ( X , d X , µ ) is strongly isomorphic to ( Y , d Y , ν ) if there exists a bijection φ : supp ( µ ) → supp ( ν ) such that:i φ is an isometry, i.e. d Y ( φ ( x ) , φ ( x )) = d X ( x, x ) for x, x ∈ supp ( µ ) ii φ pushes µ forward to ν , i.e. φ µ = ν When it is clear from the context we will simply say that µ is strongly isomorphic to ν when previousconditions are satisﬁed. Example 2.2.3.

Let us consider two mm-spaces ( X = { x , x } , d X = { } , µ = { , } ) and ( Y = { y , y } , d Y = { } , ν = { , } ) as depicted in Figure 2.11. These spaces are isometric but not isomorphicas there exists no measure preserving map which pushes µ forward to ν Another notion of isomorphism deserves attention especially when c X , c Y are not distances. In thiscase we will consider the following weak isomorphism property: Deﬁnition 2.2.3 (Weak isomorphism) . Let X , Y be Polish spaces and µ ∈ P ( X ) , ν ∈ P ( Y ) . We saythat ( X , c X , µ ) is weakly isomorphic to ( Y , c Y , ν ) if there exists ( Z , c Z , m ) , with supp ( m ) = Z and maps φ : Z → X , φ : Z → Y such that: i c Z ( z, z ) = c X ( φ ( x ) , φ ( x )) = c Y ( φ ( x ) , φ ( x )) for z, z ∈ Z ii φ m = µ and φ m = ν When it is clear from the context we will simply say that µ is weakly isomorphic to ν when previousconditions are satisﬁed. The weak isomorphism brings to light a kind of “tripod structure” in which the isomorphism is deﬁnedtrough a third space Z . In fact both notions are equivalent when c X , c Y are distances as stated in thenext proposition [Sturm 2012, lemma 1.10]: Proposition 2.2.1.

The spaces ( X , d X , µ ) and ( Y , d Y , ν ) are strongly isomorphic if and only if ( X , d X , µ ) and ( Y , d Y , ν ) are weakly isomorphic. However the weak-isomorphism property has its own interest when working with arbitrary similaritymeasures c X , c Y . The following theorem is fundamental for GW and aims to unify the metric properties of GW given in [Sturm 2012, Chowdhury 2019a]. It proves that GW deﬁnes a metric w.r.t. the isomorphismnotions: Theorem 2.2.1 (Metric properties of GW ) . In the following ( X , d X ) , ( Y , d Y ) are Polish spaces and µ ∈ P ( X ) , ν ∈ P ( Y ) i GW p is symmetric, positive and satisﬁes the triangle inequality. More precisely for ( X , c X , µ ) , ( Y , c Y , ν ) , ( Z , c Z , m ) we have: GW p ( c X , c Y , µ, ν ) ≤ GW p ( c X , c Z , µ, m ) + GW p ( c Z , c Y , m, ν ) (2.47) ii GW p ( d X , d Y , µ, ν ) = 0 if and only if ( X , d X , µ ) and ( Y , d Y , ν ) are strongly isomorphic.iii GW p ( c X , c Y , µ, ν ) = 0 if and only if ( X , c X , µ ) and ( Y , c Y , ν ) are weakly isomorphic.iv More generally, for any q ≥ , GW p ( d q X , d q Y , µ, ν ) = 0 if and only if ( X , d X , µ ) and ( Y , d Y , ν ) are strongly isomorphic.Proof. For the ﬁrst point (i) positiveness is straightforward. For the triangle inequality and symmetry see[Chowdhury 2019a, Theorem 16]. For (ii) see [Sturm 2012, Lemma 1.10] and for (iii) see [Chowdhury 2019a,Theorem 18]. Note in the proof that for (iii) the result is still valid even if c X , c Y are not p -integrable. For(iv) see [Sturm 2012, Lemma 9.2].This theorem can endow the space of all spaces of the form ( X , c X , µ ) with a distance deﬁned by GWwhich, however, requires the ﬁniteness of GW . More precisely: Deﬁnition 2.2.4.

Let X be a Polish space, µ ∈ P ( X ) and c X : X → R be measurable. We deﬁne the size size p of X , given c X and µ by size p ( X , c X , µ ) = ( ´ c X ( x, x ) p d µ ( x )d µ ( x )) p We deﬁne X p be the space of all metric measure spaces with ﬁnite L p -size, i.e X p = { ( X , d X , µ ) | size p ( X , d X , µ ) < + ∞} where ( X , d X ) is a Polish space and µ ∈ P ( X ) .We deﬁne also N p be the space of all network measure spaces [Chowdhury 2019a] with ﬁnite L p -size, i.e. N p = { ( X , c X , µ ) | size p ( X , c X , µ ) < + ∞} where X is a Polish space, µ ∈ P ( X ) and c X acontinuous measurable function. .2. The Gromov-Wasserstein problem 33 The function size p quantiﬁes somehow an average diameter of X given a probability measure anda function c X . Using this notion and Theorem 2.2.1 we now state the main theorem about the metricproperties of GW : Theorem 2.2.2 ( GW is a distance) . GW p is a distance on X p quotiented by the strong isomorphisms. GW p is a distance on N p quotiented by the weak isomorphisms. This theorem has a lot of implications. It endows the space of all metric (network) measure spaceswith a topology, a geometric structure, induced by Gromov-Wasserstein and, as such, allows the useof a wide family of geometric tools and a notion of convergence of metric measure spaces. Moreover itindicates that GW is well suited for comparing objects with respect to a large class of invariants that arefor instance rotations, translations or permutations. This property is important e.g. for shape comparisonwhere the orientation of a shape does not deﬁne its nature or for graphs where any permutation of thenodes result in the same graph. It is sometimes valuable to have a notion of distance which is insensitiveto these transformations so as to focus properly on what matters rather than to encode the invariance(see Remark 2.2.2). Finally if GW vanishes it implies necessarily that the objects are isomorphic which isinteresting for detecting such cases.Note also that GW is deeply connected to the Gromov-Hausdorﬀ distance [Gromov 1999] that aimsat measuring how far are ( X , d X ) and ( Y , d Y ) from being isometric and can be used for studying theconvergence of metric spaces [Burago 2001]. However computing this distance results in a highly non-convexoptimization problem whose global solution is untractable. As shown in [Memoli 2011] the introductionof measures turns out to “smooth” the deﬁnition of the Gromov-Hausdorﬀ distance and results in theGromov-Wasserstein distance. Remark 2.2.2 (Implicit or explicit encoding of the invariances) . Let µ, ν ∈ P ( R d ) × P ( R d ) and c : R d × R d → R a cost. In [Alvarez-Melis 2019] authors propose the following OT problem: InvOT ( µ, ν ) = min π ∈ Π( µ,ν ) min f ∈ F ˆ c ( x , f ( y ))d π ( x , y ) (2.48) where F is a class of functions from R d to R d aiming at encoding a global transformation of the features.For example F can be deﬁned as O ( d ) the set of orthogonal transformations or any linear transformationwith bounded Schatten norm (see [Alvarez-Melis 2019] or Chapter 4 for more details). When considering F = O ( d ) the resulting InvOT becomes invariant by rotation of the support of the target measure. Itcould be interesting in a setting where one wants to match two distributions modulo a rotation as e.g. inunsupervised word translation where word embedding algorithms are known to produce vectors intrinsicallyinvariant to angle [Alvarez-Melis 2019, Grave 2019]. This approach can be put into perspective with the GW distance as, in the GW case, one makes the implicit assumption, or prior, that the invariances arethe isometric transformations of the data, whereas in the InvOT approach one makes the prior that weknow somehow which class of invariances is of interest for the problem and we encode it into the loss. Bothapproach are relevant and related (see Chapter 4): if we have another prior than isometric transformationsit is maybe more suited to encode it directly via F and if we know that isometries are relevant we candirectly build upon GW . Geodesics and GW interpolation

The space of all mm-spaces X p endow with GW has also anice geodesic structure which is important in order to derive dynamic formulations and gradient ﬂows[Ambrosio 2005]. Informally in X p we can connect any two points (that are mm-spaces) with a curve that somehow represents the shortest path connecting these points. More precisely given two mm-spaces( X , d X , µ ) , ( Y , d Y , ν ) the curve t ∈ [0 , → ( X × Y , d t , π ∗ ) where π ∗ is an optimal coupling of the GWproblem between µ and ν and: ∀ ( x, y ) , ( x , y ) ∈ ( X × Y ) , d t (( x, y ) , ( x , y )) = (1 − t ) d X ( x, x ) + td Y ( y, y ) (2.49)is a geodesic [Sturm 2012]. However computing this geodesic is often intractable in practice since it impliesthe calculation of the cartesian product X × Y . One can rely instead on the barycenter formulation deﬁnedin Section 2.2.4.

In this section we describe some numerical solutions to the GW problem. In the following µ = P ni =1 a i δ x i ∈P ( X ), ν = P mj =1 b j δ y j ∈ P ( Y ) are discrete probability measures over respectively ( X , d X ), ( Y , d Y ). Wenote also C , C the matrices of pair-to-pair distances inside each space, i.e. ∀ ( i, k ) ∈ [[ n ]] , C ( i, k ) = d X ( x i , x k ) and ∀ ( j, l ) ∈ [[ m ]] , C ( j, l ) = d Y ( y j , y l ). The GW problem aims at solving: GW pp ( C , C , a , b ) = min π ∈ Π( a , b ) X i,j,k,l | C ( i, k ) − C ( j, l ) | p π i,j π k,l = min π ∈ Π( a , b ) h L ( C , C ) p ⊗ π , π i F (2.50)where we deﬁne L ( C , C ) as the tensor L ( C , C ) = ( | C ( i, k ) − C ( j, l ) | ) i,j,k,l and ⊗ is the the tensor-matrix multiplication, i.e. for a tensor L = ( L i,j,k,l ), L ⊗ π is the matrix (cid:16)P k,l L i,j,k,l π k,l (cid:17) i,j .The optimization problem (2.50) is a non-convex Quadratic Program (QP) which is NP-hard ingeneral [Loiola 2007] and notoriously hard to approximate. When p = 2, i.e. when L = | . | equation (2.50)can be recast as: min π ∈ Π( a , b ) tr( c C , C π T ) − C π C π T ) (2.51)where c C , C = ( C ) a1 Tm + n b T ( C ) (see Proposition 1 in [Peyré 2016]). In standard QP formthis problem reads also: min π ∈ Π( a , b ) c T x ( π ) + 12 x ( π ) T Qx ( π ) (2.52)where x ( π ) = vec( π ) , c = vec( c C , C ) and Q = − C ⊗ K C with ⊗ K the Kronecker product of matricesdeﬁned for two arbitrary matrices A ∈ R n × m , B p × q as A ⊗ K B ∈ R np × mq with A ⊗ K B = ( A i,j B ) i,j (see Memo 2.2.3). Equation (2.52) is a non-convex QP as the Hessian Q is not positive semi-deﬁnite ingeneral (its eigenvalues are the products of the eigeinvalues of C , C ). Relations with Quadratic Assignment Problem and Graph Matching

The GW problem isvery related to the so called Quadratic Assignment Problem (QAP). This problem was ﬁrst introduced byKoopmans and Beckmann [Koopmans 1957] to model a plant location problem and plays today manyroles in optimization. Given two matrices A = ( a i,j ) ( i,j ) ∈ [[ n ]] and B = ( b i,j ) ( i,j ) ∈ [[ n ]] the standard formfor the QAP reads: min σ ∈ S n n X i =1 ,j =1 a σ ( i ) σ ( j ) b ij (2.53)The QAP can be understood as a facility location problem: given n facilities and n locations, one wantsto assign each facility to a location with a ij the ﬂow of material moving from facility i to facility j .2. The Gromov-Wasserstein problem 35 and b ij the distance from facility i to facility j . In this context the cost of simultaneously locatingfacility σ ( i ) to location i and facility σ ( j ) to location j is a σ ( i ) σ ( j ) b ij . In this model one wants to ﬁndthe assignment that minimizes the overall cost of locating each facility. This problem was consideredfor example in [Elshafei 1977] for locating hospital departments so as to minimize the total distancetraveled by patients but it also covers a large variety of applications such as scheduling [Geoﬀrion 1976],parallel and distributed computing [Bokhari 1981] or balancing of turbine runners [Laporte 1988]. Fora comprehensive survey on this topic we refer to [Çela 2013, Loiola 2007]. Unfortunately the QAP isNP-Hard in general and only few special cases are known to be solvable in polynomial time. That is thecase for example when matrices A and B have simple known structures, such as a diagonal structure andToeplitz or separability properties such as a i,j = α i α j [Çela 2018, Çela 2011, Çela 2015]). The questionof ﬁnding a polynomial time algorithm that solves the QAP when A and B satisfy the Monge and theanti-Monge properties are, to the best of our knowledge, still open [Çela 2013].The QAP is intrinsically linked with the graph matching problem whose literature is also extensive(see e.g. [Berg 2005, Lyzinski 2016, Zaslavskiy 2009, Maron 2018, Caetano 2009]). The graph matchingproblem refers to optimization problems where the goal is to match edge aﬃnities of two graphs that arerepresented by symmetric matrices A = ( a i,j ) ( i,j ) ∈ [[ n ]] and B = ( b i,j ) ( i,j ) ∈ [[ n ]] . A common approach forthese types of problem is to attempt to solve:min X ∈ Π n k AX − XB k F (2.54)where Π n is the set of permutation matrices, i.e. X = ( x ij ) ( i,j ) ∈ [[ n ]] ∈ Π n if x ij ∈ { , } and P ni =1 x ij = P nj =1 x ij = 1. By noticing that k AX − XB k F = k AX k F + k XB k F − AXBX T ) and that k AX k F =tr( X T A T AX ) = tr( XX T A T A ) = tr( A T A ) = k A k F the problem (2.54) is equivalent to:min X ∈ Π n − tr( AXBX T ) (2.55)As such the graph matching problem is a QAP and consequently is NP-Hard in general. A way ofﬁnding a approximate solution is to consider a relaxation of the constraints by replacing Π n by its convex-hull, namely the set of doubly stochastic matrices DS = { X ∈ R n × n | X1 n = n , X T n = n , X ≥ } (see [Dym 2017, Bernard 2018, Schellewald 2001]).The previous discussion can relate the GW problem with the graph matching one. Indeed when weconsider two discrete probability measures with the same number of atoms and with uniform weights( i.e. a = b = n /n ) then GW is equivalent to the relaxation of the graph matching problem with aﬃnitymatrices C , C . To see this it suﬃces to notice that, in this case, Π( a , b ) is the set of doubly stochasticmatrices (modulo a factor n which has no impact). We will see in Chapter 4 that the QAP and graphmatching point of views can be quite enlightening for deriving properties of the GW distance. Entropic regularization

In [Peyré 2016, Solomon 2016] authors propose to solve (2.50) using entropicregularization which results in the following optimization problem:min π ∈ Π( a , b ) h L ( C , C ) p ⊗ π , π i F − εH ( π ) . (2.56)This is a non-convex optimization which was tackled using projected gradient descent using thegeometry of the KL divergence for both the gradient step and the projection step [Peyré 2016]. Moreprecisely by denoting E ε ( π ), the loss in (2.56) the iterations of this algorithm read: π ← Proj KL Π( a , b ) ( π ∗ e − τ ∇ E ε ( π ) ) (2.57) Algorithm 3

Solving Entropic-regularized GW a , b , C , C , ε > Initalize π = ab T , u (0) , v (0) = for i = 1 , . . . , n it do Compute the gradient of the GW loss C = 2 L ( C , C ) p ⊗ π Set K = exp( − C ε ) // Do Sinkhorn-Knopp Algorithm 2 for i = 1 , . . . , n it do u ( i ) = a (cid:11) K > v ( i − // Update left scaling v ( i ) = b (cid:11) Ku ( i − // Update right scaling end for end for return π ∗ = Diag( u ) K Diag( v )where τ > ∗ denotes elementwise (Hadamard) matrixmultiplication. The projection operator is deﬁned as the result of the minimization problem:Proj KL Π( a , b ) ( K ) = arg min π ∈ Π( a , b ) KL ( π | K ) (2.58)As shown in Section 2.1.4 the projection can be solved using the eﬃcient Sinkhorn-Knopp Algorithm(see Algorithm 2). The gradient ∇ E ε ( π ) can be calculated as 2 L ( C , C ) p ⊗ π + ε log( π ), and, as notedin [Peyré 2016], the special case where the step size τ is deﬁned as τ = ε the iterations (2.57) boil downto solving an entropic regularized linear OT problem with ground cost 2 L ( C , C ) p ⊗ π . Overall theprocedure to solve GW with entropic regularization is a projected gradient procedure where the projectionstep can be solved using an entropic linear OT as described in Algorithm 3. Note that, as for the linearOT case, the resulting optimal coupling is not sparse since the entropy term tends to blur the solution.Moreover one usually wants to have a gradient step τ which is not “too big” in order to ensure theconvergence of the algorithm. Since the gradient step τ is inversely proportional to the regularizationparameter ε this comes at the price of blurring the resulting optimal solution. In practice, there is atrade-oﬀ between regularization and convergence of the algorithm (see [Peyré 2016] and experiments ofChapter 4). Memo 2.2.3 (Kronecker product and vec operator) . Let A ∈ R n × m , B p × q be two arbitrary matrices.The vec operator converts the matrix into a column vector. It is deﬁned as vec ( A ) ∈ R nm × =( A , , . . . , A n, , A , , . . . , A n, , . . . , A n,m ) T . The Kornecker product of two matrices result in a blockmatrix A ⊗ K B ∈ R np × mq deﬁned by A ⊗ K B = ( A i,j B ) i,j . These two operators satisfy the followingproperties (see [Petersen 2012]):• vec ( AB ) = ( I ⊗ K A ) vec ( B ) = ( B T ⊗ K I ) vec ( A ) • ( A ⊗ K B ) T = ( A T ⊗ K B T ) • ( A ⊗ K B )( B ⊗ K A ) = ( AB ⊗ K BA ) • tr ( A T B ) = vec ( A ) T vec ( B ) .2. The Gromov-Wasserstein problem 37 When p = 2 the previous problem reduces to the softassign quadratic assignement problem. Inthe special case where the problem is convex the convergence of the previous scheme was analyzedin [Rangarajan 1997b, Rangarajan 1999] but, with arbitrary matrices C , C there is no known results tothe best of our knowledge. Computing a lower-bound

Originally (2.50) was tackled by computing a lower bound (called the

TLB ) in [Memoli 2011]. More precisely, authors propose to solve the following problem:min π ∈ Π( a , b ) X k,l  min π ∈ Π( a , b ) X i,j | C ( i, k ) − C ( j, l ) | p π i,j  π k,l (2.59)This problem is actually a “Wasserstein of Wasserstein distances”: it is equivalent to solve an OT problemwhich ground cost results itself on OT problems between the 1D empirical distributions of the lines of C , C . More precisely by considering µ k = P i a i δ C ( i,k ) ∈ P ( R ) and ν l = P j b j δ C ( j,l ) ∈ P ( R ) (2.59) isequivalent to: min π ∈ Π( a , b ) X k,l W pp ( µ k , ν l ) π k,l (2.60)where the Wasserstein distance is computed using | . | as ground cost. The advantage of this formulation isthat (2.60) only involves linear OT problems that can be solved using tools presented in Section 2.1.4.This idea is based on the local distribution of distances and was also successfully applied in computergraphics for 3D shape comparison [Gelfand 2005, Memoli 2011]. Computational complexities

One major bottleneck for computing GW is ﬁrst the calculation of thebig tensor ( | C ( i, k ) − C ( j, l ) | p ) i,j,k,l which is O ( n m ) is general. By noticing that it suﬃces insteadto compute L ( C , C ) p ⊗ π authors in [Peyré 2016] show that, in the case p = 2, one can rely on theseparability of L which results in a O ( n m + m n ) complexity. A second bottleneck is the complexity ofﬁnding an optimal solution which is driven by the algorithmic method. The convergence of Algorithm3 for the entropic-regularized GW problem is still not well-understood and slow in practice as shownfor example in Chapter 4. We will present in Chapter 3 an algorithm based on Frank-Wolfe (FW) toﬁnd a sparse local optimal solution with a cubic complexity. Using the FW properties we will show thatthis algorithm also converges to a local stationary point with a O ( √ t ) rate. Regarding the lower boundcomputation one can rely on the sorting strategies for 1D distributions to compute an inner Wassersteindistance W pp ( µ i , ν j ) in O ( n log( n ) + m log( m )) hence a O ( n m log( n ) + m n log( m ))) complexity for allpairs. Then ﬁnding a solution π has the same complexity as computing a linear OT problem and one canrely on entropic regularization to reach a quadratic time complexity. More recently, in [Sato 2020] authorspropose to ﬁx the outer optimal transport plan of the lower bound to ab T which results in a divergencethat can be computed in O (( n + m ) log( nm )) using a sweep line strategy. Illustration of previous solvers

In order to give a simple illustration of the diﬀerent solvers presentedabove we consider two unlabeled graphs with the same number of nodes ( n = m = 20) and with 4communities. Each node of the graph lies in the implicit metric space deﬁned by the shortest-path distanceinside the graph so that C is the shortest path distance matrix between each node (same for C ). Weconsider uniform weights, i.e. a = b = n n and p = 2 for GW. We solve the GW problem by relying on(1) The FW algorithm deﬁned in Chapter 3, (2) The entropic regularized GW problem with ε = 8 e − TLB . The behavior of the three approaches is depicted in Figure 2.12

Figure 2.12: Illustration of the optimal coupling for the GW problem between two graphs described by theirshortest-path. (left)

FW solver of Chapter 3 (middle)

Entropic regularized GW (Algorithm 3) (right)

Lowerbound approach

TLB . The optimal coupling π ∗ is depicted in dashed lines. The darker the stronger. In the same vein as the Wasserstein barycenter we can build upon the Gromov-Wasserstein distance anotion of barycenter. This setting was ﬁrst tackled in [Peyré 2016] where authors consider the problem ofcomputing the barycenter of a family of discrete probability measures over diﬀerent metric spaces w.r.t. the Gromov-Wasserstein geometry. More formally let ( C i , b i ) i ∈ [[ n ]] be this family where C i is a arbitrarymatrix and b i is a probability vector. C i can be chosen to be a distance matrix or more generally anysimilarity matrix, such as a kernel matrix encoding a notion of similarity between the points inside eachdistribution as in [Chowdhury 2019a]. When the weights of the barycenter are given and ﬁxed to a ∈ Σ k with k ∈ N ∗ the GW barycenter problem aims at ﬁnding:min C ∈ R k × k n X i =1 λ i GW pp ( C , C i , a , b i ) = min C ∈ R k × k ∀ i ∈ [[ n ]] , π i ∈ Π( a , b i ) n X i =1 λ i h L ( C , C i ) p ⊗ π , π i F (2.61)with λ i ≥ P ni =1 λ i = 1. In [Peyré 2016] authors consider this problem based on the entropicregularized version of GW. They propose to solve (2.61) by relying on a BCD procedure which alternatesbetween solving n GW problems with C ﬁxed and by ﬁnding C with all π i ﬁxed. The latter is given inclosed-form when p = 2 by [Peyré 2016, Equation 14 ]: C = 1 aa T n X i =1 λ i π Ti C i π i (2.62)where the division is made element-wise. Solving the n GW problems can be done using Algorithm 3 asin [Peyré 2016] or with the GW or the lower bound approach as presented in the previous section. Thisapproach was further generalized in [Chowdhury 2019b] where authors leverage the Riemannian geometryof Gromov-Wasserstein space to treat e.g. the case where C i are not necessarily symmetric. Illustration

To illustrate the GW barycenter we consider a simple dataset consisting in 4 2D-shapesfrom the apple class of the MPEG-7 computer vision database. Each shape is associated with a discreteprobability measure with uniform weights. The matrices C i are simply the Euclidean distances between .3. Conclusion 39 Figure 2.13: GW barycenter of 4 apple shapes in blue. The barycenters are depicted in the right side of the Figurein red. The ﬁrst barycenter uses the FW algorithm for solving the GW problems and the second is computedusing entropic regularization solved with Algorithm 3. Note that the barycenter are arbitrarily rotated due to theMDS procedure. the points in each shape. We compute the barycenter using the previous discussion with both the FWsolver for solving the GW problems and using the entropic regularized GW ( ε = 1 e −

3) with Algorithm 3.We compute a MDS on the C matrix obtained by the barycenter procedure to recover 2D points. Resultsare depicted in Figure 2.13. Applications of GW

The GW problem is well suited for comparing heterogeneous data while beinginvariant to the isometries of the data. As such it has ﬁrst received attention for shapes comparison[Memoli 2011, Solomon 2016] or in computer vision [Schmitzer 2013] where it is often valuable to compareobjects without any assumption on their orientation. GW was further exploited to handle unstructuredgeometric data such as point clouds or meshes in [Ezuz 2017] where authors use GW to learn regular2D grids that faithfully represent the 3D meshes while being applicable for standards CNN architectures.More recently GW has been the subject of much attention in the graph community as a graph matchingtool [Xu 2019b, Xu 2019a, Fey 2020] or for graphs representation [Kwon 2020] (see Chapter 4 for moredetails). It also proves its usefulness in cellular biology thanks to its ability for aligning heterogeneoustypes of single-cell measurements [Demetci 2020]. Closer to the machine learning community GW hasbeen applied in Domain Adaptation (DA) in the complex settings of Unsupervised DA [Xia 2020] andHeterogeneous (DA) [Yan 2018] (see Chapter 5 for more details), in generative modeling on incomparablespaces i.e. when the data generated do not share the same Euclidean space as the source data [Bunne 2019],or for cross-lingual correspondences of word embeddings [Alvarez-Melis 2018a].

The optimal transport framework is a powerful tool for comparing probability distributions by relying onboth Wasserstein and Gromov-Wasserstein distances. The linear OT theory is a well-studied problemboth from the theoretical and numerical side. In contrast a lot of questions remain unanswered for theGromov-Wasserstein theory and interesting connections with the Wasserstein theory can be made. Froma theoretical perspective there is no known result yet about the regularity of the optimal transport plans.As for the practical side the GW problem remains very costly to solve and diﬃcult to approximate. Thepurpose of the next chapters is, inter alia , to work towards these directions and to address the followingquestions:i Is there any favorable cases where Optimal Transport plans of the GW are supported on a Monge map as in the Brenier theorem? (see Chapter 4)ii Can we ﬁnd some special cases where GW admits a closed-form solution such as the 1D or theGaussian cases? If so, can we derive scalable and useful formulations from these cases? (see Chapter4)iii Is the GW framework suited for the structured data setting? How does it behave for concretestructured data problems such as graph applications? (see Chapter 3)iv More generally can we derive other formulations than GW that are maybe more useful for data onincomparable spaces? (see Chapter 5) hapter Optimal Transport for structured data

Perhaps as you went along you did learn something. – Ernest Hemingway,

The Sun Also Rises

Contents

Summary of the contributions

This chapter is based on the papers [Vayer 2019a, Vayer 2020b] and considers the problem of computing distancesbetween structured objects such as undirected graphs, seen as probability distributions in a speciﬁc metric space.We consider a new transportation distance ( i.e. that minimizes a total cost of transporting probability masses)that unveils the geometric nature of the structured objects space. Unlike Wasserstein or Gromov-Wassersteinmetrics that focus solely and respectively on features (by considering a metric in the feature space) or structure (byseeing structure as a metric space), our new distance exploits jointly both information, and is consequently calledFused Gromov-Wasserstein (FGW). After discussing its properties and computational aspects, we show resultson a graph classiﬁcation task, where our method outperforms both graph kernels and deep graph convolutionalnetworks. Exploiting further on the metric properties of FGW, interesting geometric objects such as Fréchet meansor barycenters of graphs are illustrated and discussed in a clustering context. We provide in a second part themathematical framework for this distance in the continuous setting, prove its metric, geodesic and interpolationproperties and provide a concentration result for the convergence of ﬁnite samples.

There is a longstanding line of research on learning from structured data, i.e. objects that are a combinationof a feature and structural information (see for example [Bakir 2007, Battaglia 2018]). As immediateinstances, graph data are usually ensembles of nodes with attributes (typically R d vectors) linked by somespeciﬁc relation. Notable examples are found in chemical compounds or molecules modeling [Kriege 2016],brain connectivity [Ktena 2017], or social networks [Yanardag 2015]. This generic family of objects alsoencompasses time series [Cuturi 2017], trees [Day 1985] or even images [Bach 2007].Being able to leverage on both feature and structural information in a learning task is a tedious task,that requires the association in some ways of those two pieces of information in order to capture thesimilarity between the structured data. Several kernels have been designed to perform this task [Sher-vashidze 2011, Vishwanathan 2010]. As a good representative of those methods, the Weisfeiler-Lehmankernel [Vishwanathan 2010] captures in each node a notion of vicinity by aggregating, in the sense of thetopology of the graph, the surrounding features. Recent advances in graph convolutional networks [Bron-stein 2017,Kipf 2016,Deﬀerrard 2016,Wu 2020] allows learning end-to-end the best combination of featuresby relying on parametric convolutions on the graph, i.e. learnable linear combinations of features. Inthe end, and in order to compare two graphs that might have diﬀerent number of nodes and connections,those two categories of methods build a new representation for every graph that shares the same space,and that is amenable to classiﬁcation.Contrasting with those previous methods, we suggest in this chapter to see graphs as probabilitydistributions, embedded in a speciﬁc metric space. We propose to deﬁne a speciﬁc notion of distancebetween these probability distributions, that can be used in most of the classical machine learningapproaches. Beyond its mathematical properties, disposing of a distance between structured data,provided it is meaningful, is desirable in many ways: i) it can then be plugged into distance-based machinelearning algorithms such as k -nn or t-SNE ii) its quality is not dependent on the learning set size, and iii) it allows considering interesting quantities such as geodesic interpolation or barycenters.Yet, deﬁning this distance is not a trivial task. While features can always be compared using a standardmetric, such as Euclidean distances, comparing structures requires a notion of similarity which can befound via the notion of isometry , since the graph nodes are not ordered (we deﬁne later on which casestwo graphs are considered identical). We use the notion of transportation distance to compare two graphsrepresented as probability distributions. Optimal transport have inspired a number of recent breakthroughs .1. Introduction 43 in machine learning ( e.g. [Huang 2016, Courty 2017, Arjovsky 2017]) because of its capacity to compareempirical distributions, and also the recent advances in solving the underlying problem [Peyré 2019]. Yet,the natural formulation of OT cannot leverage the structural information of objects since it only relies ona cost function that compares their feature representations.However, some modiﬁcations over OT formulation have been proposed in order to compare structuralinformation of objects. Following the pioneering work by Mémoli [Memoli 2011], Peyré et al. [Peyré 2016]propose a way of comparing two distance matrices that can be seen as representations of some objects’structures. They use the Gromov-Wasserstein distance (see Chapter 2) capable of comparing twodistributions even if they do not lie in the same ground space and apply it to compute barycenter ofmolecular shapes. Even though this approach has wide applications, it only encodes the intrinsic structuralinformation in the transportation problem. To the best of our knowledge, the problem of including bothstructural and feature information in a uniﬁed OT formulation remains largely under-addressed. OT distances that include both features and structures.

Recent approaches tend to incorporatesome structure information as a regularization of the OT problem. For example in [Alvarez-Melis 2018b]and [Courty 2017], authors constrain transport maps to favor some assignments in certain groups. Theseapproaches require a known and simple structure such as class clusters to work but do not generalizewell to more general structural information. In their work [Thorpe 2017], propose an OT distance thatcombines both a Lagrangian formulation of a signal and its temporal structural information. They deﬁnea metric, called Transportation L p distance, that can be seen as a distance over the coupled space oftime and feature. They apply it for signal analysis and show that combining both structure and featuretends to better capture the signal information. Yet, for their approach to work, the structure and featureinformation should lie in the same ambiant space, which is not a valid assumption for more generalproblems such as similarity between graphs. In [Nikolentzos 2017], authors propose a graph similaritymeasure for discrete labeled graph with OT. Using the eigenvector decomposition of the adjency matrix,which captures graph connectivities, nodes of a graph are ﬁrst embedded in a new space, then a groundmetric based on the distance in both this embedding and the labels is used to compute a Wassersteindistance serving as a graph similarity measure. Contributions.

After deﬁning structured data as discrete probability measures (Section 3.2), we proposea new framework, namely

F GW , capable of taking into account both structure and feature informationinto the optimal transport problem. The framework can compare any usual structured machine learningdata even if the feature and structure information dwell in spaces of diﬀerent dimensions, allowing thecomparison of undirected labeled graphs. It is based on a distance that embeds a trade-oﬀ parameterwhich allows balancing the importance of the features and the structure. We provide a conditional-gradientalgorithm for computing

F GW (Section 3.3), and we evaluate it (Section 3.4) on both synthetic andreal-world graph datasets on various tasks. We show that

F GW is particularly useful for both supervisedand unsupervised learning on graphs.Among the contributions of this Chapter the numerical solution presented in Section 3.3 can alsobe used to compute the Gromov-Wasserstein distance. To the best of our knowledge this is the ﬁrstoptimization scheme for GW that does not require entropic regularization and which results in a sparseoptimal solution.We also deﬁne and illustrate a notion of labeled graph barycenters using

F GW (Section 3.3.2), basedon the Fréchet mean, and apply it for clustering and coarsening of graphs problems.In a last part (Section 3.5) we generalize the deﬁnition of structured data to compact metric spaces. We } } } Figure 3.1: (left)

Labeled graph with ( a i ) i its feature information, ( x i ) i its structure information and a probabilityvector ( h i ) i that measures the relative importance of the vertices. (right) Associated structured data which isentirely described by a fully supported probability measure µ over the product space of feature and structure,with marginals µ X and µ A on the structure and the features respectively. present the theoretical foundations of our framework in this general setting and states the mathematicalproperties of F GW . Notably, we show that it is a metric in the space of structured objects with respectto an intuitive equivalence relation between structured objects, we give a concentration result for theconvergence of ﬁnite samples, and we study its interpolation and geodesic properties.

In this chapter, we focus on comparing structured data which combine a feature and a structureinformation. In order to give a good intuition about the method we ﬁrst consider the discrete settingwhich corresponds to labeled graphs. More formally, we consider undirected labeled graphs as tuplesof the form G = ( V , E , ‘ f , ‘ s ) where ( V , E ) are the set of vertices and edges of the graph. ‘ f : V → Ω f is a labelling function which associates each vertex v i ∈ V with a feature a i def = ‘ f ( v i ) in some featuremetric space (Ω f , d ). We will denote by feature information the set of all the features ( a i ) i of the graph.Similarly, ‘ s : V → Ω s maps a vertex v i from the graph to its structure representation x i def = ‘ s ( v i ) insome structure space (Ω s , C ) speciﬁc to each graph. C : Ω s × Ω s → R is a symmetric application whichaims at measuring the similarity between the nodes in the graph. Unlike the feature space however, Ω s is implicit and in practice, knowing the similarity measure C will be suﬃcient. With a slight abuse ofnotation, C will be used in the following to denote both the structure similarity measure and the matrixthat encodes this similarity between pairs of nodes in the graph ( C ( i, k ) = C ( x i , x k )) i,k . Depending onthe context, C can either encode the neighborhood information of the nodes, the edge information of thegraph or more generally it can model a distance between the nodes such as the shortest path distance orthe harmonic distance [Verma 2017]. When C is a metric, such as the shortest-path distance, we naturallyendow the structure with the metric space (Ω s , C ). We will denote by structure information the set of allthe structure embeddings ( x i ) i of the graph.We propose to enrich the previously described graph with a probability vector which serves the purposeof signaling the relative importance of the vertices in the graph. To do so, if we assume that the graphhas n vertices, we equip those vertices with weights ( h i ) i ∈ Σ n . Through this procedure, we derive thenotion of structured data as a tuple S = ( G , h G ) where G is a graph as described previously and h G is a .3. Fused Gromov-Wasserstein approach for structured data 45 Figure 3.2:

F GW loss E q for a coupling π depends on both a similarity between each feature of each node ofeach graph ( d ( a i , b j )) i,j and between all intra-graph structure similarities ( | C ( x i , x k ) − C ( x j , x l ) | ) i,j,k,l . function that associates a weight to each vertex. This deﬁnition allows the graph to be represented by afully supported probability measure over the product space feature/structure µ = P ni =1 h i δ ( x i ,a i ) whichdescribes the entire structured data (see Figure 3.1). When all the weights are equal ( i.e. h i = n ), so allvertices have the same relative importance, the structured data holds the exact same information as itsgraph. However, weights can be used to encode some a priori information. For instance on segmentedimages, one can construct a graph using the spatial neighborhood of the segmented zones, the featurescan be taken as the average color in the zone, and the weights as the ratio of image pixels in the zone. We aim at deﬁning a distance between two graphs G and G , described respectively by their probabilitymeasure µ = P ni =1 h i δ ( x i ,a i ) and ν = P mi =1 g j δ ( y j ,b j ) , where h ∈ Σ n and g ∈ Σ m are probability vectors.Without loss of generality we suppose ( x i , a i ) = ( x j , a j ) for i = j (same for y j and b j ). We recall thatΠ( h , g ) the set of all admissible couplings between h and g . To that extent, the matrix π ∈ Π( h , g )describes a probabilistic matching of the nodes of the two graphs. We note M AB = ( d ( a i , b j )) i,j the n × m matrix standing for the distance between the features. The structure matrices are denoted C and C , and µ X and µ A (resp. ν Y and ν B ) are representative of the marginals of µ (resp. ν ) w.r.t. thestructure and feature respectively (see Figure 3.1). We also deﬁne the similarity between the structuresby measuring the similarity between all pairwise distances within each graph thanks to the 4-dimensionaltensor L ( C , C ): L ( C , C ) = ( L i,j,k,l ( C , C )) i,j,k,l = ( | C ( i, k ) − C ( j, l ) | ) i,j,k,l . We deﬁne a novel Optimal Transport discrepancy called the Fused Gromov-Wasserstein distance. It isdeﬁned for a trade-oﬀ parameter α ∈ [0 ,

1] as

F GW ( M AB , C , C , h , g ) = min π ∈ Π( h , g ) E q ( M AB , C , C , π ) (3.1)where E q ( M AB , C , C , π ) = h (1 − α ) M q AB + α L ( C , C ) q ⊗ π , π i F = X i,j,k,l (1 − α ) d ( a i , b j ) q + α | C ( i, k ) − C ( j, l ) | q π i,j π k,l (3.2) The

F GW distance looks for the coupling π between the vertices of the graph that minimizes the cost E q which is a linear combinaison of a cost d ( a i , b j ) of transporting one feature a i to a feature b j and acost | C ( i, k ) − C ( j, l ) | of transporting pairs of nodes in each structure (see Figure 3.2). As such, theoptimal coupling tends to associate pairs of feature and structure points with similar distances within eachstructure pair and with similar features. α acts as a trade-oﬀ parameter between the cost of the structuresrepresented by L ( C , C ) and the cost on the features M AB . In this way, the convex combination of bothterms leads to the use of both information in one formalism resulting on a single map π which “moves”the mass from one joint probability measure forawrd to the other. As an important feature of F GW ,by relying on a sum of (inter- and intra-)vertex-to-vertex distances, it can handle structured data withcontinuous attributed or discrete labeled nodes (thanks to the deﬁnition of d ) and can also be computedeven if the graphs have diﬀerent number of nodes.This new distance is called the F GW distance as it acts as a generalization of the Wasserstein andGromov-Wasserstein distances. Indeed as α tends to zero, the F GW distance recovers the Wassersteindistance between the features W q ( µ A , ν B ) q and as α tends to one, we recover the Gromov-Wassersteindistance GW q ( µ X , ν Y ) q between the structures (see Proposition 3.5.2 of Section 3.5.5).More importantly F GW enjoys metric properties on labeled graphs as stated in the following theorem:

Theorem 3.3.1 ( F GW deﬁnes a metric for q = 1 and a semi-metric for q > . If q = 1 , and if C , C are distance matrices such as shortest-path matrices then F GW deﬁnes a metric over the space ofstructured data quotiented by the measure preserving isometries that are also feature preserving. Moreprecisely,

F GW satisﬁes the triangle inequality and vanishes iﬀ n = m and there exists a permutation σ ∈ S n such that: ∀ i ∈ [[ n ]] , h i = g σ ( i ) (3.3) ∀ i ∈ [[ n ]] , a i = b σ ( i ) (3.4) ∀ i, k ∈ [[ n ]] , C ( i, k ) = C ( σ ( i ) , σ ( k )) (3.5) If q > , the triangle inequality is relaxed by a factor q − such that F GW deﬁnes a semi-metric.

This results is a direct consequence of Theorem 3.5.1 in Section 3.5.2 where

F GW is deﬁned forgeneral metric spaces. The resulting permutation σ preserves the weight of each node (equation (3.3)),the features (equation (3.4)) and the pairwise structure relation between the nodes (equation (3.5)). Forexample, comparing two graphs with uniform weights on the vertices and with shortest-path structurematrices, the F GW distance vanishes iﬀ the graphs have the same number of vertices and there exists aone-to-one mapping between the vertices of the graphs which preserves both the shortest-paths and thefeatures. More informally, in this case graphs have vertices with the same labels connected by the sameedges, and thus F GW can be used to determine if two graphs are isomorphic [West 2000].The metric

F GW is fully unsupervised and can be used in a wide set of applications such as k -nearest-neighbors, distance-substitution kernels, pseudo-Euclidean embeddings, or representative-set methods.Arguably, such a distance also allows for a ﬁne interpretation of the similarity (through the optimalmapping π ), contrary to end-to-end learning machines such as neural networks. OT barycenters have many desirable properties and applications (see Chapter 2 for more details), yet noformulation can leverage both structural and feature information in the barycenter computation. In this .3. Fused Gromov-Wasserstein approach for structured data 47 section, we consider the

F GW distance to deﬁne a barycenter of a set of structured data as a Fréchet mean.We look for the structured data µ that minimizes the sum of (weighted) F GW distances within a givenset of structured data ( µ k ) k ∈ [[ K ]] associated with structure matrices ( C k ) k ∈ [[ K ]] , features ( B k ) k ∈ [[ K ]] andbase histograms ( h k ) k ∈ [[ K ]] . For simplicity, we assume that the histogram h associated to the barycenteris known and ﬁxed; in other words, we set the number of vertices N and the weight associated to each ofthem.In this context, for a ﬁxed N ∈ N and ( λ k ) k such that P Kk =1 λ k = 1 , we aim to ﬁnd the set of features A = ( a i ) i and the structure matrix C of the barycenter that minimize the following equation:min µ K X k =1 λ k F GW q,α ( µ, µ k ) = min C ∈ R N × N , A ∈ R N × d ∀ k ∈ [[ K ]] , π k ∈ Π( h , h k ) K X k =1 λ k E q ( M AB k , C , C k , π k ) (3.6)Note that this problem is jointly convex w.r.t. C and A but not w.r.t. π k . We discuss the proposedalgorithm to solve this problem in the next section. Interestingly enough, one can derive several variantsof this problem, where the features or the structure matrices of the barycenter can be ﬁxed. Solving therelated simpler optimization problem extends straightforwardly. We give examples of such barycentersboth in the experimental section where we solve a graph based k -means problem. In this section we discuss the numerical optimization problem for computing the

F GW distance betweendiscrete distributions.

Solving the Quadratic Optimization problem.

Equation (3.1) is clearly a quadratic problem w.r.t. π which is NP-hard in general [Loiola 2007]. However ﬁnding a solution in practice can be done quiteeﬃciently. We propose here a method based on the Frank-Wolfe algorithm [Jaggi 2013] ( aka ConditionalGradient). When considering q = 2 the F GW computation problem can be re-written as ﬁnding π ∗ suchthat: π ∗ = arg min π ∈ Π( h , g ) vec( π ) T Q ( α )vec( π ) + vec( D ( α )) T vec( π ) (3.7)where Q = − α C ⊗ K C and D ( α ) = (1 − α ) M AB . ⊗ K denotes the Kronecker product of two matrices,vec the column-stacking operator. With such form, the resulting optimal map can be seen as a quadraticregularized map from initial Wasserstein [Ferradans 2014,Flamary 2014]. However, unlike these approaches,we have a quadratic but provably non convex term. The gradient G that arises from equation (3.1) canbe expressed with the following partial derivative w.r.t. π : G ( π ) = (1 − α ) M q AB + 2 α L ( C , C ) q ⊗ π (3.8)Note that despite the apparent O ( m n ) complexity of computing the tensor product L ( C , C ) q ⊗ π given π , one can simplify the sum to complexity O ( mn + m n ) [Peyré 2016] operations when q = 2.Solving a large scale QP with a classical solver can be computationally expensive. In [Ferradans 2014],authors propose a solver for a graph regularized optimal transport problem whose resulting optimizationproblem is also a QP. We can then directly use their conditional gradient scheme to solve our optimizationproblem as presented in Algorithm 4. It only needs at each iteration to compute the gradient in equation(3.8) and to solve a linear OT problem with classical solvers (see Chapter 2 for more details). Theline-search part is a constrained minimization of a second degree polynomial function and, as such, admitsa closed form expression written in Algorithm 5. While the problem is non convex, conditional gradient is Algorithm 4

Conditional Gradient (CG) for

F GW π (0) ← hg > for i = 1 , . . . , do G ← Gradient from equation (3.8) w.r.t. π ( i − ˜ π ( i ) ← Solve OT with ground loss G τ ( i ) ← Line-search for loss (3.1) with τ ∈ (0 ,

1) using Alg. 5 π ( i ) ← (1 − τ ( i ) ) π ( i − + τ ( i ) ˜ π ( i ) end forAlgorithm 5 Line-search for CG ( q = 2) c C , C from equation (6) in [Peyré 2016] a = − α h C ˜ π ( i ) C , ˜ π ( i ) i b = h (1 − α ) M AB + α c C , C , ˜ π ( i ) i F − α (cid:0) h C ˜ π ( i ) C , π ( i − i F + h C π ( i − C , ˜ π ( i ) i F (cid:1) if a > then τ ( i ) ← min(1 , max(0 , − b a )) else τ ( i ) ← a + b < τ ( i ) ← end if known to converge to a local stationary point with a O ( √ t ) rate [Lacoste-Julien 2016]. More precisely wenote g ( i ) the Frank-Wolfe gap at iteration i deﬁned by: g ( i ) = max π ∈ Π( h , g ) h π − π ( i ) , − G i i F (3.9)We note also ||| . ||| the dual norm for tensors: ||| L ||| = sup k A k F =1 k L ⊗ A k F where L is a 4-dimensionaltensor and A a matrix. Then using this dual norm we have that the gradient G ( . ) is ||| α L ( C , C ) q ||| -Lipschitz: ∀ ( π , π ) ∈ Π( h , g ) , k G ( π ) − G ( π ) k F = k α L ( C , C ) q ⊗ ( π − π ) k F ≤ α ||| L ( C , C ) q ||| k π − π k F (3.10)We note also diam k . k F (Π( h , g )) the diameter of Π( h , g ) (see [Jaggi 2013]):diam k . k F (Π( h , g )) = max ( π , π ) ∈ Π( h , g ) k π − π k F (3.11)Then by using Theorem 1 in [Lacoste-Julien 2016] then minimal gap encountered by the iterates duringthe algorithm after t iterations satisﬁes: min ≤ i ≤ t g ( i ) ≤ max { h , C }√ t + 1 (3.12)where C = 4 α ||| L ( C , C ) q ||| diam k . k F (Π( h , g )) and h is is the initial global suboptimality. Solving the barycenter problem with Block Coordinate Descent (BCD).

We propose tominimize equation (3.6) using a BCD algorithm, i.e. iteratively minimizing with respect to the couplings π k , to the metric C and the feature vector A . The minimization of this problem w.r.t. ( π k ) k ∈ [[ K ]] isequivalent to compute S independent Fused Gromov-Wasserstein distances as discussed above. We suppose .4. Experimental results 49 W = 0 F GW > GW = 0 Figure 3.3: Example of

F GW , GW and W on synthetic trees. Dark grey color represents a non null π i,j valuebetween two nodes i and j . (left) the W distance between the features with α = 0, (middle) F GW (right) the GW between the structures α = 1. that the feature space is Ω f = ( R d , k . k ) and we consider q = 2. Minimization w.r.t. C in this case has aclosed form (see Proposition 4 in [Peyré 2016] and Chapter 2) : C ← hh T K X k =1 λ k π Tk C k π k (3.13)where the division is computed elementwise and h is the histogram of the barycenter as discussed insection 3.3.2. Minimization w.r.t. A can be computed with [Cuturi 2014, Equation 8]: A ← K X k =1 λ k B k π Tk diag( 1 h ) (3.14) We illustrate in this section the behavior of our method on synthetic and real datasets. The algorithmpresented in the previous section have been implemented in the Python Optimal Transport toolbox [Fla-mary 2017].

We construct two trees as illustrated in Figure 3.3, where the 1D node features are shown with colors (inred, features belong to [0 ,

1] and in blue in [9 , C and C are theshortest-paths between the nodes. Both trees have the same individual structure and the same features upto a permutation. However when combining both informations the trees are not the same, as they do nothave the same labels at the same place. Figure 3.3 illustrates the behavior of the F GW distance whenthe trade-oﬀ parameter α changes. The left part recovers the Wasserstein distance between the features( α = 0): red nodes are coupled to red ones and the blue nodes to the blue ones. For an alpha close to 1(right), we recover the Gromov-Wasserstein distance between the structures of the trees: all couples ofpoints are coupled to another couple of points, without taking into account the features. Both approachesfail in discriminating the two trees. Finally, for an intermediate α in F GW (center), the bottom and ﬁrstlevel structure are preserved as well as the feature matching (red on red and blue on blue), resulting on apositive distance. Note that

F GW preserves also the substructures of the trees through its coupling. x a

12 3 456 78 9 1011 1213141516 17181920 µ = P i δ x i ,a i y b

123 4567 89101112131415161718192021222324252627 282930 ν = P j δ y j ,b j k i C l C j i Figure 3.4: Illustration of the diﬀerence between W , GW and F GW couplings. (left) empirical distributions µ with 20 samples and ν with 30 samples which color is proportional to their index. (middle) Cost matricesin the feature ( M AB ) and structure domains ( C , C ) with similar samples in white. (right) Solution for allmethods. Dark blue indicates a non zero coeﬃcient of the transportation map. Feature distances are large betweenpoints laying on the diagonal of M AB such that Wasserstein maps is anti-diagonal but unstructured. FusedGromov-Wasserstein incorporates both feature and structure maps in a single transport map. Figure 3.4 illustrates the diﬀerences between Wasserstein, Gromov-Wasserstein and Fused Gromov-Wasserstein couplings π ∗ on 1D distributions. In this example both the feature and structure are1-dimensional, that is ( x i , y j ) ∈ R and ( a i , b j ) ∈ R (Figure 3.4 left). The feature space (vertical axis)denotes two clusters among the elements of both objects illustrated in the OT matrix M AB , the structurespace (horizontal axis) denotes a noisy temporal sequence along the indexes illustrated in the matrices C and C (Figure 3.4 center). Wasserstein respects the clustering but forgets the temporal structure,Gromov-Wasserstein respects the structure but do not take the clustering into account. Only FGWretrieves a transport matrix respecting both feature and structure. We extract a 28 ×

28 image from the MNIST dataset and generate a second one through translation ormirroring of the digit in the original image. We use pixel gray levels as the features, and the structure isdeﬁned as the city-block distance on the pixel coordinate grid. We use equal weights for all the pixels inthe image. Figure 3.5 shows the diﬀerent couplings obtained when considering either the features only,the structure only or both information.

F GW aligns the pixels of the digits, recovering the correct orderof the pixels, while both Wassertein and Gromov-Wasserstein distances fail at providing a meaningfultransportation map. Note that in the Wasserstein and Gromov-Wasserstein case, the distances are equalto 0, whereas

F GW manages to spot that the two images are diﬀerent. Also note that, in the

F GW sense, the original digit and its mirrored version are also equivalent as there exists an isometry betweentheir structure spaces, making

F GW invariant to rotations or ﬂips in the structure space in this case.

One of the main assets of

F GW is that it can be used on a wide class of structured data such as graphsand also time series. We consider here 25 monodimensional time series composed of two humps in [0 , .4. Experimental results 51 Figure 3.5: Couplings obtained when considering (top left) the features only, where we have W = 0 (top right) the structure only, with GW = 0 (bottom left and right) both the features and the structure, with F GW . , .For readibility issues, only the couplings starting from non white pixels on the left picture are depicted. with random uniform height between 0 and 1. Signals are distributed according to two classes translatedfrom each other with a ﬁxed gap. The F GW distance is computed by considering M AB as the Euclideandistance between the features of the signals (here the value of the signal in each point) and C and C asthe euclidean distance between timestamps.A 2D embedding is computed from a F GW distance matrix between a number of examples inthis dataset with multidimensional scaling (MDS) in Figure 3.6 (top). One can clearly see that therepresentation with a reasonable α value in the center is the most discriminant one. This can be betterunderstood by looking as the OT matrices between the classes. Figure 3.6 (bottom) illustrates the behaviorof F GW on one pair of examples when going from Wasserstein to Gromov-Wasserstein. The black linedepicts the matching provided by the transport matrix and one can clearly see that while Wasserstein onthe left assigns samples completely independently to their temporal position, the Gromov-Wasserstein onthe right tends to align perfectly the samples (note that it could have reversed exactly the alignment withthe same loss) but discards the values in the signal. Only the true

F GW in the center ﬁnds a transportmatrix that both respects the time sequences and aligns similar values in the signals.

We now use

F GW on real-world dataset where we study its behavior on a graph classiﬁcation task. Moreprecisely we address the question of training a classiﬁer for graph data and evaluate the FGW distanceused in a kernel with SVM.

Datasets

We consider 12 widely used benchmark datasets divided into 3 groups. BZR, COX2 [Suther-land 2003], PROTEINS, ENZYMES [Borgwardt 2005], CUNEIFORM [Kriege 2018] and SYNTHETIC[Feragen 2013] are vector attributed graphs. MUTAG [Debnath 1991], PTC-MR [Kriege 2016] andNCI1 [Wale 2008] contain graphs with discrete attributes derived from small molecules. IMDB-B, IMDB-M [Yanardag 2015] contain unlabeled graphs derived from social networks. All datasets are availablein [Kersting 2016].

Increasing value of Class 1Class 2

2D MDS embeddings from distances

Figure 3.6: Behavior of trade-oﬀ parameter α on a toy time series classiﬁcation problem. α is increasing fromleft ( α = 0 : Wasserstein distance) to right ( α = 1 : Gromov-Wasserstein distance). (top row) (bottom row) illustration of couplingsbetween two sample time series from opposite classes. Experimental setup

Regarding the feature distance matrix M AB between node features, when dealingwith real valued vector attributed graphs, we consider the ‘ distance between the labels of the vertices.In the case of graphs with discrete attributes, we consider two settings: in the ﬁrst one, we keep theoriginal labels (denoted as raw ); we also consider a Weisfeiler-Lehman labeling (denoted as wl ) byconcatenating the labels of the neighbors. A vector of size h is created by repeating this procedure h times [Vishwanathan 2010, Kriege 2016]. In both cases, we compute the feature distance matrix byusing d ( a i , b j ) = P Hk =0 δ ( τ ( a ki ) , τ ( b kj )) where δ ( x, y ) = 1 if x = y else δ ( x, y ) = 0 and τ ( a ki ) denotes theconcatenated label at iteration k (for k = 0 original labels are used). Regarding the structure distances C ,they are computed by considering a shortest path distance between the vertices.For the classiﬁcation task, we run a SVM using the indeﬁnite kernel matrix e − γF GW which is seenas a noisy observation of the true positive semideﬁnite kernel [Luss 2007]. We compare classiﬁcationaccuracies with the following state-of-the-art graph kernel methods: (SPK) denotes the shortest pathkernel [Borgwardt 2005], (RWK) the random walk kernel [Gärtner 2003], (WLK) the Weisfeler Lehmankernel [Vishwanathan 2010], (GK) the graphlet count kernel [Shervashidze 2009]. For real valued vectorattributes, we consider the HOPPER kernel (HOPPERK) [Feragen 2013] and the propagation kernel(PROPAK) [Neumann 2016]. We build upon the GraKel library [Siglidis 2018] to construct the kernelsand C-SVM to perform the classiﬁcation. We also compare F GW with the PATCHY-SAN framework forCNN on graphs (PSCN) [Niepert 2016] building on our own implementation of the method .To compare the methods, most papers about graph classiﬁcation usually perform a nested crossvalidation (using 9 folds for training, 1 for testing, and reporting the average accuracy of this experimentrepeated 10 times) and report accuracies of the other methods taken from the original papers. However,these comparisons are not fair because of the high variance on most datasets w.r.t. the folds chosen for https://github.com/tvayer/PSCN .4. Experimental results 53 Table 3.1: Average classiﬁcation accuracy on the graph datasets with vector attributes.

Vector attributes BZR COX2 CUNEIFORM ENZYMES PROTEIN SYNTHETIC

FGW ± * 77.23 ± ± ± ± ± HOPPERK ± ± ± ± ± ± PROPAK ± ± ± ± * 61.34 ± ± PSCN k=10 ± ± ± ± ± ± PSCN k=5 ± ± ± ± ± ± training and testing. This is why, in our experiments, the nested cross validation is performed on the samefolds for training and testing for all methods. In the result tables 3.1, 3.2 and 3.3 we add a (*) when thebest score does not yield to a signiﬁcative improvement (based on a Wilcoxon signed rank test on the testscores) compared to the second best one. Note that, because of their small sizes, we repeat the experiments50 times for MUTAG and PTC-MR datasets. For all methods using SVM, we cross validate the parameter C ∈ { − , − , ..., } . The range of the WL parameter h is { , ..., } , and we also compute thiskernel with h ﬁxed at 2 ,

4. The decay factor λ for RWK { − , − ..., − } , for the GK kernel we setthe graphlet size κ = 3 and cross validate the precision level ε and the conﬁdence δ as in the originalpaper [Shervashidze 2009]. The t max parameter for PROPAK is chosen within { , , , , , , } . ForPSCN, we choose the normalized betweenness centrality as labeling procedure and cross validate the batchsize in { , , ..., } and number of epochs in { , , ..., } . Finally for F GW , γ is cross validatedwithin { − , − , ..., } and α is cross validated via a logspace search in [0 , .

5] and symmetrically[0 . ,

1] (15 values are drawn).

Results and discussion

Vector attributed graphs.

The average accuracies reported in Table 3.1 show that FGW is a clearstate-of-the-art method and performs best on 4 out of 6 datasets with performances in the error bars ofthe best methods on the other two datasets. Results for CUNEIFORM are signiﬁcantly below those fromthe original paper [Kriege 2018] which can be explained by the fact that the method in this paper usesa graph convolutional approach specially designed for this dataset and that the experimental setting isdiﬀerent. In comparison, the other competitive methods are less consistent as they exhibit some goodperformances on some datasets only.

Discrete labeled graphs.

We ﬁrst note in Table 3.2 that

F GW using WL attributes outperforms allcompetitive methods, including

F GW with raw features. Indeed, the WL attributes allow encoding moreﬁnely the neighborood of the vertices by stacking their attributes, whereas FGW with raw features onlyconsider the shortest path distance between vertices, not their sequence of labels. This result calls forusing meaningful feature and/or structure matrices in the FGW deﬁnition, that can be dataset-dependant,in order to enhance the performances. We also note that

F GW with WL attributes outperforms theWL kernel method, highlighting the beneﬁt of an optimal transport-based distance over a kernel-basedsimilarity. Surprisingly results of PSCN are signiﬁcantly lower than those from the original paper. Webelieve that it comes from the diﬀerence between the folds assignment for training and testing, whichsuggests that PSCN is diﬃcult to tune.

Table 3.2: Average classiﬁcation accuracy on the graph datasets with discrete attributes.

Discrete attr. MUTAG NCI1 PTC-MRFGW raw 83.26 ± ± ± ± ± ± ± ± ± GK k=3 82.42 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3.3: Average classiﬁcation accuracy on the graph datasets with no attributes.

Without attribute IMDB-B IMDB-MGW ± ± GK k=3 56.00 ± ± ± ± Non-attributed graphs.

The particular case of the GW distance for graph classiﬁcation is alsoillustrated on social datasets, that contain no labels on the vertices. Accuracies reported in Table 3.3show that it greatly outperforms SPK and GK graph kernel methods.

Comparison between

F GW , W and GW During the validation step, the optimal value of α wasconsistently selected inside the ]0 ,

1[ interval, excluding 0 and 1, suggesting that both structure and featurepieces of information are necessary (details are given in Section 6.1.1).

In this section we use the barycentric formulation of

F GW using the Fréchet mean formulation in equation(3.6) in two settings. The ﬁrst one is the computation of barycenter of several toy labeled graphs and thesecond the compression of one graph into a smaller graph, also known as coarsening [Loukas 2019].

Graph barycenter

In this part, we use

F GW to compute barycenter of toy graphs. In a ﬁrst example,we generate graphs following either a circle or 8 symbol with 1D features following a sine and linearvariation respectively. For each example, the number of nodes is drawn randomly between 10 and 25,Gaussian noise is added to the features and a small noise is applied to the structure (some connections arerandomly added). An example graph with no noise is provided for each class in the ﬁrst column of Figure .4. Experimental results 55

Noiseless graph BarycenterNoiseless graph Noisy graphs samples

Figure 3.7: Illustration of

F GW graph barycenter. The ﬁrst column illustrates the original settings with thenoiseless graphs, and columns 2 to 5 are the noisy samples that constitute the datasets. Column 6 show thebarycenters for each setting, with diﬀerent number of nodes. Blue nodes indicates a feature value close to − F GW barycenter containing 10 samples using the shortest path distance between the nodes as thestructural information and the distance induced by the Euclidean norm k . k for the features.Note that the iterations of the barycenter deﬁned in equation (3.13) result in a dense C matrix andto visualize properly the graph barycenter we need a adjacency matrix. We propose a simple heuristicprocedure to recover an adjacency matrix for the graphs’ barycenter based on a thresholding of the matrix C . Given a threshold t the matrix thresh t ( C ) is given by 1 if C ij < = t and 0 elsewhere. The threshold t is tuned so as to minimize the Frobenius norm between the original C matrix and the shortest pathmatrix constructed after thresholding C . More precisely if SP denotes the algorithm which takes as inputan adjacency matrix and outputs a shortest-path matrix then the threshold is given:arg min t ∈ R + k C − SP ( thresh t ( C )) k F (3.15)The idea behind equation (3.15) is that C represents somehow the shortest-path matrix of a graph sothat we want the adjacency matrix thresh t ( C ) to give a shortest-path matrix as close as possible to C .Unfortunately equation (3.15) is not diﬀerentiable with respect to t . We use a simple brute force strategyto ﬁnd a suitable threshold t by looking at arg min t ∈{ t ,...,t L } k C − SP ( thresh t ( C )) k F where t , . . . , t L are drawn from R + . Graph with communities Approximate Graph Clustering with transport matrixGraph with bimodal communities Approximate Graph Clustering with transport matrix

Figure 3.8: Example of community clustering on graphs using

F GW . (top) Community clustering with 4communities and uniform features per cluster. (bottom)

Community clustering with 4 communities and bimodalfeatures per cluster (and two nodes per cluster in the approximate graph).

Resulting barycenters are showed in Figure 3.7 for n = 15 and n = 7 nodes. First, one can see thatthe barycenters are denoised both in the feature space and the structure space. Also note that the sharpchange at the center of the 8 class is conserved in the barycenters which is a nice result compared toother divergences that tend to smooth-out their barycenters ( ‘ for instance). Finally, note that byselecting the number of nodes in the barycenter one can compress the graph or estimate a “high resolution”representation from all the samples. To the best of our knowledge, no other method can compute suchgraph barycenters. Finally, note that F GW is interpretable because the resulting OT matrix providescorrespondence between the nodes from the samples and those from the barycenter.

Graph compression

In the second experiment, we evaluate the ability of FGW to perform graphapproximation and compression on a Stochastic Block Model graph [Wang 1987, Nowicki 2001]. Thequestion is to see if estimating an approximated graph can recover the relation between the blocks andperform simultaneously a community clustering on the original graph (using the coupling matrix π ). Wegenerate two community graphs illustrated in the left column of Figure 3.8. The coarsened graph isobtained by solving the Fréchet mean formulation with k = 1. More precisely given an original graph µ .4. Experimental results 57 described by its features B and its structure C we look for a graph of N nodes, with N smaller thanthe number of nodes of the original graph, which solves:min µ F GW q,α ( µ, µ ) = min C ∈ R N × N , A ∈ R N × n , π E q ( M AB , C , C , π ) (3.16)The results are depicted in Figure 3.8. We can see that the relation between the blocks is sparse and has a“linear” structure, the example in the ﬁrst line has features that follow the blocks (noisy but similar in eachblock) whereas the example in the second line has two modes per block. The ﬁrst graph approximation(top line) is done with N = 4 nodes and we can recover both the blocks in the graph and the averagefeature on each blocks (colors on the nodes). The second problem is more complex due to the two modesper block but one can see that when approximating the graph with N = 8 nodes we recover both thestructure between the blocks and the sub-clusters in each block, which illustrates the strength of F GW that encodes both features and structures.

In the last experiment, we evaluate the ability of

F GW to perform a clustering of multiple graphs and toretrieve meaningful barycenters of such clusters. To do so, we generate a dataset of 4 groups of communitygraphs. Each graph follows a simple Stochastic Block Model [Wang 1987, Nowicki 2001] and the groupsare deﬁned w.r.t. the number of communities inside each graph and the distribution of their labels. Thedataset is composed of 40 graphs (10 graphs per group) and the number of nodes of each graph is drawnrandomly from { , , ..., } as illustrated in Figure 3.9. We perform a k -means clustering using the F GW barycenter deﬁned in equation (3.6) as the centroid of the groups and the

F GW distance for thecluster assignment. We ﬁx the number of nodes of each centroid to 30. We perform a thresholding onthe pairwise similarity matrix C of the centroid at the end in order to obtain an adjacency matrix forvisualization purposes. The threshold value is empirically chosen with the procedure described in theprevious section. The evolution of the barycenters along the iterations is reported in Figure 3.9. We cansee that these centroids recover community structures and feature distributions that are representative oftheir cluster content. On this example, note that the clustering recovers perfectly the known groups inthe dataset. To the best of our knowledge, there exists no other method able to perform a clustering ofgraphs and to retrieve the average graph in each cluster without having to solve a pre-image problem. Since its introduction the Fused Gromov-Wasserstein distance was also used in various contexts wherestructured data are involved. It has found applications in molecular biology for the analysis of Single-cellRNA (scRNA-seq) sequencing data in [Cang 2020] where authors use

F GW as a building block torecover spatial properties of scRNA-sequation In machine learning

F GW was used for learning structuredautoencoders [Xu 2020], to study the continuity of Graph Neural Network [Béthune 2020] or on domainadaptation tasks on functional Near-Infrared Spectroscopy (fNIRS) data [Lyu 2020]. The Fused Gromov-Wasserstein approach was further analysed in [Barbe 2020] where authors propose to improve the

F GW distance using a smoothing strategy on the features of the graphs. They propose to incorporate a diﬀusionkernel on the features which result in a more robust similarity measures of labeled graphs. Authors tackledomain adaptation tasks between labeled graphs where the label information is only available in a targetdomain.

F GW was also used on point clouds in [Puy 2020] for the estimation of scene ﬂows i.e. of the 3Dmotion of points at the surface of objects in a scene. c l u s t e r c l u s t e r c l u s t e r c l u s t e r Training dataset examples

Centroids iter

Figure 3.9: (left)

Examples from the clustering dataset, color indicates the labels. (right)

Evolution of thecentroids of each cluster in the k -means clustering, from the random initialization until convergence to the ﬁnalcentroid. The previous graph representation for objects with a ﬁnite number of points/vertices extends naturallyto the continuous setting. The purpose of this section is to generalize

F GW to general probabilitydistributions and to state some of its mathematical properties. We consider the following deﬁnition ofstructured objects:

Deﬁnition 3.5.1 (Structured objects) . A structured object over a metric space (Ω , d ) is a triplet ( X × Ω , d X , µ ) , where: ( X , d X ) is a metric space and µ is a probability measure over X × Ω . (Ω , d ) is denoted asthe feature space , such that d : Ω × Ω → R + is the distance in the feature space and ( X , d X ) the structurespace , such that d X : X × X → R + is the distance in the structure space. We will note µ X and µ A thestructure and feature marginals of µ . Deﬁnition 3.5.2 (Space of structured objects) . We note X the set of all metric spaces. The space of allstructured objects over (Ω , d ) will be written as S (Ω) and is deﬁned by all the triplets ( X × Ω , d X , µ ) where ( X , d X ) ∈ X and µ ∈ P ( X × Ω) . To avoid ﬁniteness issues we deﬁne for p ∈ N ∗ the space S p (Ω) ⊂ S (Ω) , ( X × Ω , d X , µ ) ∈ S p (Ω) if: ˆ Ω d ( a, a ) p d µ A ( a ) < + ∞ (3.17) .5. FGW in the continuous setting 59 (the ﬁniteness of this integral does not depend on the choice of a ) ˆ X ×X d X ( x, x ) p d µ X ( x ) dµ X ( x ) < + ∞ . (3.18)For the sake of simplicity, and when it is clear from the context, we will sometimes denote only by µ the whole structured object. In the same way as the discrete case we will note µ X and µ A the structureand feature marginals of µ . We recall that those marginals encode very partial information since theyfocus only on independent feature distributions or only on the structure. This deﬁnition encompassesthe discrete setting discussed in above. More precisely let us consider a labeled graph of n nodes withfeatures A = ( a i ) ni =1 with a i ∈ Ω and X = ( x i ) ni =1 the structure representation of the nodes. Let ( h i ) ni =1 be an histogram, then the probability measure µ = P ni =1 h i δ ( x i ,a i ) deﬁnes structured object in the senseof Deﬁnition 3.5.1 since it lies in P ( X ×

Ω). In this case, an example of µ , µ X and µ A is provided inFigure 3.1.Note that the set of structured objects is quite general and allows also considering discrete probabilitymeasures of the form µ = P p,qi,j =1 h i,j δ ( x i ,a j ) with p, q possibly diﬀerent than n . We propose to focus on aparticular type of structured objects, namely the generalized labeled graphs as described in the followingdeﬁnition: Deﬁnition 3.5.3 (Generalized labeled graph) . We call generalized labeled graph a structured object ( X × Ω , d X , µ ) ∈ S p (Ω) such that µ can be expressed as µ = ( id × ‘ f ) µ X where ‘ f : X → Ω is surjectiveand pushes µ X forward to µ A , i.e. ‘ f µ X = µ A . This deﬁnition implies that there exists a function ‘ f which associates a feature a = ‘ f ( x ) to a structurepoint x ∈ X and, since ‘ f is surjective, one structure point can not have two diﬀerent features. Thelabeled graph described by µ = P ni =1 h i δ ( x i ,a i ) is a particular instance of a generalized labeled graph inwhich ‘ f is deﬁned by ‘ f ( x i ) = a i . We now aim to deﬁne a notion of equivalence between two structured objects (

X × Ω , d X , µ ) and( Y × Ω , d Y , ν ). We note in the following ν Y , ν B the marginals of ν . Intuitively, two structured objects arethe same if they share the same feature information, if their structure information are lookalike and if theprobability measures are corresponding in some sense. In this section, we present mathematical tools forindividual comparison of the elements of structured objects. For completeness we recall here some usefulmathematical tools deﬁned in Chapter 2 for comparing the elements of structured objects and we referthe reader to this chapter for more details. Deﬁnition 3.5.4 (Isometry) . Let ( X , d X ) and ( Y , d Y ) be two metric spaces. An isometry is a sujectivemap φ : X → Y that preserves the distances: ∀ ( x, x ) ∈ X , d Y ( φ ( x ) , φ ( x )) = d X ( x, x ) (3.19)We refer the reader to section 2.2.2 for wider explanations about isometry. The previous map φ canbe used in order to compare the structure information of two structured objects. When the metric spacesare enriched with a probability measure they deﬁne a measurable metric spaces also called mm-spaces (seesection 2.2.2). In this case the notion of strong isomorphism can be used for comparing mm-spaces: Deﬁnition 3.5.5 (Strong isomorphism) . Two mm-spaces ( X , d X , µ X ) , ( Y , d Y , µ Y ) are strongly isomorphicif there exists an isometry φ : supp ( µ X ) → supp ( ν Y ) which pushes µ X forward to µ Y , i.e. φ µ X = µ Y .In this case we say that φ is measure preserving . All this considered, we can now deﬁne a notion of equivalence between structured objects:

Deﬁnition 3.5.6 ((II)-Strong isomorphism of structured objects.) . Two structured objects are said to be (II)-strongly isomorphic if there exists an isomorphism I : supp ( µ X ) → supp ( ν Y ) between the structuressuch that φ = ( I, id ) is bijective between supp ( µ ) and supp ( ν ) and measure preserving. More precisely φ satisﬁes the following properties: P.1 φ µ = ν . P.2

The function φ statisﬁes: ∀ ( x, a ) ∈ supp ( µ ) , φ ( x, a ) = ( I ( x ) , a ) . P.3

The function I : supp ( µ X ) → supp ( ν Y ) is surjective, satisﬁes I µ X = ν Y and: ∀ x, x ∈ supp ( µ X ) , d X ( x, x ) = d Y ( I ( x ) , I ( x )) . Remark 3.5.1.

It is easy to check that the (II)-strong isomorphism deﬁnes an equivalence relation over S p (Ω) . Moreover the function φ described in this deﬁnition can be seen as a feature, structure and measurepreserving function. Indeed from P.1 φ is measure preserving. Moreover ( X , d X , µ X ) and ( Y , d Y , ν Y ) areisomorphic through I . Finally using P.1 and

P.2 we have that µ A = ν B so that the feature informationis also preserved. To illustrate this deﬁnition, we consider a simple example of two structured objects in the discretecase:

Example 3.5.1.

Let two structured objects deﬁned by:  ( x , a )( x , a )( x , a )( x , a ) | {z } x i ,a i ,  | {z } d X ( x i ,x j ) ,  / / / / | {z } h i and  ( y , b )( y , b )( y , b )( y , b ) | {z } y i ,b i ,  | {z } d Y ( y i ,y j ) ,  / / / / | {z } h i with for i , a i = b i and for i = j , a i = a j (see Figure 3.10). The two structured objects have isometricstructures and same features individually but they are not (II)-strongly isomorphic. One possible map φ = ( φ , φ ) : X × Ω → Y × Ω such that φ leads to an isometry is φ ( x , a ) = ( y , b ) , φ ( x , a ) = ( y , b ) , φ ( x , a ) = ( y , b ) , φ ( x , a ) = ( y , b ) . Yet this map does not satisfy φ ( x, . ) = id for any x since φ ( x , a ) = ( y , b ) and a = b . The other possible functions such that φ leads to an isometry are simplypermutations of this example, yet it is easy to check that none of them veriﬁes P.2 (for example with φ ( x , a ) = ( y , b ) ). We generalize the deﬁnition (3.1) of Fused Gromov-Wasserstein (

F GW ) to the continuous setting asfollows:

Deﬁnition 3.5.7.

The Fused-Gromov-Wasserstein distance is deﬁned for α ∈ [0 , and p, q ≥ as: F GW α,p,q ( µ, ν ) = (cid:18) inf π ∈ Π( µ,ν ) E p,q,α ( π ) (cid:19) p (3.20) where: E p,q,α ( π ) = ˆ ˆ (cid:0) (1 − α ) d ( a, b ) q + α | d X ( x, x ) − d Y ( y, y ) | q (cid:1) p d π (( x, a ) , ( y, b ))d π (( x , a ) , ( y , b )) We will write in the following L ( x, y, x , y ) = | d X ( x, x ) − d Y ( y, y ) | . .5. FGW in the continuous setting 61 ( x , a ) ( x , a )( x , a )( x , a ) ( y , b ) ( y , b )( y , b ) ( y , b ) Figure 3.10: Two structured objects with isometric structures and identical features that are not (II)-stronglyisomorphic. The color of the nodes represent the node feature and each edge represents a distance of 1 betweenthe connected nodes.Figure 3.11: Illustration of deﬁnition 3.5.7. The ﬁgure shows two structured objects (

X × Ω , d X , µ ) and ( Y × Ω , d Y , µ ).The feature space Ω is the common space for all features. The two metric spaces ( X , d X ) and ( Y , d Y ) represent thestructures of the two structured objects, the similarity between all pair-to-pair distances of the structure points ismeasured by L ( x, y, x , y ). µ and ν are the joint measures on the structure space and the feature space. Note that this deﬁnition is coherent with the deﬁnition given in equation (3.1) when p = 1. Forbrevity we will simply note F GW instead for

F GW α,p,q when it is clear from the context. Many desirableproperties arise from this deﬁnition. Among them, one can deﬁne a topology over the space of structuredobjects using the

F GW distance to compare structured objects, in the same philosophy as for Wassersteinand Gromov-Wasserstein distances. The deﬁnition also implies that

F GW acts as a generalization of bothWasserstein and Gromov-Wasserstein distances, with

F GW achieving an interpolation between these twodistances. More remarkably,

F GW distance also realizes geodesic properties over the space of structuredobjects, allowing the deﬁnition of gradient ﬂows. Before reviewing all these properties, we ﬁrst compare

F GW with GW and W (by assuming for now that F GW exists, which will be shown later in Theorem3.5.1).

Proposition 3.5.1 (Comparison between

F GW , GW and W .) . With previous notations:• The following inequalities hold:

F GW α,p,q ( µ, ν ) ≥ (1 − α ) W pq ( µ A , ν B ) q (3.21) F GW α,p,q ( µ, ν ) ≥ αGW pq ( µ X , ν Y ) q (3.22) • Let us suppose that the structure spaces ( X , d X ) , ( Y , d Y ) are part of a single ground space ( Z , d Z ) (i.e. X , Y ⊂ Z and d X = d Y = d Z ). We consider the Wasserstein distance between µ and ν for the distance on Z × Ω : ˜ d (( x, a ) , ( y, b )) = (1 − α ) d ( a, b ) + αd Z ( x, y ) . Then: F GW α,p, ( µ, ν )( µ, ν ) ≤ W p ( µ, ν ) . (3.23)Proof of this proposition can be found in Section 6.1.2. In particular, following this proposition,when the F GW distance vanishes then both GW and W distances vanish so that the structure and thefeature of the structure object are individually “the same” (with respect to their corresponding equivalencerelation). However the converse is not necessarily true as shown further in Section 3.5.3.In the following we establish some mathematical properties of the F GW distance. The ﬁrst resultrelates to the existence of the

F GW distance and the topology of the space of structured objects. Weprove that the

F GW distance is indeed a distance regarding the equivalence relation between structuredobjects as deﬁned in Deﬁntion 3.5.6, allowing us to derive a topology on S (Ω). The

F GW distance has the following properties:

Theorem 3.5.1 (Metric properties) . Let p, q ≥ , α ∈ ]0 , and ( µ, ν ) ∈ S pq (Ω) × S pq (Ω) . The func-tional π → E p,q,α ( π ) always achieves an inﬁmum π ∗ in Π( µ, ν ) s.t. F GW α,p,q ( µ, ν ) = E p,q,α ( π ∗ ) < + ∞ . Moreover: • F GW α,p,q is symmetric and, for q = 1 , satisﬁes the triangle inequality. For q ≥ , the triangularinequality is relaxed by a factor q − . • For α ∈ ]0 , , F GW α,p,q ( µ, ν ) = 0 if an only if there exists a bijective function φ = ( φ , φ ) : supp ( µ ) → supp ( ν ) such that: φ µ = ν (3.24) ∀ ( x, a ) ∈ supp ( µ ) , φ ( x, a ) = a (3.25) ∀ ( x, a ) , ( x , a ) ∈ supp ( µ ) , d X ( x, x ) = d Y ( φ ( x, a ) , φ ( x , a )) (3.26) • If ( µ, ν ) are generalized labeled graphs then F GW α,p,q ( µ, ν ) = 0 if and only if ( X × Ω , d X , µ ) and ( Y × Ω , d Y , ν ) are (II)-strongly isomorphic. Proof of this theorem can be found in Section 6.1.3. The identity of indiscernibles is the most delicatepart to prove and is based on using the Gromov-Wasserstein distance between the spaces X × Ω and

Y ×

Ω.The previous theorem states that

F GW is a distance over the space of generalized labeled graphs endowedwith the strong isomorphism as equivalence relation deﬁned in Deﬁnition 3.5.6. More generally for anystructured objects the equivalence relation is given by equations (3.24), (3.25) and (3.26). Informally,invariants of the

F GW are structured objects that have both the same structure and the same features inthe same place. Despite the fact that q = 1 leads to a proper metric the case q = 2 can be computed moreeﬃciently using a separability trick from [Peyré 2016] as seen in Section 3.3.3. Remark 3.5.2.

Note that the previous theorem actually proves Theorem 3.3.1. Indeed if we consider µ = P ni =1 h i δ ( x i ,a i ) and ν = P mj =1 g j δ ( y j ,b j ) describing two labeled graphs as discussed in the previoussection. Then µ and ν are generalized labeled graph. Using Theorem 3.5.1 µ and ν are (II)-stronglyisomorphic if and only if there exists a bijection between the supports satisﬁes ( P.1 ) , ( P.2 ) and ( P.3 ) .Since the supports are discrete this is equivalent to the condition n = m and there exists a permutation .5. FGW in the continuous setting 63 σ ∈ S n which satisﬁes the conditions of Theorem 3.3.1. The triangle inequality property of Theorem 3.3.1derives directly from the triangle inequality of Theorem 3.5.1. There are some special cases where W and GW can be adapted to structured objects and can be usedalso to compare them. These cases result in diﬀerent notions of equivalence as described in the followingdiscussion. Despite the appealing properties of both Wasserstein and Gromov-Wasserstein distances, they fail atcomparing structured objects by focusing only on the feature and structure marginals respectively. However,with some hypotheses, one could adapt these distances for structured objects.

Adapting Wasserstein: common structure space

If the structure spaces ( X , d X ) and ( Y , d Y ) arepart of a same ground space ( Z , d Z ) one can build a distance ˆ d = d Z ⊕ d between couples ( x, a ) and( y, b ) and apply the Wasserstein distance. In this case, when the Wasserstein distance vanishes then thestructured objects are equal in the sense µ = ν which implies that µ and ν are de facto (II)-stronglyisomorphic. This approach is very related with the one discussed in [Thorpe 2017] where authors deﬁnethe Transportation L p distance for signal analysis purposes. Their approach can be viewed as a transportbetween two joint measures: µ ( X ×

Ω) = L ( { ( x , f ( x )) | x ∈ X ⊂ Z = R d | f ( x ) ∈ Ω ⊂ R m } ) (3.27) ν ( Y ×

Ω) = L ( { ( y , g ( y )) | y ∈ Y ⊂ Z = R d | g ( y ) ∈ Ω ⊂ R m } ) (3.28)for functions f, g : Z → R m representative of the signal values and L the Lebesgue measure. The distancefor the transport is deﬁned as ˆ d (( x , f ( x )) , ( y , g ( y ))) = α k x − y k pp + k f ( x ) − g ( y ) k pp for α > k · k p the l p norm. In this case f ( x ) and g ( y ) can be interpreted as encoding the feature information of thesignal while x , y encode its structure information. This approach is interesting but cannot be used onstructured objects such as graphs that will not share a common structure embedding space. Adapting Gromov-Wasserstein

The Gromov-Wasserstein distance can also be adapted to structuredobjects by considering the distances (1 − β ) d X ⊕ βd and (1 − β ) d Y ⊕ βd within each space X ×

Ω and

Y ×

Ωrespectively and β ∈ ]0 , GW distance vanishes, structured objects are stronglyisomorphic with respect to (1 − β ) d X ⊕ βd and (1 − β ) d Y ⊕ βd . However the (II)-strong isomorphismis stronger than this notion since the strong isomorphism allows for “permuting the labels” but not the(II)-strong isomorphism. More precisely we have the following lemma: Lemma 3.5.1.

Let ( X × Ω , d X , µ ) , ( Y × Ω , d Y , ν ) be two structured objects and β ∈ ]0 , .If ( X × Ω , d X , µ ) and ( Y × Ω , d Y , ν ) are (II)-strongly isomorphic then ( X × Ω , (1 − β ) d X ⊕ βd, µ ) and ( Y × Ω , (1 − β ) d Y ⊕ βd, ν ) are strongly isomorphic. However the converse is not true in general.Proof. To see this, if we consider φ as deﬁned in Theorem 3.5.1, then for ( x, a ) , ( x , b ) ∈ supp( µ ) we have d X ( x, x ) = d Y ( I ( x ) , I ( x )). In this way:(1 − β ) d X ( x, x ) + βd ( a, b ) = (1 − β ) d Y ( I ( x ) , I ( x )) + βd ( a, b ) (3.29)which can be rewritten as:(1 − β ) d ⊕ βd X (( x, a ) , ( x , b )) = (1 − β ) d ⊕ βd Y ( φ ( x, a ) , φ ( x , b )) (3.30) and so φ is an isometry with respect to (1 − β ) d ⊕ βd X and (1 − β ) d ⊕ βd Y . Since φ is also measurepreserving and surjective ( X × Ω , (1 − β ) d X ⊕ βd, µ ) and ( Y × Ω , (1 − β ) d Y ⊕ βd, ν ) are strongly isomorphic.However the converse is not necessarily true as it is easy to cook up an example with the same structurebut with permuted labels so that objects are strongly isomorphic but not (II)-strongly isomorphic. Forexample in the tree example depicted in Figure 3.10, the structures are isometric and the distances betweenthe features within each space are the same between each structured objects so that ( X × Ω , (1 − β ) d X ⊕ βd, µ )and ( Y × Ω , (1 − β ) d Y ⊕ βd, ν ) are strongly isomorphic, yet not (II)-strongly isomorphic as shown in theexample since F GW > The metric properties of

F GW naturally endow the structured object space with a notion of convergenceas described in the next deﬁnition:

Deﬁnition 3.5.8 (Convergence of structured objects.) . Let (cid:0) ( X n × Ω , d X n , µ n ) (cid:1) n ∈ N be a sequence ofstructured objects. It converges to ( X × Ω , d X , µ ) in the Fused Gromov-Wasserstein sense if: lim n →∞ F GW α,p, ( µ n , µ ) = 0 (3.31)We consider in this deﬁnition only the case q = 1 as it gives a proper metric (with q > q − ). Using Prop. 3.5.1, it is straightforward to see that if the sequenceconverges in the F GW sense, both the features and the structure converge respectively in the Wassersteinand Gromov-Wasserstein sense.An interesting question arises from this deﬁnition. If we consider a structured object (

X × Ω , d X , µ ) andif we sample the joint distribution so as to consider ( { ( x i , a i ) } i ∈{ ,..,n } , d X , µ n ) n ∈ N with µ n = n n P i =1 δ x i ,a i where ( x i , a i ) ∈ X × Ω are sampled from µ . Does this sequence converges to ( X × Ω , d X , µ ) in the F GW sense and how fast is the convergence?This question can be answered thanks to a notion of “size” of a probability measure. For the sake ofconciseness we will not present exhaustively the theory but the reader can refer to [Weed 2017] for moredetails. Given a measure µ on Ω we denote as dim ∗ p ( µ ) its upper Wasserstein dimension . It coincides withthe intuitive notion of “dimension” when the measure is suﬃciently well behaved. For example, for anyabsolutely continuous measure µ with respect to the Lebesgue measure on [0 , d , we have dim ∗ p ( µ ) = d for any p ∈ [1 , d ].Using this deﬁnition and the results in [Weed 2017], we can answer the question ofconvergence of ﬁnite sample in the following proposition (proof can be found in Section 6.1.4): Theorem 3.5.2 (Convergence of ﬁnite samples and a concentration inequality) . With previousnotations. Let p ≥ . We have: lim n →∞ F GW α,p, ( µ n , µ ) = 0 (3.32) Moreover, suppose that s > d ∗ p ( µ ) . Then there exists a constant C that does not depend on n suchthat: E [ F GW α,p, ( µ n , µ )] ≤ Cn − s . (3.33) The expectation is taken over the i.i.d samples ( x i , a i ) . A particular case of this inequality is when α = 1 so that we can use the result above to derive a concentration result for the Gromov-Wassersteindistance. More precisely, if ν n = n P i δ x i denotes the empirical measure of ν ∈ P ( X ) and if .5. FGW in the continuous setting 65 s > d ∗ p ( ν ) we have: E [ GW p ( ν n , ν )] ≤ C n − s . (3.34)This result is a simple application of the convergence of ﬁnite sample properties of the Wassersteindistance, since in this case µ n and µ are part of the same ground space so that equation (3.33) derivesnaturally from equation (3.23) and the properties of Wasserstein. In contrast to the Wasserstein distancecase this inequality is not necessarily sharp and future work will be dedicated to the study of its tightness. F GW distance is a generalization of both Wasserstein and Gromov-Wasserstein distances in the sensethat it achieves an interpolation between them. More precisely, we have the following theorem:

Proposition 3.5.2 (Interpolation properties.) . As α tends to zero, one recovers the Wasserstein distancebetween the features information and as α goes to one, one recovers the Gromov-Wasserstein distancebetween the structure information: lim α → F GW α,p,q ( µ, ν ) = ( W pq ( µ A , ν B )) q (3.35)lim α → F GW α,p,q ( µ, ν ) = ( GW pq ( µ X , ν Y )) q (3.36)Proof of this proposition can be found in Section 6.1.5. This result shows that F GW can revert to oneof the other distances and thus acts as a generalization of Wasserstein and Gromov-Wasserstein distances,as claimed in the discrete case section.

One desirable property in OT is the underlying geodesics deﬁned by the mass transfer between twoprobability distributions. These properties are useful in order to deﬁne dynamic formulation of OT problems.This dynamic point of view is inspired by ﬂuid dynamics and ﬁnds its origin in the Wasserstein contextwith [Benamou 2000]. Various applications in machine learning can be derived from this formulation:interpolation along geodesic paths was used in computer graphics for color or illumination interpolations[Bonneel 2011]. More recently, [Chizat 2018] used Wasserstein gradient ﬂows in an optimization context,deriving global minima results for non-convex particles gradient descent. In [Zhang 2018] authors usedWasserstein gradient ﬂows in the context of reinforcement learning for policy optimization.The main idea of this dynamic formulation is to describe the optimal transport problem betweentwo measures as a curve in the space of measures minimizing its total length. We ﬁrst describe somegenerality about geodesic spaces and recall classical results for dynamic formulation in both Wassersteinand Gromov-Wasserstein contexts. In a second part, we derive new geodesic properties in the

F GW context.

Geodesic spaces

Let ( X , d X ) be a metric space and x, y two points in X . We say that a curve w : [0 , → X joining the endpoints x and y ( i.e. with w (0) = x and w (1) = y ) is a constant speed geodesic if it satisﬁes d X ( w ( t ) , w ( s )) ≤ | t − s | d ( w (0) , w (1)) = | t − s | d X ( x, y ) for t, s ∈ [0 , X , d X ) isa length space ( i.e. if the distance between two points of X is equal to the inﬁmum of the lengths of thecurves connecting these two points) then the converse is also true and a constant speed geodesic satisﬁes d X ( w ( t ) , w ( s )) = | t − s | d X ( x, y ). It is easy to compute distances along such curve as they are directlyembedded into R . In the Wasserstein context, if the ground space is a complete separable, locally compact length spaceand if the endpoints of the geodesic are given, then there exists a geodesic curve. Moreover, if the transportbetween the endpoints is unique then there is a unique displacement interpolation between the endpoints(see Corollary 7.22 and 7.23 in [Villani 2008]). For example, if the ground space is R d and the distancebetween the points is measured via the k . k norm, then the geodesics exist and are uniquely determined(note that this can be generalized to costs of the form c ( x , y ) = h ( y − x ) where h is strictly convex). Inthe Gromov-Wasserstein context, there always exists constant speed geodesics as long as the endpointsare given. These geodesics are unique modulo strong isomorphisms (see [Sturm 2012]). The

F GW case

In this paragraph, we suppose that Ω = R d . We are interested in ﬁnding a geodesiccurve in the space of structured objects i.e. a constant speed curve of structured objects joining twostructured objects. As for Wasserstein and Gromov-Wasserstein, the structured object space endowedwith the Fused Gromov-Wasserstein distance maintains some geodesic properties. The following resultproves the existence of such a geodesic and characterizes it: Theorem 3.5.3 (Constant speed geodesic.) . Let p ≥ and ( X × Ω , d X , µ ) and ( Y × Ω , d Y , µ ) in S p ( R d ) . Let π ∗ be an optimal coupling for the Fused Gromov-Wasserstein distance between µ , µ and t ∈ [0 , . We equip R d with ‘ m norm for m ≥ .We deﬁne η t : X × Ω × Y × Ω → X × Y × Ω such that: ∀ ( x, a ) , ( y, b ) ∈ X × Ω × Y × Ω , η t ( x, a , y, b ) = ( x, y, (1 − t ) a + t b ) (3.37) Then: ( X × Y × Ω , (1 − t ) d X ⊕ td Y , µ t = η t π ∗ ) t ∈ [0 , (3.38) is a constant speed geodesic connecting ( X × Ω , d X , µ ) and ( Y × Ω , d Y , µ ) in the metric space (cid:0) S p ( R d ) , F GW α,p, (cid:1) . Proof of the previous theorem can be found in Section 6.1.6. In a sense this result combines thegeodesics in the Wasserstein space and in the space of all mm-spaces since it suﬃces to interpolate thedistances in the structure space and the features to construct a geodesic. The main interest is that it deﬁnesthe minimum path between two structured objects. For example, considering two discrete structuredobjects represented by the measures µ = P ni =1 h i δ ( x i , a i ) and µ = P mj =1 g j δ ( y j , b j ) , the interpolation pathis given for t ∈ [0 ,

1] by the measure µ t = P ni =1 P mj =1 π ∗ ( i, j ) δ ( x i ,y j , (1 − t ) a i + t b j ) where π ∗ is an optimalcoupling for the F GW distance. However this geodesic is diﬃcult to handle in practice since it requiresthe computation of the cartesian product X × X . The Fréchet mean deﬁned in Section 3.3.2 seems to bemore suited in practice. The proper deﬁnition and properties of velocity ﬁelds associated to this geodesicis postponed to further works. Countless problems in machine learning involve structured data, usually stressed in light of the graphformalism. We consider here labeled graphs enriched by an histogram, which naturally leads to representstructured data as probability measures in the joint space of their features and structures. Widely knownfor their ability to meaningfully compare probability measures, transportation distances are generalizedin this chapter so as to be suited in the context of structured data, motivating the so-called Fused .6. Discussion and conclusion 67

Gromov-Wasserstein distance. We theoretically prove that it deﬁnes indeed a distance on structureddata, and consequently on graphs of arbitrary sizes.

F GW provides a natural framework for analysisof labeled graphs as we demonstrate on classiﬁcation, where it reaches and surpasses most of the timethe state-of-the-art performances, and in graph-based k -means where we develop a novel approach torepresent the clusters centroids using a barycentric formulation of F GW . We believe that this metric canhave a signiﬁcant impact on challenging graph signal analysis problems.While we considered a unique measure of distance between nodes in the graph structure (shortestpath), other choices could be made with respect to the problem at hand, or eventually learned in anend-to-end manner. The same applies to the distance between features. We also envision a potential useof this distance in deep learning applications where a distance between graph is needed (such as graphauto-encoders). Another line of work will also try to lower the computational complexity of the underlyingoptimization problem to ensure better scalability to very large graphs. hapter The Gromov-Wasserstein problem inEuclidean spaces “Is it all right if I go out there?”“Sure,” Thomas Hudson had told him. “But it srugged from now on until spring and spring isn’teasy.”“I want it to be rugged,” Roger had said. “I amgoing to start new again.”“How many time it is now you’ve started new?”“Too many,” Roger had said. “And you don’t have torub it in.” – Ernest Hemingway,

Islands in the Stream

Contents

Summary of the contributions

This chapter is based on the paper [Vayer 2019b] and addresses the problem of GW in Euclidean spaces. Recentlyused in various machine learning contexts, the Gromov-Wasserstein distance allows for comparing distributionswhich supports do not necessarily lie in the same metric space. However, this optimal transport distance requiressolving a complex non convex quadratic program which is most of the time very costly both in time and memory.Contrary to GW , the Wasserstein distance enjoys several properties ( e.g. duality) that permit large scaleoptimization. Among those, the solution of W on the real line, that only requires sorting discrete samples in 1D,allows deﬁning the Sliced Wasserstein ( SW ) distance. This ﬁrst part of this chapter presents a new divergencebased on GW akin to SW . More precisely the contributions of are the following:• We derive the ﬁrst closed form solution for GW when dealing with discrete 1D distributions, based on anew result for the related quadratic assignment problem (Theorem 4.1.1 and Theorem 4.1.2).• Based on this result we deﬁne a novel OT discrepancy which can deal with large scale distributions via aslicing approach and we show how it relates to the GW distance while being O ( n log( n )) to compute.• We illustrate the behavior of this so called Sliced Gromov-Wasserstein ( SGW ) discrepancy in experimentswhere we demonstrate its ability to tackle similar problems as GW while being several order of magnitudesfaster to compute.The second part of this chapter is more prospective and tackle the problem of probability distributions whichsupports lie on Euclidean spaces with, potentially, diﬀerent dimensions. This part investigate the regularity of GW optimal transport plan in the cases of inner product similarities and Euclidean distances. The contributionsof this part are, in summary:• We show that the GW problem in Euclidean spaces is equivalent to jointly solve a linear transportationproblem and a “alignment” problem (Theorem 4.2.1 and Theorem 4.2.5).• We give necessary conditions under which a GW optimal transport plan is supported on a deterministicfunction (Theorem 4.2.3 and Proposition 4.2.4). This allows to derive a closed-form expression for GW withinner product similarities between 1D probability distributions (not necessarily discrete, see Theorem 4.2.4).• We study the Gromov-Monge problem in Euclidean spaces, and in particular the linear Gromov-Mongeproblem for which we exhibit a closed-form expression between Gaussian distributions (Theorem 4.2.6).

As described in Chapter 2 the linear optimal transport problem aims at deﬁning ways to compareprobability distributions, through e.g. the Wasserstein distance. It has proved to be very useful for awide range of machine learning tasks including generative modelling (Wasserstein GANs [Arjovsky 2017]),domain adaptation [Courty 2017] or supervised embeddings for classiﬁcation purposes [Huang 2016].However one limitation of this approach is that it implicitly assumes aligned distributions, i.e. that lie inthe same metric space or at least between spaces where a meaningful distance across domains can becomputed. From another perspective, the Gromov-Wasserstein distance beneﬁts from more ﬂexibility whenit comes to the more challenging scenario where heterogeneous distributions are involved, i.e. distributionswhich supports do not necessarily lie on the same metric space. It only requires modelling the topologicalor relational aspects of the distributions within each domain in order to compare them. As such, ithas recently received a high interest in the machine learning community, solving learning tasks such asheterogenous domain adaptation [Yan 2018], deep metric alignment [Ezuz 2017], graph classiﬁcation (seeChapter 3 for more details) or generative modelling [Bunne 2019]. .1. Sliced Gromov-Wasserstein 71

OT is known to be a computationally diﬃcult problem: the Wasserstein distance involves a linearprogram that most of the time prevents its use to settings with more than a few tens of thousands ofpoints. For medium to large scale problems, some methods relying e.g. on entropic regularization ordual formulation (as seen in Chapter 2) have been investigated in the past years. Among them, onebuilds upon the mono-dimensional case where computing the Wasserstein distance can be trivially solvedin O ( n log n ) by sorting points in order and pairing them from left to right. While this 1D case has alimited interest per se , it is one of the main ingredients of the Sliced

Wasserstein distance [Rabin 2011]:high-dimensional data are linearly projected into sets of mono-dimensional distributions, the slicedWasserstein distance being the average of the Wasserstein distances between all projected measures. Thisframework provides an eﬃcient algorithm that can handle millions of points and has similar properties tothe Wasserstein distance [Bonnotte 2013]. As such, it has attracted attention and has been successfullyused in various tasks such as barycenter computation [Bonneel 2015], classiﬁcation [Kolouri 2016] orgenerative modeling [Kolouri 2019b, Deshpande 2018, Liutkus 2019, Wu 2019].Regarding GW , the optimization problem is a non-convex quadratic program, with a prohibitivecomputational cost for problems with more than a few thousands of points: the number of terms growsquadratically with the number of samples and one cannot rely on a dual formulation as for Wasserstein.However several approaches have been proposed to tackle its computation. Initially approximated by alinear lower bound, GW was thereafter estimated through an entropy regularized version that can beeﬃciently computed by iterating Sinkhorn projections (see Chapter 2) or using a conditional gradientscheme relying on linear program OT solvers (see Chapter 3). However, all these methods are still toocostly for large scale scenarii . In this section, we propose a new formulation related to GW that lowers itscomputational cost. To that extent, we derive a novel OT discrepancy called Sliced Gromov-Wasserstein( SGW ). It is similar in spirit to the Sliced Wasserstein distance as it relies on the exact computationof 1D GW distances of distributions projected onto random directions. We notably provide the ﬁrst 1Dclosed-form solution of the GW problem by proving a new result about the Quadratic Assignment Problemfor matrices that are squared euclidean distances of real numbers. Computation of SGW for discretedistributions of n points is O ( L n log( n )), where L is the number of sampled directions. This complexityis the same as the Sliced-Wasserstein distance and is even lower than computing the value of GW whichis O ( n ) for a known coupling (once the optimization problem solved) in the general case [Peyré 2016].Experimental validation shows that SGW retains various properties of GW while being much cheaperto compute, allowing its use in diﬃcult large scale settings such as large mesh matching or generativeadversarial networks. We ﬁrst provide and prove a solution for an 1D Quadratic Assignement Problem with a quasilinear timecomplexity of O ( n log( n )). This new special case of the QAP is shown to be equivalent to the hardassignment version of GW , called the Gromov-Monge ( GM ) problem, with squared Euclidean cost fordistributions lying on the real line. We also show that, in this context, solving GM is equivalent to solving GW . We derive a new discrepancy named Sliced Gromov-Wasserstein ( SGW ) that relies on these ﬁndingsfor eﬃcient computation.

Solving a Quadratic Assignment Problem in 1D

In Koopmans-Beckmann form [Koopmans 1957]a QAP takes as input two n × n matrices A = ( a ij ), B = ( b ij ). The goal is to ﬁnd a permutation σ ∈ S n , the set of all permutations of [[ n ]], which minimizes the objective function P ni,j =1 a i,j b σ ( i ) ,σ ( j ) . In full generality this problem is NP-hard (see Section 2.2.3 for more details). The following theorem is anew result about QAP and states that it can be solved in polynomial time when A and B are squaredEuclidean distance matrices of sorted real numbers: Theorem 4.1.1 (A new special case for the Quadratic Assignment Problem) . For real numbers x < · · · < x n and y < · · · < y n , min σ ∈ S n X i,j − ( x i − x j ) ( y σ ( i ) − y σ ( j ) ) (4.1) is achieved either by the identity permutation σ ( i ) = i ( Id ) or the anti-identity permutation σ ( i ) = n + 1 − i ( anti − Id ). In other words: ∃ σ ∈ { Id, anti − Id } , σ ∈ arg min σ ∈ S n X i,j − ( x i − x j ) ( y σ ( i ) − y σ ( j ) ) (4.2)To the best of our knowledge, this result is new. It states that if one wants to ﬁnd the best one-to-onecorrespondence of real numbers such that their pairwise distances are best conserved, it suﬃces to sort thepoints and check whether the identity has a better cost than the anti-identity. Proof of this theorem canbe found in Section 6.2.1. We postulate that this result also holds for a ij = | x i − x j | k and b ij = −| y i − y j | k with any k ≥ Gromov-Wasserstein distance on the real line

When n = m and a i = b j = n , one can look forthe hard assignment version of the GW distance resulting in the Gromov-Monge problem [Mémoli 2018]associated with the following GM distance: GM ( c X , c Y , µ, ν ) = min σ ∈ S n n X i,j (cid:12)(cid:12) c X ( x i , x j ) − c Y ( y σ ( i ) , y σ ( j ) ) (cid:12)(cid:12) (4.3)where σ ∈ S n is a one-to-one mapping [[ n ]] → [[ n ]]. Interestingly when the permutation σ is known,the computation of the cost is O ( n ) which is far better than O ( n ) for the general GW case. It iseasy to see that this problem is equivalent to minimizing P ni,j =1 a i,j b σ ( i ) ,σ ( j ) with a ij = c X ( x i , x j ) and b ij = − c Y ( y σ ( i ) , y σ ( j ) ). Indeed we have: X i,j (cid:12)(cid:12) c X ( x i , x j ) − c Y ( y σ ( i ) , y σ ( j ) ) (cid:12)(cid:12) = X i,j c X ( x i , x j ) + X i,j c Y ( y σ ( i ) , y σ ( j ) ) − X i,j c X ( x i , x j ) c Y ( y σ ( i ) , y σ ( j ) )= X i,j c X ( x i , x j ) + X i,j c Y ( y i , y j ) − X i,j c X ( x i , x j ) c Y ( y σ ( i ) , y σ ( j ) )So that only the term − P i,j c X ( x i , x j ) c Y ( y σ ( i ) , y σ ( j ) ) depends on σ . Thus, when squared Euclideancosts are used for distributions lying on the real line, Theorem 4.1.1 exactly recovers the solution of the GM problem deﬁned in equation (4.3). As matter of consequence, Theorem 4.1.1 provides an eﬃcientway of solving the Gromov-Monge problem.Moreover, this theorem also allows ﬁnding a closed-form for the GW distance. Indeed, some recentadvances in graph matching state that, under some conditions on A and B , the assignment problemis equivalent to its soft-assignment counterpart [Maron 2018]. This way, using both Theorem 4.1.1and [Maron 2018], one can ﬁnd a solvable case for the GW distance as stated in the following theorem: .1. Sliced Gromov-Wasserstein 73 Theorem 4.1.2 (Equivalence between GW and GM for discrete measures) . Let µ ∈ P ( R p ) , ν ∈P ( R q ) be discrete probability measures with same number of atoms and uniform weights, i.e. µ = n P ni =1 δ x i , ν = n P ni =1 δ y i with x i ∈ R p , y i ∈ R q . For x ∈ R p we note k x k ,p = pP pi =1 | x i | the ‘ norm on R p (same for R q ). Let c X ( x , x ) = k x − x k ,p , c Y ( y , y ) = k y − y k ,q . Then: GW ( c X , c Y , µ, ν ) = GM ( c X , c Y , µ, ν ) (4.4) Moreover when p = q = 1 , i.e. c X ( x, x ) = c Y ( x, x ) = | x − x | , and x < · · · < x n and y < · · · < y n the optimal values are achieved by considering either the identity or the anti-identity permutation. A detailed proof is provided in Section 6.2.2. Note also that, while both possible solutions for problem(4.3) can be computed in O ( n log( n )), ﬁnding the best one requires the computation of the cost whichseems, at ﬁrst sight, to have a O ( n ) complexity. However, under the hypotheses of squared Euclideandistances, the cost can be computed in O ( n ). Indeed, in this case, one can develop the sum in equation(4.3) to compute it in O ( n ) operations using binomial expansion (see details in Section 6.2.3) so that theoverall complexity of ﬁnding the best assignment and computing the cost is O ( n log( n )) which is the samecomplexity as the Wasserstein for 1D distributions. Sliced Gromov-Wasserstein discrepancy

Theorem 4.1.2 can be put in perspective with the Wasser-stein distance for 1D distributions which is achieved by the identity permutation when points aresorted [Peyré 2019]. As explained in Chapter 2, this result was used to approximate the Wassersteindistance between measures of R q using the so called Sliced Wasserstein (SW) distance [Bonneel 2015]. Themain idea is to project the points of the measures on lines of R q where computing a Wasserstein distanceis easy since it only involves a simple sort and to average these distances. In the same philosophy webuild upon Theorem 4.1.2 to deﬁne a “sliced” version of the GW distance. In the following, we consider µ ∈ P ( R p ) , ν ∈ P ( R q ) be probability distributions (not necessarily discrete).Let S q − = { θ ∈ R q |k θ k = 1 } be the q -dimensional hypersphere and λ q − the uniform measure on S q − . For θ we note P θ the projection on θ , i.e. P θ ( x ) = h x , θ i . For a linear map ∆ ∈ R q × p (identiﬁedwith slight abuses of notation by its corresponding matrix), we deﬁne the Sliced Gromov-Wasserstein(SGW) discrepancy as follows: SGW ∆ ( µ, ν ) = E θ ∼ λ q − [ GW ( P θ µ ∆ , P θ ν )] = ˆ S q − GW ( d , P θ µ ∆ , P θ ν )d λ q − ( θ ) (4.5)where µ ∆ = ∆ µ ∈ P ( R q ). The function ∆ acts as a mapping for a point in R p of the measure µ onto R q . When p = q and when we consider ∆ as the identity map we simply write SGW ( µ, ν ) instead of SGW I p ( µ, ν ). When p < q , one straightforward choice is ∆ = ∆ pad the "uplifting" operator which padseach point of the measure with zeros: ∆ pad ( x ) = ( x , . . . , x p , , . . . , | {z } q − p ). The procedure is illustrated in Fig4.1.In general ﬁxing ∆ implies that some properties of GW , such as the rotational invariance, are lost.Consequently, we also propose a variant of SGW that does not depends on the choice of ∆ called RotationInvariant SGW ( RISGW ) and expressed for p ≥ q as the following: RISGW ( µ, ν ) = min ∆ ∈ V p ( R q ) SGW ∆ ( µ, ν ) . (4.6) y y y y x x x x P θ νP θ µ )for θ ∈ S q − y θ y θ y θ y θ x θ x θ x θ x θ Figure 4.1: Example in dimension p = 2 and q = 3 (left) that are projected on the line (right) . The solution forthis projection is the anti-diagonal coupling. We propose to minimize

SGW ∆ with respect to ∆ in the Stiefel manifold V p ( R q ) [Absil 2009] whichis deﬁned as V p ( R q ) = { ∆ ∈ R q × p | ∆ T ∆ = I p } . It can be seen as ﬁnding an optimal projector of themeasure µ [Paty 2019, Deshpande 2019]. This formulation comes at the cost of an additional optimizationstep but allows recovering one key property of GW. When p = q this encompasses for e.g. all rotations ofthe space, making RISGW invariant by rotation.Interestingly enough,

SGW holds various properties of the GW distance as summarized in the followingtheorem: Theorem 4.1.3 (Properties of

SGW ) . • For all ∆ , SGW ∆ and RISGW are translation invariant.

RISGW is also rotational invari-ant when p = q , more precisely if Q ∈ O ( p ) is an orthogonal matrix, RISGW ( Q µ, ν ) = RISGW ( µ, ν ) (same for any Q ∈ O ( q ) applied on ν ).• SGW and

The ∆ map can also be used in the context of the Sliced Wasserstein distance so as todeﬁne SW ∆ ( µ, ν ) , RISW ( µ, ν ) for µ, ν ∈ P ( R p ) × P ( R q ) with p = q . Please note that from a purelycomputational point of view, complexities of these discrepancies are the same as SGW and

RISGW when µ and ν are discrete measures with the same number of atoms n = m , and uniform weights. Also,unlike SGW and

RISGW , these discrepancies are not translation invariant. This approach was studiedin [Lai 2014] for the case p = q in the context of point cloud registration. More details are given in Section6.2.5. .1. Sliced Gromov-Wasserstein 75 Algorithm 6

Sliced Gromov-Wasserstein for discrete measures p < q , µ = n P ni =1 δ x i ∈ P ( R p ) and ν = n P ni =1 δ y j ∈ P ( R q ) ∀ i, x i ← ∆( x i ), sample uniformly ( θ l ) l =1 ,...,L ∈ S q − for l = 1 , . . . , L do Sort ( h x i , θ l i ) i and ( h y j , θ l i ) j in increasing order Solve (4.3) for reals ( h x i , θ l i ) i and ( h y j , θ l i ) j , σ θ l (Anti-Id or Id is a solution) end for return n L L P l =1 n P i,k =1 (cid:0) h x i − x k , θ l i −h y σ θ l ( i ) − y σ θ l ( k ) , θ l i (cid:1) Computational aspects

In the following µ, ν are discrete measures with the same number of atoms n = m , and uniform weights , i.e. µ = n P ni =1 δ x i , ν = n P ni =1 δ y i with x i ∈ R p , y i ∈ R q so that wecan apply Theorem 4.1.2. Similarly to Sliced Wasserstein, SGW can be approximated by replacing theintegral by a ﬁnite sum over randomly drawn directions. In practice we compute

SGW as the averageof GW projected on L directions θ . While the sum in (4.5) can be implemented with libraries such asPykeops [Charlier 2018], Theorem 4.1.2 shows that computing (4.5) is achieved by an O ( n log( n )) sortingof the projected samples and by ﬁnding the optimal permutation which is either the identity or the antiidentity. Moreover computing the cost is O ( n ) for each projection as explained previously. Thus the overallcomplexity of computing SGW with L projections is O ( Ln ( p + q )+ Ln log( n )+ Ln ) = O ( Ln ( p + q +log( n )))when taking into account the cost of projections. The pseudo-code for SGW is presented in Algorithm 6Note that these computations can be eﬃciently implemented in parallel on GPUs with modern toolkitssuch as Pytorch [Paszke 2017].The complexity of solving

RISGW is higher but one can rely on eﬃcient algorithms for optimizingon the Stiefel manifold [Absil 2009] that have been implemented in several toolboxes [Townsend 2016,Meghwanshi 2018]. Note that each iteration in a manifold gradient decent requires the solution of

SGW ,that can be computed and diﬀerentiated eﬃciently with the frameworks described above. Moreover, theoptimization over the Stiefel manifold does not depend on the number of points but only on the dimension d of the problem so that overall complexity is n iter ( Ln ( d + log( n )) + d ), which is aﬀordable for small d .In practice, we observed in the numerical experiments that RISGW converges in few iterations (the orderof 10). The goal of this section is to validate

SGW and its rotational invariant on both quantitative (executiontime) and qualitative sides. All the experiments were conducted on a standard computer equipped with aNVIDIA Titan X GPU.

SGW and RISGW on spiral dataset

As a ﬁrst example, we use the spiral dataset from sklearntoolbox and compute GW , SGW and

RISGW on n = 100 samples with L = 20 sampled lines fordiﬀerent rotations of the target distribution. The optimization of ∆ on the Stiefel manifold is performedusing Pymanopt [Townsend 2016] with automatic diﬀerentiation with autograd [Maclaurin 2015]. Someexamples of empirical distributions are available in Figure 4.2 (left). The mean value of GW , SGW and

RISGW are reported on Figure 4.2 (right) where we can see that

RISGW is invariant to rotation as GW whereas SGW with ∆ = I p is clearly not. Figure 4.2: Illustration of

SGW , RISGW and GW on spiral dataset for varying rotations on discrete 2D spiraldataset. (left) Examples of spiral distributions for source and target with diﬀerent rotations. (right)

Averagevalue of

SGW , GW and RISGW with L = 20 as a function of rotation angle of the target. Colored areascorrespond to the 20% and 80% percentiles. Runtimes comparison

We perform a comparison between runtimes of

SGW , GW and its entropiccounterpart [Solomon 2016]. We calculate these distances between two 2D random measures of n ∈{ e , ..., e } points. For SGW , the number of projections L is taken from { , } . We use the PythonOptimal Transport (POT) toolbox [Flamary 2017] to compute GW distance on CPU. For entropic- GW we use the Pytorch GPU implementation from [Bunne 2019] that uses the log-stabilized Sinkhornalgorithm [Schmitzer 2016] with a regularization parameter ε = 100. For SGW , we implemented both aNumpy implementation and a Pytorch implementation running on GPU. Figure 4.3 illustrates the results.

SGW is the only method which scales w.r.t. the number of samples and allows computation for n > . While entropic- GW uses GPU, it is still slow because the gradient step size in the algorithm isinversely proportional to the regularization parameter [Peyré 2016] which highly curtails the convergenceof the method. On CPU, SGW is two orders of magnitude faster than GW . On GPU, SGW is ﬁve ordersof magnitude faster than GW and four orders of magnitude faster than entropic GW . Still the slopeof both GW implementations are surprisingly good, probably due to their maximum iteration stoppingcriteria. In this experiment we were able to compute SGW between 10 points in 1s. Finally note thatwe recover exactly a quasi-linear slope, corresponding to the O ( n log( n )) complexity for SGW . Meshes comparison

In the context of computer graphics, GW can be used to quantify the correspon-dances between two meshes. A direct interest is found in shape retrieval, search, exploration or organizationof databases. In order to recover experimentally some of the desired properties of the GW distance, wereproduce an experiment originally conducted in [Rustamov 2013] and presented in [Solomon 2016] withthe use of entropic- GW .From a given time series of 45 meshes representing a galloping horse, the goal is to conduct a multi-dimensional scaling (MDS) of the pairwise distances, computed with SGW between the meshes, thatallows ploting each mesh as a 2D point. As one can observe in Figure 4.4, the cyclical nature of thismotion is recovered in this 2D plot, as already illustrated in [Solomon 2016] with the GW distance. Eachhorse mesh is composed of approximately 9 ,

000 vertices. The average time for computing one distance isaround 30 minutes using the POT implementation, which makes the computation of the full pairwisedistance matrix impractical (as already mentioned in [Solomon 2016]). In contrast, our method onlyrequires 25 minutes to compute the full distance matrix, with an average of 1.5s per mesh pair, using ourCPU implementation. This clearly highlights the beneﬁts of our method in this case. .1. Sliced Gromov-Wasserstein 77

Figure 4.3: Runtimes comparison between

SGW , GW , entropic- GW between two 2D random distributions withvarying number of points from 0 to 10 in log-log scale. The time includes the calculation of the pair-to-pairdistances. SGW as a generative adversarial network (GAN) loss

In a recent paper [Bunne 2019], Bunneand colleagues propose a new variant of GAN between incomparable spaces, i.e. of diﬀerent dimensions.In contrast with classical divergences such as Wasserstein, they suggest to capture the intrinsic relationsbetween the samples of the target probability distribution by using GW as a loss for learning. Moreformally, this translates into the following optimization problem over a desired generator G : G ∗ = arg min GW ( c X , c G ( Z ) , µ, ν G ) , (4.8)where Z is a random noise following a prescribed low-dimensional distribution (typically Gaussian), G ( Z )performs the uplifting of Z in the desired dimensional space, and c G ( Z ) is the corresponding metric. µ and ν G correspond respectively to the target and generated distributions, that we might want to align inthe sense of GW . Following the same idea, and the fact that sliced variants of the Wasserstein distancehave been successfully used in the context of GAN [Deshpande 2018], we propose to use SGW instead of GW as a loss for learning G . As a proof of concept, we reproduce the simple toy example of [Bunne 2019].Those examples consist in generating 2D or 3D distributions from target distributions either in 2D or 3Dspaces (Figure 4.5 and Figure 4.6). These distributions are formed by 3 ,

000 samples. We do not use theiradversarial metric learning as it might confuse the objectives of this experiment and as it is not requiredfor these low dimensional problems [Bunne 2019]. The generator G is designed as a simple multilayerperceptron with 2 hidden layers of respectively 256 and 128 units with ReLu activation functions, and oneﬁnal layer with 2 or 3 output neurons (with linear activation) as output, depending on the experiment.The Adam optimizer is used, with a learning rate of 2 . − and β = 0 . , β = 0 .

99. The convergenceto a visually acceptable solution takes a few hundred epochs. Contrary to [Bunne 2019], we directlyback-propagate through our loss, without having to explicit a coupling matrix and resorting to the envelopeTheorem. Compared to [Bunne 2019] and the use of entropic- GW , the time per epoch is more than oneorder of magnitude faster, as expected from previous experiment. Figure 4.4: Each sample in this Figure corresponds to a mesh and is colored by the corresponding time iteration.One can see that the cyclical nature of the motion is recovered.Figure 4.5: Using

SGW in a GAN loss. First image shows the loss value along epochs. The next 4 images areproduced by sampling the generated distribution (3 ,

000 samples, plotted as a continuous density map). Lastimage shows the target 3D distribution.

In this section we establish a new result about Quadratic Assignment Problem when matrices aresquared euclidean distances on the real line, and use it to state a closed-form expression for GW betweenmonodimensional measures. Building upon this result we deﬁne a new similarity measure, called theSliced Gromov-Wasserstein and a variant Rotation-invariant SGW and prove that both conserve variousproperties of the GW distance while being cheaper to compute and applicable in a large-scale setting.Notably SGW can be computed in 1 second for distributions with 1 million samples each. This paves theway for novel promising machine learning applications of optimal transport between metric spaces. .2. Regularity & formulations of GW problems in Euclidean spaces 79

Figure 4.6: Using

SGW in a GAN loss. The three rows depicts three diﬀerent examples. First row is 2D (Generator)to 2D (Target) , Second 3D to 2D. First column is initialization, second one is at 100 Epochs, third one at 1000.Last column depicts the target distribution.

Yet, several questions are raised in this work. Notably, our method perfectly ﬁts the case when thetwo distributions are given empirically through samples embedded in an Hilbertian space, that allows forprojection on the real line. This is the case in most of the machine learning applications that use theGromov-Wasserstein distance. However, when only distances between samples are available, the projectionoperation can not be carried anymore, while the computation of GW is still possible. One can arguethat it is possible to embed either isometrically those distances into a Hilbertian space, or at least with alow distortion, and then apply the presented technique. Our future line of work considers this option, aswell as a possible direct reasoning on the distance matrix. For example, one should be able to considergeodesic paths (in a graph for instance) as the equivalent appropriate geometric object related to the line.This constitutes the direct follow-up of this work, as well as a better understanding of the accuracy of theestimated discrepancy with respect to the ambiant dimension and the projections number. In the previous part we built upon the special case of 1D probability discrete measures. We consider inthis section general probability measures µ ∈ P ( R p ) and ν ∈ P ( R q ) supported on Euclidean spaces R p and R q with (possibly) p = q . The corresponding inner products are denoted by h x , x i p ( resp. h y , y i q )for vectors in R p ( resp. R q ) associated with the Euclidean norms which are denoted both by k . k to avoidoverloading notations.We tackle in this section the problem of the regularity of the optimal transport plans of GW in theEuclidean setting. More precisely we consider the following problem: Problem 1.

Let µ ∈ P ( R p ) , ν ∈ P ( R q ) . Can we ﬁnd an deterministic transport map to the GW problem?More precisely does the following statement hold? ∃ T : R p → R q such that T µ = ν and γ T = ( id × T ) µ is optimal for the GW problem: inf π ∈ Π( µ,ν ) ˆ ˆ | c X ( x , x ) − c Y ( y , y ) | d π ( x , y )d π ( x , y ) (4.9) The Euclidean setting is motivated by the linear transportation theory where the ﬁrst regularity solutionfor Optimal Transport was proved by Brenier for probability measures in R p (see Chapter 2). In this casewe recall that the optimal transport plan γ T between two probability measure ( µ, ν ) ∈ P ( R p ) × P ( R p ),with µ “well behaved” and with cost c ( x , y ) = k x − y k , is unique and supported by a map T such that γ T = ( id × T ) µ . The purpose of this section is to show that the Euclidean setting is also quite suitedfor the GW case. The problem of regularity of GW optimal transport plans was ﬁrst addressed by Sturmin his seminal work about GW [Sturm 2012, Challenges 3.6]. More precisely Sturm asks the followingquestion: are there some “nice” spaces in which we are able to prove Brenier’s like results for GW ? Wegive in this section some partial answers to this query by considering the two cases where c X , c Y aredeﬁned by the inner products or by the squared Euclidean distances in each space.The main results of this section is to derive equivalent formulations of the GW in these two cases.More precisely we show that solving GW is equivalent to jointly solve a linear transportation problemand a “alignment” problem. As such the regularity of GW optimal plans can be observed in the lightof these “dual problems”. As another consequence it allows also to derive algorithmic solutions for the GW problems in Euclidean spaces based on simple Block Coordinate Descent procedures. This section isorganized as follow:(i) In Section 4.2.2 we consider the case where c X , c Y are deﬁned by the inner products in each space.Providing that the source probability measure is regular with respect to the Lebesgue measure wegive a suﬃcient condition for the existence of a deterministic optimal transport plan, i.e. supportedon a deterministic function T . We show that this function is of the form ∇ u ◦ P where u is a convexfunction and P is a linear application which can be seen as a global transformation “realigning” theprobability measures in the same space. We use this formulation to show that the GW distancebetween 1D probability measures admits a closed-form solution. More precisely we show that theoptimal coupling is determined by the cumulative and the anti-cumulative distribution functions ofthe source distribution. We further discuss the diﬀerence between the linear OT problem W andthe GW problem when the target measure is a perturbed version of the source measure.(ii) In Section 4.2.3 we consider c X , c Y as the squared Euclidean distances in each space. We showthat this setting is equivalent to a maximization of a convex function on Π( µ, ν ). We use theFenchel-Legendre duality in the space of measures to derive a problem equivalent to that of Gromov-Wasserstein. We further analyse it and show that the regularity of optimal transport plans is morecomplicated to state than in the previous case.(iii) In Section 4.2.4 we use the previous formulations to derive eﬃcient numerical solutions for the GW problem based on Block Coordinate Descent. We show that these procedures compare favourablywith respect to standard solvers such as Conditional Gradient (see Chapter 3) or with entropicregularization (see Chapter 2).(iv) We conclude in Section 4.2.5 by considering the Gromov-Monge problem in Euclidean spaces, which isthe exact counterpart of the Monge problem of linear transportation but in the Gromov-Wassersteincontext. We discuss the special case of the Gromov-Monge between Gaussian measures and we showthat this problem admits a closed-form solution when restricting to linear push-forward . We givegeometric interpretations of this result and compare the optimal push-forward with the standardoptimal map of linear OT theory in the case of Gaussian measures (see Chapter 2).This section is more prospective and somehow opens more doors than it closes. We hope that it willpaves the path for further interesting works on this topic and believe that it could help for bridging the .2. Regularity & formulations of GW problems in Euclidean spaces 81 gap between the understanding of the linear OT problem and the Gromov-Wasserstein theory. We havechosen to include in the main text some proofs that are reasonably long and that we consider interestingfor the overall understanding of the section. The other proofs, which require more space, are postponedto Section 6.2.

To encompass the two cases described in the introduction we consider the following lemma (a proof canbe found in Section 6.2.6):

Lemma 4.2.1.

For a coupling π ∈ Π( µ, ν ) we note: J ( c X , c Y , π ) def = ˆ X ×X ˆ Y×Y | c X ( x , x ) − c Y ( y , y ) | d π ( x , y )d π ( x , y ) (4.10) the GW loss. Suppose that there exist scalars a, b, c such that c X ( x , x ) = a k x k + b k x k + c h x , x i p and c Y ( y , y ) = a k y k + b k y k + c h y , y i q . Then: J ( c X , c Y , π ) = C µ,ν − Z ( π ) (4.11) where C µ,ν = ´ c X d µ d µ + ´ c Y d ν d ν − ab ´ k x k k y k dµ ( x ) dν ( y ) and: Z ( π ) = ( a + b ) ˆ k x k k y k d π ( x , y ) + c k ˆ yx T d π ( x , y ) k F + ( a + b ) c ˆ (cid:2) k x k h E Y ∼ ν [ Y ] , y i q + k y k h E X ∼ µ [ X ] , x i p d π ( x , y ) (cid:3) (4.12)In this section we study the GW problem with c X = h x , x i p and c Y = h y , y i q . This corresponds to a, b = 0 and c = 1 case of Lemma 4.2.1. With a small abuse of notation we will denote by P ∈ R q × p boththe linear application P : R p → R q and its associated matrix. Moreover we will often make no distinctionbetween a vector x ∈ R p and the matrix associated to x ∈ R p × such that x T ∈ R × p . The next theoremgives a equivalent formulation of GW in this context: Theorem 4.2.1 (Equivalence of GW for the inner product case) . Let µ ∈ P ( R p ) , ν ∈ P ( R q ) with ´ k x k d µ ( x ) < + ∞ , ´ k y k d ν ( y ) < + ∞ . Suppose without loss of generality that p ≥ q and let: F p,q def = { P ∈ R q × p | k P k F = √ p } (4.13) Then problems: inf π ∈ Π( µ,ν ) ˆ ˆ (cid:0) h x , x i p − h y , y i q (cid:1) d π ( x , y )d π ( x , y ) (innerGW)sup π ∈ Π( µ,ν ) sup P ∈ F p,q ˆ h Px , y i q d π ( x , y ) (MaxOT) are equivalent. In other words, π ∗ ∈ Π( µ, ν ) is an optimal solution of (innerGW) if and only if π ∗ isan optimal solution of (MaxOT) . Remark 4.2.1.

The condition ´ k x k d µ ( x ) < + ∞ , ´ k y k d ν ( y ) < + ∞ suﬃces to prove that both (innerGW) and (MaxOT) are ﬁnite and that (MaxOT) admits an optimal solution π ∗ ∈ Π( µ, ν ) (wepostponed this study to Lemma 6.2.7 in Section 6.2.7). This theorem gives another interesting formulation of the Gromov-Wasserstein problem. It proves that GW is equivalent to a linear OT problem combined with an “alignment” of the measures µ and ν on thesame space using a linear application P . The set F p,q can be regarded as the set of matrices with ﬁxedSchatten ‘ norms, that is P ∈ F p,q if k σ ( P ) k = √ p where σ ( P ) is a vector containing the singular valuesof P . When p = q any orthogonal matrix O ∈ O ( p ) is in F p,q since k O k F = p tr( O T O ) = p tr( I p ) = √ p .More generally when p > q any matrix in the Stiefel manifold ∆ ∈ V p ( R q ) is an element of F p,q since∆ T ∆ = I p . Interestingly enough, the problem (MaxOT) can be related to the work of Alvarez andcoauthors [Alvarez-Melis 2019] where they proposed a linear optimal transport problem which takes intoaccount a latent global transformation of the measures. More precisely they consider two probabilitymeasures ( µ, ν ) ∈ P ( R p ) × R p ( i.e. p = q ) and propose two minimize the following problem: InvOT ( µ, ν ) = min π ∈ Π( µ,ν ) min k P k F = √ p ˆ k Px − y k d π ( x , y ) (4.14)If we note Σ µ = ´ xx T d µ ( x ) and we suppose that Σ µ = I p (which is called the µ -whitened propertyin [Alvarez-Melis 2019]) then problem (4.14) is equivalent to (MaxOT). To see this is suﬃces to develop ´ k Px − y k d π ( x , y ) as: ˆ k Px − y k d π ( x , y ) = ˆ k Px k d µ ( x ) + ˆ k y k d ν ( y ) − ˆ h Px , y i p d π ( x , y ) (4.15)Then we can check that ´ k Px k d µ ( x ) does not depend on P since: ˆ k Px k d µ ( x ) = ˆ x T P T Px d µ ( x ) ∗ = ˆ tr( x T P T Px )d µ ( x ) ∗∗ = ˆ tr( P T Pxx T )d µ ( x ) ∗∗∗ = tr( P T P ˆ xx T d µ ( x )) = tr( P T P ) = k P k F = p (4.16)where in (*) we used x T P T Px ∈ R , in (**) that the trace is invariant by cyclical permutation and in(***) the linearity of the trace. Finally we used that tr( P T P ) = k P k F = p by hypothesis. However notethat in general both problems may diﬀer since the µ -whitened property does not hold in general.Theorem 4.2.1 is based on the following generalization of the Frobenius norm duality to the continuoussetting: Lemma 4.2.2.

For any µ ∈ P ( R p ) , ν ∈ P ( R q ) and π ∈ Π( µ, ν ) . Then: sup k P k F = √ p ˆ h Px , y i q d π ( x , y ) = √ p k ˆ yx T d π ( x , y ) k F (4.17) This supremum is achieved for P ∗ = √ p k ´ yx T d π ( x , y ) k F ´ yx T d π ( x , y ) Proof.

We have: ˆ h Px , y i q d π ( x , y ) = ˆ y T Px d π ( x , y ) ∗ = ˆ tr( y T Px )d π ( x , y ) ∗∗ = ˆ tr( Pxy T )d π ( x , y ) ∗∗∗ = tr( P ˆ xy T d π ( x , y )) = h ˆ yx T d π ( x , y ) , P i F where in (*) we used that y T Px ∈ R , in (**) we used the cyclical permutation invariance of the traceand in (***) its linearity. Hence sup k P k F = √ p ´ h Px , y i q d π ( x , y ) = sup k P k F = √ p h P , ´ yx T d π ( x , y ) i F . We note V π = ´ yx T d π ( x , y ). We want to solve: sup k P k F = √ p h P , V π i F (4.18) .2. Regularity & formulations of GW problems in Euclidean spaces 83 Let P such that k P k F = √ p . Then by Cauchy-Schwartz (see Memo 4.2.1) h P , V π i F ≤ k P k F k V π k F = √ p k V π k F . Hence, sup k P k F = √ p h P , V π i F ≤ √ p k V π k F .Conversely, take P ∗ = √ p V π k V π k F . Then k P ∗ k F = √ p so sup k P k F = √ p h P , V π i F ≥ h P ∗ , V π i F = h√ p V π k V π k F , V π i F = √ p k V π k F which concludes the proof.Combining Lemma 4.2.1 and Lemma 4.2.2 actually proves Theorem 4.2.1. Indeed using Lemma4.2.1 we see that (innerGW) is equivalent to maximizing Z ( π ) = k ´ yx T d π ( x , y ) k F over the couplings π ∈ Π( µ, ν ) since the other terms are constant. In this way it is equivalent to maximize k ´ yx T d π ( x , y ) k F which is equivalent by Lemma 4.2.2 to maximize sup P ∈ F p,q ´ h Px , y i q d π ( x , y ) w.r.t. π . Regularity of (innerGW)

OT plans

Theorem 4.2.1 proves that it is equivalent to study the problem(MaxOT) for studying the regularity of GW optimal transport plans. Interestingly enough the problem(MaxOT) echoes the linear transportation problem sup π ∈ Π( µ,ν ) ´ h x , y i p d π ( x , y ) when µ, ν ∈ P ( R p ) ×P ( R p ) which is widely studied in the literature an can be tackled using tools from convexity analysis suchas the Legendre transform. The following result due to McCann is particularly useful in this case: Theorem 4.2.2 ( [McCann 1995]) . Let µ ∈ P ( R p ) , ν ∈ P ( R p ) . Suppose that µ is absolutelycontinuous with respect to the Lebesgue measure, then there exists a convex function u : R p → R whose gradient ∇ u pushes µ forward to ν , i.e. ∇ u µ = ν . Moreover ∇ u is unique µ a.e. By noticing that, for all x , y ∈ R p × R p , h x , y i p ≤ u ∗ ( x ) + u ( y ) where u ∗ is the Legendre transformof the convex function u the result of McCann proves that the map ∇ u deﬁnes an optimal coupling γ = ( id × ∇ u ) µ for the problem sup π ∈ Π( µ,ν ) ´ h x , y i p d π ( x , y ) between ( µ, ν ) ∈ P ( R p ) × P ( R p ). Indeedfor any coupling π ∈ Π( µ, ν ): ˆ h x , y i p d π ( x , y ) ≤ ˆ u ∗ ( x ) + u ( y )d π ( x , y )= ˆ u ∗ ( x )d µ ( x ) + ˆ u ( y )d ν ( y ) (4.20)Using that ∇ u pushes µ forward to ν implies: ˆ h x , y i p d π ( x , y ) ≤ ˆ u ∗ ( x )d µ ( x ) + ˆ u ( ∇ u ( x ))d µ ( x )= ˆ h x , ∇ u ( x ) i p d µ ( x ) (4.21)since for any convex function u ∗ ( x ) + u ( x ) = h x , ∇ u ( x ) i . Overall ´ h x , y i p d π ( x , y ) ≤ ´ h x , y i p d γ ( x , y ) forany coupling π which proves that γ is optimal. The idea is to use the same reasoning to ﬁnd an optimalsolution of (MaxOT). In order to invoke McCann’s theorem we will need the regularity of the probabilitymeasure P µ for P ∈ F p,q : Memo 4.2.1 (Cauchy-Schwartz inequality) . Let Ω be a vector space associated with an inner product h , i which deﬁnes a norm k . k through k x k = p h x , x i for x ∈ Ω . The Cauchy-Schwartz inequalityreads: ∀ x , y ∈ Ω , |h x , y i| ≤ k x kk y k (4.19) Proposition 4.2.1.

Let µ ∈ P ( R p ) regular with respect to the Lebesgue measure in R p and a linear map l : R p → R q associated with a matrix L ∈ R q × p • If p < q (we go from lower to higher dimension), then l µ ∈ R q is not regular with respect to theLebesgue measure on R q .• If p ≥ q (we go from higher to lower dimension), then l µ ∈ R q is regular with respect to theLebesgue measure on R q if and only if the linear map l is surjective, that is rank ( L ) = q .Proof. We have l µ ( l ( R p )) = µ ( l − ( l ( R p )) = µ ( R p ) = 1 so that l µ gives measure 1 to the image of l .However for the ﬁrst point the image of l is a strict linear subspace of R q and therefore has Lebesguemeasure zero. Using Radon-Nykodym theorem this implies that l µ can not have a density with respectto the Lebesgue on R q . For the second point, suppose that l is surjective. Then rank( L ) = q which impliesthat LL T is invertible so that det( LL T ) = 0. We deﬁne J = p det( LL T ). Let g be the density of µ withrespect to the Lebesgue measure in R p . Then by the coarea formula the density of l µ with respect tothe Lebesgue measure on R q is: h ( y ) = ˆ l − ( y ) g ( x ) J dV l − ( y ) ( x ) (4.22)where denotes dV l − ( y ) ( x ) the volume element. Conversely if l is not surjective then rank( L ) < q . Thenthe image of l is a strict linear subspace of R q with Lebesgue measure zero and therefore l µ can nothave a density with respect to the Lebesgue on R q .In the light of previous results we can give the following suﬃcient condition so that (innerGW) problemadmits an optimal transport plan supported on a deterministic function: Theorem 4.2.3.

Let µ ∈ P ( R p ) , ν ∈ P ( R q ) with ´ k x k d µ ( x ) < + ∞ , ´ k y k d ν ( y ) < + ∞ . Supposethat p ≥ q and that µ is regular with respect to the Lebesgue measure in R p . Suppose that there exists ( π ∗ , P ∗ ) an optimal solution of (MaxOT) with P ∗ surjective.Then there exists u : R q → R convex such that ∇ u ◦ P ∗ pushes µ forward to ν . Moreover thecoupling γ = ( id × ∇ u ◦ P ∗ ) µ is optimal for (innerGW) . In particular problem 1 holds.Proof. Let ( π ∗ , P ∗ ) be maximizers of (MaxOT) with P ∗ surjective. Using Proposition 4.2.1 we knowthat P ∗ µ is regular with respect to the Lebesgue measure on R q . Using Theorem 4.2.2 there exists u : R q → R convex such that ∇ u P ∗ µ = ν or equivalently ∇ u ◦ P ∗ pushes µ forward to ν . Moreoverwe have: ˆ h P ∗ x , y i d d π ∗ ( x , y ) ≤ ˆ ( u ( P ∗ x ) + u ∗ ( y )) d π ∗ ( x , y ) = ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( y )d ν ( y ) = ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( y )d( ∇ u ◦ P ∗ µ )( y )= ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( ∇ u ( P ∗ x ))d µ ( x ) = ˆ h P ∗ x , ∇ u ( P ∗ x ) i q d µ ( x )where in (1) we used that u is convex which implies u ( Px ) + u ∗ ( y ) ≥ h Px , y i q by Fenchel-Young inequality,in (2) we used ∇ u ◦ P ∗ pushes µ forward to ν , in (3) we used that for any x and convex function u ∗ ( x ) + u ( x ) = h x , ∇ u ( x ) i . .2. Regularity & formulations of GW problems in Euclidean spaces 85 If we deﬁne T = ∇ u ◦ P ∗ and γ = ( id × T ) µ ∈ Π( µ, ν ) we can deduce from (3) that:sup π ∈ Π( µ,ν ) sup P ∈ F p,q ˆ h Px , y i q d π ( x , y ) ≤ ˆ h P ∗ x , y i q d γ ( x , y )By suboptimality the converse inequality is also true (since γ = ( id × T ) µ ∈ Π( µ, ν )). In this way wehave: sup π ∈ Π( µ,ν ) sup P ∈ F p,q ˆ h Px , y i q d π ( x , y = ˆ h P ∗ x , y i q d γ ( x , y )Overall the couple ( γ, P ∗ ) is optimal for the problem (MaxOT) and γ is optimal for (innerGW) usingTheorem 4.2.1 which concludes the proof.The last result indicates that Monge map of the form ∇ u ◦ P where P is a linear application may beof interest to study optimal solutions of (innerGW). When considering p = q and µ, ν ∈ P ( R p ) × P ( R p )looking at such maps can generate the optimal Monge map ∇ u of linear optimal transport problemsup π ∈ Π( µ,ν ) ´ h x , y i p d π ( x , y ) when P = I p . Note that a couple ( u, P ) which satisﬁes both ∇ u ◦ P pushes µ onto ν and γ = ( id × ∇ u ◦ P ) is optimal for (innerGW) is not guaranteed to be unique. This diﬀerenceshould be put in perspective with the theory of linear transportation with ground cost c ( x , y ) = k x − y k .Indeed in this context if one ﬁnds a mapping T = ∇ u with any convex function u then Brenier’s theoremstates that this mapping is optimal since there is a unique Monge map satisfying this property. Thisunicity result is particularly interesting e.g. in order to prove that the linear transport between Gaussianmeasure admits a closed-form expression [Takatsu 2011]. However in our context the map ∇ u ◦ P mayfail to be unique as shown in the following example: Example 4.2.1.

Consider p = q , a source measure µ ∈ P ( R p ) and a target measure ν ∈ P ( R p ) whosesupport is invariant by rotation, i.e. such that O ν = ν for any O ∈ O ( p ) . We can consider e.g. anyisotropic Gaussian measure N ( m ν , σ I p ) . Since O preserves the angles then the problem (innerGW) is invariant by O which implies that any optimal map deﬁned in Theorem 4.2.3 will fail to be unique.More precisely let suppose that ∇ u ◦ P pushes µ forward to ν and that γ = ( id × ∇ u ◦ P ) is optimal for (innerGW) . Then for any O ∈ O ( p ) , O T ◦ ∇ u ◦ P also pushes µ forward to ν since ∇ u ◦ P µ = ν = O ν by rotational invariance of the support of ν . This implies that O T ◦ ∇ u ◦ P µ = ν . Moreover the map γ = ( id × O T ◦ ∇ u ◦ P ) is also optimal with the same cost as γ . Indeed: ˆ ˆ (cid:0) h x , x i p − h y , y i p (cid:1) d γ ( x , y )d γ ( x , y ) = ˆ ˆ (cid:0) h x , x i p − h O T ∇ u ( Px ) , O T ∇ u ( Px ) i p (cid:1) d µ ( x )d µ ( x ) ∗ = ˆ ˆ (cid:0) h x , x i p − h∇ u ( Px ) , ∇ u ( Px ) i p (cid:1) d µ ( x )d µ ( x )= ˆ ˆ (cid:0) h x , x i p − h y , y i p (cid:1) d γ ( x , y )d γ ( x , y ) where in (*) we used that O ∈ O ( p ) . The condition of Theorem 4.2.3 seems reasonable and not too restrictive: it suﬃces that one optimalsolution of (MaxOT) where P ∗ is surjective exists in order to prove that an optimal coupling of innerGWis supported by a deterministic map. We believe that some simple assumption can be made on µ, ν inorder to meet this condition. In particular this is satisﬁed when µ, ν are 1D distributions as detailedbelow. Application of Theorem 4.2.3 for 1D probability distributions

The suﬃcient condition givenin Theorem 4.2.3 can be used to derived a closed-form expression for the problem (innerGW) between1D probability distributions. Let us consider ( µ, ν ) ∈ P ( R ) × P ( R ) and F µ and F ν be the cumulativedistribution functions of µ and ν and F − µ , F − ν its pseudo inverses (see Chapter 2). We suppose that µ isregular with respect to the Lebesgue measure in R .In this case the linear application P in (MaxOT) reduces to a scalar p ∈ R :sup π ∈ Π( µ,ν ) sup | p | =1 ˆ ( px ) .y d π ( x, y ) (4.23)In this way an optimal any optimal solution ( π ∗ , p ∗ ) of (4.23) satisﬁes p ∗ ∈ {− , } so that p deﬁnes asurjective linear application. Then by applying Theorem 4.2.3 there exists u : R → R , convex such that u ◦ p ∗ pushes µ forward to ν and that γ = ( id × u ◦ p ∗ ) µ is optimal for (innerGW). In other wordsthere exists f = u non-decreasing such that γ = ( id × f ◦ p ∗ ) µ is optimal for (innerGW). Howeverwe known from linear transport theory that there is a unique non-decreasing map T asc : R → R suchthat T asc µ = ν and it is given by T asc ( x ) = F − ν ( F µ ( x )) (see theorem 2.5 in [Santambrogio 2015]).This proves that if p ∗ = 1 then f ◦ p ∗ = T asc so that γ = ( id × T asc ) µ is optimal for (innerGW). If p ∗ = − f ◦ p ∗ isnon-increasing and pushes µ forward to ν which is equivalent to say that f is non-decreasing and pushes˜ µ forward to ν where “d˜ µ ( x ) = d µ ( − x )”. This discussion leads to the following result: Theorem 4.2.4 (Closed Form expression for (innerGW) between 1D distributions) . Let ( µ, ν ) ∈P ( R ) × P ( R ) with µ regular with respect to the Lebesgue measure. Let F % µ ( x ) = µ (] − ∞ , x ]) bethe cumulative distribution function and F & µ ( x ) = µ (] − x, + ∞ ]) be the anti-cumulative distributionfunction. Let T asc : R → R deﬁned by T asc ( x ) = F − ν ( F % µ ( x )) and T desc : R → R deﬁned by T desc ( x ) = F − ν ( F & µ ( x )) .Then an optimal solution of (innerGW) is achieved by the map γ = ( id × T asc ) µ or by the map γ = ( id × T desc ) µ . Theorem 4.2.4 proves that it suﬃces to compute the CDF or the anti-CDF of the distribution torecover an optimal coupling. It can be put in light of the results of Section 4.1 where we proved that fordiscrete probability measures with uniform weights and same number of atoms an optimal coupling for theGW problem with squared Euclidean distances can be found in the diagonal or the anti-diagonal couplingwhen samples are sorted. As such the previous theorem is stronger since it can be applied for general1D probability distribution when considering inner product similarities. Can we ﬁnd other examples ofoptimal couplings for (innerGW), for example when the dimension is larger than 1? The next discussiongives another example which answers this question.

Construction of an optimal couple ( u, P ) As seen in the previous discussion the couples ( u, P )where u is convex so that ∇ u ◦ P pushes µ forward to ν may lead to an optimal map w.r.t. the Gromov-Wasserstein distance with inner product similarities. In this part we wish to give an example of such couple( u, P ) which leads to an optimal map. Moreover this discussion will also highlight another diﬀerencebetween the linear OT problem and the Gromov-Wasserstein problem. It is known that when µ ∈ P ( R p )and when u : R p → R is convex and diﬀerentiable µ a.e then the optimal transport plan for the linearOT problem inf π ∈ Π( µ,ν ) ´ k x − y k d π ( x , y ) between µ and ν def = ∇ u µ is given by γ = ( id × ∇ u ) µ .In other words if one perturbs the source measure µ with a transformation which is the gradient of .2. Regularity & formulations of GW problems in Euclidean spaces 87 a convex function then the cheapest way (in terms of W ) of moving the source measure forward tothe target is by the means of this transformation. To see this we can do the same reasoning as in(4.20) by noticing that inf π ∈ Π( µ,ν ) ´ k x − y k d π ( x , y ) is equivalent to sup π ∈ Π( µ,ν ) ´ h x , y i p d π ( x , y ) (seealso [Santambrogio 2015, Theorem 1.48]). In our case however the situation is a little bit diﬀerent asdetailed in the next proposition: Proposition 4.2.2.

Let µ ∈ P ( R p ) with ´ k x k d µ ( x ) < + ∞ . Let u : R p → R be a convex function. Weconsider: sup Q ∈ F p,q ˆ u ( Qx )d µ ( x ) ( E u ) Let P ∈ F p,q be a solution to ( E u ) with ´ k∇ u ( Px ) k d µ ( x ) < + ∞ then γ = ( id × ∇ u ◦ P ) µ is anoptimal solution of (innerGW) between µ and ν def = ∇ u ◦ P µ .Proof. Let ( π ∗ , P ∗ ) be maximizers of (MaxOT). We have: ˆ h P ∗ x , y i q d π ∗ ( x , y ) ≤ ˆ ( u ( P ∗ x ) + u ∗ ( y )) d π ∗ ( x , y ) = ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( y )d ν ( y ) = ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( y )d( ∇ u ◦ P µ )( y )= ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( ∇ u ( Px ))d µ ( x ) ≤ sup Q ∈ F p,q ˆ u ( Qx )d µ ( x ) + ˆ u ∗ ( ∇ u ( Px ))d µ ( x ) = ˆ u ( Px )d µ ( x ) + ˆ u ∗ ( ∇ u ( Px ))d µ ( x ) = ˆ h Px , ∇ u ( Px ) i q d µ ( x )where in (1) we used that u is convex, in (2) we used ( ∇ u ◦ P ) µ = ν , in(3) we used that P maximizessup Q ∈ F p,q ´ u ( Qx )d µ ( x ) and in (4) we used that for any x and convex function u ∗ ( x ) + u ( x ) = h x , ∇ u ( x ) i .We can deduce from (4) that:sup π ∈ Π( µ,ν ) sup P ∈ F p,q ˆ h Px , y i q d π ( x , y ) ≤ ˆ h Px , y i q d γ ( x , y )By suboptimality the converse inequality is also true so that ( γ, P ) is an optimal solution of (MaxOT)and consequently γ is an optimal solution of (innerGW) using Theorem 4.2.1.This results states that if one perturbs a source measure µ ∈ P ( R p ) with a map ∇ u ◦ P with thecondition that P : R p → R q achieves sup Q ∈ F p,q ´ u ( Qx )d µ ( x ) then the cheapest way w.r.t. (innerGW)of moving µ forward to ∇ u ◦ P µ is by the means of the transformation ∇ u ◦ P . To illustrate thisproposition we can look at simple convex transformations u ( x ) = x T Ux where U is symmetric positivesemi-deﬁnite (which means that the transformation ∇ u is linear). In this case the following propositionexhibits a couple ( u, P ) which leads to an optimal map for (innerGW): Proposition 4.2.3 (An optimal couple ( u, P )) . Let µ ∈ P ( R p ) . We note Σ µ = ´ xx T d µ ( x ) . Let u : R p → R convex deﬁned by u ( x ) = x T Ux where U is symmetric positive semi-deﬁnite. Let v , w bethe eigenvectors corresponding to the largest eigenvalue of respectively Σ µ and U and P = √ p vw T ∈ F p,p The coupling γ = ( id × ∇ u ◦ P ) is optimal for (innerGW) between µ and ν def = ∇ u ◦ P µ Proof.

In this case the problem sup Q ∈ F p,p ´ u ( Qx )d µ ( x ) reduces to sup k Q k F = p = ´ x T Q T UQx d µ ( x ) =sup k Q k F = p tr( Q T UQΣ µ ) where Σ µ = ´ xx T d µ ( x ). By vectorizing the matrix Q it is equivalent to:max q ∈ R p k q k = p q T M u,µ q (4.24)where M u = U ⊗ K Σ µ with ⊗ K the Kronecker product of matrices. M u,µ is symmetric positivesemi-deﬁnite. We can rewrite this problem with ˜ q = √ p q :max ˜ q T ˜ q =1 ˜ q T M u,µ ˜ q (4.25)Which is a maximization of a Rayleigh quotient problem. It is well known that a solution of this problem isfound at any eigenvector associated to the largest eigenvalue of M u,µ (see e.g. [Anstreicher 1998]). Howeverthe eigenvalues of M u,µ are given by all the products of the eigenvalues of Σ µ and U (see e.g. [Horn 1991]).Since they are all positive the largest eigenvalue of M u,µ is found at the largest eigenvalue of Σ µ and U with corresponding eigenvector v , w and the optimal ˜ q is v , w T . This implies that the optimal Q is Q = √ p vw T . We can apply Proposition 4.2.2 to conclude. The GW problem is usually considered with distances, as it provides a metric with respect to strongisomorphisms. The goal of this section is to study the case c X ( x , x ) = k x − x k , c Y ( y , y ) = k y − y k with µ ∈ P ( R p ) , ν ∈ P ( R q ). As for the inner product case we can prove that this problem is equivalent toanother linear OT problem parametrized by a linear application. More precisely: Theorem 4.2.5.

Let X and Y be compact subset of respectively R p and R q . Let µ ∈ P ( X ) , ν ∈ P ( Y ) .Assume without loss of generality that E X ∼ µ [ X ] = 0 and E Y ∼ ν [ Y ] = 0 . Then problems: inf π ∈ Π( µ,ν ) ˆ ( k x − x k − k y − y k ) d π ( x , y )d π ( x , y ) (sqGW) and sup π ∈ Π( µ,ν ) sup P ∈ R q × p ˆ ( h Px , y i q + k x k k y k )d π ( x , y ) − k P k F (dual-sqGW) are equivalent. To prove the previous theorem we will rely on the calculus of Lemma 4.2.1 and the observation that thecost J ( c X , c Y , π ) is invariant by translation of the support of the measures so that they can be centeredwithout loss of generaly. This implies that the term ´ (cid:2) k x k h E Y ∼ ν [ Y ] , y i q + k y k h E X ∼ µ [ X ] , x i p d π ( x , y )in Lemma 4.2.1 vanishes. More presicely we have the following result: Lemma 4.2.3.

Let X and Y be compact subset of respectively R p and R q . Let µ ∈ P ( X ) , ν ∈ P ( Y ) .We can assume without loss of generality that E X ∼ µ [ X ] = 0 and E Y ∼ ν [ Y ] = 0 . In this case (sqGW) isequivalent to: sup π ∈ Π( µ,ν ) ˆ k x k k y k d π ( x , y ) + 2 k ˆ yx T d π ( x , y ) k F (4.26)A proof of this lemma can be found in Section 6.2.8. In the following we will note F ( π ) = ´ k x k k y k d π ( x , y ) + 2 k ´ yx T d π ( x , y ) k F . To prove Theorem 4.2.5 the idea is to observe that thisproblem is a maximization of a convex function of π . We can use standard convex analysis tools such thatthe Fenchel-Moreau duality to derive the Fenchel dual of F ( π ) as detailed below. .2. Regularity & formulations of GW problems in Euclidean spaces 89 Duality in the space of measure

We suppose in the following that X , Y are general compact spaces.In this case the dual space of C ( X × Y ) is M ( X × Y ) (see e.g.

Memo 1.3 in [Santambrogio 2015]). Werecall the main deﬁnitions of the Legendre-Fenchel tansform:

Deﬁnition 4.2.1 (Legendre–Fenchel Transformation) . Let X , Y be compact Hausdorﬀ spaces. For afunction F : M ( X × Y ) → R ∪ { + ∞} we deﬁne its convex conjugate F ∗ : C ( X × Y ) → R ∪ { + ∞} and itsbi-conjugate F ∗∗ : M ( X × Y ) → R ∪ { + ∞} as: F ∗ ( h ) = sup π ∈M ( X ×Y ) ˆ h ( x, y )d π ( x, y ) − F ( π ) F ∗∗ ( π ) = sup h ∈C ( X ×Y ) ˆ h ( x, y )d π ( x, y ) − F ∗ ( h ) (4.27)One remarkable property of the convex conjugate F ∗ is that it is always l.s.c and convex because h → ´ h d π − F ( π ) is an aﬃne function which is always convex and continuous. In this way F ∗ is the pointwise supremum of continuous linear functions which is convex and l.s.c. We denote bydom( F ) = { π | F ( π ) < + ∞} the domain of F . A function is called proper if dom( F ) = ∅ . A fundamentalresult in convex analysis is the Fenchel-Moreau theorem states that the bi-conjugate of a proper and l.s.cconvex function equals to the original function (see e.g. [Lai 1988]). In other words when F is convex andwell-behaved one can rely on the bi-conjugate to study the original function. In our context this impliesthat F ( π ) = F ∗∗ ( π ) for all π ∈ M ( X × Y ). To compute F ∗∗ ( π ) we will need a notion of derivative in thespace of measures as described in the next deﬁnition: Deﬁnition 4.2.2 (Fréchet diﬀerentiable) . Let X , Y be compact Hausdorﬀ spaces. A function F : M ( X ×Y ) → R is is Fréchet diﬀerentiable at π if there exists ∇ F ( π ) ∈ C ( X × Y ) such that for any ε ∈ M ( X × Y ) as t → : F ( π + tε ) = F ( π ) + t ˆ ∇ F ( π ) dε + o ( t ) (4.28) Application to the Gromov-Wasserstein problem

Let

X ⊂ R p , Y ⊂ R q be compact spaces and µ ∈ P ( X ) , ν ∈ P ( Y ). Since X , Y are compact and the norms are continuous the GW distance is ﬁniteand the function F deﬁned in (4.26) is proper, convex and l.s.c. By application of the Fenchel-Moreautheorem we have:sup π ∈ Π( µ,ν ) F ( π ) = sup π ∈ Π( µ,ν ) F ∗∗ ( π ) = sup π ∈ Π( µ,ν ) sup h ∈C ( X ×Y ) ˆ h ( x , y )d π ( x , y ) − F ∗ ( h ) (4.29)In the following we denote by V π = ´ yx T d π ( x , y ). We can prove that we can solve the dual problemby parametrizing h by a linear application P ∈ R q × p as h ( x , y ) = h Px , y i q + k x k k y k . Using this formthe problem becomes much simpler as we only need to optimize on a ﬁnite dimensional space instead ofmaximizing over all continuous function. Lemma 4.2.4. If π ∗ is a solution of the primal problem sup π ∈ Π( µ,ν ) F ( π ) then there exists P ∈ R q × p and h ∗ ∈ C ( X × Y ) of the form h ∗ ( x , y ) = h Px , y i q + k x k k y k such that ( π ∗ , h ∗ ) is a solution of thedual problem (4.26) . Moreover when h ∗ is in such form we have F ∗ ( h ∗ ) = k P k F . A proof can be found in Section 6.2.9. The previous Lemma 4.2.4 can be used to prove Theorem 4.2.5.Indeed with previous notations h can be written in the form h ( x , y ) = h Px , y i q + k x k k y k . Plugging thecalculus of the conjugate F ∗ ( h ) = k P k F into (4.29) gives the desired result. Regularity of (sqGW) optimal plans

The Fenchel dual problem is more diﬃcult to analyse than thedual problem of the inner product case. Indeed the cost c : ( x , y ) → h Px , y i q + k x k k y k hardly relatesto a “standard” linear OT problem. Especially it does not statiﬁes the Twist condition (see Chapter 2)in general and an approach with convex tools is trickier due to the term k x k k y k . We will show in thefollowing that if ∇ u ◦ P is “not too far” from an isometry then it deﬁnes an optimal coupling. We noteΓ ( R , R ) the set of derivable convex functions from R to R and we deﬁne the following set: Deﬁnition 4.2.3.

We deﬁne H ( µ, ν ) the set of push-forward between µ and ν deﬁned by: H ( µ, ν ) = { T µ = ν |∃ f ∈ Γ ( R , R ) , k T ( x ) k = f ( k x k ) µ a.e } (4.30)When p = q this set encompasses all the linear push forward of the form T = c O where c > , O ∈ O ( p ).It turns out that this condition is also suﬃcient when considering only linear push-forward (see Lemma6.2.11).We have the following suﬃcient condition if we look at Monge map of the form ∇ u ◦ P with u convexand and P ∈ R q × p : Proposition 4.2.4.

Let X and Y be compact subset of respectively R p and R q with p ≥ q . Let µ ∈P ( X ) , ν ∈ P ( Y ) . Assume that µ is regular w.r.t. the Lebesgue measure on R p and that E X ∼ µ [ X ] = 0 and E Y ∼ ν [ Y ] = 0 without loss of generality.Let ( π ∗ , P ∗ ) be optimal solution of (dual-sqGW) . If P ∗ is surjective there exists u : R q → R convex suchthat ∇ u ◦ P ∗ pushes µ forward to ν . Moreover if ∇ u ◦ P ∗ ∈ H ( µ, ν ) then the coupling γ = ( id × ∇ u ◦ P ∗ ) µ is optimal for (sqGW) .Proof. Let ( P ∗ , π ∗ ) be optimal solution of (dual-sqGW) with P ∗ surjective. We have seen previously that P ∗ µ is regular with respect to the Lebesgue measure on R q using Proposition 4.2.1 such that thereexists u : R q → R convex such that ∇ u ◦ P ∗ pushes µ forward to ν . Moreover: ˆ h P ∗ x , y i q + k x k k y k d π ∗ ( x , y ) ≤ ˆ ( u ( P ∗ x ) + u ∗ ( y )) d π ∗ ( x , y ) + ˆ k x k k y k d π ∗ ( x , y )= ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( y )d ν ( y ) + ˆ k x k k y k d π ∗ ( x , y ) = ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( y )d( ∇ u ◦ P ∗ µ )( y ) + ˆ k x k k y k d π ∗ ( x , y )= ˆ u ( P ∗ x )d µ ( x ) + ˆ u ∗ ( ∇ u ( P ∗ x ))d µ ( x ) + ˆ k x k k y k d π ∗ ( x , y ) = ˆ h P ∗ x , ∇ u ( P ∗ x ) i q d µ ( x ) + ˆ k x k k y k d π ∗ ( x , y )In (1) we used convexity of u , in (2) we used that ∇ u ◦ P ∗ is a push-forward and in (3) weused u ∗ ( ∇ u ( x )) + u ( x ) = h x , ∇ u ( x ) i . Now suppose that ∇ u ◦ P ∗ ∈ H ( µ, ν ). Then there exists f ∈ Γ ( R , R ) , k T ( x ) k = f ( k x k ) µ a.e.Moreover we have ´ k x k k y k d π ∗ ( x , y ) ≤ ´ f ( k x k ) + f ∗ ( k y k )d π ∗ ( x , y ) by Young’s inequality (see .2. Regularity & formulations of GW problems in Euclidean spaces 91 Memo 4.2.2). In this way: ˆ h P ∗ x , y i q + k x k k y k d π ∗ ( x , y ) ≤ ˆ h P ∗ x , ∇ u ( P ∗ x ) i q d µ ( x ) + ˆ f ∗ ( k x k )d µ ( x ) + ˆ f ( k y k )d ν ( y ) = ˆ h P ∗ x , ∇ u ( P ∗ x ) i q d µ ( x ) + ˆ f ∗ ( k∇ u ( P ∗ x ) k ) + f ( k x k )d µ ( x ) = ˆ h P ∗ x , ∇ u ( P ∗ x ) i q d µ ( x ) + ˆ k∇ u ( P ∗ x ) k k y k d µ ( x ) = ˆ h P ∗ x , y i q + k x k k y k d γ ( x , y )In (5) we used that ∇ u ◦ P ∗ is a push-forward. In (6) we used that f satisﬁes k∇ u ( P ∗ x ) k ∈ ∂f ( k x k ) = f ( k x k ) by deﬁnition of H ( µ, ν ) which implies f ∗ ( k∇ u ( P ∗ x ) k ) + f ( k x k ) = k∇ u ( P ∗ x ) k k x k (see Memo2.1.2). In (7) γ is deﬁned as γ = ( id × ∇ u ◦ P ∗ ) µ . Overall ( P ∗ , γ ) is optimal for (dual-sqGW). UsingTheorem 4.2.5 this proves that γ is optimal for (sqGW) which concludes the proof.The condition ∇ u ◦ P ∈ H ( µ, ν ) is quite strong compared to the conditions of Theorem 4.2.3. Forexample in the case where ∇ u is linear only orthogonal transformations are admissible (modulo a scaling).We believe that this result can be improved and we leave this study for further works. In general, andwithout any furter assumption on µ and ν , we postulate that there might be degenerate cases in which µ is regular but there is no deterministic optimal couplings. In this section we provide numerical solutions for Gromov-Wasserstein problems in Euclidean spaces. Wewill rely on the equivalent formulation deﬁned in Theorem 4.2.1 and Theorem 4.2.5. We consider twodiscrete probability measures µ = n P ni =1 a i δ x i ∈ P ( R p ) and ν = P mi =1 b j δ y j ∈ P ( R q ) with a ∈ Σ n , b ∈ Σ m . Numerical solution for (innerGW) We note X = ( x i ) ni =1 ∈ R n × p , Y = ( y j ) mj =1 ∈ R m × q . As seen inTheorem 4.2.1 computing (innerGW) can be achieved by solving sup π ∈ Π( a , b ) sup k P k F = √ p h XP T Y T , π i F or equivalently: min π ∈ Π( a , b ) min k P k F = √ p h− XP T Y T , π i F (4.33)The resulting problem is convex w.r.t. P and π (but not jointly convex). We propose to solve equation(4.33) using Block Coordinate Descent (BCD) which alternates between minimizing w.r.t. P and π . Memo 4.2.2 (Young’s Inequality) . Let a, b ∈ R + × R + and p, q real numbers greater than with p + q = 1 . Then: ab ≤ ap + bq (4.31) More generally if f is a convex function and f ∗ is its Legendre transform then: ab ≤ f ( a ) + f ∗ ( b ) (4.32) which is a consequence of the celebrated Fenchel-Young inequality (see Memo 2.1.2) Algorithm 7

Gromov-Wasserstein with inner products Require µ = n P ni =1 a i δ x i ∈ P ( R p ) and ν = P mi =1 b j δ y j ∈ P ( R q ) Set X = ( x i ) ni =1 ∈ R n × p , Y = ( y j ) mj =1 ∈ R m × q Initialize π = ab T . while not converged do P ← √ d k Y T π T X k F Y T π T X // (maximize (MaxOT) w.r.t. P ) π ← arg min π ∈ Π( a , b ) h− XP T Y T , π i F // (maximize (MaxOT) w.r.t. π ∼ linear OT) end while return ( π, P )Interestingly enough the minimization of π with P ﬁxed is a linear transportation problem with groundcost − XP T Y T which can be computed using standard solvers (see Chapter 2).The minimization w.r.t. P with π ﬁxed reads sup k P k F = √ p h XP T Y T , π i F and has a closed-formsolution based on Lemma 4.2.2. More precisely it reads: P ← √ d k Y T π T X k F Y T π T X (4.34)This procedure is presented in Algorithm 7. The complexity is driven by the linear OT problem whichis O ( n log( n )) when n = m . The complexity for computing P at each iteration is O ( mn ( q + p ) + pq )with O ( mn ( q + p )) for the computing Y T π T X and O ( pq ) for k Y T π T X k F . The overall complexity when n = m is then O ( n log( n ) + n ( q + p ) + pq ). Numerical solution for (sqGW) We note x = Diag( XX T ) ∈ R n , y = Diag( YY T ) ∈ R m . As seen inTheorem 4.2.5 computing (sqGW) can be achieved by ﬁrst substracting the mean of the measures andthen by solving: min π ∈ Π( a , b ) min P ∈ R q × p h− ( XP T Y T + xy T ) , π i F + 18 k P k F (4.35)This problem also is convex w.r.t. π and P (but not jointly convex) and we can rely on a BCD procedureto ﬁnd a local minimum. The minimization w.r.t. to π with P ﬁxed is a linear OT problem with groundcost − ( XP T Y T + xy T ) and the minimization w.r.t. P with π ﬁxed reads:min P ∈ R q × p h− XP T Y T , π i F + 18 k P k F def = min P ∈ R q × p G ( P ) (4.36)which is a unconstrained QP which can be solved in closed-form. Indeed the gradient reads ∇ G ( P ) = − Y T π T X + P and by ﬁrst order condition a solution can be found at ∇ G ( P ) = 0. In this way theiteration for P are: P ← Y T π T X (4.37)This procedure is presented in Algorithm 8. Computing all the terms k x ( i ) k k y ( j ) k where x ( i ) is the i -th row of X has a naive complexity of O ( np + mq + mn ) and one needs also to compute 4 Y T π T X at a O ( mn ( q + p )) price. The overall complexity is then O ( n log( n ) + ( p + q ) n + n + n ( p + q )) = O ( n log( n ) + ( p + q ) n ) when n = m . Runtimes comparison

We perform a comparison between runtimes of GW using diﬀerent algorithmicsolutions. We consider the Gromov-Wasserstein problem between 10 realizations of one source 2D randomprobability measures of n ∈ { , ..., } points and one target 3D random measures with m = 100 .2. Regularity & formulations of GW problems in Euclidean spaces 93 Algorithm 8

Gromov-Wasserstein with squared Euclidean distances Require µ = n P ni =1 a i δ x i ∈ P ( R p ) and ν = P mi =1 b j δ y j ∈ P ( R q ) Subtract the mean: x i ← x i − n P k x k , y j ← y j − m P k y k . Set X = ( x i ) ni =1 ∈ R n × p , Y = ( y j ) mj =1 ∈ R m × q , x = Diag( XX T ) ∈ R n , y = Diag( YY T ) ∈ R m Initialize π = ab T . while not converged do P ← Y T π T X // (maximize (dual-sqGW) w.r.t. P ) π ← arg min π ∈ Π( a , b ) h− ( XP T Y T + xy T ) , π i F // (maximize (dual-sqGW) w.r.t. π ∼ linear OT) end while return ( π, P )

10 30 50 100 200 1000 2000 10000 30000

Number of samples n in the target distribution S e c o n d s O ( n ) O ( n log( n )) Running time

GW FWGW BCDGW entropic =5.0GW entropic =10.0GW entropic =100.0

Figure 4.7: Runtimes comparison of GW using the BCD approach Algorithm 8. (GW BCD), the Frank-Wolfeapproach (GW FW) and entropic- GW between two random distributions whose source points vary from 10 to30000 in log-log scale. The time does not include the calculation of the pair-to-pair distances but only the time ofthe diﬀerent loops. The variance is computed among the diﬀerent 10 realizations. points. We compute the Gromov-Wasserstein distance using squared Euclidean distances as c X , c Y . Wecompare the timings of the Frank-Wolfe algorithm (see Chapter 3), the entropic regularized approachwith ε ∈ { , , } (see Chapter 2), and the BCD approach Algorithm 8 using the same initialization of π = ab T for all methods. The result is depicted in Figure 4.7. Please note that the timings are calculatedwithout taking into account the time needed for computing the matrices C , C or X , Y but are onlybased on the loops of the diﬀerent algorithms. The entropic- GW could not be computed with ε ≤ GW is the fact thatthe regularization parameter ε is inversely proportional to the gradient step [Peyré 2016]. Thus onlyhigh value of ε are computable in reasonable time and for ε < cost GW BCD c o s t G W F W n=20007.75 8.00 8.25 8.50 8.75 9.00 9.25 cost GW BCD c o s t G W F W n=50034 35 36 37 38 cost GW BCD c o s t G W F W n=200040 42 44 46 48 50 cost GW BCD c o s t G W F W n=500 Figure 4.8: Comparison of the FW solver and the BCD solver for computing the Gromov-Wasserstein distance. (left) c X , c Y are squared Euclidean distances. (right) c X , c Y are inner products. The x -axis is the GW costobtained by the BCD approach and the y -axis the GW cost obtained with the FW algorithm for respectively (top) n = 2000 samples (bottom) n = 500 samples in each distribution. The dashed line represents the diagonal y = x . which explains the low variance of entropic- GW with ε = 100. We can see in Figure 4.7 that the BCDapproach is a little bit faster than the FW approach, suggesting that the BCD may converge faster to alocal minimum than FW, which is, of course, data dependant. Overall both methods are equally rapid onthis example. Costs comparison

In this experiment we compared the ability of the BCD approach to ﬁnd a bettersolution than the FW approach. We consider the Gromov-Wasserstein problem with both inner productsimilarities and squared Euclidean distances. We compute 200 distances using both algorithms whereeach distance is calculated by: (1) We draw n ∈ { , } samples from two 10 dimensional normaldistributions (2) We associate to these points random weights ( a , b ) ∈ Σ n × Σ n (3) We initialize thealgorithms with the same random coupling matrix π . The initialization is computed by sampling a randommatrix with positive entries and by scaling it with the Sinkhorn algorithm in order to have the prescribedmarginals a , b . Results are depicted in Figure 4.8. We plot the GW cost obtained by BCD approach vs the cost obtained by the FW approach after convergence of each algorithm. As seen in Figure 4.8 there isno strong diﬀerences between FW approach and the BCD approach when squared Euclidean distances areconsidered (left part of the ﬁgure): the F W algorithm ﬁnds a better solution in 51% of the cases when n = 2000 and 56% for n = 500. Surprisingly both algorithms seem to lead to the same solution wheninner product similarities are used (right part of the ﬁgure). Indeed in 99% of the cases the costs areidentical up to 1 e − Conclusion on the experiments

As seen in these experiments, there are no signiﬁcant diﬀerencesbetween the FW approach and the BCD procedure. It is not surprising for the runtimes comparison sinceboth methods have a theoretical cubic complexity. The cost comparison experiment may also suggestthat both BCD procedure and FW are equivalent in the case of inner product similarities, that is the .2. Regularity & formulations of GW problems in Euclidean spaces 95 iterations of the BCD are the same than the iterations of the FW. Further studies could be conductedto examine the high-dimensional setting where we postulate that the BCD approach may lead to betterresults that the FW procedure, i.e. may converge faster a produce a better solution.

As seen in Section 4.1.2 the Gromov-Wasserstein problem is also very related the Gromov-Mongeproblem [Mémoli 2018] which is deﬁned in Euclidean spaces for µ ∈ P ( R p ) , ν ∈ P ( R q ) as: Deﬁnition 4.2.4 (Gromov-Monge) . The Gromov-Monge problem aims at ﬁnding: GM ( µ, ν ) = inf T µ = ν J ( T ) = inf T µ = ν ˆ (cid:0) k x − x k − k T ( x ) − T ( x ) k (cid:1) d µ ( x )d µ ( x ) (GM) With the understanding that GM ( µ, ν ) = + ∞ when { T : R p → R q | T µ = ν } = ∅ The problem (GM) is the exact counterpart for the Gromov-Wasserstein distance of the Mongeproblem for the Wasserstein distance. If we note GW ( µ, ν ) the Gromov-Wasserstein distance withsquared Euclidean costs then we have GW ( µ, ν ) ≤ GM ( µ, ν ) but in general both problems are notequivalent (see [Mémoli 2018]). Note that when problem 1 holds, i.e. when the Gromov-Wassersteinproblem admits an optimal transport plan supported on a deterministic map then both (GM) and (sqGW)are equivalent. Moreover we have seen previously in Theorem 4.1.2 that when µ and ν are discreteprobability measures with the same number of atoms and uniform weights then (GM) is also equivalentto the Gromov-Wasserstein problem (sqGW) so that that GW ( µ, ν ) = GM ( µ, ν ).We propose to study further the Gromov-Monge problem in this section. Especially we consider thespecial case of Gaussian measures µ = N ( m ν , Σ ν ) , ν = N ( m µ , Σ µ ). It is motivated by the linear OTtheory where, when p = q , there is a close form solution for W . In this case the optimal Monge map is linear and given by [Takatsu 2011] T : x → m ν + A ( x − m ν ) where: A = Σ − / µ ( Σ / µ Σ ν Σ / µ ) Σ − / µ (4.38)Can we derive the same type of result for the Gromov-Monge geometry? We will prove that when restrictedto linear push-forward and in the special case p = q the problem admits also a close form expression. Inthis way we consider the following linear Gromov-Monge problem: LGM ( µ, ν ) = inf T µ = ν T is linear J ( T ) = inf T µ = ν T is linear ˆ (cid:0) k x − x k − k T ( x ) − T ( x ) k (cid:1) d µ ( x )d µ ( x ) (LGM)We recall that V p ( R q ) is the Stiefel manifod deﬁned by V p ( R q ) = { B ∈ R q × p | B T B = I p } . The mainresult of this section is the following theorem: Theorem 4.2.6.

We can show that when considering only linear push-forward the problem (GM) can berecast as a Orthogonality constrained Quadratic Program (QPOC) which, when p = q , admits a closeform. This is made possible thanks to the Gaussian assumption which allows to compute the 4-th ordermoments of the distributions using Isserlis theorem [Isserlis 1918] which prove that 4-th order momentscan be computed using the 2-nd order ones. We give the full proof in Section 6.2.11. Geometric interpretations of

LGM

Interestingly enough the optimal linear map can be related, inter alia , to the optimal linear map of the classical Monge problem. In the following we consider p = q sothat we have using Theorem 4.2.6: LGM ( µ, ν ) = 4(tr( Σ µ ) − tr( Σ ν )) + 8(tr( Σ µ Σ µ ) + tr( Σ ν Σ ν )) − D µ D ν ) (4.42)wich corresponds to A = V ν D / ν D − / µ V > µ = Σ / ν V ν V > µ Σ − / µ . • Case Σ µ = Σ ν . When the covariances are equals, i.e. Σ µ = Σ ν we can conclude that LGM ( µ, ν ) =0 + 8(tr( Σ µ Σ µ ) + tr( Σ µ Σ µ )) − D µ D µ ) = 16tr( Σ µ Σ µ ) − Σ µ Σ µ ) = 0 which correspondsto A = I . This implies that an optimal way for transferring the masses so that, on average, thepair-to-pair distances are preserved is by the means of the identity mapping. • Case Σ µ = k Σ ν . If there is a scaling factor between the covariances Σ µ = k Σ ν with k > A = √ k I .With a little calculus one can check that LGM ( µ, ν ) = 4( k − (cid:0) tr( Σ ν ) + 2tr( Σ ν Σ ν ) (cid:1) so that LGM ( µ, ν ) is minimal, equal to zero, when k = 1 which corresponds to the previous case. When k increases above 1 the distance increases quadratically and when k ∈ ]0 ,

1] the distances decreases as k goes to 1. • Rotation.

Another interesting case is when we rotate the samples of the distribution and we computethe linear Gromov-Monge between the original distribution and its rotated counterpart. This casecorresponds to Σ ν = OΣ µ O T where O ∈ O ( p ) and we can check easily that LGM ( µ, ν ) = 0 withan optimal map given the rotation. This behavior is intuitive since the Gromov-Monge problemwith linear map is invariant by rotations. • Commuting covariances.

Finally when the covariances commute i.e. Σ µ Σ ν = Σ ν Σ µ we canrelate with the linear transportation. In this situation both are simultaneously diagonalizable andeigenspaces coincide. We recall that the optimal linear map for the Wasserstein distance with d ( x , y ) = k x − y k is A W = Σ − / µ ( Σ / µ Σ ν Σ / µ ) Σ − / µ which reduces to A W = Σ / ν Σ − / µ whencovariances are commuting. Moreover since matrices share the same eigenspaces then we can take V µ = V ν for the linear Gromov-Monge. In this way A reduces to A = Σ / ν Σ − / µ = A W . Thisproves that when the covariances commute the optimal map of Wasserstein is an optimal map forthe linear Gromov-Monge. .2. Regularity & formulations of GW problems in Euclidean spaces 97 Source and Target distributions

Source samplesTarget samples 8 10 12 14 16 18 20 22 24246810 Target samplesMapped source samples (Gromov-Monge)8 10 12 14 16 18 20 22 24246810 Target samplesMapped source samples (Wasserstein)

Figure 4.9: Example of linear Gromov-Monge mapping estimation between empirical distributions. (left) (middle) resulting linear mapping of Wasserstein A W . (right) resulting linearmapping using the linear Gromov-Monge mapping A . Note that in this case the mapped samples are arbitraryrotated. An illustration of the map A is given in Figure.4.9 where we compute LGM between two 2D empiricaldistributions, that we consider Gaussian in ﬁrst approximation. The optimal map A gives a diﬀerentbehaviour than the Wasserstein map A W which seems to better grasp the transformation of the samples.It is somehow natural since the Gromov-Monge cost is less rigid than the Wasserstein one and only forcesthe samples to be isometrically distributed on average. However note that the target samples are arbitraryrotated for the case of Gromov-Monge since LGM is invariant by rotation of the samples.

Solving the problem when p = q As described in Theorem 4.2.6 the situation is more delicate when p = q , i.e. when the Gaussian measuresare not part of the same ground Euclidean space. In this case there is no close form anymore and oneneeds to solve the following Quadratic optimization with Orthogonality Constraints (QPOC) problem:min B ∈ V p ( R q ) − tr( B > D ν BD µ ) def = min B ∈ V p ( R q ) F ( B ) (QPOC)This problem is a particular special case of optimizing a smooth function over the Stiefel manifold,which is non-convex in general but for which various methods have been proposed over the years (see e.g [Jiang 2014, Wen 2010, Abrudan 2008, Absil 2009]). A standard approach for solving these typesof problems is to leverage the Riemannian structure of V p ( R q ) to use a gradient descent method onthe manifold. Informally, the idea is to start with a arbitrary point B ∈ V p ( R q ) then iterativelymove in a search direction D ( B ) deﬁned by a tangent vector while staying on V p ( R q ) until a criticalpoint is found. In general this procedure require to compute the so-called exponential map whichis quite diﬃcult. A cheaper alternative can be found into a retraction map which approximates theexponential map (see e.g. [Liu 2016] and references therein). We propose to solve (QPOC) using theGeoopt library [Kochurov 2020]. We illustrate the resulting optimal map in Figure 4.10 where we computethe LGM problem between two empirical distributions. The source distribution is 3 dimensional whilethe target distribution is 2 dimensional. We observe that the optimal Gromov-Monge map manages tocapture the overall transformation of the source distribution.

Figure 4.10: Example of linear Gromov-Monge mapping estimation between two empirical distributions. (left)

3D source and 2D target distributions. (right) resulting linear mapping using the linear Gromov-Monge mapping A . In our opinion the previous results raise interesting questions and pave the path for a deeper understandingof the Gromov-Wasserstein and Gromov-Monge geometry in Euclidean spaces. The works on the regularityof optimal couplings are, to the best of our knowledge, the ﬁrst work on this subject and we believethat these results could be improved in order to bridge the theoretical gap between Wasserstein andGromov-Wasserstein.

Gromov-Wasserstein in Euclidean spaces

In the inner product case we have seen that among theset of solutions of (MaxOT) the couples with P ∗ surjective are of particular interest. For example is itpossible under some mild conditions on µ, ν to prove that there always exists an optimal couple ( π ∗ , P ∗ ) of(MaxOT) such that P ∗ is surjective? If this result holds then it would imply that, if the source measureis regular, the Gromov-Wasserstein problem with inner products is equivalent to its Monge counterpart:inf T µ = ν ˆ ( h x , x i p − h T ( x ) , T ( x ) i q ) d µ ( x )d µ ( x ) (4.43)The equivalence between Gromov-Monge and Gromov-Wasserstein problems would be a nice generalizationof the Brenier’s theorem in the case of the Gromov-Wasserstein geometry. Related to this problem theunicity of the couples ( u, P ) is also an interesting further work. More precisely can we ﬁnd suitableconditions on µ, ν so that there is a unique ∇ u ◦ P which pushes µ forward to ν optimally? When thesupport of the measures are not “too symmetric” it seems reasonable that all the solutions are somehowunique modulo a class of isometries such as rotations. As described in this section the Brenier’s theoremstates that there is a unique optimal push forward in the case of Wasserstein geometry. This propertyallows to derive close form expression for the Gaussian case and we believe that similar reasoning couldbe made if the unicity (up to some rotations) holds for (MaxOT). Finally can we extend the previousresults about the inner case for kernel similarities? More precisely when c X ( x, x ) = k X ( x, x ) deﬁnes akernel (same for c Y ) we know from the kernel trick that there exists an inner product space V X and afeature map φ : X → V X such that c X ( x, x ) = h φ ( x ) , φ ( x ) i V X . In this case can we ﬁnd similar resultsregarding the regularity of the optimal transport plans of the Gromov-Wasserstein distance?The squared Euclidean distance case seems to be more delicate to handle as it does not echo to classicalOT costs. We believe that the dual problem could be used to ﬁnd a close form expression for 1D discrete .3. Conclusion: perspectives and open questions 99 probability measures with possibly diﬀerent number of atoms and non-uniform weights, as done for theinner product case. This would be an interesting stronger result than the case considered in Section 4.1.2.We postulate that, in this case, an optimal coupling is also given by a monotone rearrangement which isincreasing or decreasing. For general probability measures assessing the regularity of optimal transportplans seems more complicated. The suﬃcient condition of Proposition 4.2.4 may be improved with anecessary condition in order to characterize optimal couplings under some hypothesis on µ, ν . Anotherinteresting approach for the squared Euclidean distance case would be to consider optimal couplings whenthe target measure is a small perturbation of the source measure. More precisely we believe the followingholds: Problem 2.

Let µ ∈ P ( R p ) . Let ν = ( id + εu ) µ ∈ P ( R p ) with u : R p → R p be a small perturbation of µ . Then the map γ = ( id × T ) µ where T = id + εu is optimal for: inf π ∈ Π( µ,ν ) ˆ (cid:0) k x − x k − k y − y k (cid:1) d π ( x , y )d π ( x , y ) (4.44) hapter CO-Optimal Transport “Tu as tort de lire les journaux; ça te congestionne.”– André Gide,

Les Faux-Monnayeurs

Contents

02 Chapter 5. CO-Optimal Transport

Summary of the contributions

This chapter is based on the paper [Redko 2020] and addresses the problem of optimal transport on incomparablespaces. The original formulation of the optimal transport problem relies on the existence of a cost function betweenthe samples of the two distributions, which makes it impractical for comparing data distributions supportedon diﬀerent topological spaces. To circumvent this limitation, we propose a novel OT problem, named COOTfor CO-Optimal Transport, that aims to simultaneously optimize two transport maps between both samplesand features. This is diﬀerent from other approaches that either discard the individual features by focusing onpairwise distances (e.g. Gromov-Wasserstein) or need to model explicitly the relations between the features.COOT leads to interpretable correspondences between both samples and feature representations and holds metricproperties. We provide a thorough theoretical analysis of our framework and establish rich connections withthe Gromov-Wasserstein distance. We demonstrate its versatility with two machine learning applications inheterogeneous domain adaptation and co-clustering/data summarization, where COOT leads to performanceimprovements over the competing state-of-the-art methods.

The problem of comparing two sets of samples arises in many ﬁelds in machine learning, such as manifoldalignment [Cui 2014], image registration [Haker 2001], unsupervised word and sentence translation[Rapp 1995] among others. When correspondences between the sets are known a priori , one canalign them with a global transformation of the features, e.g , with the widely used

Procrustes analysis [Gower 2004, Goodall 1991]. For unknown correspondences, other popular alternatives to this methodinclude correspondence free manifold alignment procedure [Wang 2009], soft assignment coupled witha Procrustes matching [Rangarajan 1997a] or Iterative closest point and its variants for 3D shapes[Besl 1992, Yang 2020]. When one models the considered sets of samples as empirical probabilitydistributions, the optimal transport framework provides a solution to ﬁnd, without supervision, a soft-correspondence map between them given by an optimal coupling . OT-based approaches have been usedwith success in numerous applications such as embeddings’ alignments [Alvarez-Melis 2019, Grave 2019]and Domain Adaptation (DA) [Courty 2017] to name a few. However, one important limit of using OT forsuch tasks is that the two sets are assumed to lie in the same space so that the cost between samples acrossthem can be computed. This major drawback does not allow OT to handle correspondences’ estimationacross heterogeneous spaces, preventing its application in problems such as, for instance, heterogeneous DA(HDA). To circumvent this restriction, one may rely on the Gromov-Wasserstein distance [Memoli 2011]:a non-convex quadratic OT problem that ﬁnds the correspondences between two sets of samples based ontheir pairwise intra-domain similarity (or distance) matrices. Such an approach was successfully appliedto sets of samples that do not lie in the same Euclidean space, e.g for shapes [Solomon 2016], wordembeddings [Alvarez-Melis 2018a] and HDA [Yan 2018] mentioned previously. One important limit of GWis that it ﬁnds the samples’ correspondences but discards the relations between the features by consideringpairwise similarities only. Another line of works [Alvarez-Melis 2019, Grave 2019] considers the problem ofmatching sets of points with respect to a global transformations of the features, usually modeled by alinear transformation such as a rotation. These approches diﬀer from the work proposed here where weconsider instead a probabilistic coupling of the features as described below.In this Chapter, we propose a novel OT approach called CO-Optimal transport (COOT) that simul-taneously infers the correspondences between the samples and the features of two arbitrary sets. Ournew formulation includes GW as a special case, and has an extra-advantage of working with raw datadirectly without needing to compute, store and choose computationally demanding similarity measures .2. CO-Optimal transport optimization problem 103 required for the latter. Moreover, COOT provides a meaningful mapping between both instances andfeatures across the two datasets thus having the virtue of being interpretable. We thoroughly analyze theproposed problem, derive an optimization procedure for it and highlight several insightful links to otherapproaches. On the practical side, we provide evidence of its versatility in machine learning by puttingforward two applications in HDA and co-clustering where our approach achieves state-of-the-art results.The rest of this chapter is organized as follows. We introduce the COOT problem in Section 5.2,states its mathematical properties in Section 5.3 and give an optimization routine for solving it eﬃcientlyin Section 5.4. In Section 5.5, we show how COOT is related to other OT-based distances and recovereﬃcient solvers for some of them in particular cases. Finally, in Section 5.6.1 and Section 5.6.2, we presentan experimental study providing highly competitive results in HDA and co-clustering compared to severalbaselines.

We consider two datasets represented by matrices X = [ x , . . . , x n ] T ∈ R n × d and X = [ x , . . . , x n ] T ∈ R n × d , where in general we assume that n = n and d = d . In what follows, the rows of the datasetsare denoted as samples and its columns as features . We endow the samples ( x i ) i ∈ [[ n ]] and ( x i ) i ∈ [[ n ]] withweights w = [ w , . . . , w n ] > ∈ Σ n and w = [ w , . . . , w n ] > ∈ Σ n that both lie in the simplex so as todeﬁne empirical distributions supported on ( x i ) i ∈ [[ n ]] and ( x i ) i ∈ [[ n ]] . In addition to these distributions,we similarly associate weights given by vectors v ∈ Σ d and v ∈ Σ d with features. Note that when noadditional information is available about the data, all the weights’ vectors can be set as uniform.We deﬁne the CO-Optimal Transport problem as follows:min π s ∈ Π( w , w ) π v ∈ Π( v , v ) X i,j,k,l L ( X i,k , X j,l ) π si,j π vk,l = min π s ∈ Π( w , w ) π v ∈ Π( v , v ) h L ( X , X ) ⊗ π s , π v i F (COOT)where L : R × R → R + is a divergence measure between 1D variables, L ( X , X ) is the d × d × n × n tensorof all pairwise divergences between the elements of X and X , and Π( · , · ) is the set of linear transportconstraints.Note that problem (COOT) seeks for a simultaneous transport π s between samples and a transport π v between features across distributions. In the following, we write COOT( X , X , w , w , v , v ) (orCOOT( X , X ) when it is clear from the context) to denote the objective value of the optimization problem(COOT). Entropic regularization

Equation (COOT) can be also extended to the entropic regularized casefavoured in the OT community for remedying the heavy computation burden of OT and reducing itssample complexity [Cuturi 2013, Altschuler 2017, Genevay 2019]. This leads to the following problem:min π s ∈ Π( w , w ) π v ∈ Π( v , v ) h L ( X , X ) ⊗ π s , π v i F + Ω( π s , π v ) (5.1)where for ε , ε >

0, the regularization term writes as Ω( π s , π v ) = ε H ( π s | ww T ) + ε H ( π v | vv T ) with H ( π s | ww T ) = P i,j log( π si,j w i w j ) π si,j being the relative entropy. Note that similarly to OT [Cuturi 2013] andGW [Peyré 2016], adding the regularization term can lead to a more robust estimation of the transportmatrices but prevents them from being sparse.

04 Chapter 5. CO-Optimal Transport

MNIST USPS

USPS samples M N I S T s a m p l e s s matrix between samples USPS colored coded pixels MNIST pixels through v MNIST pixels through entropic v Figure 5.1: Illustration of COOT between MNIST and USPS datasets. (left) samples from MNIST and USPSdata sets; (center left)

Transport matrix π s between samples sorted by class; (center) USPS image with pixelscolored w.r.t.. their 2D position; (center right) transported colors on MNIST image using π v , black pixelscorrespond to non-informative MNIST pixels always at 0; (right) transported colors on MNIST image using π v with entropic regularization. In order to illustrate our proposed COOT method and to explain the intuition behind it, we solve theoptimization problem (COOT) using the algorithm described in Section 5.4 between two classical digitrecognition datasets: MNIST and USPS. We choose these particular datasets for our illustration as theycontain images of diﬀerent resolutions (USPS is 16 ×

16 and MNIST is 28 ×

28) that belong to the sameclasses (digits between 0 and 9). Additionally, the digits are also slightly diﬀerently centered as illustratedon the examples in the left part of Figure 5.1. Altogether, this means that without speciﬁc pre-processing,the images do not lie in the same topological space and thus cannot be compared directly using conventionaldistances. We randomly select 300 images per class in each dataset, normalize magnitudes of pixels to[0 ,

1] and consider digit images as samples while each pixel acts as a feature leading to 256 and 784 featuresfor USPS and MNIST respectively. We use uniform weights for w , w and normalize average values ofeach pixel for v , v in order to discard non-informative ones that are always equal to 0.The result of solving problem (COOT) is reported in Figure 5.1. In the center-left part, we providethe coupling π s between the samples, i.e the diﬀerent images, sorted by class and observe that 67% ofmappings occur between the samples from the same class as indicated by block diagonal structure ofthe coupling matrix. The coupling π v , in its turn, describes the relations between the features, i.e thepixels, in both domains. To visualize it, we color-code the pixels of the source USPS image and use π v totransport the colors on a target MNIST image so that its pixels are deﬁned as convex combinations ofcolors from the former with coeﬃcients given by π v . The corresponding results are shown in the rightpart of Figure 5.1 for both the original COOT and its entropic regularized counterpart. From these twoimages, we can observe that colored pixels appear only in the central areas and exhibit a strong spatialcoherency despite the fact that the geometric structure of the image is totally unknown to the optimizationproblem, as each pixel is treated as an independent feature. COOT has recovered a meaningful spatialtransformation between the two datasets in a completely unsupervised way, diﬀerent from trivial rescalingof images that one may expect when aligning USPS digits occupying the full image space and MNISTdigits lying in the middle of it.For further evidence Figures 5.2 and 5.3 illustrate diﬀerent images of both datasets obtained bytransporting pixels from USPS ( resp. MNIST) to MNIST ( resp.

USPS) using the optimal coupling π v .Notably the case USPS → MNIST show that transporting the pixel through π v leads to a better spatialcoherency than a simple rescaling of the image. .3. Properties of COOT 105 U S P S R e s i z e M a p v M a p r e g v Figure 5.2: Linear mapping from USPS to MNIST using π v . (First row) Original USPS samples, (Second row)

Samples resized to target resolution, (Third row)

Samples mapped using π v , (Fourth row) Samples mappedusing π v with entropic regularization. M N I S T R e s i z e M a p v M a p r e g v Figure 5.3: Linear mapping from MNIST to USPS using π v . (First row) Original MNIST samples, (Secondrow)

Samples resized to target resolution, (Third row)

Samples mapped using π v , (Fourth row) Samplesmapped using π v with entropic regularization. COOT as a billinear program

COOT is a special case of a Quadratic Program (QP) with linearconstraints called a Bilinear Program (BP). More precisely, it is an indeﬁnite BP problem [Gallo 1977]. Itwas proved ( e.g in in [Pardalos 1987, Horst 1996]) that there exists an optimal solution lying on extremalpoints of the polytopes Π( w , w ) and Π( v , v ). When n = n , d = d and weights w = w = n n , v = v = d d are uniform, Birkhoﬀ’s theorem [Birkhoﬀ 1946] states that the set of extremal points of Π( n n , n n ) andΠ( d d , d d ) are the set of permutation matrices so that there exists an optimal solution ( π s ∗ , π v ∗ ) whichtransport maps are supported on two permutations σ s ∗ , σ v ∗ ∈ S n × S d .The BP problem is also related to the Bilinear Assignment Problem (BAP) where π s and π v aresearched in the set of permutation matrices. The latter was shown to be NP-hard if d = O ( r √ n ) for ﬁxed r and solvable in polynomial time if d = O ( p log( n )) [Custic 2016]. In this case, we look for the bestpermutations of the rows and columns of our datasets that lead to the smallest cost. COOT provides atight convex relaxation of the BAP by 1) relaxing the constraint set of permutations into the convex setof doubly stochastic matrices and 2) ensuring that two problems are equivalent, i.e. , one can always ﬁnd apair of permutations that minimizes (COOT), as explained in the paragraph above.Finding a meaningful similarity measure between datasets is useful in many machine learning tasksas pointed out, e.g in [Alvarez-Melis 2020]. Interestingly enough, COOT induces a notion of distancebetween datasets X and X . More precisely it vanishes iﬀ they are the same up to a permutation of rows

06 Chapter 5. CO-Optimal Transport

Algorithm 9

BCD for COOT π s (0) ← ww T , π v (0) ← vv T , k ← while k < maxIt and err > do π s ( k ) ← arg min π s ∈ Π( w , w ) h L ( X , X ) ⊗ π v ( k − , π s i F // linear OT problem on the samples π v ( k ) ← arg min π v ∈ Π( v , v ) h L ( X , X ) ⊗ π s ( k − , π v i F // linear OT problem on the features err ← || π v ( k − − π v ( k ) || F k ← k + 1 end while and columns as established below: Proposition 5.3.1 (COOT is a distance) . Suppose L = | · | p , p ≥ , n = n , d = d and that theweights w , w , v , v are uniform. Then COOT ( X , X ) = 0 iﬀ there exists a permutation of the samples σ ∈ S n and of the features σ ∈ S d , s.t, ∀ i, k X i,k = X σ ( i ) ,σ ( k ) . Moreover, it is symmetric andsatisﬁes the triangular inequality as long as L satisﬁes the triangle inequality, i.e., COOT ( X , X ) ≤ COOT ( X , X ) + COOT ( X , X ) . Note that in the general case when n = n , d = d , positivity and triangle inequality still hold butCOOT( X , X ) >

0. The proof can be found in Section 6.3. Interestingly, our result generalizes themetric property proved in [Faliszewski 2019] for the election isomorphism problem with this latter resultbeing valid only for the BAP case (for a discussion on the connection between COOT and the workof [Faliszewski 2019], see Section 6.3.7). Finally, we note that this metric property means that COOTcan be used as a divergence in a large number of potential applications as, for instance, in generativelearning [Bunne 2019].

Even though solving COOT exactly is NP-hard, in practice computing a solution can be done rathereﬃciently. To this end, we propose to use Block Coordinate Descent (BCD) that consists in iterativelysolving the problem for π s or π v with the other kept ﬁxed. Interestingly, this boils down to solving ateach step a linear OT problem that requires O ( n log( n )) operations with a network simplex algorithm asdetailed in the pseudo-code given in Algorithm 9. This approach, also known as the “mountain climbingprocedure” [Konno 1976a] in the BP literature, was proved to decrease the loss at each iteration and so toconverge within a ﬁnite number of iterations [Horst 1996]. We also note that at each iteration one needs tocompute the equivalent cost matrix L ( X , X ) ⊗ π ( · ) which has a complexity of O ( ndn d ). However, onecan reduce it using Proposition 1 from [Peyré 2016] for the case when L is the squared Euclidean distance | · | or the Kullback-Leibler divergence. In this case, the overall computational complexity becomes O (min { ( n + n ) dd + n n ; ( d + d ) nn + d d } ). We refer the interested reader to Section 6.3.2 for furtherdetails.Finally, we can use the same BCD procedure for the entropic regularized version of COOT (5.1) whereeach iteration an entropic regularized OT problem can be solved eﬃciently using Sinkhorn’s algorithm[Cuturi 2013] with several possible improvements [Altschuler 2017, Altschuler 2019, Alaya 2019]. Note thatthis procedure can be easily adapted in the same way to include unbalanced OT problems [Chizat 2017]as well. .5. Relation with other OT distances 107 The COOT problem is deﬁned for arbitrary matrices X ∈ R n × d , X ∈ R n × d and so can be readily used tocompare pairwise similarity matrices between the samples C = ( c ( x i , x j ) i,j ) ∈ R n × n , C = ( c ( x k , x l )) k,l ∈ R n × n for some c, c . To avoid redundancy, we use the term “similarity” for both similarity and distancefunctions in what follows. This situation arises in applications dealing with relational data, e.g , in agraph context (see Chapter 3) or e.g. in deep metric alignment [Ezuz 2017]. These problems havebeen successfully tackled using the Gromov-Wasserstein distance (see Chapter 2). We recall that given C ∈ R n × n and C ∈ R n × n , the GW distance is deﬁned by: GW ( C , C , w , w ) = min π s ∈ Π( w , w ) h L ( C , C ) ⊗ π s , π s i F (5.2)As suggested by the similar objective functions and constraints, GW and COOT are linked in multipleways. Below, we explicit the link between GW and COOT using a reduction of a concave QP to anassociated BP problem established in [Konno 1976b] and show that they are equivalent when workingwith squared Euclidean distance matrices C ∈ R n × n , C ∈ R n × n or with inner product similarities (seeChapter 4). More precisely, this latter equivalence follows from [Konno 1976b] where it was shown that aconcave QP can be solved by a reduction to an associated BP problem: Theorem 5.5.1 (Adapted from [Konno 1976b]) . If Q is a negative deﬁnite matrix then problems: min x f ( x ) = c T x + x T Qx s.t. Ax = b , x ≥ x , y g ( x , y ) = c T x + c T y + x T Qy s.t. Ax = b , Ay = b , x , y ≥ are equivalent. More precisely, if x ∗ is an optimal solution for (5.3) , then ( x ∗ , x ∗ ) is a solution for (5.4) and if ( x ∗ , y ∗ ) is optimal for (5.4) , then both x ∗ and y ∗ are optimal for (5.3) . Using this principle one can link GW with the COOT problem when working on intra domain similaritymatrices C ∈ R n × n , C ∈ R n × n thanks to the next proposition: Proposition 5.5.1.

Let L = | · | and suppose that C ∈ R n × n , C ∈ R n × n are squared Euclidean distancematrices such that C = x1 Tn + n x T − XX T , C = x Tn + n x T − X X T with x = diag ( XX T ) , x = diag ( X X T ) . Then, the GW problem can be written as a concave quadratic program (QP) which Hessianreads Q = − ∗ XX T ⊗ K X X T .If C ∈ R n × n , C ∈ R n × n are inner products similarities, i.e. such that C = XX T , C = X X T thenthe GW is also a concave quadratic program (QP) which Hessian reads Q = − ∗ XX T ⊗ K X X T . When working with arbitrary similarity matrices, COOT provides a lower-bound for GW and usingProposition 5.5.1 we can prove that both problems become equivalent for the cases of squared Euclideandistances and inner product similarities.

Proposition 5.5.2.

Let C ∈ R n × n , C ∈ R n × n be any symmetric matrices, then:COOT ( C , C , w , w , w , w ) ≤ GW ( C , C , w , w ) .

08 Chapter 5. CO-Optimal Transport

Algorithm 10

DC Algorithm for solving COOT and GW with squared Euclidean matrices or innerproduct similarities π (0) ← ww T , k ← while k < maxIt and err > do π ( k ) ← arg min π ∈ Π( w , w ) h L ( C , C ) ⊗ π ( k − , π i F // linear OT problem err ← || π ( k − − π ( k ) || F k ← k + 1 end while return π ( k ) for GW and ( π ( k ) , π ( k ) ) for COOT The converse is also true under the hypothesis of Proposition 5.5.1. In this case, if ( π s ∗ , π v ∗ ) is an optimalsolution of (COOT) , then both π s ∗ , π v ∗ are solutions of (5.2) . Conversely, if π s ∗ is an optimal solution of (5.2) , then ( π s ∗ , π s ∗ ) is an optimal solution for (COOT) . Equivalence of algorithms

Under the hypothesis of Proposition 5.5.1 we know that there exists anoptimal solution for the COOT problem of the form ( π ∗ , π ∗ ), where π ∗ is an optimal solution of theGW problem. This gives a conceptually very simple ﬁxed-point procedure where one only iterates overone coupling in order to compute a optimal solution of GW as described in Algorithm 10. Interestinglyenough, the iterations of the ﬁxed point method are exactly equivalent to the Frank Wolfe proceduredescribed in Chapter 3, since, in the concave setting, the line search step can be ﬁxed to 1 [Maron 2018](see Section 6.3.6 for more details). Also note that the steps of Algorithm 10 are iterations of Diﬀerenceof Convex Algorithm (DCA) [Tao 2005, Yuille 2003] where the concave function is approximated at eachiteration by its linear majorization. When applying the same procedure for entropic regularized COOT,the resulting DCA also recovers exactly the projected gradients iterations proposed in [Peyré 2016] forsolving the entropic regularized version of GW.We would like to stress out that COOT is much more than a generalization of GW and that is formultiple reasons. First, it can be used on raw data without requiring to choose or compute the similaritymatrices, that can be costly, for instance, when dealing with shortest path distances in graphs, and to storethem ( O ( n + n ) overhead). Second, it can take into account additional information given by featureweights v , v and provides an interpretable mapping between them across two heterogeneous datasets.Finally, contrary to GW, COOT is not invariant neither to feature rotations nor to the change of signsleading to a more informative samples’ coupling when compared to GW in some applications. One suchexample is given in the previous MNIST-USPS transfer task (Figure 5.1) for which the coupling matrixobtained via GW (given in Figure 5.4) exhibits important ﬂaws in respecting class memberships whenaligning samples. In [Alvarez-Melis 2019], authors consider a scenario where the OT problem is used to align measuressupported on sets of points for which meaningful pairwise distances are hard or impossible to calculate.This may happen, for instance, due to some latent transformation that have been applied to the features.The main underlying idea of their approach is to ﬁnd an assignment of the points and to calculate atransformation to match the features. More precisely, for two datasets X , X , with same feature space R d , .5. Relation with other OT distances 109 MNIST USPS

USPS samples M N I S T s a m p l e s matrix for GW USPS samples M N I S T s a m p l e s s matrix for COOT Figure 5.4: Comparison between the coupling matrices obtained via GW and COOT on MNIST-USPS. the corresponding objective function is:

InvOT ( X , X ) = min π ∈ Π( w , w ) min P ∈ F d h M P , π i F (5.5)where M P ( i, j ) = k x i − Px j k and F d is a space of matrices F d = { P ∈ R d × d | || P || F = √ d } . As noted bythe authors in Lemma 4.3, equation (5.5) can be related to the GW problem when C , C are calculatedusing inner product similarities and when X is w -whitened, i.e X T diag( w ) X = I d . In this case,author show that GW and InvOT are equivalent, namely a solution of GW is a solution of InvOT andconversely. Since the GW problem with cosine similarities is actually concave we have proven COOT andGW are also equivalent in this case which proves the following proposition:

Proposition 5.5.3.

Using previous notations, L = | · | , d = d , and inner product similarities C = XX T , C = X X T . Suppose that X is w -whitened, i.e X T diag ( w ) X = I d . Then, InvOT ( X , X ) ,COOT ( C , C ) and GW ( C , C ) are equivalent, namely any optimal coupling of one of this problem is asolution to others problems. Another way of proving this result is to consider the Theorem 4.2.1 of Chapter 4 where we provedthat the Gromov-Wasserstein distance, when considering inner product similarities, is equivalent to theproblem: min π ∈ Π( w , w ) min P ∈ F d n,n X i =1 ,j =1 h x i , Px j i π ij (MaxOT)When X is w -whitened we can check easily by developping the terms in M P ( i, j ) = k x i − Px j k that InvOT is equivalent to (MaxOT) (see Chapter 4). InvOT was further used as a building block for aligningclustered datasets in [Lee 2019] where the authors applied it as a divergence measure between the clusters,thus leading to an approach diﬀerent from ours. Finally, in [Yurochkin 2019] the authors proposed ahierarchical OT distance as an OT problem with costs deﬁned based on precomputed Wasserstein distancesbut with no global features’ mapping, contrary to COOT that optimises two couplings of the features andthe samples simultaneously.

The election isomorphism problem mentioned earlier has recently been introduced in [Faliszewski 2019]to compare two elections given by preference orders for candidates of voters. The authors express their

10 Chapter 5. CO-Optimal Transport problem quite similarly to COOT and also seek for correspondences between voters and candidates acrosstwo elections where the preferences of each voter are known (which is unrealistic in modern democracies).They focus on the setting where both elections have exactly the same number of voters n = n andcandidates d = d and search for an optimal permutation via a Linear Integer Program. It is interesting tosee that their problem consisting in aligning voters using the Spearman distance is actually equivalent tosolving COOT with L = | · | on voters preferences. To that extent, COOT is a more general approach asit is applicable for general loss functions L , contrary to the Spearman distance used in [Faliszewski 2019],and generalizes to the cases where n = n and m = m . A more detailed comparison is given in Section6.3.7. In the next section, we highlight two possible applications of COOT in a machine learning context: HDAand co-clustering. We consider these two particular tasks because 1) OT-based methods are considered asa strong baseline in DA; 2) COOT is a natural match for co-clustering as it allows for soft assignments ofdata samples and features to co-clusters.

In a classiﬁcation context, the problem of domain adaptation (DA) arises whenever one has to performclassiﬁcation on a set of data X t = { x ti } N t i =1 (usually called the target domain) but has only few or nolabelled data associated. Given a source domain X s = { x si } N s i =1 with associated labels Y s = { y si } N s i =1 ,one would like to leverage on this knowledge to train a classiﬁer in the target domain. Unfortunately,direct use of the source information usually leads to poor results because of the discrepancy betweensource and target distributions. Among others, several works, e.g. [Courty 2017], use OT to perform thisadaptation. However, in the case where the data do not belong to the same metric space ( X s ∈ R N s × d and X t ∈ R N t × d with d = d ), the problem is getting harder as domains probability distributions can notbe anymore compared or aligned in a straightforward way. This instance of the DA problem, known as Heterogeneous Domain Adaptation (HDA), has received less attention in the literature, partly due to thelack of appropriate divergence measures that can be used in such context. State-of-the-art HDA methodsinclude Canonical Correlation Analysis [Yeh 2014] and its kernelized version and a more recent approachbased on the Gromov-Wasserstein discrepancy [Yan 2018]. Usually, one considers a semi-supervised variantof the problem, where one has access to a small number n t of labelled samples per class in the targetdomain, because the unsupervised problem ( n t = 0) is much more diﬃcult. We investigate here the use ofCOOT for both semi-supervised HDA, where one has access to a small number n t of labelled samples perclass in the target domain and unsupervised HDA with n t = 0. Solving HDA with COOT

In order to solve the HDA problem, we compute COOT( X s , X t ) betweenthe two domains and use the π s matrix providing a transport/correspondence between samples (asillustrated in Figure 5.1) to estimate the labels in the target domain via label propagation [Redko 2019].Assuming uniform sample weights and one-hot encoded labels, a class prediction ˆ Y t in the target domainsamples can be obtained by computing ˆ Y t = π s Y s . When labelled target samples are available, wefurther prevent source samples to be mapped to target samples from a diﬀerent class by adding a highcost in the cost matrix for every such source sample as suggested in [Courty 2017, Section 4.2]. .6. Experiments 111 Domains No-adaptation baseline CCA KCCA EGW SGW COOTC → W 69 . ± .

82 11 . ± .

78 66 . ± .

40 11 . ± .

93 78 . ± . . ± . → C 83 . ± .

95 19 . ± .

71 76 . ± .

70 11 . ± .

05 92 . ± . . ± . → W 82 . ± .

63 14 . ± .

15 78 . ± .

94 10 . ± .

64 93 . ± . . ± . → A 84 . ± .

35 17 . ± .

41 78 . ± .

13 7 . ± .

78 93 . ± . . ± . → C 83 . ± .

82 15 . ± .

88 76 . ± .

07 9 . ± .

37 80 . ± . . ± . → W 81 . ± .

69 12 . ± .

92 81 . ± .

93 12 . ± .

21 87 . ± . . ± . → A 84 . ± .

45 13 . ± .

88 80 . ± .

03 14 . ± .

23 82 . ± . . ± . → C 67 . ± .

72 13 . ± .

33 60 . ± .

38 11 . ± .

91 77 . ± . . ± . → A 66 . ± .

47 13 . ± .

15 63 . ± .

32 11 . ± .

58 75 . ± . . ± . Mean . ± .

43 14 . ± .

29 73 . ± .

47 11 . ± .

86 84 . ± . . ± . p-value < .001 < .001 < .001 < .001 < .001 - Table 5.1:

Semi-supervised HDA for n t = 3 from Decaf to GoogleNet task. Datasets

We choose to test our method on the classical Caltech-Oﬃce dataset [Saenko 2010], which isdedicated to object recognition in images from several domains. Those domains exhibit variability in termof presence/absence of background, lightning conditions, image quality, that as such induce distributionshifts between the domains. Among the available domains, we select the following three:

Amazon (A),the

Caltech-256 image collection (C) and

Webcam (W). Ten overlapping classes between the domainsare used and two diﬀerent deep feature representations of image in each domain are obtained using twodiﬀerent neural networks, namely: the Decaf [Donahue 2014] and GoogleNet [Szegedy 2015] neuralnetwork architectures. In both cases, we extract the image representations as the activations of the lastfully-connected layer, yielding respectively sparse 4096 and 1024 dimensional vectors. The heterogeneitycomes from these two very diﬀerent representations.

Competing methods and experimental settings

We evaluate COOT on

Amazon (A),

Caltech-256 (C) and

Webcam (W) domains from Caltech-Oﬃce dataset [Saenko 2010] with 10 overlapping classesbetween the domains and two diﬀerent deep feature representations obtained for images from each domainusing the Decaf [Donahue 2014] and GoogleNet [Szegedy 2015] neural network architectures. In bothcases, we extract the image representations as the activations of the last fully-connected layer, yieldingrespectively sparse 4096 and 1024 dimensional vectors. The heterogeneity comes from these two verydiﬀerent representations. We consider 4 baselines: CCA, its kernalized version KCCA [Yeh 2014] with aGaussian kernel which width parameter is set to the inverse of the dimension of the input vector, EGWrepresenting the entropic version of GW and SGW [Yan 2018] that incorporates labelled target data intotwo regularization terms. For EGW and SGW, the entropic regularization term was set to 0 .

1, and thetwo other regularization hyperparameters for the semi-supervised case to λ = 10 − and γ = 10 − as donein [Yan 2018]. We use COOT with entropic regularization on the feature mapping, with parameter ε = 1in all experiments. For all OT methods, we use label propagation to obtain target labels as the maximumentry of ˆ Y t in each row. For all non-OT methods, classiﬁcation was conducted with a k-nn classiﬁer with k = 3. We run the experiment in a semi-supervised setting with n t = 3, i.e. , 3 samples per class werelabelled in the target domain. The baseline score is the result of classiﬁcation by only considering labelledsamples in the target domain as the training set. For each pair of domains, we selected 20 samples perclass to form the learning sets. We run this random selection process 10 times and consider the meanaccuracy of the diﬀerent runs as a performance measure.

12 Chapter 5. CO-Optimal Transport

Domains CCA KCCA EGW COOTC → W 14 . ± .

60 21 . ± .

64 10 . ± . . ± . → C 13 . ± .

70 18 . ± .

44 10 . ± . . ± . → W 10 . ± .

36 13 . ± .

34 10 . ± . . ± . → A 14 . ± .

14 23 . ± .

95 9 . ± . . ± . → C 11 . ± .

23 11 . ± .

23 11 . ± . . ± . → W 19 . ± .

85 28 . ± .

13 11 . ± . . ± . → A 11 . ± .

82 14 . ± .

78 13 . ± . . ± . → C 12 . ± .

69 14 . ± .

79 12 . ± . . ± . → A 15 . ± .

30 23 . ± .

61 12 . ± . . ± . Mean . ± .

55 18 . ± .

33 11 . ± . . ± . p-value < .001 < .001 < .001 - Table 5.2:

Unsupervised HDA for n t = 0 from Decaf to GoogleNet task. Results

We ﬁrst provide in Table 5.1 the results for the semi-supervised case where we performadaptation from Decaf to GoogleNet features. Note that we report the results in the opposite directionin Section 6.3.8 for n t ∈ { , , , } . From it, we see that COOT surpasses all the other state-of-the-artmethods in terms of mean accuracy. This result is conﬁrmed by a p -value lower than 0 .

001 on a pairwisemethod comparison with COOT in a Wilcoxon signed rank test. SGW provides the second best result,while CCA and EGW have a less than average performance. Finally, KCCA performs better than thetwo latter methods, but still fails most of the time to surpass the no-adaptation baseline score given by aclassiﬁer learned on the available labelled target data. Results for the unsupervised case can be found inTable 5.2. This setting is rarely considered in the literature as unsupervised HDA is regarded as a verydiﬃcult problem. In this table, we do not provide scores for the no-adaptation baseline and SGW, as theyrequire labelled data.As one can expect, most of the methods fail in obtaining good classiﬁcation accuracies in this setting,despite having access to discriminant feature representations. Yet, COOT succeeds in providing ameaningful mapping in some cases. The overall superior performance of COOT highlights its strengthsand underlines the limits of other HDA methods. First, COOT does not depend on approximatingempirical quantities from the data, contrary to CCA and KCCA that rely on the estimation of thecross-covariance matrix that is known to be ﬂawed for high-dimensional data with few samples [Song 2016].Second, COOT takes into account the features of the raw data that are more informative than the pairwisedistances used in EGW. Finally, COOT avoids the sign invariance issue discussed previously that hindersGW’s capability to recover classes without supervision as illustrated for the MNIST-USPS problem before.

While clustering methods present an important discovery tool for data analysis, one of their main limationsis to completely discard the potential relationships that may exist between the features that describe thedata samples. For instance, in recommendation systems, where each user is described in terms of hisor her preferences for some product, clustering algorithms may beneﬁt from the knowledge about thecorrelation between diﬀerent products revealing their probability of being recommended to the same users. .6. Experiments 113

Data set n × d g × m Overlapping ProportionsD1 600 ×

300 3 × ×

200 2 × ×

300 5 × Table 5.3: Size ( n × d ), number of co-clusters ( g × m ), degree of overlapping ([+] for well-separated and [++] forill-separated co-clusters) and the proportions of co-clusters for simulated data sets. This idea is the cornerstone of co-clustering [Hartigan 1972] where the goal is to perform clustering ofboth samples and features simultaneously. More precisely given a data matrix X ∈ R n × d and the numberof samples (rows) and features (columns) clusters denoted by g ≤ n and m ≤ d , respectively, we seek toﬁnd X c ∈ R g × m that summarizes X in the best way possible. COOT-clustering

We look for X c which is as close as possible to the original X w.r.t. COOT bysolving: min X c COOT( X , X c ) = min π s , π v , X c h L ( X , X c ) ⊗ π s , π v i F (5.6)with potentially entropic regularization. More precisely, we set w , w , v , v as uniform, initialize X c withrandom values and apply the BCD algorithm over ( π s , π v , X c ) by alternating between the following steps:1. Obtain π s and π v by solving COOT( X , X c )2. Set X c to gm π s > X π v .This second step of the procedure is a least-square estimation when L = |·| and corresponds to minimizingthe COOT objective w.r.t.. X c . In practice, we observed that few iterations of this procedure are enoughto ensure the convergence. Once solved, we use the soft assignments provided by coupling matrices π s ∈ R n × g , π v ∈ R d × m to partition data points and features to clusters by taking the index of themaximum element in each row of π s and π v , respectively. Simulated data

We follow [Laclau 2017] where four scenarios with diﬀerent number of co-clusters,degrees of separation and sizes were considered (for details, see the supplementary materials). We chooseto evaluate COOT on simulated data as it provides us with the ground-truth for feature clusters that areoften unavailable for real-world data sets. As in [Laclau 2017], we use the same co-clustering baselinesincluding ITCC [Dhillon 2003], Double K-Means (DKM) [Rocci 2008], Orthogonal Nonnegative MatrixTri-Factorizations (ONTMF) [Ding 2006], the Gaussian Latent Block Models (GLBM) [Nadif 2008] andResidual Bayesian Co-Clustering (RBC) [Shan 2010] as well as the K-means and NMF run on bothmodes of the data matrix, as clustering baseline. The performance of all methods is measured using theco-clustering error (CCE) deﬁned as follows [Patrikainen 2006]:CCE(( z , w ) , (ˆ z , ˆ w )) = e ( z , ˆ z ) + e ( w , ˆ w ) − e ( z , ˆ z ) × e ( w , ˆ w ) (5.7)where ˆ z and ˆ w are the partitions of samples and features estimated by the algorithm; z and w are the truepartitions and e ( z , ˆ z ) (resp. e ( w , ˆ w )) denotes the error rate, i.e., the proportion of misclassiﬁed instances(resp. features).For all conﬁgurations, we generate 100 data sets and present the mean and standard deviation of theCCE over all sets for all baselines in Table 5.4. Table 5.3 below summarizes the characteristics of thesimulated data sets used in our experiment.

14 Chapter 5. CO-Optimal Transport

Data set AlgorithmsK-means NMF DKM Tri-NMF GLBM ITCC RBC CCOT CCOT-GW COOTD1 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . D2 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . D3 – – . ± .

000 – . ± . . ± .

031 – . ± . . ± . . ± . . ± .

038 – . ± .

082 – . ± . . ± . . ± . . ± . . ± . . ± . Table 5.4: Mean ( ± standard-deviation) of the co-clustering error (CCE) obtained for all conﬁgurations. “-"indicates that the algorithm cannot ﬁnd a partition with the requested number of co-clusters. All the baselinesresults (ﬁrst 9 columns) are from [Laclau 2017]. Face dataset Centroids for sample clustering Pixel (feature) clustering

Figure 5.5: Co-clustering with COOT on the Olivetti faces dataset. (left)

Example images from the dataset, (center) centroids estimated by COOT (right) clustering of the pixels estimated by COOT where each colorrepresents a cluster.

Based on these results, we see that our algorithm outperforms all the other baselines on D1, D2 and D4data sets, while being behind (CCOT-GW) proposed by [Laclau 2017] on D3. This result is rather strongas our method relies on the original data matrix, while (CCOT-GW) relies on its kernel representation andthus beneﬁts from the non-linear information captured by it. Finally, we note that while both competingmethods rely on OT, they remain very diﬀerent as (CCOT-GW) approach is based on detecting thepositions and the number of jumps in the scaling vectors of GW entropic regularized solution, while ourmethod relies on coupling matrices to obtain the partitions.

Olivetti Face dataset

As a ﬁrst application of COOT for the co-clustering problem on real data, wepropose to run the algorithm on the well known Olivetti faces dataset [Samaria 1994].We take 400 images normalized between 0 and 1 and run our algorithm with g = 9 image clusters and m = 40 feature (pixel) clusters. As before, we consider the empirical distributions supported on imagesand features, respectively. The resulting reconstructed image’s clusters are given in Figure 5.5 and thepixel clusters are illustrated in its rightmost part. We can see that despite the high variability in the dataset, we still manage to recover detailed centroids, whereas L2-based clustering such as standard NMF ork-means based on ‘ norm cost function are known to provide blurry estimates in this case. Finally, as inthe MNIST-USPS example, COOT recovers spatially localized pixel clusters with no prior informationabout the pixel relations. MovieLens

We now evaluate our approach on the benchmark

MovieLens -100K data set that provides100,000 user-movie ratings, on a scale of one to ﬁve, collected from 943 users on 1682 movies. The https://grouplens.org/datasets/movielens/100k/ .7. Conclusion 115 M1 M20Shawshank Redemption (1994) Police Story 4: Project S (Chao ji ji hua) (1993)Schindler’s List (1993) Eye of Vichy, The (Oeil de Vichy, L’) (1993)Casablanca (1942) Promise, The (Versprechen, Das) (1994)Rear Window (1954) To Cross the Rubicon (1991)Usual Suspects, The (1995) Daens (1992)

Table 5.5: Top 5 of movies in clusters M1 and M20. Average rating of the top 5 rated movies in M1 is 4.42, whilefor the M20 it is 1. main goal of our algorithm here is to summarize the initial data matrix so that X c reveals the blocks(co-clusters) of movies and users that share similar tastes. We set the number of user and ﬁlm clusters to g = 10 and m = 20, respectively as in [Banerjee 2007].The obtained results provide the ﬁrst movie cluster consisting of ﬁlms with high ratings (3.92 onaverage), while the last movie cluster includes movies with very low ratings (1.92 on average). Amongthose, we show the 5 best/worst rated movies in those two clusters in Table 5.5. Overall, our algorithmmanages to ﬁnd a coherent co-clustering structure in MovieLens -100K and obtains results similar tothose provided in [Laclau 2017, Banerjee 2007].

In this chapter, we presented a novel variant of the optimal transport problem which aims at comparingdistributions supported in diﬀerent spaces. To this end, two optimal transport maps, one acting onthe sample space, and the other on the feature space, are optimized to connect the two heterogenousdistributions. We show that this novel problem has connections with bilinear assignment and providealgorithms to solve it. We demonstrate its usefulness and versatility on two diﬃcult machine learningproblems: heterogeneous domain adaptation and co-clustering/data summarization, where promisingresults were obtained.Numerous follow-up of this work are expected. Beyond the potential applications of the method invarious contexts, such as e.g. statistical matching, data analysis or even losses in deep learning settings,one immediate and intriguing question lies into the generalization of this framework in the continuoussetting, and the potential connections to duality theory. This might lead to stochastic optimizationschemes enabling large scale solvers for this problem. hapter Proofs of claims and additional results

Contents GM and GW in the discrete case . . . . 1326.2.3 Computing GW in the 1d case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.2.4 Proof of Theorem 4.1.3 – Properties of SGW . . . . . . . . . . . . . . . . . . . . . . . . 1336.2.5 Additional results – SW ∆ and RISW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.2.6 Proof of Lemma 4.2.1 – Reductions of the GW costs for inner products and squaredEuclidean distance matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.2.7 Proof of the existence and ﬁniteness (innerGW) and (MaxOT) . . . . . . . . . . . . . . 1406.2.8 Proof of Lemma 4.2.3 – Reduction of the GW cost . . . . . . . . . . . . . . . . . . . . . 1426.2.9 Proof of Lemma 4.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.2.10 Characterization of H ( µ, ν ) for linear push-forward . . . . . . . . . . . . . . . . . . . . . 1436.2.11 Proof of Theorem 4.2.6 – Closed form expression of the linear Gromov-Monge problembetween Gaussian measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

18 Chapter 6. Proofs of claims and additional results

This section contains all the proofs of the claims and additional results of the Chapter 3. In the followingwe will denote by H q the Wasserstein loss and by J q the Gromov-Wasserstein loss. More precisely, withnotation of Chapter 3: H q ( π ) = ˆ d ( a, b ) q d π ( a, b ) (6.1) J q ( d X , d Y , π ) = ˆ ˆ L ( x, y, x , y ) q d π ( x, y )d π ( x , y ) = ˆ ˆ | d X ( x, x ) − d Y ( y, y ) | q d π ( x, y )d π ( x , y )(6.2) E p,q,α ( π ) = ˆ ˆ (cid:0) (1 − α ) d ( a, b ) q + α | d X ( x, x ) − d Y ( y, y ) | q (cid:1) p d π (( x, a ) , ( y, b ))d π (( x , a ) , ( y , b )) (6.3)We note P i π the projection on the marginal i of π . Cross validation results

During the nested cross validation, we divided the dataset into 10 and use 9folds for training, where α is chosen within [0 , via a 10-CV cross-validation, 1 fold for testing, with thebest value of α (with the best average accuracy on the 10-CV) previously selected. The experiment isrepeated 10 times for each dataset except for MUTAG and PTC where it is repeated 50 times. Table 6.1and 6.2 report the average number of time α was chose within ]0 , ...

1[ without 0 and 1 corresponding tothe Wasserstein and Gromov-Wasserstein distances respectively. Results suggests that both structure andfeature pieces of information are necessary as α is consistently selected inside ]0 , ...

1[ except for PTC andCOX2.

Table 6.1: Percentage of α chosen in ]0 , ...,

1[ compared to { , } for discrete labeled graphs Discrete attr. MUTAG NCI1 PTCFGW raw sp 100% 100% 98%FGW wl h=2 sp 100% 100% 88%FGW wl h=4 sp 100% 100% 88%

Table 6.2: Percentage of α chosen in ]0 , ...,

1[ compared to { , } for vector attributed graphs Vector attributes BZR COX2 CUNEIFORM ENZYMES PROTEIN SYNTHETIC

FGW sp

100 % 90% 100% 100% 100% 100%

Nested CV results

We report in tables 6.4 and 6.3 the average classiﬁcation accuracies of the nestedclassiﬁcation procedure by taking W and GW instead of F GW ( i.e by taking α = 0 , .1. Proofs and additional results of Chapter 3 119 compared to the second best score. The signiﬁcance is based on a Wilcoxon signed rank test between thebest method and the second one.Results illustrates that F GW encompasses the two cases of W and GW , as scores of F GW are usuallygreater or equal on every dataset than scores of both W and GW and when it is not the case the diﬀerenceis not statistically signiﬁcant. Table 6.3: Average classiﬁcation accuracy on the graph datasets with discrete attributes.

Discrete attr. MUTAG NCI1 PTC-MRFGW raw sp 83.26 ± ± ± ± ± ± ± ± * 65.31 ± ± ± ± ± ± ± ± ± ± *GW sp 82.73 ± ± . ± Table 6.4: Average classiﬁcation accuracy on the graph datasets with vector attributes.

Vector attributes BZR COX2 CUNEIFORM ENZYMES PROTEIN SYNTHETIC

FGW sp ± ± * ± ± ± ± W ± * 77.23 ± ± ± * ± * 34.07 ± GW sp ± ± ± ± ± ± Timings

In this paragraph we provide some timings for the discrete attributed datasets. Table 6.5displays the average timing for computing

F GW between two pair of graphs.

Table 6.5: Average timings for the computation of

F GW between two pairs of graph

Discrete attr. MUTAG NCI1 PTC-MRFGW 2.5 ms 7.3 ms 3.7 ms

We recall the proposition:

Proposition.

Comparison between

F GW , GW and W .• The following inequalities hold: F GW α,p,q ( µ, ν ) ≥ (1 − α ) W pq ( µ A , ν B ) q (6.4) F GW α,p,q ( µ, ν ) ≥ αGW pq ( µ X , ν Y ) q (6.5)

20 Chapter 6. Proofs of claims and additional results • Let us suppose that the structure spaces ( X , d X ) , ( Y , d Y ) are part of a single ground space ( Z , d Z ) (i.e. X , Y ⊂ Z and d X = d Y = d Z ). We consider the Wasserstein distance between µ and ν for thedistance on Z × Ω : ˜ d (( x, a ) , ( y, b )) = (1 − α ) d ( a, b ) + αd Z ( x, y ) . Then: F GW α,p, ( µ, ν )( µ, ν ) ≤ W p ( µ, ν ) . (6.6) Proof.

For the two inequalities (6.4) and (6.5) let π be an optimal coupling for the Fused Gromov-Wasserstein distance between µ and ν . Clearly: F GW α,p,q ( µ, ν ) = (cid:16) ˆ ( X × Ω ×Y× Ω) (cid:0) (1 − α ) d ( a, b ) q + αL ( x, y, x , y ) q (cid:1) p d π (( x, a ) , ( y, b ))d π (( x , a ) , ( y , b )) (cid:17) p ≥ (cid:16) ˆ X × Ω ×Y× Ω (1 − α ) p d ( a, b ) pq d π (( x, a ) , ( y, b )) (cid:17) p = (1 − α ) (cid:16) ˆ Ω × Ω d ( a, b ) pq d P , π ( a, b ) (cid:17) p Since π ∈ Π( µ, ν ) the coupling P , π is in Π( µ A , ν B ). So by suboptimality: F GW α,p,q ( µ, ν ) ≥ (1 − α )( W pq ( µ A , ν B )) q which proves equation (6.4). Same reasoning is used for equation (6.5).For the last inequality (6.6) let π ∈ Π( µ, ν ) be any admissible coupling. By suboptimality: F GW α,p, ( µ, ν ) ≤ (cid:16) ˆ ( X× Ω ×Y× Ω) (cid:0) (1 − α ) d ( a, b ) + α | d Z ( x, x ) − d Z ( y, y ) | (cid:1) p d π (( x, a ) , ( y, b ))d π (( x , a ) , ( y , b )) (cid:17) p ( ∗ ) ≤ (cid:16) ˆ ( X× Ω ×Y× Ω) (cid:0) (1 − α ) d ( a, b ) + αd Z ( x, y ) + αd Z ( x , y ) (cid:1) p d π (( x, a ) , ( y, b ))d π (( x , a ) , ( y , b )) (cid:17) p ≤ (cid:16) ˆ ( X× Ω ×Y× Ω) (cid:0) (1 − α ) d ( a, b ) + αd Z ( x, y ) + (1 − α ) d ( a , b ) + αd Z ( x , y ) (cid:1) p d π (( x, a ) , ( y, b ))d π (( x , a ) , ( y , b )) (cid:17) p ( ∗∗ ) ≤ (cid:16) ˆ X× Ω ×Y× Ω (cid:0) (1 − α ) d ( a, b ) + αd Z ( x, y ) (cid:1) p d π (( x, a ) , ( y, b )) (cid:17) p (*) is the triangle inequality of d Z and (**) Minkowski inequality. Since this inequality is true for anyadmissible coupling π we can apply it with the optimal coupling for the Wasserstein distance deﬁned inthe proposition and the claim follows. We recall the theorem:

Theorem (Metric properties) . Let p, q ≥ , α ∈ ]0 , and ( µ, ν ) ∈ S pq (Ω) × S pq (Ω) . The functional π → E p,q,α ( π ) always achieves an inﬁmum π ∗ in Π( µ, ν ) s.t. F GW α,p,q ( µ, ν ) = E p,q,α ( π ∗ ) < + ∞ .Moreover: • F GW α,p,q is symmetric and, for q = 1 , satisﬁes the triangle inequality. For q ≥ , the triangularinequality is relaxed by a factor q − . .1. Proofs and additional results of Chapter 3 121 • For α ∈ ]0 , , F GW α,p,q ( µ, ν ) = 0 if an only if there exists a bijective function φ = ( φ , φ ) : supp ( µ ) → supp ( ν ) such that: φ µ = ν ∀ ( x, a ) ∈ supp ( µ ) , φ ( x, a ) = a ∀ ( x, a ) , ( x , a ) ∈ supp ( µ ) , d X ( x, x ) = d Y ( φ ( x, a ) , φ ( x , a )) • If ( µ, ν ) are generalized labeled graphs then F GW α,p,q ( µ, ν ) = 0 if and only if ( X × Ω , d X , µ ) and ( Y × Ω , d Y , ν ) are (II)-strongly isomorphic. We propose to prove the theorem point by point: ﬁrst the existence, then the the triangle inequalitystatement and ﬁnally the equality relation.

Proposition 6.1.1 (Existence of an optimal coupling for the

F GW distance.) . For p, q ≥ , π → E p,q,α ( π ) always achieves a inﬁmum π ∗ in Π( µ, ν ) such that F GW α,p,q ( µ, ν ) = E p,q,α ( π ∗ ) < + ∞ .Proof. Since

X ×

Ω and

Y ×

Ω are Polish spaces we known that Π( µ, ν ) ⊂ P ( X × Ω × Y × Ω) is compact(Theorem 1.7 in [Santambrogio 2015]), so by applying Weierstrass theorem we can conclude that theinﬁmum is attained at some π ∗ ∈ Π( µ, ν ) if π → E p,q,α ( π ) is l.s.c.We will use Lemma 2.2.1 to prove that the functionnal is l.s.c. on Π( µ, ν ). If we consider W = X × Ω × Y × Ω which is a a metric space endowed with the distance d X ⊗ d ⊗ d Y ⊗ d and f (( w =( x, a, y, b ) , w = ( x , a , y , b )) = ((1 − α ) d ( a, b ) q + αL ( x, y, x , y ) q ) p then f is l.s.c. by continuity of d , d X and d Y . With the previous reasoning we can conclude that the inﬁmum is attained.Finally ﬁniteness come from: ˆ ( X × Ω ×Y× Ω) (cid:0) (1 − α ) d ( a, b ) q + αL ( x, y, x , y ) q (cid:1) p dπ (( x, a ) , ( y, b )) dπ (( x , a ) , ( y , b )) ∗ ≤ ˆ p − (1 − α ) d ( a, b ) qp dµ A ( a ) dν ( b ) + ˆ p − αL ( x, y, x , y ) qp dµ X ( x ) dµ X ( x ) dν Y ( y ) dν Y ( y ) ∗∗ < + ∞ (6.7)where in (*) we used equation (2.44) in Memo 2.2.2 and in (**) that µ, ν are in S pq (Ω). Proposition 6.1.2 (Symmetry and triangle inequality.) . F GW α,p,q is symmetric and for q = 1 satisﬁesthe triangle inequality. For q ≥ the triangular inequality is relaxed by a factor q − To prove this result we will use the following lemma:

Lemma 6.1.1.

Let ( X × Ω , d X , µ ) , ( Y × Ω , d Y , β ) , ( Z × Ω , d Z , ν ) ∈ S (Ω) . For ( x, a ) , ( x , a ) ∈ ( X × Ω) , ( y, b ) , ( y , b ) ∈ ( Y × Ω) and ( z, c ) , ( z , c ) ∈ ( Z × Ω) we have: L ( x, z, x , z ) q ≤ q − ( L ( x, y, x , y ) q + L ( y, z, y , z ) q ) (6.8) d ( a, c ) q ≤ q − ( d ( a, b ) q + d ( b, c ) q ) (6.9) Proof.

Direct consequence of equation (2.44) in Memo 2.2.2 and triangle inequalities of d, d X , d Y , d Z . Proof of Proposition 6.1.2.

To prove the triangle inequality of

F GW α,p,q distance for arbitrary measureswe will use the Gluing lemma which stresses the existence of couplings with a prescribed structure. Let(

X × Ω , d X , µ ) , ( Y × Ω , d Y , β ) , ( Z × Ω , d Z , ν ) ∈ S (Ω) .

22 Chapter 6. Proofs of claims and additional results

Let π ∈ Π( µ, β ) and π ∈ Π( β, ν ) be optimal transportation plans for the Fused Gromov-Wassersteindistance between µ , β and β , ν respectively. By the Gluing Lemma (see [Villani 2008] and Lemma 5.3.2in [Ambrosio 2005]) there exists a probability measure π ∈ P (cid:0) ( X × Ω) × ( Y × Ω) × ( Z × Ω) (cid:1) with marginals π on ( X × Ω) × ( Y ×

Ω) and π on ( Y × Ω) × ( Z ×

Ω). Let π be the marginal of π on ( X × Ω) × ( Z ×

Ω).By construction π ∈ Π( µ, ν ). So by suboptimality of π : F GW α,p,q ( d X , d Z , µ, ν ) ≤ (cid:16) ˆ ( X× Ω ×Z× Ω) (cid:0) (1 − α ) d ( a, c ) q + αL ( x, z, x , z ) q (cid:1) p dπ (( x, a ) , ( z, c )) dπ (( x , a ) , ( z , c )) (cid:17) p = (cid:16) ˆ ( X× Ω ×Y× Ω ×Z× Ω) (cid:0) (1 − α ) d ( a, c ) q + αL ( x, z, x , z ) q (cid:1) p dπ (( x, a ) , ( y, b ) , ( z, c )) dπ (( x , a ) , ( y , b ) , ( z , c )) (cid:17) p ( ∗ ) ≤ q − (cid:16) ˆ ( X× Ω ×Y× Ω ×Z× Ω) (cid:0) (1 − α ) d ( a, b ) q + (1 − α ) d ( b, c ) q + αL ( x, y, x , y ) q + αL ( y, z, y , z ) q (cid:1) p dπ (( x, a ) , ( y, b ) , ( z, c )) dπ (( x , a ) , ( y , b ) , ( z , c )) (cid:17) p ( ∗∗ ) ≤ q − (cid:18)(cid:18) ˆ ( X× Ω ×Y× Ω ×Z× Ω) (cid:0) (1 − α ) d ( a, b ) q + αL ( x, y, x , y ) q (cid:1) p dπ (( x, a ) , ( y, b ) , ( z, c )) dπ (( x , a ) , ( y , b ) , ( z , c )) (cid:17) p + (cid:16) ˆ ( X× Ω ×Y× Ω ×Z× Ω) (cid:0) (1 − α ) d ( b, c ) q + αL ( y, z, y , z ) q (cid:1) p dπ (( x, a ) , ( y, b ) , ( z, c )) dπ (( x , a ) , ( y , b ) , ( z , c )) (cid:17) p (cid:17) = 2 q − (cid:18)(cid:18) ˆ ( X× Ω ×Y× Ω) (cid:0) (1 − α ) d ( a, b ) q + αL ( x, y, x , y ) q (cid:1) p dπ (( x, a ) , ( y, b )) dπ (( x , a ) , ( y , b )) (cid:19) p + (cid:18) ˆ ( Y× Ω ×Z× Ω) (cid:0) (1 − α ) d ( b, c ) q + αL ( y, z, y , z ) q (cid:1) p dπ (( y, b ) , ( z, c )) dπ (( y , b ) , ( z , c )) (cid:19) p (cid:19) = 2 q − ( F GW α,p,q ( µ, β )+ F GW α,p,q ( β, ν )) with (*) comes from (6.8) and (6.9) and (**) is Minkowski inequality. So when q = 1, F GW α,p,q satisﬁesthe triangle inequality and when q > F GW α,p,q satisﬁes a relaxed triangle inequality so that it deﬁnesa semi-metric as described previously.

Proposition 6.1.3 (Equality relation.) . For α ∈ ]0 , , F GW α,p,q ( µ, ν ) = 0 if an only if there exists abijective function φ = ( φ , φ ) : supp ( µ ) → supp ( ν ) such that: φ µ = ν (6.10) ∀ ( x, a ) ∈ supp ( µ ) , φ ( x, a ) = a (6.11) ∀ ( x, a ) , ( x , a ) ∈ supp ( µ ) , d X ( x, x ) = d Y ( φ ( x, a ) , φ ( x , a )) (6.12) Moreover if ( µ, ν ) are generalized labeled graphs then F GW α,p,q ( µ, ν ) = 0 if and only if ( X × Ω , d X , µ ) and ( Y × Ω , d Y , ν ) are (II)-strongly isomorphic.Proof. For the ﬁrst point, let us assume that there exists a function φ verifying (6.10), (6.11) and (6.12). .1. Proofs and additional results of Chapter 3 123 We consider the map π = ( I d × φ ) µ ∈ Π( µ, ν ). We note φ = ( φ , φ ). Then: E p,q,α ( π ) = ˆ ( X × Ω ×Y× Ω) (cid:16) (1 − α ) d ( a, b ) q + αL (( x, y, x , y ) q (cid:17) p d π (( x, a ) , ( y, b ))d π (( x , a ) , ( y , b ))= ˆ ( X × Ω) (cid:16) (1 − α ) d ( a, φ ( x, a )) q + αL (cid:0) ( x, φ ( x, a ) , x , φ ( x , a ) (cid:1) q (cid:17) p dµ ( x, a ) dµ ( x , a )= ˆ ( X × Ω) (cid:16) (1 − α ) d ( a, φ ( x, a )) q + α | d X ( x, x ) − d Y ( φ ( x, a ) , φ ( x , a )) | q (cid:17) p dµ ( x, a ) dµ ( x , a )= 0 (6.13)Conversely suppose that F GW α,p,q ( µ, ν ) = 0. To prove the existence of a map φ : supp( µ ) → supp( ν )verifying (6.10), (6.11) and (6.12) we will use the Gromov-Wasserstein properties. We are looking fora vanishing Gromov-Wassersein distance between the spaces X ×

Ω and

Y ×

Ω equipped with our twomeasures µ and ν .More precisely, we deﬁne for (cid:0) ( x, a ) , ( y, b ) , ( x , a ) , ( y , b ) (cid:1) ∈ ( X × Ω × Y × Ω) and β ∈ ]0 , d X × Ω (cid:0) ( x, a ) , ( x , a ) (cid:1) = (1 − β ) d X ( x, x ) + βd ( a, a )and d Y× Ω (cid:0) ( y, b ) , ( y , b ) (cid:1) = (1 − β ) d Y ( y, y ) + βd ( b, b )We will prove that d GW,p ( d X × Ω , d Y× Ω , µ, ν ) = 0. To show that we will bound the Gromov cost with themetrics d X × Ω , d Y× Ω by the Gromov cost with the metrics d X , d Y and a Wasserstein cost.Let π ∈ Π( µ, ν ) be any admissible transportation plan. Then for n ≥ J n ( d X× Ω , d Y× Ω , π ) def = ˆ ( X× Ω ×Y× Ω) L ( x, y, x , y ) n dπ (( x, a ) , ( y, b )) dπ (( x , a ) , ( y , b ))= ˆ ( X× Ω ×Y× Ω) | (1 − β )( d X ( x, x ) − d Y ( y, y )) + β ( d ( a, a ) − d ( b, b )) | n dπ (( x, a ) , ( y, b )) dπ (( x , a ) , ( y , b )) ≤ ˆ ( X× Ω ×Y× Ω) (1 − β ) | d X ( x, x ) − d Y ( y, y ) | n dπ (( x, a ) , ( y, b )) dπ (( x , a ) , ( y , b ))+ ˆ ( X× Ω ×Y× Ω) β | d ( a, a ) − d ( b, b ) | n dπ (( x, a ) , ( y, b )) dπ (( x , a ) , ( y , b )) using Jensen inequality with convexity of t → t n and subadditivity of |.| . We note ( ∗ ) the ﬁrst termabove and ( ∗∗ ) the second term above. By the triangle inequality property of d we have:(**) ≤ β ´ (cid:0) X × Ω ×Y× Ω) ( d ( a, b ) + d ( a , b ) (cid:1) n dπ (( x, a ) , ( y, b )) dπ (( x , a ) , ( y , b )) def = βM n ( π ) such that wehave shown: ∀ π ∈ Π( µ, ν ) , ∀ n ≥ , J n ( d X × Ω , d Y× Ω , π ) ≤ (1 − β ) J n ( d X , d Y , π ) + βM n ( π ) (6.14)Now let π ∗ be an optimal coupling for F GW α,p,q between µ and ν . By hypothesis F GW α,p,q ( µ, ν ) = 0so that:

24 Chapter 6. Proofs of claims and additional results J qp ( d X , d Y , π ∗ ) = 0 (6.15)and: H qp ( π ∗ ) = 0 . (6.16)Then ´ ( X × Ω ×Y× Ω) d ( a, b ) qp dπ ∗ (( x, a ) , ( y, b ))=0 which implies that d is zero π ∗ a.e. so that ´ ( X × Ω ×Y× Ω) d ( a, b ) m dπ ∗ (( x, a ) , ( y, b )) = 0 for any m ∈ N ∗ . In this way: M qp ( π ∗ ) = β ˆ ( X × Ω ×Y× Ω) X h (cid:18) qph (cid:19) d ( a, b ) h d ( a , b ) qp − h dπ ∗ (( x, a ) , ( y, b )) dπ ∗ (( x , a ) , ( y , b ))= β X h (cid:18) qph (cid:19)(cid:16) ˆ ( X × Ω ×Y× Ω) d ( a, b ) h dπ ∗ (( x, a ) , ( y, b )) (cid:17)(cid:16) ˆ ( X × Ω ×Y× Ω) d ( a , b ) qp − h dπ ∗ (( x , a ) , ( y , b )) (cid:17) = 0Using equation (6.14) we have shown J qp ( d X × Ω , d Y× Ω , π ∗ ) = 0which implies that d GW,p ( d X × Ω , d Y× Ω , µ, ν ) = 0 for the coupling π ∗ .Thanks to the Gromov-Wasserstein properties (see Chapter 2) this states the existence of an isometrybetween supp( µ ) and supp( ν ). So there exists a surjective function φ = ( φ , φ ) : supp( µ ) → supp( ν )which veriﬁes P.1 and: ∀ (( x, a ) , ( x , a )) ∈ (supp( µ )) , d X × Ω (( x, a ) , ( x , a )) = d Y× Ω ( φ ( x, a ) , φ ( x , a )) (6.17)or equivalently: ∀ (( x, a ) , ( x , a )) ∈ (supp( µ )) , (1 − β ) d X ( x, x )+ βd ( a, a ) = (1 − β ) d Y ( φ ( x, a ) , φ ( x , a ))+ βd ( φ ( x, a ) , φ ( x , a ))(6.18) In particular π ∗ is concentrated on { ( x, y ) = φ ( x, y ) } or equivalently π ∗ = ( I d × φ ) µ . Injecting π ∗ in(6.16) leads to: H qp ( π ∗ ) = ˆ ( X × Ω ×Y× Ω) d ( a, b ) qp d π ∗ (( x, a ) , ( y, b )) = ˆ X × Ω d ( a, φ ( x, a )) qp d µ ( x, a ) = 0 (6.19)Which implies: ∀ ( x, a ) ∈ supp( µ ) , φ ( x, a ) = a (6.20)Moreover, using the equality (6.18) we can conclude that: ∀ ( x, a )( x , a ) ∈ supp( µ ) , d X ( x, x ) = d Y ( φ ( x, a ) , φ ( x , a )) (6.21)In this way f veriﬁes all the properties (6.10),(6.11),(6.12).Moreover suppose that µ and ν are generalized labeled graphs. In this case there exists ‘ f : X →

Ωsurjective such that µ = ( id × ‘ f ) µ X . Then (6.21) implies that: ∀ ( x, x ) ∈ supp( µ X ) , d X ( x, x ) = d Y ( φ ( x, ‘ f ( x )) , φ ( x , ‘ f ( x ))) (6.22) .1. Proofs and additional results of Chapter 3 125 We deﬁne I : supp( µ X ) → supp( µ Y ) such that I ( x ) = φ ( x, ‘ f ( x )). Then we have by (6.22) d X ( x, x ) = d Y ( I ( x ) , I ( x )) for ( x, x ) ∈ supp( µ X ) . Overall we have φ ( x, a ) = ( I ( x ) , a ) for all ( x, a ) ∈ supp( µ ). Alsosince φ µ = ν we have I µ X = ν Y .Moreover I is a surjective function. Indeed let y ∈ supp( ν Y ). Let b ∈ supp( ν B ) such that ( y, b ) ∈ supp( ν ). By surjectivity of φ there exists ( x, a ) ∈ supp( µ ) such that ( y, b ) = φ ( x, a ) = ( I ( x ) , a ) so that y = I ( x ).Overall φ satisﬁes all P.1 , P.2 and

P.3 if µ and ν are generalized labeled graphs. The converse is alsotrue using the reasoning in (6.13). We recall the theorem:

Theorem.

Convergence of ﬁnite samples and a concentration inequality Let p ≥ . We have: lim n →∞ F GW α,p, ( µ n , µ ) = 0 Moreover, suppose that s > d ∗ p ( µ ) . Then there exists a constant C that does not depend on n such that: E [ F GW α,p, ( µ n , µ )] ≤ Cn − s . The expectation is taken over the i.i.d samples ( x i , a i ) . A particular case of this inequality is when α = 1 so that we can use the result above to derive a concentration result for the Gromov-Wasserstein distance.More precisely, if ν n = n P i δ x i denotes the empirical measure of ν ∈ P ( X ) and if s > d ∗ p ( ν ) we have: E [ GW p ( ν n , ν )] ≤ C n − s . Proof.

The proof of the convergence in

F GW derives directly from the weak convergence of the empiricalmeasure and Lemma 2.2.1. Moreover, since µ n and µ are both in the same ground space, we have: F GW α,p, ( µ n , µ ) ≤ W p ( µ n , µ ) = ⇒ E [ F GW α,p, ( µ n , µ )] ≤ E [ W p ( µ n , µ )] . We can directly apply theorem 1 in [Weed 2017] to state the inequality.

We recall the proposition:

Proposition (Interpolation properties.) . As α tends to zero, one recovers the Wasserstein distancebetween the features information and as α goes to one, one recovers the Gromov-Wasserstein distancebetween the structure information: lim α → F GW α,p,q ( µ, ν ) = ( W pq ( µ A , ν B )) q lim α → F GW α,p,q ( µ, ν ) = ( GW pq ( µ X , ν Y )) q Proof.

Let π OT ∈ Π( µ A , ν B ) be an optimal coupling for the pq -Wasserstein distance between µ A and ν B .We can use the same Gluing lemma (lemma 5.3.2 in [Ambrosio 2005]) to construct: ρ ∈ P ( µ z }| { X × | {z } π OT Ω × ν z }| { Ω × Y )

26 Chapter 6. Proofs of claims and additional results such that ρ ∈ Π( µ, ν ) and P , ρ = π OT .Moreover we have: ˆ Ω × Ω d ( a, b ) pq dπ OT ( a, b ) = ˆ X × Ω × Ω ×Y d ( a, b ) pq dρ ( x, a, b, y ) (6.23)Let α ≥ π α optimal plan for the Fused Gromov-Wasserstein distance between µ and ν .We can deduce that: F GW α,p,q ( µ, ν ) p − (1 − α ) p W pq ( µ A , ν B ) pq = ˆ ( X× Ω ×Y× Ω) (cid:18) (1 − α ) d ( a, b ) q + αL ( x, y, x , y ) q (cid:19) p dπ α (( x, a ) , ( y, b )) dπ α (( x , a ) , ( y , b )) − ˆ Ω × Ω (1 − α ) p d ( a, b ) pq dπ OT ( a, b ) ( ∗ ) ≤ ˆ ( X× Ω ×Y× Ω) (cid:18) (1 − α ) d ( a, b ) q + αL ( x, y, x , y ) q (cid:19) p dρ ( x, a, b, y ) dρ ( x , a , b , y ) − ˆ X× Ω ×Y× Ω (1 − α ) p d ( a, b ) pq dρ ( x, a, b, y )= (1 − α ) p ˆ ( X× Ω ×Y× Ω) d ( a, b ) pq dρ ( x, a, b, y ) dρ ( x , a , b , y ) − (1 − α ) p ˆ X× Ω ×Y× Ω d ( a, b ) pq dρ ( x, a, b, y )+ p − X k =0 (cid:18) pk (cid:19) (1 − α ) k α p − k ˆ ( X× Ω ×Y× Ω) d ( a, b ) qk L ( x, y, x , y ) q ( p − k ) dρ ( x, a, b, y ) dρ ( x , a , b , y )= p − X k =0 (cid:18) pk (cid:19) (1 − α ) k α p − k ˆ ( X× Ω ×Y× Ω) d ( a, b ) qk L ( x, y, x , y ) q ( p − k ) dρ ( x, a, b, y ) dρ ( x , a , b , y ) . We note H k = ´ ( X × Ω ×Y× Ω) d ( a, b ) qk L ( x, y, x , y ) q ( p − k ) dρ ( x, a, b, y ) dρ ( x , a , b , y ).Using (6.4) we have shown that:(1 − α )( W pq ( µ A , ν B )) q ≤ F GW α,p,q ( µ, ν ) ≤ (cid:18) (1 − α ) p ( W pq ( µ A , ν B )) pq + p − X k =0 (cid:18) pk (cid:19) (1 − α ) k α p − k H k (cid:19) p So lim α → F GW α,p,q ( µ, ν ) = ( W pq ( µ A , ν B )) q .For the case α → π GW ∈ Π( µ X , ν Y ) an optimal coupling for the pq -Gromov-Wasserstein distance between µ X and ν Y and we construct γ ∈ P ( µ z }| { Ω × | {z } π GW X × ν z }| { Y ×

Ω)such that γ ∈ Π( µ, ν ) and P , ρ = π GW . In the same way as previous reasoning we can derive: α ( GW pq ( µ X , ν Y )) q ≤ F GW α,p,q ( µ, ν ) ≤ (cid:0) α p ( GW pq ( µ X , ν Y )) pq + p − X k =0 (cid:18) pk (cid:19) (1 − α ) p − k α k J k ) p (6.24)with J k = ´ d ( a, b ) q ( p − k ) L ( x, y, x , y ) qk dρ ( x, a, b, y ) dρ ( x , a , b , y ). In this way lim α → F GW α,p,q ( µ, ν ) =( GW pq ( µ X , ν Y )) q . .2. Proofs and additional results of Chapter 4 127 We recall the theorem:

Theorem (Constant speed geodesic.) . Let p ≥ and ( X × Ω , d X , µ ) and ( Y × Ω , d Y , µ ) in S p ( R d ) . Let π ∗ be an optimal coupling for the Fused Gromov-Wasserstein distance between µ , µ and t ∈ [0 , . Weequip R d with ‘ m norm for m ≥ .We deﬁne η t : X × Ω × Y × Ω → X × Y × Ω such that: ∀ ( x, a ) , ( y, b ) ∈ X × Ω × Y × Ω , η t ( x, a , y, b ) = ( x, y, (1 − t ) a + t b ) Then: ( X × Y × Ω , (1 − t ) d X ⊕ td Y , µ t = η t π ∗ ) t ∈ [0 , is a constant speed geodesic connecting ( X × Ω , d X , µ ) and ( Y × Ω , d Y , µ ) in the metric space (cid:0) S p ( R d ) , F GW α,p, (cid:1) .Proof. We note S t = ( X × Y × Ω , d t , µ t = η t π ∗ ) t ∈ [0 , where d t = (1 − t ) d X ⊕ td Y . Let k . k be any ‘ m norm for m ≥

1. It suﬃces to prove:

F GW α,p, ( µ t , µ s ) ≤ | t − s | F GW α,p, ( µ , µ ) (6.25)To do so we consider ∆ ts ∈ P ( X × Y × Ω × X × Y × Ω) deﬁned by ∆ ts = ( η t × η s ) π ∗ ∈ Π( µ t , µ s ) andthe following “diagonal” coupling:d γ ts (( x, y ) , a , ( x , y ) , b ) = d∆ ts (( x, y ) , a , ( x , y ) , b )d δ ( x ,x ) ( x , x ) (6.26)Then γ ts ∈ P ( X × Y × Ω × X × Y × Ω) and since ∆ ts ∈ Π( µ t , µ s ) then γ ts ∈ Π( µ t , µ s ) So by suboptimality: F GW α,p, ( µ t , µ s ) p ≤ ˆ ( X×Y× Ω ×X×Y× Ω) (cid:18) (1 − α ) d ( a, b ) + α | d t [( x, y ) , ( x , y )] − d s [( x , y ) , ( x , y )] | (cid:19) p d γ ts ( x, y, a , x , y , b )d γ ts ( x , y , a , x , y , b )= ˆ ( X×Y× Ω ×X×Y× Ω) (cid:18) (1 − α ) d ( a , b ) + α | d t [( x, y ) , ( x , y )] − d s [( x, y ) , ( x , y )] | (cid:19) p d∆ ts ( x, y, a , x, y, b ) d ∆ ts ( x , y , a , x , y , b )= ˆ ( X× Ω ×Y× Ω) (cid:0) (1 − α ) k (1 − t ) a + t b − (1 − s ) a − s b k + α | (1 − t ) d X ( x, x )+ td Y ( y, y ) − (1 − s ) d X ( x, x ) + sd Y ( y, y ) | (cid:1) p d π ∗ ( x, a, y, b ) dπ ∗ ( x , a , y , b )= | t − s | p ˆ ( X× Ω ×Y× Ω) (cid:18) (1 − α ) k a − b k + α | d X ( x, x ) − d Y ( y, y ) | (cid:19) p dπ ∗ ( x, a , y, b )d π ∗ ( x , a , y , b ) So F GW α,p, ( µ t , µ s ) ≤ | t − s | F GW α,p, ( d , d , µ , µ ). This Chapter contains all the proofs of the Chapter 4. We recall the following notations: J ( c X , c Y , π ) def = ˆ X ×X ˆ Y×Y | c X ( x , x ) − c Y ( y , y ) | d π ( x , y )d π ( x , y )

28 Chapter 6. Proofs of claims and additional results J ( T ) def = ˆ (cid:0) k x − x k − k T ( x ) − T ( x ) k (cid:1) d µ ( x )d µ ( x ) LGM ( µ, ν ) def = inf T µ = ν T is linear J ( T ) = inf T µ = ν T is linear ˆ (cid:0) k x − x k − k T ( x ) − T ( x ) k (cid:1) d µ ( x )d µ ( x ) We recall the theorem:

Theorem (A new special case for the Quadratic Assignment Problem) . For real numbers x < · · · < x n and y < · · · < y n , min σ ∈ S n X i,j − ( x i − x j ) ( y σ ( i ) − y σ ( j ) ) (6.27) is achieved either by the identity permutation σ ( i ) = i ( Id ) or the anti-identity permutation σ ( i ) = n + 1 − i ( anti − Id ). In other words: ∃ σ ∈ { Id, anti − Id } , σ ∈ arg min σ ∈ S n X i,j − ( x i − x j ) ( y σ ( i ) − y σ ( j ) ) (6.28)Let us note I = { x , y ∈ R n × R n | x < x < · · · < x n , y < y < · · · < y n } . We consider for x , y ∈ I :max σ ∈ S n Z ( x , y , σ ) = max σ ∈ S n X i,j ( x i − x j ) ( y σ ( i ) − y σ ( j ) ) (6.29)The original problem is equivalent to maximizing Z ( x , y , σ ) over S n . Given x , y ∈ I , we deﬁne X def = P i x i and Y def = P i y i . Then:max σ ∈ S n Z ( x , y , σ ) = max σ ∈ S n X i,j ( x i − x j ) ( y σ ( i ) − y σ ( j ) ) = max σ ∈ S n X i,j ( x i + x j )( y σ ( i ) + y σ ( j ) ) − X i,j x i x j ( y σ ( i ) + y σ ( j ) ) − X i,j y σ ( i ) y σ ( j ) ( x i + x j )+ 4 X i,j x i x j y σ ( i ) y σ ( j ) = max σ ∈ S n n X i x i y σ ( i ) − X i,j x i x j ( y σ ( i ) + y σ ( j ) ) − X i,j y σ ( i ) y σ ( j ) ( x i + x j )+ 4 X i,j x i x j y σ ( i ) y σ ( j ) + 2( X i x i )( X i y i )= max σ ∈ S n n X i x i y σ ( i ) − X X i x i y σ ( i ) − Y X i x i y σ ( i ) + 4 X i,j x i x j y σ ( i ) y σ ( j ) + 2( X i x i )( X i y i ) ( ∗ ) = Cte + 2 (cid:0) max σ ∈ S n X i nx i y σ ( i ) − X i ( Xx i y σ ( i ) + Y x i y σ ( i ) ) + 2( X i x i y σ ( i ) ) (cid:1) where in (*) we deﬁned Cte def = 2( P i x i )( P i y i ) the term that does not depend on σ . Overall we have: ∀ x , y ∈ I , argmax σ ∈ S n Z ( x , y , σ ) = argmax σ ∈ S n X i nx i y σ ( i ) − X i ( Xx i y σ ( i ) + Y x i y σ ( i ) )+2( X i x i y σ ( i ) ) (6.30)Since Z is invariant by translation of x , y we can suppose without loss of generality that X = Y = 0.We consider the set D = { x , y ∈ R n × R n | x < x < · · · < x n , y < y < · · · < y n , P i x i = P j y j = 0 } .We want to ﬁnd for x , y ∈ D :max σ ∈ S n n X i x i y σ ( i ) + 2 X i x i y σ ( i ) ! def = max σ ∈ S n g ( x , y , σ ) (QAP) .2. Proofs and additional results of Chapter 4 129 We have the following result:

Lemma 6.2.1.

Let x , y ∈ D and consider the problem: max π ∈ DS X ijkl ( x i y j + 2 x i y j x k y l ) π ij π kl (QP) where DS is the set of doubly stochastic matrices. Then (QP) and (QAP) are equivalent. More preciselyif σ ∗ is an optimal solution of (QAP) then π σ ∗ deﬁned by π σ ∗ ( i, j ) = 1 if j = σ ∗ ( i ) else for all ( i, j ) ∈ [[ n ]] is an optimal solution of (QP) and if π ∗ is an optimal solution of (QP) then it is supportedon a permutation σ ∗ which is an optimal solution of (QAP) .Proof. The problem (QAP) can be rewritten as: max P ij ∈{ , }∀ j P i P ij =1 ∀ i P j P ij =1 n X ij x i y j P ij + 2 X i,j x i y j P ij ! = max P ij ∈{ , }∀ j P i P ij =1 ∀ i P j P ij =1 n X ij x i y j P ij + 2 X ijkl x i x k y j y l P ij P kl ∗ = max P ij ∈{ , }∀ j P i P ij =1 ∀ i P j P ij =1 X ijkl x i y j P ij P kl + 2 X ijkl x i y j x k y l P ij P kl = max P ij ∈{ , }∀ j P i P ij =1 ∀ i P j P ij =1 X ijkl ( x i y j + 2 x i y j x k y l ) P ij P kl (6.31) In (*) we used P k,l P k,l = n . We consider the following relaxation of (6.31) as:max π ∈ DS X ijkl ( x i y j + 2 x i y j x k y l ) π ij π kl (6.32)which is a maximization of a convex function. More precislely it is quadratic programming problem whichHessian is positive semi-deﬁnite xx T ⊗ K yy T . Since the problem is a maximization of a convex functionan optimal solution π ∗ of (QP) lies necassarily in the extremal points of DS [Rockafellar 1970] such thatboth (QP) and (QAP) are equivalent: if π ∗ is an optimal solution it is necessarily supported on a σ ∗ ∈ S n such that σ ∗ is an optimal solution of (QAP) and if σ ∗ ∈ S n is an optimal solution of (QAP) then π ∗ deﬁned by π ∗ ij = 1 if j = σ ∗ ( i ) else 0 for all ( i, j ) ∈ [[ n ]] is an optimal solution of (QP). Lemma 6.2.2.

Let x , y ∈ D . For σ ∈ S n we note C ( x , y , σ ) = P i x i y σ ( i ) . Let π ∗ an optimal solutionof (QP) with σ ∗ the permutation associated to π ∗ .If C ( x , y , σ ∗ ) > then π ∗ = I n is the identiy and if C ( x , y , σ ∗ ) < then π ∗ = J n is the anti-identity. To prove this result we will rely on the following theorem which gives necessary conditions for beingan optimal solution of (QP):

Theorem (Theorem 1.12 in [Murty 1988]) . Consider the following (QP): min x f ( x ) = cx + x T Qx s.t. Ax = b , x ≥ Then if x ∗ is an optimal solution of (6.33) it is an optimal solution of the following (LP): min x f ( x ) = ( c + x T ∗ Q ) x s.t. Ax = b , x ≥

30 Chapter 6. Proofs of claims and additional results

Proof.

Of Lemma 6.2.2. Applying Theorem 6.2.1 in our case gives that if π ∗ is a solution of (QP) itnecessarily a solution of the following (LP):max π ∈ DS X ijkl ( x i y j + 2 x i y j x k y l ) π ∗ ij π kl = n X ij x i y j π ∗ ij + max π ∈ DS X ij x i y j π ∗ ij )( X kl x k y l π kl ) (6.35)Since π ∗ is supported on a permutation σ ∗ this gives: n X i x i y σ ∗ ( i ) + max π ∈ DS C ( x , y , σ ∗ ) X kl x k y l π kl (LP)where C ( x , y , σ ∗ ) = 2 (cid:0)P i x i y σ ∗ ( i ) (cid:1) .• If C ( x , y , σ ∗ ) > π ∗ = I n . This is aconsequence of the Rearrangement Inequality (see Memo 6.2.1) which states that for all permutations P i x i y σ ( i ) < P i x i y i (since x i and y j are distinct). Using the fact that an optimal solution of (LP)is supported on a permutation concludes.• If C ( x , y , σ ∗ ) < P i x i y n +1 − i < P i x i y σ ( i ) for all permutation because of Rearrangement Inequality.Using both results we can prove the following proposition which is the main ingredient to proveTheorem 4.1.1: Proposition 6.2.1.

Let x , y ∈ D and σ ∗ a solution of (QAP) i.e. σ ∗ ∈ arg max σ ∈ S n g ( x , y , σ ) . For σ ∈ S n we note C ( x , y , σ ) = P i x i y σ ( i ) .If C ( x , y , σ ∗ ) > then σ ∗ is the identiy permutation σ ∗ ( i ) = i and if C ( x , y , σ ∗ ) < then σ ∗ is theanti-identity permutation σ ∗ ( i ) = n + 1 − i for all i ∈ [[ n ]] .Proof. Let σ ∗ be an optimal solution of (QAP) and π ∗ deﬁned by π ∗ ij = 1 if j = σ ∗ ( i ) else 0. By Lemma6.2.1 we know that π ∗ is an optimal solution of (QP). Consider the case C ( x , y , σ ∗ ) >

0. Suppose that σ ∗ is not the identity, then π ∗ = I n which is not possible by Lemma 6.2.2 since π ∗ is an optimal solutionof (QP). Same applies for C ( x , y , σ ∗ ) < g : Lemma 6.2.3 (Continuity of g ) . Let x , y ∈ D ﬁxed. There exists ε x,y > such that for all k h k < ε x,y we have: arg max σ ∈ S n g ( x + h , y , σ ) ⊂ arg max σ ∈ S n g ( x , y , σ ) (6.37) Memo 6.2.1 (Rearrangement Inequality) . Let x ≤ · · · ≤ x n , y ≤ · · · ≤ y n then we have: ∀ σ ∈ S n , X i x i y n +1 − i ≤ X i x i y σ ( i ) ≤ X i x i y i (6.36) If the numbers are diﬀerent then the lower bound (resp upper bound) is attained only for the permutationwhich reverses the order (resp for the identiy permutation) .2. Proofs and additional results of Chapter 4 131

Proof.

Let x , y ∈ D , σ ∗ ∈ arg max σ ∈ S n g ( x , y , σ ) and τ any permutation in S n such that τ / ∈ arg max σ ∈ S n g ( x , y , σ ). Then we have g ( x , y , σ ∗ ) > g ( x , y , τ ). Let η = g ( x , y , σ ∗ ) − g ( x , y , τ ) > β we have that g ( ., y , β ) is continuous. In this way: ∀ β ∈ S n , ∃ ε x , y ( β, σ ∗ , τ ) > , ∀k h k < ε x , y ( β, σ ∗ , τ ) , | g ( x + h , y , β ) − g ( x , y , β ) | < η h ∈ R n such that k h k < min ( β,σ,τ ) ∈ ( S n ) ε x , y ( β, σ, τ ). By (6.38) applied to σ ∗ and τ : g ( x + h , y , σ ∗ ) − g ( x + h , y , τ ) = g ( x + h , y , σ ∗ ) − g ( x , y , σ ∗ )+ g ( x , y , σ ∗ ) − g ( x , y , τ ) + g ( x , y , τ ) − g ( x + h , y , τ ) > − η η − η η > g ( x + h , y , σ ∗ ) > g ( x + h , y , τ ) and in this way τ / ∈ arg max σ ∈ S n g ( x + h , y , σ ) because σ ∗ leads toa striclty better cost. Overall we have proven that for any permutation τ , if τ / ∈ arg max σ ∈ S n g ( x , y , σ ) and k h k < min ( β,σ,τ ) ∈ ( S n ) ε x , y ( β, σ, τ ) then τ / ∈ arg max σ ∈ S n g ( x + h , y , σ ) which proves that arg max σ ∈ S n g ( x + h , y , σ ) ⊂ arg max σ ∈ S n g ( x , y , σ ).Using the previous lemma we can now prove the following result: Lemma 6.2.4.

Let x , y ∈ D ﬁxed. There exists ε ∈ R n such that: arg max σ ∈ S n g ( x + ε , y , σ ) ⊂ arg max σ ∈ S n g ( x , y , σ )arg max σ ∈ S n g ( x + ε , y , σ ) ⊂ { Id, anti − Id } (6.40) Proof.

Let x , y ∈ D . We consider ε = ( ζ, − ζ, , . . . ,

0) with ζ > ζ < x − x and k ε k < ε x,y deﬁned in Lemma 6.2.3. We have x + ε , y ∈ D since P i ( x i + ε ( i )) = P i x i + ζ − ζ = 0 and x + ε (1) < · · · < x N + ε ( N ) since x + ζ < x − ζ .Let σ ∗ ε ∈ arg max σ ∈ S n g ( x + ε , y , σ ). By Lemma 6.2.3 we have σ ∗ ε ∈ arg max σ ∈ S n g ( x , y , σ ).Moreover we have C ( x + ε , y , σ ∗ ε ) = P i x i y σ ∗ ε ( i ) + ζ ( y σ ∗ ε (0) − y σ ∗ ε (1) ).• If P i x i y σ ∗ ε ( i ) = 0 then C ( x + ε , y , σ ∗ ε ) = ζ ( y σ ∗ ε (0) − y σ ∗ ε (1) ) = 0 since all y i are distinct. Wecan apply Proposition 6.2.1 with x + ε , y ∈ D to conclude that σ ∗ ε is wether the identity or theanti-identity.• If P i x i y σ ∗ ε ( i ) = 0 then σ ∗ ε ∈ arg max σ ∈ S n g ( x , y , σ ) and C ( x , y , σ ∗ ε ) = 0 so by Proposition 6.2.1with x , y ∈ D we can conclude that σ ∗ ε is wether the identity or the anti-identity. Corollary 6.2.1 (Theorem 4.1.1 is valid) . Let x , y ∈ D . The identity or the anti-identity is an optimalsolution of (QAP) Proof.

Let x , y ∈ D . We consider ε deﬁned in Lemma 6.2.4 and σ ∗ ε ∈ arg max σ ∈ S n g ( x + ε , y , σ ).Then by Lemma 6.2.4 σ ∗ ε is wether the identity or the anti-identity. Moreover by Lemma 6.2.4 σ ∗ ε ∈ arg max σ ∈ S n g ( x , y , σ ) so it is an optimal solution of (QAP). This concludes that the identity or theanti-identity is an optimal solution of (QAP) which proves Theorem 4.1.1.

32 Chapter 6. Proofs of claims and additional results GM and GW in the discretecase This paragraph aims at proving the equivalence between GM and GW . We recall the theorem: Theorem (Equivalence between GW and GM for discrete measures) . Let µ ∈ P ( R p ) , ν ∈ P ( R q ) bediscrete probability measures with same number of atoms and uniform weights, i.e. µ = n P ni =1 δ x i , ν = n P ni =1 δ y i with x i ∈ R p , y i ∈ R q . For x ∈ R p we note k x k ,p = pP pi =1 | x i | the ‘ norm on R p (samefor R q ). Let c X ( x , x ) = k x − x k ,p , c Y ( y , y ) = k y − y k ,q . Then: GW ( c X , c Y , µ, ν ) = GM ( c X , c Y , µ, ν ) (6.41) Moreover if p = q = 1 , i.e. c X ( x, x ) = c Y ( x, x ) = | x − x | for x, x ∈ R , and if x < · · · < x n and y < · · · < y n the optimal values are achieved by considering either the identity or the anti-identitypermutation.Proof. The proof is essentially based on theoretical results from [Maron 2018] and on Theorem 4.1.1.In [Maron 2018] authors consider the minimizing energy problem min X ∈ Π n − tr( BX T AX ) where Π n the set ofpermutation matrices. In fact, the GM problem deﬁned in this chapter is equivalent to min X ∈ Π n − tr( BX T AX )by considering A = ( k x i − x j k ,p ) i,j and B = ( k y i − y j k ,q ) i,j .To tackle this problem authors propose to minimize − tr( BX T AX ) over the set of doubly stochasticmatrices (which is the convex-hull of Π n ): DS = { X ∈ R n × n s.t. X1 n = X T n = n , X ≥ } Minimizing − tr( BX T AX ) over DS is equivalent to solving the GW distance when a i = b j = n . Thepaper proves that when both A and B are conditionally positive (or negative) deﬁnite of order 1 then therelaxation leads to the same optimum so that the minimum over DS is the same as the minimum overΠ n [Maron 2018, Theorem 1]. Yet A and B deﬁned previously satisfy this property (see examples underDeﬁnition 2 in [Maron 2018]) and so GW and GM coincide.Moreover when p = q = 1 and when the sample are sorted we can apply Theorem 4.1.1 to provethat an optimal permutation of the GM problem is found whether at the identity or the anti-identitypermutation which concludes the proof. GW in the 1d case We recall the result:

Lemma.

The GM and GW costs in 1D with same numbers of atoms and uniform weights can becomputed in O ( n ) .Proof. As seen in Theorem 4.1.2 ﬁnding the optimal permutation σ ∗ can be found in O ( n log( n )). Moreover .2. Proofs and additional results of Chapter 4 133 the ﬁnal costs can be written using binomial expansion: X i,j (cid:0) ( x i − x j ) − ( y σ ∗ ( i ) − y σ ∗ ( j ) ) (cid:1) = 2 n X i x i − X i x i X k x k + 6( X i x i ) + 2 n X i y i − X i y i X k y k + 6( X i y i ) − X i x i ) ( X k y k ) − n X i x i y σ ∗ ( i ) + 8 X i (( X k x k ) x i y σ ∗ ( i ) + ( X k y k ) x i y σ ∗ ( i ) ) − X i x i y σ ∗ ( i ) ) (6.42)which can be computed in O ( n ) operations. We recall the theorem:

Theorem (Properties of

SGW ) . • For all ∆ , SGW ∆ and RISGW are translation invariant.

RISGW is also rotational invariant when p = q , more precisely if Q ∈ O ( p ) is an orthogonalmatrix, RISGW ( Q µ, ν ) = RISGW ( µ, ν ) (same for any Q ∈ O ( q ) applied on ν ).• SGW and

RISGW are pseudo-distances on P ( R p ) , i.e. they are symmetric, satisfy the triangleinequality and SGW ( µ, µ ) = RISGW ( µ, µ ) = 0 .• Let µ, ν ∈ P ( R p ) × P ( R p ) be probability distributions with compact supports . If SGW ( µ, ν ) = 0 then µ and ν are isomorphic for the distance induced by the ‘ norm on R p , i.e. d ( x, x ) = P pi =1 | x i − x i | for ( x, x ) ∈ R p × R p . In particular this implies: SGW ( µ, ν ) = 0 = ⇒ GW ( d, d, µ, ν ) = 0 (6.43)The invariance by translation is clear since the costs are invariant by translation of the support of themeasures. The pseudo-distances properties are straightforward thanks to the properties of GW . For theinvariance by rotation if p = q then V p ( R p ) is bijective with O ( p ) so for Q ∈ O ( p ): RISGW ( Q µ, ν ) = min ∆ ∈ V p ( R p ) SGW ∆ ( Q µ, ν )= min ∆ ∈O ( p ) SGW ∆ ( Q µ, ν )= min ∆ ∈O ( p ) E θ ∼ λ q − [ GW ( d , P θ Q µ ) , P θ ν )]= min ∆ ∈O ( p ) E θ ∼ λ q − [ GW ( d , P θ µ, P θ ν )]= RISGW ( µ, ν ) (6.44)On the other side for ν a change of formula on theta gives the result.For the case SGW = 0 = ⇒ GW = 0 it will be a consequence of the following theorem:

34 Chapter 6. Proofs of claims and additional results

Theorem 6.2.1.

Let µ, ν ∈ P ( R p ) × P ( R p ) be probability distributions such that µ, ν have compactsupports. If for almost all θ ∈ S p − , P θ µ and P θ ν are isomoprhic then µ and ν are isomoprhic.In other words if for almost all θ ∈ S p − we have: ∃ T θ : supp ( P θ µ ) ⊂ R supp ( P θ ν ) ⊂ R , surjective s.t. T θ P θ µ ) = P θ ν ∀ x, x ∈ supp ( P θ µ ) , | T θ ( x ) − T θ ( x ) | = | x − x | (6.45) Then there exists a measure preserving isometry f between supp ( µ ) and supp ( ν ) . More precisely wehave f µ = ν and: ∀ x , x ∈ supp ( µ ) , k f ( x ) − f ( x ) k = k x − x k (6.46)To prove this theorem we will exhibit the isometry. This result can be put in light of Cramer–Woldtheorem [Cramér 1936] which states that a probability measure is uniquely determined by the totalityof its one-dimensional projections. Equivalently, if we consider two probability measures so that theone-dimensional measures resulting from the projections over all the hypersphere are equal then themeasures are equal. The equality relation is replaced in our theorem by the isomoprhism relation.The proof is divided into four parts. In the ﬁrst one, we construct an "almost orthogonal" basis onwhich measures are isomorphic. Building upon this result we deﬁne a sequence of functions from supp( µ )to supp( ν ) and show that it has a convergent subsequence. We conclude by proving that the limit of thesubsequence is actually a good candidate for being the isometry we are looking for. In the following k . k denotes the ‘ norm, k . k denotes the ‘ norm and p ≥

2. We recall that F µ is the Fourier transform of µ .We consider the following Q θ property for θ ∈ S p − : ∃ T θ : supp( P θ µ ) ⊂ R supp( P θ ν ) ⊂ R , surjective s.t. T θ P θ µ ) = P θ ν ∀ x, x ∈ supp( P θ µ ) , | T θ ( x ) − T θ ( x ) | = | x − x | ( Q θ )Informally if we have the Q θ property for θ ∈ S p − it implies that µ and ν are isomorphic on the 1D linegiven by the projection w.r.t. θ . We have the following lemma: Lemma 6.2.5.

Let µ, ν ∈ P ( R p ) × P ( R p ) and suppose that Q θ holds for almost all θ ∈ S p − . Let n > p − . There exists a basis ( e ( n ) , ..., e p ( n )) of R p part of the following spaces: B np def = { ( θ , ..., θ p ) ∈ ( S p − ) p s.t. |h θ i , θ j i| < n } (6.47) and Q def = { ( θ , ..., θ p ) ∈ ( S p − ) p s.t. ∀ i ∈ { , ..., p } , Q θ i } (6.48) Proof.

We want to construct a basis ( e , ..., e p ) as orthogonal as possible such that for all i we have Q e i .We note λ ⊗ pp − the product measure λ p − ⊗ ... ⊗ λ p − where λ p − is the uniform measure on the sphere. B np is an open set as inverse image by a continuous function of an open set. Then λ ⊗ pp − ( B np ) >

0. Moreover,since for almost all θ ∈ S p − we have Q θ then λ ⊗ pp − ( Q ) > λ ⊗ pp − ( B np ∩ Q ) > e ( n ) , ..., e p ( n )) ∈ B np ∩ Q . If n > p − e ( n ) , ..., e p ( n ))is strictly diagonal dominant, thus invertible, such that ( e ( n ) , ..., e p ( n )) is a basis. Note that we can notconsider directly an orthogonal basis since the set of all orthogonal basis has measure zero.We now express all the vectors and inner products in this new almost orthogonal basis as expressed inthe following lemma: .2. Proofs and additional results of Chapter 4 135 Lemma 6.2.6.

Let n > p − and a basis ( e ( n ) , ..., e p ( n )) as deﬁned in Lemma 6.2.5. Then all x ∈ R p can be written as: x = p X i =1 [ h x , e i ( n ) i + R ( x , e i ( n ))] e i ( n ) (6.49) where | R ( x , e i ( n )) | = o ( n ) . Moreover for all ( x , y ) ∈ R p × R p : h x , y i = p X i =1 h x , e i ( n ) ih y , e i ( n ) i + ˜ R ( x , y ) (6.50) where | ˜ R ( x , y ) | = o ( n ) .Proof. In the following x i denotes the i -th coordinate of a vector x in the standard basis, i.e. a vectorwrites x = ( x , . . . , x p ). For x ∈ R p , we can write in the new basis x = P pi =1 [ h x , e i ( n ) i + R ( x , e i ( n ))] e i with R ( x , e i ( n )) def = x i − h x , e i ( n ) i . We have also | R ( x , e i ( n )) | = o ( n ). Indeed, x = p X i =1 x i e i = ⇒ ∀ j, h x , e j i = p X i =1 x i h e i , e j i = ⇒ x j − h x , e j i = X i = j x i h e i , e j i = ⇒ | R ( x , e j ( n )) | = | X i = j x i h e i , e j i| = ⇒ | R ( x , e j ( n )) | ≤ n X i = j | x i | Also in the same way for x , y ∈ R p × R p we can rewrite their inner product: h x , y i = p X i =1 h x , e i ( n ) ih y , e i ( n ) i + ˜ R ( x , y ) (6.51)with: ˜ R ( x , y ) def = h x , y i − p X i =1 h x , e i ( n ) ih y , e i ( n ) i = X i = j h x , e i ( n ) ih y , e i ( n ) ih e j ( n ) , e i ( n ) i + X i,j h x , e i ( n ) i R ( y , e j ( n )) h e j ( n ) , e i ( n ) i + X i,j h y , e j ( n ) i R ( x , e i ( n )) h e j ( n ) , e i ( n ) i + X i,j R ( x , e j ( n )) R ( y , e i ( n )) h e j ( n ) , e i ( n ) i and with the same calculus than for R we have | ˜ R ( x , y ) | = o ( n ). Proposition 6.2.2.

Let µ, ν ∈ P ( R p ) × P ( R p ) and suppose that Q θ holds for almost all θ ∈ S p − andthat ν has compact support. There exists a sequence ( f n ) n ∈ N from supp ( µ ) to supp ( ν ) uniformly boundedwhich satisﬁes: ∀ n ∈ N , ∀ x , x ∈ supp ( µ ) , (cid:12)(cid:12) k f n ( x ) − f n ( x ) k − k x − x k (cid:12)(cid:12) = o ( 1 n ) (6.52) ∀ n ∈ N , ∀ s ∈ R p , |F f n µ ( s ) − F ν ( s ) | = o ( 1 n ) (6.53) Proof.

In the following x i denotes the i -th coordinate of a vector x in the standard basis, i.e. a vectorwrites x = ( x , . . . , x p ). We deﬁne: ∀ n > p − , ∀ x ∈ supp( µ ) , f n ( x ) = ( T e ( n ) ( h x , e ( n ) i ) , ..., T e p ( n ) ( h x , e p ( n ) i )) (6.54)

36 Chapter 6. Proofs of claims and additional results where ( e k ( n )) k ∈ [[ p ]] is the almost orthogonal basis deﬁne in Lemma 6.2.5, and T e k ( n ) is deﬁned from( Q θ ) since we have Q e k ( n ) for all k . It is clear from the deﬁnition that f n ( x ) ∈ supp( ν ). Moreover for x , x ∈ supp( µ ): k f n ( x ) − f n ( x ) k = p X k =1 | T e k ( n ) ( h x , e k ( n ) i ) − T e k ( n ) ( h x , e k ( n ) i ) | ( ∗ ) = p X k =1 |h x , e k ( n ) i − h x , e k ( n ) i| = p X k =1 |h x − x , e k ( n ) i| where in (*) we used that T e k ( n ) is an isometry since we have Q e k ( n ) and h x , e k ( n ) i ∈ supp( P e k ( n ) µ )(idem for x ). In this way: (cid:12)(cid:12) k f n ( x ) − f n ( x ) k − k x − x k (cid:12)(cid:12) = (cid:12)(cid:12) p X k =1 |h x − x , e k ( n ) i| − | x k − x k | (cid:12)(cid:12) ≤ p X k =1 (cid:12)(cid:12) |h x − x , e k ( n ) i| − | x k − x k | (cid:12)(cid:12) ∗ ≤ p X k =1 |h x − x , e k ( n ) i − ( x k − x k ) | = p X k =1 | R ( x − x , e k ( n )) | = o ( 1 n )where in (*) the second triangular inequality || x | − | y || ≤ | x − y | . Hence: (cid:12)(cid:12) k f n ( x ) − f n ( x ) k − k x − x k (cid:12)(cid:12) = o ( 1 n ) (6.55)Moreover we have by deﬁnition of the Fourier transform, for s ∈ R P , F f n µ ( s ) = ˆ e − iπ h s ,f n ( x ) i dµ ( x ) = ˆ e − iπ P pk =1 s k T e k ( n ) ( h x , e k ( n ) i ) dµ ( x )Moreover using ( Q θ ) we have F T e k ( n ) P e k ( n ) µ ) ( t ) = F P e k ( n ) ν ( t ) for all k ∈ { , ..., p } , and any real t ∈ R .This implies ´ e − iπt.T e k ( n ) ( h e k ( n ) , x i ) d µ ( x ) = ´ e − iπt h e k ( n ) , y i d ν ( y ). So by applying this results for t = s k we have: ˆ e − iπs k T e k ( n ) ( h x , e k ( n ) i ) dµ ( x ) = ˆ e − iπs k h e k ( n ) ,y i dν ( y ) (6.56)Combining both results: F f n µ ( s ) = ˆ e − iπ P pk =1 s k h e k ( n ) , y i d ν ( y ) (6.57)We can now bound |F f n µ ( s ) − F ν ( s ) | as: |F f n µ ( s ) − F ν ( s ) | = |F f n µ ( s ) − ˆ e − iπ h s , y i d ν ( y ) | ∗ = |F f n µ ( s ) − ˆ e − iπ [ P pk =1 h s , e k ( n ) ih e k ( n ) , y i + ˜ R ( s , y )] d ν ( y ) | ∗∗ = | ˆ e − iπ P pk =1 s k h e k ( n ) , y i d ν ( y ) − ˆ e − iπ ˜ R ( s , y ) e − iπ P pk =1 h s , e k ( n ) ih e k ( n ) , y i d ν ( y ) | where in (*) we used the expression in the new base of the inner product h s , y i seen in Lemma 6.2.6, in .2. Proofs and additional results of Chapter 4 137 (**) we used (6.57). By injecting the expression of s k w.r.t. the new base we have: |F f n µ ( s ) − F ν ( s ) | ≤ | ˆ e − iπ P pk =1 ( h s , e k ( n ) i + R ( s , e k ( n ))) h e k ( n ) , y i d ν ( y ) − ˆ e − iπ ˜ R ( s , y ) e − iπ P pk =1 h s, e k ( n ) ih e k ( n ) ,y i d ν ( y ) | = (cid:12)(cid:12) ˆ e − iπ P pk =1 h s , e k ( n ) ih e k ( n ) , y i ( e − iπ P pk =1 R ( s , e k ( n )) h e k ( n ) , y i − e − iπ ˜ R ( s , y ) )d ν ( y ) (cid:12)(cid:12) ≤ ˆ | e − iπ P pk =1 R ( s , e k ( n )) h e k ( n ) , y i − e − iπ ˜ R ( s , y ) | d ν ( y )= ˆ | e − iπ ˜ R ( s , y ) ( e − iπ ( P pk =1 R ( s , e k ( n )) h e k ( n ) ,y i− ˜ R ( s , y )) − | d ν ( y ) ≤ ˆ | e − iπ ( P pk =1 R ( s , e k ( n )) h e k ( n ) , y i− ˜ R ( s , y )) − | d ν ( y )= ˆ | ie − iπ ( P pk =1 R ( s , e k ( n )) h e k ( n ) , y i− ˜ R ( s , y )) sin( π ( p X k =1 R ( s , e k ( n )) h e k ( n ) , y i − ˜ R ( s , y )) | d ν ( y ) ≤ ˆ | sin( π ( p X k =1 R ( s , e k ( n )) h e k ( n ) , y i − ˜ R ( s , y )) | d ν ( y ) ≤ π ˆ ( p X k =1 | R ( s , e k ( n )) h e k ( n ) , y i| + | ˜ R ( s , y ) | )d ν ( y ) ∗ = o ( 1 n ) (6.58)in (*) the fact that each term is o ( n ). In this way: |F f n µ ( s ) − F ν ( s ) | = o ( 1 n ) (6.59)Moreover ( f n ) n>p − is also uniformly bounded. To see that we consider x ∈ supp( µ ). We havethat for all k ∈ [[ p ]] T e k ( n ) ( h x , e k ( n ) i ) ∈ supp( P e k ( n ) ν ) by deﬁnition of T e k ( n ) . So there exists a y ( x , n, k ) ∈ supp( ν ) such that T e k ( n ) ( h x , e k ( n ) i ) = h y ( x, n, k ) , e k ( n ) i . In this way | T e k ( n ) ( h x , e k ( n ) i ) | = |h y ( x, n, k ) , e k ( n ) i| ≤ k y ( x, n, k ) k k e k ( n ) k by Cauchy-Swartz.Moreover k e k ( n ) k < q n ≤ q p − ≤ ν has compact support then there is a constant M ν we have k y ( x, n, k ) k ≤ M ν So we have for n ∈ N , x ∈ supp( µ ), k f n ( x ) k = p X k =1 | T e k ( n ) ( h x , e k ( n ) i ) | ≤ pM ν Since on R p all norms are equivalent this suﬃces to state the existence of a constant C such that ∀ x ∈ R p , n ∈ N , k f n ( x ) k ≤ C so that ( f n ) n ∈ N is uniformly bounded. Reindexing ( f n ) n>p − gives thedesired result.We can now prove Theorem 6.2.1. Proof of Theorem 6.2.1.

We consider the sequence ( f n ) n ∈ N deﬁned in Proposition 6.2.2. We will showthat ( f n ) n ∈ N is equicontinuous. Let ε >

0, using (6.52) there exists a N ∈ N such that we have for all x , x ∈ supp( µ ): k f n ( x ) − f n ( x ) k ≤ ε + k x − x k for all n ≥ N (6.60)

38 Chapter 6. Proofs of claims and additional results

Now let δ < ε . Suppose that k x − x k < δ then k f n ( x ) − f n ( x ) k < ε + δ < ε for all n ≥ N (6.61)Without loss of generality we can reindex ( f n ) n ∈ N for n large enough ( n ≥ N ) so that ( f n ) n ∈ N isequicontinuous with the previous argument.Since ( f n ) n ∈ N is a uniformly bounded and equicontinuous sequence from the support of µ which iscompact to R p we can apply Arzela-Ascoli theorem (see Memo 6.2.2) which states that ( f n ) n ∈ N hasa uniformly convergent subsequence. We denote by ( f φ ( n ) ) n this sequence. We have f φ ( n ) u → n →∞ f thissequence.Moreover equation (6.53) states that for all s ∈ R p , F f n µ ( s ) → n →∞ F ν ( s ). In this way ( F f n µ ( s )) n ∈ N isa convergent real valued sequence, so every adherence value goes to the same limit, hence F f φ ( n ) µ ( s ) → n →∞ F ν ( s ).Moreover the function f is a measure preserving isometry from supp( µ ) to supp( ν ). Indeed let ε > , s ∈ R p , there exists from previous statements N , N ∈ N such that for n ≥ N , |F f φ ( n ) µ ( s ) −F ν ( s ) | < ε and n ≥ N , |F f φ ( n ) µ ( s ) − F f µ ( s ) | < ε . Let n ≥ max( N , N ) |F f µ ( s ) − F ν ( s ) | ≤ |F f φ ( n ) µ ( s ) − F ν ( s ) | + |F f φ ( n ) µ ( s ) − F f µ ( s ) | < ε As this result holds for any ε > F f µ ( s ) = F ν ( s ) and by injectivity of the Fourrier transform f µ = ν such that f is measure preserving.In the same way for any x , x ∈ supp( µ ) , ε > n large enough: (cid:12)(cid:12) k f ( x ) − f ( x ) k − k x − x k (cid:12)(cid:12) ≤ (cid:12)(cid:12) k f φ ( n ) ( x ) − f φ ( n ) ( x ) k − k f ( x ) − f ( x ) k (cid:12)(cid:12) + (cid:12)(cid:12) k f φ ( n ) ( x ) − f φ ( n ) ( x ) k − k x − x k (cid:12)(cid:12) < ε using f φ ( n ) u → n →∞ f and (6.52). As this result holds true for any ε > k f ( x ) − f ( x ) k = k x − x k for any x , x ∈ supp( µ ) which concludes. Corollary 6.2.2.

Let µ, ν ∈ P ( R p ) × P ( R p ) with compact support. If SGW ( µ, ν ) = 0 then µ and ν are isomorphic for the distance induced by the ‘ norm on R p , i.e. d ( x , x ) = P pi =1 | x i − x i | for Memo 6.2.2.

Let ( X , d ) be a compact metric space and k . k a norm on R p . We say taht:• A family F ⊂ C ( X , R p ) is bounded means if there exists a positive constant M < ∞ such that k f ( x ) k ≤ M for all x ∈ X and f ∈ F • A family F ⊂ C ( X, R p ) is equicontinuous means if for every ε > there exists δ > (whichdepends only on ε ) such that for x, y ∈ X : d ( x, y ) < δ ⇒ k f ( x ) − f ( y ) k < ε ∀ f ∈ F (6.62) If ( f n ) n ∈ N is a sequence in C ( X, R p ) that is bounded and equicontinuous then the Arzela-Ascoli statesthat it has a uniformly convergent subsequence (see Theorem 7.25 [Rudin 1976]) . .2. Proofs and additional results of Chapter 4 139 π/ π/ π/ π/ Rotation angle (radian)85868788899091 V a l u e Values for increasing rotation RISWSW

Figure 6.1: Illustration of SW , RISW on spiral datasets for varying rotations on discrete 2D spiral datasets. (left)Examples of spiral distributions for source and target with diﬀerent rotations. (right) Average value of SW and RISW with L = 20 as a function of rotation angle of the target. Colored areas correspond to the 20% and 80%percentiles. ( x , x ) ∈ R p × R p . In particular this implies: SGW ( µ, ν ) = 0 = ⇒ GW ( d, d, µ, ν ) = 0 (6.63) Proof. If SGW ( µ, ν ) = 0 then using the Gromov-Wasserstein properties it implies that for almost all θ ∈ S p − the projected measures are isomorphic. Moreover since µ, ν have compact support, it is boundedand we can directly apply Theorem 6.2.1 to state the existence of a measure preserving application f asdeﬁned in Theorem 6.2.1. We consider the coupling π = ( id × f ) µ ∈ Π( µ, ν ) since f µ = ν . Then wehave: ˆ ˆ | d ( x , x ) − d ( y , y ) | d π ( x , y )d π ( x , y ) = ˆ ˆ | d ( x , x ) − d ( f ( x ) , f ( x )) | d µ ( x )d µ ( x )= ˆ ˆ |k x − x k − k f ( x ) − f ( x ) k | d µ ( x )d µ ( x ) = 0Since f is an isometry. This directly implies that GW ( d, d, µ, ν ) = 0. SW ∆ and RISW

Analogously to

SGW we can deﬁne for the Sliced-Wasserstein distance SW ∆ ( µ, ν ) for µ, ν ∈ P ( R p ) ×P ( R q )with p = q and its rotational invariant counterpart as: SW ∆ ( µ, ν ) = ˆ S q − SW ( P θ µ ∆ , P θ ν )d λ q − ( θ ) RISW ( µ, ν ) = min ∆ ∈ V q ( R p ) SW ∆ ( µ, ν ) (6.64)where SW is the Sliced-Wasserstein distance. The complexity for computing SW ∆ between twodiscrete probability measures with n atoms and uniform weights is O ( Ln ( p + q + log( n ))) which is exactlythe same complexity as SGW ∆ . With these formulations, we can perform the same experiment as for

40 Chapter 6. Proofs of claims and additional results

RISGW on the spiral dataset. The optimisation on the Stiefel manifold is performed using Pymanopt asfor

SGW . Results are reported in Figure 6.1. As one can see,

RISW is rotational invariant on averagewhereas SW is not. One can also note that, due to the sampling process of the spiral dataset, the varianceis quite large. This can be explained by the fact that, unlike SGW , the Sliced-Wasserstein may realignthe distributions without taking the rotation into account.

We recall the lemma:

Lemma.

Suppose that there exist scalars a, b, c such that c X ( x , x ) = a k x k + b k x k + c h x , x i p and c Y ( y , y ) = a k y k + b k y k + c h y , y i q . Then: J ( c X , c Y , π ) = C µ,ν − Z ( π ) (6.65) where C µ,ν = ´ c X d µ d µ + ´ c Y d ν d ν − ab ´ k x k k y k dµ ( x ) dν ( y ) and: Z ( π ) = ( a + b ) ˆ k x k k y k d π ( x , y ) + c k ˆ yx T d π ( x , y ) k F + ( a + b ) c ˆ (cid:2) k x k h E Y ∼ ν [ Y ] , y i q + k y k h E X ∼ µ [ X ] , x i p d π ( x , y ) (cid:3) (6.66) Proof.

Let π ∈ Π( µ, ν ). We have J ( c X , c Y , π ) = ´ c X d µ d µ + ´ c Y d ν d ν − ´ c X c Y d π d π . In this way: ˆ c X c Y d π d π = ˆ ( a k x k + b k x k + c h x , x i p )( a k y k + b k y k + c h y , y i q )d π ( x , y )d π ( x , y )= ˆ (cid:2) ( a k x k k y k + ab k x k k y k + ac k x k h y , y i q ) + ( ab k x k k y k + b k x k k y k + bc k x k h y , y i q )+ ( ca h x , x i p k y k + cb h x , x i p k y k + c h x , x i p h y , y i q ) (cid:3) d π ( x , y )d π ( x , y )= ( a + b ) ˆ k x k k y k d π ( x , y ) + 2 ab ˆ k x k k y k dµ ( x ) dν ( y ) + c ˆ h x , x i p h y , y i q d π ( x , y )d π ( x , y )+ ( a + b ) c ˆ k x k h ˆ y d π ( x , y ) , y i q d π ( x , y ) + ( a + b ) c ˆ k y k h ˆ x d π ( x , y ) , x i p d π ( x , y )Moreover: ˆ h x , x i p h y , y i q d π ( x , y )d π ( x , y ) = ˆ x T x y T y d π ( x , y )d π ( x , y ) = ˆ tr( x T x y T y )d π ( x , y )d π ( x , y )= ˆ tr( y x T x y T )d π ( x , y )d π ( x , y ) ∗ = ˆ tr( y x T xy T )d π ( x , y )d π ( x , y )= tr(( ˆ y x T d π ( x , y )( ˆ xy T d π ( x , y )))) = k ˆ xy T d π ( x , y ) k F = k ˆ yx T d π ( x , y ) k F where in (*) we used that x T x ∈ R so it is equal to its transpose. (innerGW) and (MaxOT) In this section we prove that (innerGW) and (MaxOT) are ﬁnite and that (MaxOT) always admits amaximizer. More precisely: .2. Proofs and additional results of Chapter 4 141

Lemma 6.2.7.

Let µ ∈ P ( R p ) , ν ∈ P ( R q ) with ´ k x k d µ ( x ) < + ∞ , ´ k y k d ν ( y ) < + ∞ . Both (innerGW) and (MaxOT) are ﬁnite.Moreover the set F p,q is a compact subset of R q × p . The functional π → sup P ∈ F p,q ´ h Px , y i q d π ( x , y ) is continuous for the weak convergence of measure. In particular problem (MaxOT) admits an optimalsolution π ∗ ∈ Π( µ, ν ) Proof.

In this case ´ k x k d µ ( x ) < + ∞ , ´ k y k d ν ( y ) < + ∞ by Hölder’s inequality and for all( x , x , y , y ) ∈ X × Y (cid:0) h x , x i p − h y , y i q (cid:1) ≤ h x , x i p + h y , y i q ) ≤ k x k k x k + k y k k y k )by Cauchy-Swartz. In particular this implies:inf π ∈ Π( µ,ν ) ˆ ˆ (cid:0) h x , x i p − h y , y i q (cid:1) d π ( x , y )d π ( x , y ) ≤ ˆ ˆ (cid:0) h x , x i p − h y , y i q (cid:1) d µ ( x )d µ ( x )d ν ( y )d ν ( y ) ≤ ˆ k x k d µ ( x ) ˆ k x k d µ ( x ) + ˆ k y k d ν ( x ) ˆ k y k d ν ( y )) < + ∞ Moreover for any π ∈ Π( µ, ν ) , P ∈ F p,q : ˆ h Px , y i q d π ( x , y ) = ˆ h P , yx T i F d π ( x , y ) ≤ ˆ k P k F k xy T k F d π ( x , y ) ≤ p ˆ k xy T k F d π ( x , y ) = p ˆ tr( yx T xy T )d π ( x , y ) = p ˆ tr( x T xy T y )d π ( x , y )= p ˆ tr( k x k k y k )d π ( x , y ) = p ˆ k x k y k d π ( x , y ) ∗ ≤ p ˆ k x k d µ ( x ) + ˆ k y k d ν ( y )) < + ∞ where in (*) we used Young’s inequality (recalled in Memo 4.2.2).For the compacity, using Borel Lebesgue theorem it suﬃces to show that F p,q is closed and bounded. Itis clearly bounded by √ p and closed as the pre-image of the closed set { } by the continuous application P → √ p − k P k F .We note f : Π( µ, ν ) × F p,q → R the function f ( π, P ) = − ´ h Px , y i q d π ( x , y ). For any π ∈ Π( µ, ν ), f ( π, . ) is continuous. Indeed suppose that P m → F P where → F denotes the convergence in Frobenius norm, Memo 6.2.3.

Let X , Y be topological spaces with Y compact. If f : X × Y → R is continuous then g : x → inf y f ( x, y ) is well deﬁned and continuous.Proof. Note ﬁrst that g ( x ) > −∞ , since for every x ∈ X , f ( x, . ) : Y → R is continuous on acompact space so it is bounded. To prove the continuity it suﬃces to show that g − (] − ∞ , a [)and g − (] b, + ∞ [) are open. For the former, if we note π X : X × Y → X the canonical projectionthen g − (] − ∞ , a [) = π X ◦ f − (] − ∞ , a [). By continuity of f we can conclude that g − (] − ∞ , a [)is open. For the latter ﬁrst observe that g ( x ) > b = ⇒ ∀ y, f ( x, y ) > b which means that g ( x ) > b = ⇒ ∀ y, ( x, y ) ∈ f − (] b, + ∞ [). In particular for x ∈ g − (] b, + ∞ [) and y ∈ Y there existsa neighborhood U ( x,y ) × V ( x,y ) contained in f − (] b, + ∞ [). Since Y is compact there exists a ﬁnitesubset { ( x, y i ) } of all the neighborhoods U ( x,y ) × V ( x,y ) which cover all of { x } × Y . Overall: { x } × Y ⊂ ( ∩ ki =1 U ( x,y i ) ) × Y ⊂ f − (] b, + ∞ [) (6.67)Hence g − (] b, + ∞ [) = ∪ x ∈ g − (] b, + ∞ [) ∩ ki =1 U ( x,y i ) ) which is open so g is continuous.

42 Chapter 6. Proofs of claims and additional results then: | ˆ h P m x , y i q d π ( x , y ) − ˆ h Px , y i q d π ( x, y ) | ≤ ˆ |h ( P m − P ) x , y i q | d π ( x , y ) ≤ k P m − P k F ( ˆ k x k k y k d π ( x , y )) ≤ k P m − P k F ( ˆ k x k d µ ( x ) + ˆ k y k d ν ( y )) → m → + ∞ F p,q is compact, g : π → inf P ∈ F f ( π, P ) is well deﬁned and continuous for the weakconvergence of measure (see Memo 6.2.3). Since it is continuous and Π( µ, ν ) is compact we can appliedthe Weirstrass’ theorem to state that inf π ∈ Π( µ,ν ) g ( π ) exists which concludes the proof. We recall the lemma:

Lemma.

Let X and Y be compact subset of respectively R p and R q . Let µ ∈ P ( X ) , ν ∈ P ( Y ) . We canassume without loss of generality that E X ∼ µ [ X ] = 0 and E Y ∼ ν [ Y ] = 0 . In this case (sqGW) is equivalentto: sup π ∈ Π( µ,ν ) ˆ k x k k y k d π ( x , y ) + 2 k ˆ yx T d π ( x , y ) k F Indeed equation (sqGW) corresponds to the case a = 1 , b = 1 , c = − π ∈ Π( µ,ν ) ˆ k x k k y k d π ( x, y ) + 4 k ˆ xy T d π ( x , y ) k F − ˆ (cid:2) k x k h E Y ∼ ν [ Y ] , y i d + k y k h E X ∼ µ [ X ] , x i d d π ( x , y ) (cid:3) (6.68)We can reduce this problem using the translation invariance of the GW cost since k x + t − ( x + t ) k = k x − x k for all x , x (same for y ). More precisely we can rely on the following lemmas: Lemma 6.2.8.

Let f : R p → R p and f : R q → R q Borel and µ ∈ P ( R p ) , ν ∈ P ( R q ) Then

Π ( f µ, f ν ) = { ( f × f ) π | π ∈ Π( µ, ν ) } Proof.

This is a straightforward extension of the Lemma 6 in [Paty 2019].Based on previous Lemma 6.2.8 we have:

Lemma 6.2.9 (Translation Invariance) . Let µ ∈ P ( R p ) , ν ∈ P ( R q ) and c X ( x , x ) = k x − x k , c Y ( y , y ) = k y − y k . Let t , t ∈ R p × R q and f t ( x ) = x + t , f t = y + t be translations. Then: sup π ∈ Π( µ,ν ) J ( c X , c Y , π ) = sup π ∈ Π( f t µ,f t ν ) J ( c X , c Y , π ) (6.69) Proof.

Using Lemma 6.2.8 we have:sup π ∈ Π( f t µ,f t ν ) J ( c X , c Y , π ) = sup π ∈ Π( µ,ν ) J ( c X , c Y , ( f t × f t ) π ) ∗ = sup π ∈ Π( µ,ν ) J ( c X , c Y , π )where in (*) we used the invariance of the cost J ( c X , c Y , π ) with respect to translations.Using this property we can assume without loss of generality µ and ν centered, i.e , E X ∼ µ [ X ] = 0 and E Y ∼ ν [ Y ] = 0. By plugging this condition into Z ( π ) of Lemma 4.2.1 we have the desired result. .2. Proofs and additional results of Chapter 4 143 We recall the lemma:

Lemma. If π ∗ is a solution of the primal problem sup π ∈ Π( µ,ν ) F ( π ) then there exists P ∈ R q × p and h ∗ ∈ C ( X × Y ) of the form h ( x , y ) = h Px , y i q + k x k k y k such that ( π ∗ , h ∗ ) is a solution of the dualproblem (4.26) . Moreover when h is in such form we have F ∗ ( h ) = k P k F . To prove this result we will need the following calculus:

Lemma 6.2.10.

With previous notations, the Fréchet derivative of F reads: ∇ F ( π ) = ( x , y ) → h V π x , y i q + k x k k y k (6.70) Proof. F ( π + tε ) = 2 k V π k F + ˆ k x k k y k d π ( x, y ) + k t V ε k F + t ˆ (4 h V π , yx T i F + k x k k y k ) dε ( x , y )= F ( π ) + t ˆ (4 h V π , yx T i F + k x k k y k ) dε ( x, y ) + o t → ( t ) (6.71)for ε ∈ M ( X × Y ). Hence ∇ F ( π ) : ( x, y ) → h V π , yx T i F + k x k k y k = 4 h V π x , y i q + k x k k y k Proof of Lemma 4.2.4 – Parametrization of the dual problem. If π ∗ is a maximizer of the primal problemthen we know that h ∗ = ∇ F ( π ∗ ) is a maximizer of sup h ∈C ( X ×Y ) sup π ∈ Π( µ,ν ) ´ h ( x , y )d π ( x , y ) − F ∗ ( h ).The calulus in Lemma 6.2.10) implies that ∇ F ( π ∗ ) = ( x , y ) → h V π ∗ x , y i q + k x k k y k . Setting P = 4 V π ∗ ∈ R q × p concludes for the ﬁrst point.For the second point we have by deﬁnition F ∗ ( h ) = sup π ∈M ( X ×Y ) ´ h ( x , y )d π ( x , y ) − F ( π ). We note G ( π ) = ´ h ( x , y )d π ( x , y ) − F ( π ). Then ∇ G ( π ) = h − ∇ F ( π ) = ( x , y ) → h ( x , y ) − h V π x , y i q − k x k k y k by Lemma 6.2.10. Using the fact that h is parametrized by a linear application we have ∇ G ( π ) = ( x , y ) →h ( P − V π ) x , y i q . Then for ( x , y ) ∈ X × Y : ∇ G ( π )( x , y ) = 0 ⇐⇒ h ( P − V π ) x , y i q = 0 (6.72)We write P = P i λ i v i u Ti the SVD of P . We note γ P ∈ M ( X × Y ) the measure γ P = P i λ i δ ( u i , v i ) (note that we do not need that γ P be a probability measure). Then ´ yx T d γ P ( x , y ) = P i λ i v i u Ti = P .By previous calculus ∇ G ( γ P ) = 0 so that γ P satisﬁes the ﬁrst order condition and is a solution tosup π ∈M ( X ×Y ) ´ h ( x , y )d π ( x , y ) − F ( π ). Overall: F ∗ ( h ) = ˆ h ( x , y )d γ P ( x , y ) − F ( γ P )= ˆ h Px , y i q + k x k k y k d γ P ( x , y ) − ˆ k x k k y k d γ P ( x , y ) − k ˆ yx T d γ P ( x , y ) k F = h P , ˆ yx T d γ P ( x, y ) i F − k ˆ yx T d γ P ( x , y ) k F = 14 k P k F − k P k F = 18 k P k F (6.73) H ( µ, ν ) for linear push-forward We have the following result:

44 Chapter 6. Proofs of claims and additional results

Lemma 6.2.11.

Let µ, ν ∈ P ( R p ) × P ( R p ) . If T is a linear symmetric push-forward of µ to ν then: T ∈ H ( µ, ν ) ⇐⇒ T = λ O (6.74) where O ∈ O ( p ) is an orthogonal matrix and λ > .Proof. We note T ( x ) = Ax . Then we have k Ax k = x T A T Ax , we can write A T A = PDP T where D diagonal and P orthogonal, then it is equivalent to ﬁnd a g = f such that: k D / v k = X i | β i v i | = g ( k v k ) (6.75)(with β i ≥ v = P T x ). If there is i such that β i = 0 (the ﬁrst one without lossof generality) then for all t ≥ v = ( √ t, , ..,

0) then g ( k v k ) = g ( t ) = 0 which isnot possible unless A = 0. In this way β i > i . Let e i be one eigenvector of A T A then k Ae i k = e Ti A T Ae i = λ i = g ( k e i k ) = g (1). Overall β i = g (1) = λ for all i . So A T A = λ I p . Overallit implies that A can be written as A = λ O with O an orthogonal matrix. Conversely if T = λ O then k T ( x ) k = c k x k . We can take the function f ( t ) = c t which is convex and continuous from R to R . We recall the theorem:

Theorem.

Let µ = N (0 , Σ ν ) ∈ P ( R p ) , ν = N (0 , Σ µ ) ∈ P ( R q ) centered without loss of generality. Let Σ µ = V µ D µ V > µ , Σ ν = V ν D ν V > ν be the diagonalizations of the covariance matrices such that eigenvaluesof D µ and D ν are ordered nondecreasing. When p = q we have: LGM ( µ, ν ) = 4( tr ( Σ µ ) − tr ( Σ ν )) + 8( tr ( Σ µ Σ µ ) + tr ( Σ ν Σ ν )) + 16 min B ∈ V p ( R q ) − tr ( D µ B > D ν B ) When p = q , an optimal linear Monge map is given by T ( x ) = Ax where: A = V ν D / ν D − / µ V > µ = Σ / ν V ν V > µ Σ − / µ so that: LGM ( µ, ν ) = 4( tr ( Σ µ ) − tr ( Σ ν )) + 8( tr ( Σ µ Σ µ ) + tr ( Σ ν Σ ν )) − tr ( D µ D ν )In order to prove Theorem 4.2.6 we will rely on the the following result: Proposition 6.2.3 (Proposition 3.1 in [Anstreicher 1998]) . Consider Σ , Σ ∈ R p × p × R p × p two symmetricmatrices and the following (QQP): min BB T = I tr ( Σ BΣ B T ) (6.76) Let Σ = V D V T and Σ = V D V T be the orthogonal diagonalizations of Σ , Σ where the eigenvaluesof D are ordered nonincreasing the eigenvalues in D are ordered nondecreasing. An optimal solution of (6.76) is found at B ∗ = V V T and the optimal value of (6.76) is tr ( D D ) . We will prove that when considering only linear push-forward we can recast the Gromov-Mongeproblem in the form of Proposition 6.2.3. We have the following result: .2. Proofs and additional results of Chapter 4 145

Lemma 6.2.12.

Let µ = N (0 , Σ ν ) ∈ P ( R p ) , ν = N (0 , Σ µ ) ∈ P ( R q ) centered without loss of generality.For a linear push-forward T µ = ν we note T ( x ) = Ax . The Gromov-Monge problem is equivalent to: min A ∈ R q × p AΣ µ A > = Σ ν h M , A > A i F (6.77) where: M = E x , x ∼ µ [ − x > x ) . xx > − x x ) . xx > − x > x ) . x x > ] (6.78) Proof.

We have:min T µ = ν T linear J ( T ) = min T µ = ν T linear E x , x ∼ µ [ (cid:0) k x − x k − k T ( x ) − T ( x ) k (cid:1) ]= min T µ = ν T linear E x , x ∼ µ [ k x − x k ] + E x , x ∼ µ [ k T ( x ) − T ( x ) k ] − E x , x ∼ µ [ k x − x k k T ( x ) − T ( x ) k ]= E x , x ∼ µ [ k x − x k ] + E y , y ∼ ν [ k y − y k ] + 2 min T µ = ν T linear J ( T )With J ( T ) def = − E x , x ∼ µ [ k x − x k k T ( x ) − T ( x ) k ]. In this way the problem is equivalent to minimizing J ( T ). Since T µ = ν : J ( T ) = − E x , x ∼ µ [ k x − x k k T ( x ) − T ( x ) k ]= − E x , x ∼ µ [( k x k + k x k − h x , x i p )( k T ( x ) k + k T ( x ) k − h T ( x ) , T ( x ) i q i )]= − E x , x ∼ µ [ k x k k T ( x ) k + k x k k T ( x ) k ] − E x , x ∼ µ [ k x k k T ( x ) k + k x k k T ( x ) k ]+2 E x , x ∼ µ [ h x , x ik T ( x ) k ] + 2 E x , x ∼ µ [ h x , x i p k T ( x ) k ] − E x , x ∼ µ [ h x , x i p h T ( x ) , T ( x ) i q ]+2 E x , x ∼ µ [ h T ( x ) , T ( x ) i q k x k ] + 2 E x , x ∼ µ [ h T ( x ) , T ( x ) i q k x k ]= − E x ∼ µ [ k x k k T ( x ) k ] − E x , x ∼ µ [ k x k k T ( x ) k ]+4 E x , x ∼ µ [ h x , x i p k T ( x ) k ] − E x , x ∼ µ [ h x , x i p h T ( x ) , T ( x ) i q ]+4 E x , x ∼ µ [ h T ( x ) , T ( x ) i q k x k ]Since T is linear it can be written in the form T ( x ) = Ax . The push-forward constraint in this casereads [Flamary 2019]: AΣ µ A > = Σ ν (6.79)When A is symmetric positive deﬁnite this equation admits a unique solution which is the optimal linearmap for the Wasserstein problem. However here we have no result about the regularity of A . Plugging T ( x ) = Ax into J ( T ) and writing the push-forward condition gives the equivalent problem:min A ∈ R q × p , AΣ µ A > = Σ ν J ( A ) (6.80)where J ( A ) = − E x ∼ µ [ x > xx > A > Ax ] − E x , x ∼ µ [ x x x > A > Ax ]+4 E x , x ∼ µ [ x > x x > A > Ax ] (6.81) − E x , x ∼ µ [ x > x x > A > Ax ]+4 E x , x ∼ µ [ x > A > Ax x > x ] (6.82)

46 Chapter 6. Proofs of claims and additional results

Using the property of the trace of a matrix and linearity of the inner product, one can reformulate J as J ( A ) = h M , A > A i F (6.83)where M = E x , x ∼ µ [ − x > x ) . xx > − x x ) . xx > +4( x > x ) . xx > − x > x ) . x x > +4( x > x ) . x x > ] (6.84)Indeed for all terms we used the following reasoning: E x ∼ µ [ x > xx > A > Ax ] ∗ = E x ∼ µ [tr( x > xx > A > Ax )] ∗∗ = E x ∼ µ [ x > x tr( x > A > Ax )] ∗∗∗ = E x ∼ µ [ x > x tr( xx > A > A )]= E x ∼ µ [tr( x > xxx > A > A )] = tr( E x ∼ µ [ x > xxx > A > A ]) = tr( E x ∼ µ [ x > xxx > ] A > A )= h E x ∼ µ [ x > xxx > ] , A > A i F (6.85)where in (*) we used that x > xx > A > Ax ∈ R , in (**) that x > x ∈ R and in (***) the cyclical permutationinvariance of tr. Moreover E x , x ∼ µ [( x > x ) . x x > ] , E x , x ∼ µ ( x > x ) . xx > ] = 0 since Gaussians are centered. Hence: M = E x , x ∼ µ [ − x > x ) . xx > − x x ) . xx > − x > x ) . x x > ] (6.86)We can go further on this calculus by using Isserlis’ theorem recalled here: Theorem (Isserlis [Isserlis 1918]) . If ( X , X , X , X ) is a zero-mean multivariate normal random vectorthen: E [ X X X X ] = E [ X X ] E [ X X ] + E [ X X ] E [ X X ] + E [ X X ] E [ X X ] (6.87)Applying this theorem to M in (6.77) gives the following lemma: Lemma 6.2.13.

Let µ = N (0 , Σ ν ) ∈ P ( R p ) , ν = N (0 , Σ µ ) ∈ P ( R q ) centered without loss of generality.Then: min A ∈ R q × p AΣ µ A > = Σ ν h M , A > A i F = min A ∈ R q × p AΣ µ A > = Σ ν h− tr ( Σ µ ) . Σ µ − Σ > µ Σ µ , A > A i F (6.88) Proof.

We will use Isserlis’ theorem to calculate M in (6.77). We have for i, j :( E x ∼ µ [( x > x ) . xx > ]) i,j = E x [ X k x k x k x i x j ] = X k E x [ x k x k x i x j ] ∗ = X k E x [ x k x k ] E x [ x i x j ] + E x [ x k x i ] E x [ x k x j ] + E x [ x k x j ] E x [ x k x i ]= X k Σ k,kµ Σ i,jµ + Σ k,iµ Σ k,jµ + Σ k,jµ Σ k,iµ = X k Σ k,kµ Σ i,jµ + 2 Σ k,iµ Σ k,jµ = Σ i,jµ tr( Σ µ ) + 2( Σ > µ Σ µ ) i,j (6.89) .2. Proofs and additional results of Chapter 4 147 In (*) we used Isserlis’ theorem. Hence E x ∼ µ [( x > x ) . xx > ] = tr( Σ µ ) . Σ µ + 2 Σ > µ Σ µ . In the same way:( E x , x [( x x ) . xx > ]) i,j = E x , x [ X k x k x k x i x j ] = X k E x , x [ x k x k x i x j ]= X k E x , x [ x k x k ] E x , x [ x i x j ] + E x , x [ x k x i ] E x , x [ x k x j ] + E x , x [ x k x j ] E x , x [ x k x i ] ∗ = X k Σ k,kµ Σ i,jµ = tr( Σ µ ) Σ i,jµ (6.90)In (*) we used the independence of x and x and the fact that Gaussians are centered. Finally,( E x , x [( x > x ) . x x > ]) i,j = E x , x [ X k x k x k x i x j ] = X k E x,x [ x k x k x i x j ]= X k E x , x [ x k x k ] E x , x [ x i x j ] + E x , x [ x k x i ] E x , x [ x k x j ] + E x , x [ x k x j ] E x [ x k x i ]= X k Σ k,iµ Σ k,jµ = ( Σ > µ Σ µ ) i,j (6.91)Overall we want to solve the following optimization problem:min A , AΣ µ A > = Σ ν h M , A > A i F (6.92)where: M = − Σ µ ) . Σ µ − Σ > µ Σ µ (6.93)We can now prove the main result: Of Theorem 4.2.6.

With previous notations and calculus, we consider the following change of variable: A = V ν D / ν BD − / µ V > µ (6.94)then the pushforward equality becomes: BB > = I q i.e. B ∈ V p ( R q ) (6.95)Indeed: AΣ µ A > = ( V ν D / ν BD − / µ V > µ )( V µ D µ V > µ ) A > = V ν D / ν BD / µ V > µ A > = V ν D / ν BD / µ V > µ ( V µ D − / µ B > D / ν V > ν )= V ν D / ν BB > D / ν V > ν = Σ ν ⇐⇒ V ν D / ν BB > D / ν V > ν = V ν D ν V > ν ⇐⇒ BB > = I q ⇐⇒ B ∈ V p ( R q ) (6.96)With this change of variable the criterion becomes: J ( B ) = h M , ( V µ D − / µ B > D / ν V > ν )( V ν D / ν BD − / µ V > µ ) i F = h M , V µ D − / µ B > D ν BD − / µ V > µ i F = h D − / µ V > µ MV µ D − / µ , B > D ν B i F = h ˜ M , B > D ν B i F (6.97)

48 Chapter 6. Proofs of claims and additional results with ˜ M = D − / µ V > µ MV µ D − / µ = D − / µ V > µ (cid:0) − Σ µ ) . Σ µ − Σ > µ Σ µ (cid:1) V µ D − / µ = D − / µ ( − Σ µ ) D µ ) D − / µ + D − / µ ( − D µ ) D − / µ = − Σ µ ) . I p − D µ (6.98)Overall we have the following optimization problem:min B , BB > = I q h ˜ M , B > D ν B i F = min B , BB > = I tr (cid:0) ( − Σ µ ) . I p − D µ ) B > D ν B (cid:1) = min B , BB > = I q − Σ µ ) . tr (cid:0) B > D ν B (cid:1) − D µ B > D ν B )= − Σ µ ) . tr (cid:0) Σ ν ) + 8 min B , BB > = I q − tr( D µ B > D ν B ) (6.99)Note that for the ﬁnal cost we need also to compute: E x , x ∼ µ [ k x − x k ] = E x , x ∼ µ [( k x k − h x , x i p + k x k )( k x k − h x , x i p + k x k )]= tr( Σ µ ) + 2tr( Σ > µ Σ µ ) + tr( Σ µ ) + 4tr( Σ > µ Σ µ ) + tr( Σ µ ) + tr( Σ µ ) + 2tr( Σ > µ Σ µ )= 4tr( Σ µ ) + 8tr( Σ > µ Σ µ ) (6.100)Overall we have:min T µ = ν T linear J ( T ) = 4tr( Σ µ ) + 8tr( Σ > µ Σ µ ) + 4tr( Σ ν ) + 8tr( Σ > ν Σ ν )+ 2 (cid:18) − Σ µ ) . tr (cid:0) Σ ν ) + 8 min B , BB > = I q − tr( D µ B > D ν B ) (cid:19) = 4tr( Σ µ ) + 8tr( Σ > µ Σ µ ) + 4tr( Σ ν ) + 8tr( Σ > ν Σ ν ) − Σ µ ) . tr (cid:0) Σ ν )+ 16 min B , BB > = I − tr( D µ B > D ν B )= 4(tr( Σ µ ) − tr( Σ ν )) + 8(tr( Σ Tµ Σ µ ) + tr( Σ Tν Σ ν )) + 16 min B , BB > = I q − tr( D µ B > D ν B )Since the covariances are symmetric it gives (4.39).We can use Proposition 6.2.3 to solve min B , BB > = I q − tr( D µ B > D ν B ). Indeed − D µ is already diagonalwhich values are nonincreasing and D ν is already diagonal which values are nondecreasing. In this waywhen p = q min B , BB > = I q − tr( D µ B > D ν B ) can be solved in close form with B = I q which corresponds to A = V ν D / ν D − / µ V > µ = Σ / ν V ν V > µ Σ − / µ . This section contains all the proofs of the claims and additional results of the Chapter 5. We recallthe notations of the Chapter. Two datasets are represented by matrices X = [ x , . . . , x n ] T ∈ R n × d and X = [ x , . . . , x n ] T ∈ R n × d . The rows of the datasets are denoted as samples and their columns asfeatures. Let µ = P ni =1 w i δ x i and µ = P n i =1 w i δ x i be two empirical distributions related to the samples,where x i ∈ R d and x i ∈ R d . We refer in the following to w = [ w , . . . , w n ] > and w = [ w , . . . , w n ] > asto sample weights vectors that both lie in the simplex ( w ∈ ∆ n and w ∈ ∆ n ). In addition to them, wealso introduce weights for the features that are stored on vectors v ∈ ∆ d and v ∈ ∆ d . Finally, we let vecdenote the column-stacking operator. .3. Proofs and additional results of Chapter 5 149 We recall the proposition:

Proposition (COOT is a distance) . Suppose L = | · | p , p ≥ , n = n , d = d and that the weights w , w , v , v are uniform. Then COOT ( X , X ) = 0 iﬀ there exists a permutation of the samples σ ∈ S n and of the features σ ∈ S d , s.t, ∀ i, k X i,k = X σ ( i ) ,σ ( k ) . Moreover, it is symmetric and satisﬁes thetriangular inequality as long as L satisﬁes the triangle inequality, i.e., COOT ( X , X ) ≤ COOT ( X , X ) + COOT ( X , X ) . Proof.

The symmetry follows from the deﬁnition of COOT. To prove the triangle inequality of COOTfor arbitrary measures, we will use the gluing lemma (see [Villani 2008]) which states the existenceof couplings with a prescribed structure. Let X ∈ R n × d , X ∈ R n × d , X ∈ R n × d associated with w ∈ Σ n , v ∈ Σ d , w ∈ Σ n , v ∈ Σ d , w ∈ Σ n , v ∈ Σ d . Without loss of generality, we can suppose in theproof that all weights are diﬀerent from zeros (otherwise we can consider ˜ w i = w i if w i > w i = 1 if w i = 0 see proof of Proposition 2.2 in [Peyré 2019])Let ( π s , π v ) and ( π s , π v ) be two couples of optimal solutions for the COOT problems associated withCOOT( X , X , w , w , v , v ) and COOT( X , X , w , w , v , v ) respectively.We deﬁne: S = π s diag (cid:18) w (cid:19) π s , S = π v diag (cid:18) v (cid:19) π v Then, it is easy to check that S ∈ Π( w , w ) and S ∈ Π( v , v ) (see e.g Proposition 2.2 in [Peyré 2019]).We now show the following:COOT( X , X , w , w , v , v ) ∗ ≤ h L ( X , X ) ⊗ S , S i = h L ( X , X ) ⊗ [ π s diag( 1 w ) π s ] , [ π v diag( 1 v ) π v ] i ∗∗ ≤ h [ L ( X , X ) + L ( X , X )] ⊗ [ π s diag( 1 w ) π s ] , [ π v diag( 1 v ) π v ] i = h L ( X , X ) ⊗ [ π s diag( 1 w ) π s ] , [ π v diag( 1 v ) π v ] i + h L ( X , X ) ⊗ [ π s diag( 1 w ) π s ] , [ π v diag( 1 v ) π v ] i , where in (*) we used the suboptimality of S , S and in (**) the fact that L satisﬁes the triangle inequality.Now note that: h L ( X , X ) ⊗ [ π s diag( 1 w ) π s ] , [ π v diag( 1 v ) π v ] i + h L ( X , X ) ⊗ [ π s diag( 1 w ) π s ] , [ π v diag( 1 v ) π v ] i = X i,j,k,l,e,o L ( X i,k , X e,o ) π s i,e π s e,j w e π v k,o π v o,l v o + X i,j,k,l,e,o L ( X e,o , X j,l ) π s i,e π s e,j w e π v k,o π v o,l v o ∗ = X i,k,e,o L ( X i,k , X e,o ) π s i,e π v k,o + X l,j,e,o L ( X e,o , X j,l ) π s e,j π v o,l where in (*) we used: X j π s e,j w e = 1 , X l π v o,l v o = 1 , X i π s i,e w e = 1 , X k π v k,o v o = 1Overall, from the deﬁnition of π s , π v and π s , π v we have:COOT( X , X , w , w , v , v ) ≤ COOT( X , X , w , w , v , v ) + COOT( X , X , w , w , v , v ) . For the identity of indiscernibles, suppose that n = n , d = d and that the weights w , w , v , v areuniform. Suppose that there exists a permutation of the samples σ ∈ S n and of the features σ ∈ S d ,

50 Chapter 6. Proofs of claims and additional results s.t ∀ i, k ∈ [[ n ]] × [[ d ]] , X i,k = X σ ( i ) ,σ ( k ) . We deﬁne the couplings π s , π v supported on the graphs ofthe permutations σ , σ respectively, i.e π s = ( id × σ ) and π v = ( id × σ ). These couplings have theprescribed marginals and lead to a zero cost hence are optimal.Conversely, as described in the chapter, there always exists an optimal solution of (COOT) which lies onextremal points of the polytopes Π( w , w ) and Π( v , v ). When n = n , d = d and uniform weights are used,Birkhoﬀ’s theorem [Birkhoﬀ 1946] states that the set of extremal points of Π( n n , n n ) and Π( d d , d d ) are theset of permutation matrices so there exists an optimal solution ( π s ∗ , π v ∗ ) supported on σ s ∗ , σ v ∗ respectivelywith σ s ∗ , σ v ∗ ∈ S n × S d . Then, if COOT( X , X ) = 0, it implies that P i,k L ( X i,k , X σ s ∗ ( i ) ,σ v ∗ ( k ) ) = 0. If L = | · | p then X i,k = X σ s ∗ ( i ) ,σ v ∗ ( k ) which gives the desired result. If n = n , d = d the COOT cost is alwaysstrictly positive as there exists a strictly positive element outside the diagonal. We recall the result:

Lemma.

The overall computational complexity of computing the value of COOT when L = | . | is O (min { ( n + n ) dd + n n ; ( d + d ) nn + d d } ) .Proof. As mentionned in [Peyré 2016], if L can be written as L ( a, b ) = f ( a ) + f ( b ) − h ( a ) h ( b ) then wehave that L ( X , X ) ⊗ π s = C X , X − h ( X ) π s h ( X ) T , where C X , X = Xw Tn + n w T X T so that the latter can be computed in O ( ndd + n dd ) = O (( n + n ) dd ).To compute the ﬁnal cost, we must also calculate the inner product with π v that can be done in O ( n n )making the complexity of h L ( X , X ) ⊗ π s , π v i equal to O (( n + n ) dd + n n ).Finally, as the cost is symmetric w.r.t π s , π v , we obtain the overall complexity of O (min { ( n + n ) dd + n n ; ( d + d ) nn + d d } ). As pointed in [Konno 1976b], we can relate the solutions of a QAP and a BAP. We will prove the equivalentfollowing result (maximization version of theorem 5.5.1):

Theorem. If Q is a positive semi-deﬁnite matrix, then problems: max x f ( x ) = c T x + x T Qx s.t. Ax = b , x ≥ x , y g ( x , y ) = c T x + c T y + x T Qy s.t. Ax = b , Ay = b , x , y ≥ are equivalent. More precisely, if x ∗ is an optimal solution for (6.101) , then ( x ∗ , x ∗ ) is a solution for (6.102) and if ( x ∗ , y ∗ ) is optimal for (6.102) , then both x ∗ and y ∗ are optimal for (6.101) .Proof. This proof follows the proof of Theorem 2.2 in [Konno 1976b]. Let z ∗ be optimal for (6.101)and ( x ∗ , y ∗ ) be optimal for (6.102). Then, by deﬁnition, for all x satisfying the constraints of (6.101), f ( z ∗ ) ≥ f ( x ). In particular, f ( z ∗ ) ≥ f ( x ∗ ) = g ( x ∗ , x ∗ ) and f ( z ∗ ) ≥ f ( y ∗ ) = g ( y ∗ , y ∗ ). Also, g ( x ∗ , y ∗ ) ≥ max x , x s.t Ax = b , x ≥ g ( x , x ) = f ( z ∗ ).To prove the theorem, it suﬃces to prove that f ( y ∗ ) = f ( x ∗ ) = g ( x ∗ , y ∗ ) (6.103) .3. Proofs and additional results of Chapter 5 151 since, in this case, g ( x ∗ , y ∗ ) = f ( x ∗ ) ≥ f ( z ∗ ) and g ( x ∗ , y ∗ ) = f ( y ∗ ) ≥ f ( z ∗ ).Let us prove (6.103). Since ( x ∗ , y ∗ ) is optimal, we have:0 ≤ g ( x ∗ , y ∗ ) − g ( x ∗ , x ∗ ) = 12 c T ( y ∗ − x ∗ ) + 12 x ∗ T Q ( y ∗ − x ∗ )0 ≤ g ( x ∗ , y ∗ ) − g ( y ∗ , y ∗ ) = 12 c T ( x ∗ − y ∗ ) + 12 y ∗ T Q ( x ∗ − y ∗ ) . By adding these inequalities we obtain:( x ∗ − y ∗ ) T Q ( x ∗ − y ∗ ) ≤ . Since Q is positive semi-deﬁnite, this implies that Q ( x ∗ − y ∗ ) = 0. So, using previous inequalities, wehave c T ( x ∗ − y ∗ ) = 0, hence g ( x ∗ , y ∗ ) = g ( x ∗ , x ∗ ) = g ( y ∗ , y ∗ ) as required.Note also that this result holds when we add a constant term to the cost function. We recall the proposition:

Proposition.

Let L = | · | and suppose that C ∈ R n × n , C ∈ R n × n are squared Euclidean distancematrices such that C = x1 Tn + n x T − XX T , C = x Tn + n x T − X X T with x = diag ( XX T ) , x = diag ( X X T ) . Then, the GW problem can be written as a concave quadratic program (QP) which Hessianreads Q = − ∗ XX T ⊗ K X X T .If C ∈ R n × n , C ∈ R n × n are inner products similarities, i.e. such that C = XX T , C = X X T thenthe GW is also a concave quadratic program (QP) which Hessian reads Q = − ∗ XX T ⊗ K X X T . This result is a consequence of the following lemma.

Lemma 6.3.1.

When C , C are squared Euclidean distance matrices as deﬁned previously, the GWproblem can be formulated as: GW ( C , C , w , w ) = min π s ∈ Π( w , w ) − vec ( M ) T vec ( π s ) − vec ( π s ) T Q vec ( π s ) + Cte with M = xx T − xw T X X T − XX T wx T and Q = XX T ⊗ K X X T ,Cte = X i k x i − x j k w i w j + X i k x i − x j k w i w j − w T xw T x When C , C are inner product similarities as deﬁned in Proposition 5.5.1, the GW problem can beformulated as: GW ( C , C , w , w ) = min π s ∈ Π( w , w ) − vec ( π s ) T Q vec ( π s ) + Cte Proof.

Using the results in [Peyré 2016] for L = | · | , we have L ( C , C ) ⊗ π s = c C , C − C π s C with c C , C = ( C ) w1 Tn + n w T ( C ) , where ( C ) = ( C i,j ) is applied element-wise.We now have that: h C π s C , π s i = tr (cid:2) π sT ( x1 Tn + n x T − XX T ) π s ( x Tn + n x T − X X T ) (cid:3) = tr (cid:2) ( π sT x1 Tn + w x T − π sT XX T )( π s x Tn + wx T − π s X X T ) (cid:3) = tr (cid:2) π sT xw T x Tn + π sT xx T − π sT xw T X X T + w x T π s x Tn + w x T wx T − w x T π s X X T − π sT XX T π s x Tn − π sT XX T wx T + 4 π sT XX T π s X X T (cid:3) ∗ = tr (cid:2) π sT xw T ( x Tn + n x T ) + π sT xx T + w x T wx T − π sT xw T X X T − w x T π s X X T − π sT XX T π s x Tn − π sT XX T wx T + 4 π sT XX T π s X X T (cid:3) ,

52 Chapter 6. Proofs of claims and additional results where in (*) we used: tr( w x T π s x Tn ) = tr( x Tn w x T π s ) = tr( π sT xw T n x T ) . Moreover, since:tr( π sT XX T π s x Tn ) = tr( Tn π sT XX T π s x ) = tr( w T XX T π s x ) = tr( π sT XX T wx T )and tr( w x T π s X X T ) = tr( π sT xw T X X T ), we can simplify the last expression to obtain: h C π s C , π s i = tr (cid:2) π sT xw T ( x Tn + n x T ) + π sT xx T + w x T wx T − π sT xw T X X T − π sT XX T wx T + 4 π sT XX T π s X X T (cid:3) . Finally, we have that h C π s C , π s i = tr (cid:2) π sT xw T x Tn + π sT xw T n x T + π sT xx T + w x T wx T − π sT xw T X X T − π sT XX T wx T + 4 π sT XX T π s X X T (cid:3) = tr (cid:2) w x T wx T + 2 π sT xx T − π sT xw T X X T − π sT XX T wx T + 4 π sT XX T π s X X T (cid:3) = 2 w T xw T x + 2 h xx T − xw T X X T − XX T wx T , π s i + 4tr( π sT XX T π s X X T ) . The term 2 w T xw T x is constant since it does not depend on the coupling. Also, we can verify that c C , C does not depend on π s as follows: h c C , C , π s i = X i k x i − x j k w i w j + X i k x i − x j k w i w j implying that: h c C , C − C π s C , π s i = Cte − h xx T − xw T X X T − X T Xwx T , π s i − π sT XX T π s X X T ) . We can rewrite this equation as stated in the proposition using the vec operator.Using a standard QP form c T x + xQ x T with c = − M ) and Q = − XX T ⊗ K X X T we seethat the Hessian is negative semi-deﬁnite as the opposite of a Kronecker product of positive semi-deﬁnitematrices XX T and X X T .The inner product case is the same calculus, only the constant term changes and M = 0. Note thatboth calculus were also made in Chapter 4, and precisely in lemma 4.2.1. We recall the proposition:

Proposition.

Let C ∈ R n × n , C ∈ R n × n be any symmetric matrices, then:COOT ( C , C , w , w , w , w ) ≤ GW ( C , C , w , w ) . The converse is also true under the hypothesis of Proposition 5.5.1. In this case, if ( π s ∗ , π v ∗ ) is an optimalsolution of COOT, then both π s ∗ , π v ∗ are solutions of GW . Conversely, if π s ∗ is an optimal solution of GW , then ( π s ∗ , π s ∗ ) is an optimal solution for COOT. .3. Proofs and additional results of Chapter 5 153 Proof.

The inequality follows from the fact that any optimal solution of the GW problem is an admissiblesolution for the COOT problem, hence the inequality is true by suboptimality of this optimal solution.For the equality part, by following the same calculus as in the proof of lemma 6.3.1, we can verify that:COOT( C , C , w , w , w , w ) = min π s ∈ Π( w , w ) − M ) T vec( π s ) − M ) T vec( π v ) − π s ) T Q vec( π v ) + Cte, with M , Q as deﬁned in lemma 6.3.1. Since Q is negative semi-deﬁnite, we can apply Theorem 5.5.1 toprove that both problems are equivalent and lead to the same cost and that every optimal solution of GWis an optimal solution of COOT and vice versa. Same applies for the inner product case. We recall the result:

Proposition.

When X = C , X = C are squared Euclidean distance matrices or inner product similaritiesthe iterations of Algorithm 10 are the same as the iteration of the FW procedure deﬁned in Chapter 3 forsolving GW (provided that the initialization is the same).Proof. Using Proposition 5.5.2, we know that when X = C , X = C are squared Euclidean distancematrices or inner product similarities, then there is an optimal solution of the form ( π ∗ , π ∗ ). In this case,we can set π s ( k ) = π v ( k ) during the iterations of Algorithm 9 to obtain an optimal solution for both COOTand GW. This reduces to Algorithm 10 that corresponds to a DC algorithm where the quadratic form isreplaced by its linear upper bound.Below, we prove that this DC algorithm for solving GW problems is equivalent to the Frank-Wolfe(FW) based algorithm presented in Chapter 3 and recalled in Algorithm 11 when L = | · | and for squaredEuclidean distance matrices C , C . Algorithm 11

FW Algorithm for GW (see Chapter 3) Input: maxIt, thd π (0) ← ww while k < maxIt and err > thd do G ← Gradient from equation (5.2) w.r.t. π s ( k − ˜ π s ( k ) ← OT ( w , w , G ) z k ( τ ) ← π s ( k − + τ ( ˜ π s ( k ) − π s ( k − ) for τ ∈ (0 , τ ( k ) ← argmin τ ∈ (0 , h L ( C , C ) ⊗ z k ( τ ) , z k ( τ ) i π s ( k ) ← (1 − τ ( k ) ) π s ( k − + τ ( k ) ˜ π s ( k ) err ← || π s ( k − − π s ( k ) || F k ← k + 1 end while The cases when L = |·| and C , C are squared Euclidean distance matrices or inner product similaritieshave interesting implications in practice, since in this case the resulting GW problem is a concave QP(as explained in this Chapter and shown in Lemma 6.3.1). In [Maron 2018], the authors investigated thesolution to QP with conditionally concave energies using a FW algorithm and showed that in this casethe line-search step of the FW is always 1. Moreover, as shown in Proposition 6.3.1, the GW problem

54 Chapter 6. Proofs of claims and additional results can be written as a concave QP with concave energy and is minimizing a fortiori a conditionally concaveenergy. Consequently, the line-search step of the FW algorithm proposed in Chapter 3 and described inAlgorithm.11 always leads to an optimal line-search step of 1. In this case, the Algorithm.11 is equivalentto Algorithm.12 goven below, since τ ( k ) = 1 for all k . Algorithm 12

FW Algorithm for GW with squared Euclidean distance matrices or inner productsimilarities Input: maxIt, thd π (0) ← ww while k < maxIt and err > thd do G ← Gradient from equation (5.2) w.r.t. π s ( k − π s ( k ) ← OT ( w , w , G ) err ← || π s ( k − − π s ( k ) || F k ← k + 1 end while Finally, by noticing that in the step 3 of Algorithm 12 the gradient of (5.2) w.r.t π s ( k − is 2 L ( C , C ) ⊗ π s ( k − , which gives the same OT solution as for the OT problem in step 3 of Algorithm 10, we canconclude that the iterations of both algorithms are equivalent. This section shows that COOT approach can be used to solve the election isomorphism problem deﬁnedin [Faliszewski 2019] as follows: let E = ( C, V ) and E = ( C , V ) be two elections, where C = { c , . . . , c m } (resp. C ) denotes a set of candidates and V = ( v , . . . , v n ) (resp. V ) denotes a set of voters, where eachvoter v i has a preference order, also denoted by v i . The two elections E = ( C, V ) and E = ( C , V ),where abs C = abs C , V = ( v , . . . , v n ), and V = ( v , . . . , v n ), are said to be isomorphic if there exists abijection σ : C → C and a permutation ν ∈ S n such that σ ( v i ) = v ν ( i ) for all i ∈ [ n ]. The authors furtherpropose a distance underlying this problem deﬁned as follows:d-ID( E, E ) = min ν ∈ S n min σ ∈ Π( C,C ) n X i =1 d (cid:16) σ ( v i ) , v ν ( i ) (cid:17) , where S n denotes the set of all permutations over { , . . . , n } , Π( C, C ) is a set of bijections and d is anarbitrary distance between preference orders. The authors of [Faliszewski 2019] compute d-ID( E, E )in practice by expressing it as the following Integer Linear Programming problem over the tensor P ijkl = M ij N kl where M ∈ R m × m , N ∈ R n × n min P , N , M X i,j,k,l P k,l,i,j | pos v i ( c k ) − pos v j ( c l ) | s.t. ( N1 n ) k = 1 , ∀ k, ( N > n ) l = 1 , ∀ l (6.104)( M1 m ) i = 1 , ∀ i, ( M > m ) j = 1 , ∀ jP kl ≤ N k,l , P i,j,k,l ≤ M i,j , ∀ i, j, k, l X i,k P i,j,k,l = 1 , ∀ j, l (6.105)where pos v i ( c k ) denotes the position of candidate c k in the preference order of voter v i . Let us nowdeﬁne two matrices X and X such that X i,k = pos v i ( c k ) and X j,l = pos v j ( c l ) and denote by π s ∗ , π v ∗ a .3. Proofs and additional results of Chapter 5 155 minimizer of COOT( X , X , n /n, n /n, m /m, m /m ) with L = | · | and by N ∗ , M ∗ the minimizers ofproblem (6.104), respectively.As shown in the chapter, there exists an optimal solution for COOT( X , X ) given by permutationmatrices as solutions of the Monge-Kantorovich problems for uniform distributions supported on thesame number of elements. Then, one may show that the solution of the two problems coincide modulo amultiplicative factor, i.e. , π s ∗ = n N ∗ and π v ∗ = m M ∗ are optimal since abs C = abs C and abs V = abs V .For π s ∗ (the same reasoning holds for π v ∗ as well), we have that( π s ∗ ) ij = ( n , j = ν ∗ i , otherwise . where ν ∗ i is a permutation of voters in the two sets. The only diﬀerence between the two solutions π s ∗ and N ∗ thus stems from marginal constraints (6.104). To conclude, we note that COOT is a moregeneral approach as it is applicable for general loss functions L , contrary to the Spearman distance usedin [Faliszewski 2019], and generalizes to the cases where n = n and m = m . Here, we present the results for the heterogeneous domain adaptation experiment not included in section5.6.1. Table 6.6 follows the same experimental protocol as in the chapter but shows the two cases where n t = 1 and n t = 5. Table 6.7 and Table 6.8 contain the results for the adaptation from GoogleNet to Decaffeatures, in a semi-supervised and unsupervised scenarios, respectively Overall, the results are coherentwith those from the chapter: in both settings, when n t = 5, one can see that the performance diﬀerencesbetween SGW and COOT is rather signiﬁcant.

56 Chapter 6. Proofs of claims and additional results

Decaf → GoogleNetDomains Baseline CCA KCCA EGW SGW COOT n t = 1C → W 30 . ± .

90 13 . ± .

23 29 . ± .

14 10 . ± .

31 66 . ± . . ± . → C 26 . ± .

75 16 . ± .

18 40 . ± .

02 10 . ± .

84 80 . ± . . ± . → W 30 . ± .

78 13 . ± .

38 36 . ± .

38 8 . ± .

36 78 . ± . . ± . → A 30 . ± .

51 12 . ± .

99 39 . ± .

85 9 . ± .

90 80 . ± . . ± . → C 41 . ± .

59 12 . ± .

95 28 . ± .

24 9 . ± .

17 72 . ± . . ± . → W 39 . ± .

27 19 . ± .

40 38 . ± .

30 12 . ± .

56 75 . ± . . ± . → A 42 . ± .

36 15 . ± .

36 38 . ± .

99 13 . ± .

93 75 . ± . . ± . → C 28 . ± .

40 18 . ± .

81 35 . ± .

96 11 . ± .

63 61 . ± . . ± . → A 31 . ± .

25 15 . ± .

10 33 . ± .

10 11 . ± .

67 66 . ± . . ± . Mean . ± .

77 15 . ± .

44 35 . ± .

98 10 . ± .

47 72 . ± . . ± . n t = 5C → W 74 . ± .

53 14 . ± .

37 73 . ± .

99 11 . ± .

13 84 . ± . . ± . → C 90 . ± .

67 21 . ± .

85 85 . ± .

44 10 . ± . . ± .

84 94 . ± . → W 90 . ± .

50 15 . ± .

27 90 . ± .

95 9 . ± . . ± .

47 94 . ± . → A 90 . ± .

92 16 . ± .

85 87 . ± .

47 9 . ± .

68 95 . ± . . ± . → C 88 . ± .

33 15 . ± .

64 83 . ± .

84 10 . ± .

89 84 . ± . . ± . → W 88 . ± .

17 13 . ± .

25 87 . ± .

82 11 . ± .

40 87 . ± . . ± . → A 86 . ± .

08 14 . ± .

93 87 . ± .

48 14 . ± .

65 89 . ± . . ± . → C 75 . ± .

83 13 . ± .

98 70 . ± .

45 11 . ± . . ± .

54 84 . ± . → A 73 . ± .

62 15 . ± .

50 74 . ± .

42 11 . ± .

47 85 . ± . . ± . Mean . ± .

01 15 . ± .

25 82 . ± .

03 11 . ± .

23 89 . ± . . ± . Table 6.6:

Semi-supervised Heterogeneous Domain Adaptation results for adaptation from Decaf toGoogleNet representations with diﬀerent values of n t . .3. Proofs and additional results of Chapter 5 157 GoogleNet → DecafDomains Baseline CCA KCCA EGW SGW COOT n t = 1C → A 31 . ± .

87 12 . ± .

78 33 . ± .

47 7 . ± .

11 77 . ± . . ± . → C 30 . ± .

73 13 . ± .

29 32 . ± .

98 12 . ± .

81 76 . ± . . ± . → A 37 . ± .

04 15 . ± .

71 34 . ± .

71 14 . ± .

77 86 . ± . . ± . → C 35 . ± .

89 15 . ± .

18 40 . ± .

54 13 . ± .

49 87 . ± . . ± . → A 36 . ± .

73 13 . ± .

47 34 . ± .

44 13 . ± .

56 89 . ± . . ± . → W 32 . ± .

63 19 . ± .

82 36 . ± .

98 10 . ± .

59 84 . ± . . ± . → C 32 . ± .

56 21 . ± .

01 33 . ± .

72 11 . ± .

03 86 . ± . . ± . → W 33 . ± .

75 16 . ± .

74 39 . ± .

94 11 . ± .

01 87 . ± . . ± . → W 32 . ± .

76 15 . ± .

72 34 . ± .

96 12 . ± .

52 81 . ± . . ± . Mean . ± .

45 15 . ± .

81 35 . ± .

50 11 . ± .

08 84 . ± . . ± . n t = 3C → A 76 . ± .

15 17 . ± .

45 73 . ± .

53 7 . ± .

27 88 . ± . . ± . → C 78 . ± .

61 18 . ± .

44 69 . ± .

51 14 . ± .

16 89 . ± . . ± . → A 85 . ± .

25 19 . ± .

10 80 . ± .

82 14 . ± .

72 94 . ± . . ± . → C 89 . ± .

05 23 . ± .

17 80 . ± .

30 13 . ± .

69 93 . ± . . ± . → A 89 . ± .

92 17 . ± .

11 83 . ± .

30 14 . ± .

28 93 . ± . . ± . → W 86 . ± .

07 21 . ± .

78 84 . ± .

67 9 . ± . . ± .

79 94 . ± . → C 88 . ± .

02 22 . ± .

23 80 . ± .

65 13 . ± . . ± .

15 95 . ± . → W 90 . ± .

35 22 . ± .

00 87 . ± .

53 13 . ± .

60 94 . ± . . ± . → W 78 . ± .

44 22 . ± .

42 80 . ± .

95 11 . ± .

25 89 . ± . . ± . Mean . ± .

19 20 . ± .

34 80 . ± .

12 12 . ± .

31 92 . ± . . ± . n t = 5C → A 84 . ± .

65 18 . ± .

75 84 . ± .

33 6 . ± . . ± .

61 91 . ± . → C 85 . ± .

76 21 . ± .

91 78 . ± .

74 13 . ± .

00 91 . ± . . ± . → A 95 . ± .

29 31 . ± .

67 91 . ± .

82 14 . ± .

40 96 . ± . . ± . → C 91 . ± .

60 21 . ± .

35 85 . ± .

27 13 . ± . . ± .

51 94 . ± . → A 93 . ± .

57 23 . ± .

66 89 . ± .

98 13 . ± . . ± .

07 95 . ± . → W 95 . ± .

33 23 . ± .

48 92 . ± .

78 11 . ± .

58 96 . ± . . ± . → C 95 . ± .

50 28 . ± .

71 87 . ± .

79 14 . ± . . ± .

31 96 . ± . → W 92 . ± .

36 22 . ± .

94 89 . ± .

14 11 . ± .

50 93 . ± . . ± . → W 84 . ± .

45 20 . ± .

31 82 . ± .

56 11 . ± .

70 90 . ± . . ± . Mean . ± .

57 23 . ± .

64 86 . ± .

26 12 . ± .

37 94 . ± . . ± . Table 6.7:

Semi-supervised Heterogeneous Domain Adaptation results for adaptation from GoogleNet toDecaf representations with diﬀerent values of n t .

58 Chapter 6. Proofs of claims and additional results

GoogleNet → DecafDomains CCA KCCA EGW COOTC → A 11 . ± .

04 14 . ± .

12 8 . ± . . ± . → C 13 . ± .

32 17 . ± .

16 11 . ± . . ± . → A 14 . ± .

68 25 . ± .

73 14 . ± . . ± . → C 13 . ± .

51 20 . ± .

94 16 . ± . . ± . → A 16 . ± .

45 28 . ± .

62 12 . ± . . ± . → W 14 . ± .

72 24 . ± .

35 9 . ± . . ± . → C 13 . ± .

98 14 . ± .

79 11 . ± . . ± . → W 10 . ± .

62 14 . ± .

36 12 . ± . . ± . → W 18 . ± .

02 25 . ± .

40 11 . ± . . ± . Mean . ± .

25 20 . ± .

22 12 . ± . . ± . Table 6.8:

Unsupervised Heterogeneous Domain Adaptation results for adaptation from GoogleNet toDecaf representations. hapter Conclusion

Le soleil est noyé. - C’est le soir - dans le portLe navire bercé sur ses câbles, s’endort– Tristan Corbière,

Les Amours jaunes

Contents

This thesis presents a set of optimal transport tools for dealing with probability distributions on incompa-rable spaces , or equivalently probability distributions whose supports do not lie in a common metric space.We explained how this problem occurs e.g. when one needs to consider some structural knowledge aboutthe data or when the data come from heterogeneous sources. As a ﬁrst instance of probability distributionson incomparable spaces we studied the setting of structured data such as labeled graphs, times series orany “relational” data whose structure can be modelled through a notion of cost or similarity . We showedhow to describe them as a probability distributions and how to compare them using the so-called FusedGromov-Wasserstein distance which builds upon Wasserstein and the Gromov-Wasserstein distances. Thisnew optimal transport distance was successfully applied in a graph context where it ﬁnds applications forthe classiﬁcation, clustering or summarization of labeled graphs.As the main building block of F GW , the Gromov-Wasserstein distance is a central notion of thisthesis. We attempted to bridge the gap between the understanding of the Wasserstein distance and GW where we consider the special case of Euclidean spaces. This setting allows us to derive a sliced approach,based on the ﬁrst closed-form expression for GW between 1D probability distributions. We called it SlicedGromov-Wasserstein, akin to the Sliced Wasserstein distance which has recently found many applicationsin machine learning. The Euclidean setting reveals to be also a good starting point for analysing theregularity of GW optimal transport plans and, as such, we partially answered the following question:can we ﬁnd guarantees on the probability measures so that an optimal transport plan for GW can beexpressed through a deterministic function? This question, central in linear optimal transport theory, canbe tackled with the celebrated Brenier’s theorem in the context of the Wasserstein distance but was stillquite under-addressed when dealing with GW .Although GW is a powerful tool for comparing probability distributions on incomparable spaces it islimited in its ability to ﬁnd correspondences between the features of the samples of these distributions.

60 Chapter 7. Conclusion

This limitation originates from the method itself which discards the feature information by focusing onlyon pair-to-pair distance matrices. To circumvent this constraint we proposed a novel optimal transportdistance which both ﬁnds the correspondences between the samples and the features of the distributions.This work is based on the CO-Optimal transport framework which computes two optimal transportplans directly on the raw data unlike GW which requires pre-computed pair-to-pair distance or similaritymatrices. We showed that it is particularly suited for problems such as Heterogeneous Domain Adaptationand Co-clustering. In the light of the previous results we drew interesting connections between COOTand the GW distance the latter being a special case of the former in the concave regime and with datadescribed by distance or similarity matrices. • [Vayer 2020b] T. Vayer , L. Chapel, R. Flamary, R. Tavenard, and N. Courty.

Fused Gromov-Wasserstein distance for structured objects . Journal. In: Algorithms. (2020)• [Vayer 2019a]

T. Vayer , L. Chapel, R. Flamary, R. Tavenard, and N. Courty.

Optimal Transportfor structured data with application on graphs . International conference. In: International Conferenceon Machine Learning (ICML). (2019)• [Vayer 2019b]

T. Vayer , R. Flamary, R. Tavenard, L. Chapel and N. Courty.

Sliced Gromov-Wasserstein . International conference. In: Advances in Neural Information Processing Systems(NeurIPS). (2019)•

T. Vayer , L. Chapel, R. Flamary, R. Tavenard, and N. Courty.

Fused Gromov Wasserstein distance .National conference. In: Conférence sur l’Apprentissage automatique (CAp). (2018)•

T. Vayer , L. Chapel, R. Flamary, R. Tavenard, and N. Courty.

Transport Optimal pour les Signauxsur Graphes . National conference. In: GRETSI. (2019)• [Redko 2020] I. Redko,

T. Vayer , R. Flamary and N. Courty.

CO-Optimal Transport . In:arXiv:2002.03731. Submitted to NeurIPS (2020)• [Vayer 2020a]

T. Vayer , L. Chapel, N. Courty, R. Flamary, Y. Soullard and R. Tavenard.

TimeSeries Alignment with Global Invariances . In: arXiv:2002.03848. (2020)

There are many possible extensions and improvements of the works developed here. To conclude we discusspotential limitations and further works of the methods proposed in this thesis from the OT perspective tothe machine learning point of view.

What are the improvements and limitations of the diﬀerent results of this thesis from theOT perspective?

The work about GW on Euclidean spaces suggests, in our opinion, interesting furtherworks. The slicing approach for GW ﬁnds e.g. natural extensions inspired by the Wasserstein distance. Animmediate one is the max-sliced approach [Deshpande 2019] where only the projection which maximizesthe loss is drawn. This setting is directly applicable in our case without changing the theoretical propertiesof SGW stated in Theorem 4.1.3. Moreover, one limitation of

SGW in the formulation that we proposedis the map ∆ which has to be chosen or optimized when one wants to compute

SGW when dimensions .2. Perspective for further works 161 diﬀer or in order to retrieve the invariants of GW . The max-sliced approach could be a interesting remedyfor this situation. We can for example draw one line in each space without requiring any map ∆ but withthe price of losing the divergence property ( i.e. max − SGW ( µ, µ ) could be diﬀerent from zero). Anotherpotential direction would be to consider many lines in each space and then ﬁnd their correspondences byusing e.g. Wasserstein which could deﬁne a proper metric w.r.t. isomorphisms but which will come witha cubic complexity. Interestingly enough this approach would exactly be the COOT distance but withreplacing the matching of the feature by a sliced Wasserstein. In any case there are room for improvementsin the way of deﬁning

SGW when dimensions of the support of the probability measures diﬀer.Regarding the theoretical aspects of

SGW another interesting line of research, in our opinion, wouldbe to draw connections with the statistical properties of the Sliced Wasserstein distance which is knownto behave well in terms of convergence of ﬁnite samples. We believe that similar type of studies for theSliced Gromov-Wasserstein could be promising.From a computational OT point of view the CO-Optimal Transport framework seems also to bequite suited for large scale datasets. Indeed since the BCD procedure used for solving COOT is basedon alternating linear optimal transport problems another line of works could be to rely on the dual orsemi dual formulations of linear OT, used for large scale OT, in order to compute COOT for large scaledatasets whose samples are in incomparable spaces. Finally from a theoretical point of view a continuousformulation of COOT could also be of interest for studying the properties of GW , and especially theregularity of its optimal plans. Indeed we have proven that an optimal solution for COOT can be found via permutation matrices, whose continuous counterparts are deterministic push-forwards. As such, canwe ﬁnd a natural extension for COOT which preserves this property in the continuous setting? Also if theproblem is concave, such as when GW is considered with squared Euclidean distances, can we preservethe result stating that COOT and GW are equivalent so that an optimal solution of the ﬁrst is optimalsolution for the second? In this way can we conclude that there exists an optimal map for GW thatis supported on a Monge map when the problem is concave? We believe, enthusiastically, that thesequestions worth investigating. How other ﬁelds can contribute to the frameworks developed in this thesis?

One of the mostchallenging improvement of the work developed in this thesis is maybe to lighten the computationalcomplexity of calculating

F GW which is driven by the complexity of the GW distance. In this form, thecomputational complexity

F GW renders this framework limited to small to medium scale scenarii and, inparticular, is inapplicable for very large graphs. Moreover one can not hope to directly rely on the worksabout GW on Euclidean spaces in order to derive tractable formulations since the structures of labeledgraphs are usually highly non-Euclidean. However it worth pointing out that approximating C , C withEuclidean matrices can be done in several ways [Borg 2005, Glunt 1990, Alfakih 1999, Liberti 2014]. Webelieve that a kind of no free lunch theorem applies here: if we faithfully model the structure of the datait is likely to result in a more precise notion of distance between structured data but also in a morechallenging optimization problem than if we approximate it for better scalability. Needless to say, since GW is at the backbone of our method, any further improvements in computational eﬃciency for nonconvex Quadratic programs, and by way of consequence for GW , would directly beneﬁt to F GW . In thisway progress in graph matching could deﬁnitely contribute to derive eﬃcient algorithms for solving

F GW as both are intrinsically related. Another interesting study would be regarding the design of the matrices C , C representative of the structures for F GW . We considered in this thesis that these matrices aregiven, such as shortest-path matrices, yet, one could argue that this is a strong prior on the structures ofthe graphs and that no perfect choice arises in practice. The problem of ﬁnding “good” structure matrices

62 Chapter 7. Conclusion is very related with the metric learning ﬁeld [Bellet 2013] and further appealing works for

F GW could beto go into this direction and to add C , C into the learning process. Related to this problem we believethat very insightful connections between F GW and the ﬁeld of signal graph processing [Ortega 2018] canbe drawn and that both can beneﬁt from each other. As one example: can we build upon the spectraltools from signal graph processing in order to ﬁnd adequate measures of structure C , C of the graphs? How do the above-mentioned optimal transport tools can, or can not, be used for machinelearning?

As mentioned throughout this thesis, having a adequate measure for comparing data lyingon incomparable spaces can be useful for a wide range of machine learning tasks. For applications involvingstructured data we proposed the

F GW distance based on optimal transport. Note, however, that manyways of comparing structured data have been proposed in the literature: from the design of dedicatedkernels (see [Kriege 2020] for a comprehensive survey on this topic) to other choices of distances betweengraphs [Bento 2019, Wills 2020] including the popular graph-edit distance [Willett 1998, Raymond 2002].More recently, end-to-end approaches [Li 2019, Bai 2020, Riba 2018, Sun 2020] attempt to learn a similaritymeasure function between graphs based on graph neural networks, i.e. to learn a neural network basedfunction that takes two graphs as input and outputs the desired similarity. The question of ﬁnding a“good” similarity measure for structured data is far from being closed and is naturally dependent ofthe application. The choice of

F GW can be motivated by its metric properties which allow detectingwhether two graphs are isomorphic and, as such, could be used as an alternative to the graph-edit distance.However, despite its appealing properties, one could question the optimal transport framework on whichthis method is based. As described in this thesis the coupling matrices are used to ﬁnd a probabilisticmatching between all the nodes of two graphs. Conversely, and depending on the application, one mightwant to match only a small portion of the nodes of two graphs which is not possible using the

F GW framework which considers global matchings. In this way, in a scenario where only the local structure isimportant, the

F GW machinery appears to be quite disproportionate, and some kernel methods whichare built on local structures may be more suited. The use of the

F GW distance for structured data suchas time series is also questionable. Indeed the set of couplings itself is constrained so that a matching“back in time” is made possible. On the contrary, it may be more interesting to force that the points whichare matched to the target sequence at time t can only depend on the source sequence up to time t as done e.g. using Dynamic Time Warping distances [Cuturi 2017, Sakoe 1978]. To remedy to this problem therecent casual optimal transport framework [Lassalle 2018, Veraguas 2016, Acciaio 2020, Zalashko 2017] maybe more suited for these type of problems. Apart from these limitations the F GW framework could beuseful for other applications than these mentioned in this thesis. As a ﬁrst instance it could be used as away for learning graph neural networks (GNN). Given a set of graphs we could for example minimizea triplet loss [Chechik 2010] or a graph similarity score [Bai 2019] which impose that the distances inthe embedding of the GNN are close to the

F GW distances so as to force the GNN to produce similarembeddings for similar graphs. Moreover the properties of the Fréchet barycenter with

F GW couldbe elaborated, especially the projection on a smaller graph using

F GW . A perspective on this topicwould be to draw connections with graph signal processing and more standard coarsening procedures(see [Loukas 2019] and references therein) by questioning the ability of the

F GW projection to reduce thegraph without altering too much its spectral properties.Finally the results of this thesis suggests that COOT may be suited for Heterogeneous DomainAdaptation. An interesting further work on this topic would be to see if we can conﬁrm these empiricalresults with theoretical guarantees such as bounds for Heterogeneous Domain Adaptation by relying onthe COOT framework. For example can we prove that the adaptation is easier when the datasets are .2. Perspective for further works 163 close from each other w.r.t. the COOT distance than if they are far from each other? To the best of ourknowledge the question of ﬁnding bounds for HDA problems is still quite under-addressed [Zhou 2019]and COOT may lead to promising further works on this topic.All these considered we hope that the works proposed in this thesis will pave the path for positiveand interesting studies on various and interdisciplinary topics in machine learning and that it will alsocontribute to the richness of the optimal transport theory, which is far from being extinguished. ibliography [Abrudan 2008] T. E. Abrudan, J. Eriksson et V. Koivunen.

Steepest Descent Algorithms for OptimizationUnder Unitary Matrix Constraint . IEEE Transactions on Signal Processing, vol. 56, no. 3, pages1134–1147, 2008. (Cited on page 97.)[Absil 2009] P-A Absil, Robert Mahony et Rodolphe Sepulchre. Optimization algorithms on matrixmanifolds. Princeton University Press, 2009. (Cited on pages 74, 75 and 97.)[Acciaio 2020] Beatrice Acciaio, Julio Backhoﬀ-Veraguas et Junchao Jia.

Cournot-Nash equilibrium andoptimal transport in a dynamic setting , 2020. (Cited on page 162.)[Agueh 2011] Martial Agueh et Guillaume Carlier.

Barycenters in the Wasserstein space . SIAM Journalon Mathematical Analysis, vol. 43, no. 2, pages 904–924, 2011. (Cited on page 25.)[Alaya 2019] Mokhtar Z. Alaya, Maxime Berar, Gilles Gasso et Alain Rakotomamonjy.

Screening SinkhornAlgorithm for Regularized Optimal Transport . In Advances in Neural Information Processing Systems32, pages 12169–12179. Curran Associates, Inc., 2019. (Cited on page 106.)[Alfakih 1999] Abdo Y. Alfakih, Amir Khandani et Henry Wolkowicz.

Solving Euclidean DistanceMatrix Completion Problems Via Semideﬁnite Programming . Computational Optimization andApplications, vol. 12, no. 1, pages 13–30, Janvier 1999. (Cited on page 161.)[Altschuler 2017] Jason Altschuler, Jonathan Weed et Philippe Rigollet.

Near-linear time approximationalgorithms for optimal transport via Sinkhorn iteration . In Advances in Neural InformationProcessing Systems, pages 1961–1971, 2017. (Cited on pages 3, 1, 22, 103 and 106.)[Altschuler 2019] Jason Altschuler, Francis Bach, Alessandro Rudi et Jonathan Niles-Weed.

Massivelyscalable Sinkhorn distances via the Nyström method . In Advances in Neural Information ProcessingSystems 32, pages 4427–4437. Curran Associates, Inc., 2019. (Cited on page 106.)[Alvarez-Melis 2018a] David Alvarez-Melis et Tommi Jaakkola.

Gromov-Wasserstein Alignment of WordEmbedding Spaces . In Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing, 2018. (Cited on pages 39 and 102.)[Alvarez-Melis 2018b] David Alvarez-Melis, Tommi S. Jaakkola et Stefanie Jegelka.

Structured OptimalTransport . In AISTATS, 2018. (Cited on pages 5, 3 and 43.)[Alvarez-Melis 2019] David Alvarez-Melis, Stefanie Jegelka et Tommi S. Jaakkola.

Towards OptimalTransport with Global Invariances . In AISTATS, volume 89, pages 1870–1879, 2019. (Cited onpages 33, 82, 102 and 108.)[Alvarez-Melis 2020] David Alvarez-Melis et Nicolò Fusi.

Geometric Dataset Distances via OptimalTransport , 2020. (Cited on page 105.)

66 Bibliography [Ambrosio 2005] L. Ambrosio, N. Gigli et G. Savare. Gradient ﬂows: In metric spaces and in the space ofprobability measures. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel, 2005. (Cited onpages 28, 33, 122 and 125.)[Andoni 2015] Alexandr Andoni, Assaf Naor et Ofer Neiman.

Snowﬂake universality of Wassersteinspaces . Annales Scientiﬁques de l’Ecole Normale Superieure, vol. 51, 09 2015. (Cited on page 12.)[Anstreicher 1998] Kurt Anstreicher et Henry Wolkowicz.

On Lagrangian Relaxation of Quadratic MatrixConstraints . SIAM Journal on Matrix Analysis and Applications, vol. 22, 07 1998. (Cited onpages 88 and 144.)[Arjovsky 2017] M. Arjovsky, S. Chintala et L. Bottou.

Wasserstein Generative Adversarial Networks . InInternational Conference on Machine Learning, volume 70, pages 214–223, 2017. (Cited on pages 3,1, 15, 24, 43 and 70.)[Bach 2007] F. Bach et Z. Harchaoui.

Image Classiﬁcation with Segmentation Graph Kernels . In CVPR,volume 00, pages 1–8, 06 2007. (Cited on pages 5, 3 and 42.)[Bai 2019] Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun et Wei Wang.

SimGNN: ANeural Network Approach to Fast Graph Similarity Computation . In Proceedings of the TwelfthACM International Conference on Web Search and Data Mining, WSDM ’19, page 384–392, NewYork, NY, USA, 2019. Association for Computing Machinery. (Cited on page 162.)[Bai 2020] Yunsheng Bai, Hao Ding, Ken Gu, Yizhou Sun et Wei Wang.

Learning-Based Eﬃcient GraphSimilarity Computation via Multi-Scale Convolutional Set Matching . Proceedings of the AAAIConference on Artiﬁcial Intelligence, vol. 34, pages 3219–3226, 04 2020. (Cited on page 162.)[Bakir 2007] Gükhan H. Bakir, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskaret S. V. N. Vishwanathan. Predicting structured data (neural information processing). The MITPress, 2007. (Cited on page 42.)[Banerjee 2007] Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu et Dharmendra S.Modha.

A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Ap-proximation . Journal of Machine Learning Research, vol. 8, pages 1919–1986, 2007. (Cited onpage 115.)[Barbe 2020] Amélie Barbe, Marc Sebban, Paulo Gonçalves, Pierre Borgnat et Rémi Gribonval.

GraphDiﬀusion Wasserstein Distances . In European Conference on Machine Learning and Principlesand Practice of Knowledge Discovery in Databases, Ghent, Belgium, Septembre 2020. (Cited onpage 57.)[Battaglia 2018] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Ma-linowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, C. Gulcehre, F. Song, A. Ballard,J. Gilmer, G. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra,P. Kohli, M. Botvinick, O. Vinyals, Y. Li et R. Pascanu.

Relational inductive biases, deep learning,and graph networks . ArXiv e-prints, Juin 2018. (Cited on pages 4, 2, 3 and 42.)[Bellet 2013] Aurélien Bellet, Amaury Habrard et Marc Sebban.

A Survey on Metric Learning for FeatureVectors and Structured Data , 2013. (Cited on page 162.) ibliography 167 [Benamou 2000] Jean-David Benamou et Yann Brenier.

A computational ﬂuid mechanics solution to theMonge-Kantorovich mass transfer problem . Numerische Mathematik, vol. 84, no. 3, pages 375–393,Jan 2000. (Cited on page 65.)[Benamou 2015] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna et Gabriel Peyré.

Iterative bregman projections for regularized transportation problems . SIAM Journal on ScientiﬁcComputing, vol. 37, no. 2, pages A1111–A1138, 2015. (Cited on pages 22, 23 and 26.)[Bento 2019] José Bento et Stratis Ioannidis.

A family of tractable graph metrics . Applied NetworkScience, vol. 4, no. 1, page 107, Novembre 2019. (Cited on page 162.)[Berg 2005] Alexander Berg, Tamara Berg et Jitendra Malik.

Shape Matching and Object RecognitionUsing Low Distortion Correspondences. volume 1, pages 26–33, 01 2005. (Cited on page 35.)[Bernard 2018] Florian Bernard, Christian Theobalt et Michael Moeller.

DS*: Tighter Lifting-Free ConvexRelaxations for Quadratic Matching Problems . pages 4310–4319, 06 2018. (Cited on page 35.)[Bertsimas 1997] D. Bertsimas et J.N. Tsitsiklis. Introduction to linear optimization. Athena Scientiﬁc,1997. (Cited on page 20.)[Besl 1992] Paul J. Besl et Neil D. McKay.

A Method for Registration of 3-D Shapes.

IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 14, pages 239–256, 1992. (Cited on page 102.)[Bigot 2019a] Jérémie Bigot, Elsa Cazelles et Nicolas Papadakis.

Data-driven regularization of Wassersteinbarycenters with an application to multivariate density registration . Information and Inference,vol. 8, no. 4, pages 719–755, 2019. (Cited on page 27.)[Bigot 2019b] Jérémie Bigot, Elsa Cazelles et Nicolas Papadakis.

Penalization of Barycenters in theWasserstein Space . SIAM Journal on Mathematical Analysis, vol. 51, no. 3, pages 2261–2285, 2019.(Cited on pages 23 and 27.)[Billingsley 1999] Patrick Billingsley. Convergence of probability measures. Wiley Series in Probabilityand Statistics: Probability and Statistics. John Wiley & Sons Inc., New York, second édition, 1999.A Wiley-Interscience Publication. (Cited on page 28.)[Birkhoﬀ 1946] Garrett Birkhoﬀ.

Tres observaciones sobre el algebra lineal . Univ. Nac. Tucumán Rev.Ser. A, 1946. (Cited on pages 20, 105 and 150.)[Blondel 2020] Mathieu Blondel, André F.T. Martins et Vlad Niculae.

Learning with Fenchel-Young losses .Journal of Machine Learning Research, vol. 21, no. 35, pages 1–69, 2020. (Cited on pages 5 and 3.)[Bokhari 1981] S. H. Bokhari.

On the Mapping Problem . IEEE Trans. Comput., vol. 30, no. 3, page207–214, Mars 1981. (Cited on page 35.)[Bonneel 2011] Nicolas Bonneel, Michiel van de Panne, Sylvain Paris et Wolfgang Heidrich.

DisplacementInterpolation Using Lagrangian Mass Transport . In Proceedings of the 2011 SIGGRAPH AsiaConference, SA ’11, pages 158:1–158:12, New York, NY, USA, 2011. ACM. (Cited on page 65.)[Bonneel 2015] Nicolas Bonneel, Julien Rabin, Gabriel Peyré et Hanspeter Pﬁster.

Sliced and RadonWasserstein Barycenters of Measures . Journal of Mathematical Imaging and Vision, vol. 1, no. 51,pages 22–45, 2015. (Cited on pages 25, 27, 71 and 73.)

68 Bibliography [Bonneel 2019] Nicolas Bonneel et David Coeurjolly.

SPOT: Sliced Partial Optimal Transport . ACMTrans. Graph., vol. 38, no. 4, Juillet 2019. (Cited on page 25.)[Bonnotte 2013] Nicolas Bonnotte.

Unidimensional and Evolution Methods for Optimal Transportation .PhD thesis, 2013. (Cited on pages 24 and 71.)[Borg 2005] I. Borg et P.J.F. Groenen. Modern Multidimensional Scaling: Theory and Applications.Springer, 2005. (Cited on page 161.)[Borgwardt 2005] Karsten M. Borgwardt et Hans-Peter Kriegel.

Shortest-Path Kernels on Graphs . InICDM, ICDM ’05, pages 74–81, Washington, DC, USA, 2005. IEEE Computer Society. (Cited onpages 51 and 52.)[Bourgain 1986] Jean Bourgain.

The metrical interpretation of superreﬂexivity in banach spaces . IsraelJournal of Mathematics, vol. 56, pages 222–230, 1986. (Cited on page 12.)[Brenier 1991] Yann Brenier.

Polar factorization and monotone rearrangement of vector-valued functions .Communications on Pure and Applied Mathematics, vol. 44, no. 4, pages 375–417, 1991. (Cited onpages 9, 8 and 16.)[Bronstein 2017] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam et Pierre Vandergheynst.

Geometric deep learning: going beyond euclidean data . IEEE Signal Processing Magazine, vol. 34,no. 4, pages 18–42, 2017. (Cited on page 42.)[Bunne 2019] Charlotte Bunne, David Alvarez-Melis, Andreas Krause et Stefanie Jegelka.

LearningGenerative Models across Incomparable Spaces . In International Conference on Machine Learning,volume 97, 2019. (Cited on pages 39, 70, 76, 77 and 106.)[Burago 2001] D. Burago, I.D. Burago, I.U.D. Burago, I.U.D. Burago, J.D. Burago, Y. Burago, Û.D.Burago, D. Burago, I.U.D. Burago, S. Ivanov et others. A Course in Metric Geometry. CrmProceedings & Lecture Notes. American Mathematical Society, 2001. (Cited on page 33.)[Bures 1969] Donald Bures.

An Extension of Kakutani’s Theorem on Inﬁnite Product Measures to theTensor Product of Semiﬁnite w*-Algebras . Transactions of the American Mathematical Society,vol. 135, pages 199–212, 1969. (Cited on page 18.)[Burkard 1996] Rainer E. Burkard, Bettina Klinz et Rüdiger Rudolf.

Perspectives of Monge Propertiesin Optimization . Discrete Appl. Math., vol. 70, no. 2, page 95–161, Septembre 1996. (Cited onpage 20.)[Burkard 1999] Rainer E. Burkard et Eranda Çela. Linear assignment problems and extensions, pages75–149. Springer US, Boston, MA, 1999. (Cited on page 19.)[Béthune 2020] Louis Béthune, Yacouba Kaloga, Pierre Borgnat, Aurélien Garivier et Amaury Habrard.

Hierarchical and Unsupervised Graph Representation Learning with Loukas’s Coarsening , 2020.(Cited on page 57.)[Caetano 2009] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le et A. J. Smola.

Learning GraphMatching . IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 6, pages1048–1058, 2009. (Cited on page 35.) ibliography 169 [Cang 2020] Zixuan Cang et Qing Nie.

Inferring spatial and signaling relationships between cells fromsingle cell transcriptomic data . Nature Communications, vol. 11, no. 1, page 2084, Avril 2020.(Cited on page 57.)[Carlier 2015] Guillaume Carlier, Adam Oberman et Edouard Oudet.

Numerical methods for matchingfor teams and Wasserstein barycenters . ESAIM: Mathematical Modelling and Numerical Analysis,vol. 49, no. 6, pages 1621–1642, Novembre 2015. (Cited on page 26.)[Carrière 2017] Mathieu Carrière, Marco Cuturi et Steve Oudot.

Sliced Wasserstein Kernel for PersistenceDiagrams . In Proceedings of the 34th International Conference on Machine Learning - Volume 70,ICML’17, page 664–673. JMLR.org, 2017. (Cited on page 25.)[Çela 2011] Eranda Çela, Nina S. Schmuck, Shmuel Wimer et Gerhard J. Woeginger.

The Wienermaximum quadratic assignment problem . Discrete Optimization, vol. 8, pages 411–416, 2011.(Cited on page 35.)[Çela 2018] Eranda Çela, Vladimir Deineko et Gerhard J. Woeginger.

New special cases of the QuadraticAssignment Problem with diagonally structured coeﬃcient matrices . European journal of operationalresearch, vol. 267, no. 3, pages 818–834, 2018. (Cited on page 35.)[Çela 2013] Eranda Çela. The quadratic assignment problem: Theory and algorithms, volume 1. SpringerScience & Business Media, 2013. (Cited on page 35.)[Çela 2015] Eranda Çela, Vladimir G. Deineko et Gerhard J. Woeginger.

Well-solvable cases of the QAPwith block-structured matrices . Discrete applied mathematics, vol. 186, pages 56–65, 2015. (Citedon page 35.)[Charlier 2018] Benjamin Charlier, Jean Feydy et Joan Glaunes.

Kernel Operations on the GPU, withautodiﬀ, without memory overﬂows . https://github.com/getkeops/keops, 2018. (Cited on page 75.)[Chechik 2010] Gal Chechik, Varun Sharma, Uri Shalit et Samy Bengio.

Large Scale Online Learningof Image Similarity Through Ranking . Journal of Machine Learning Research, vol. 11, pages1109–1135, 2010. (Cited on page 162.)[Chen 2018] Xinlei Chen, Li-Jia Li, Li Fei-Fei et Abhinav Gupta.

Iterative Visual Reasoning BeyondConvolutions.

In CVPR, pages 7239–7248. IEEE Computer Society, 2018. (Cited on pages 5 and 3.)[Chizat 2017] Lenaïc Chizat.

Unbalanced Optimal Transport : Models, Numerical Methods, Applications .Thesiss, PSL Research University, Novembre 2017. (Cited on pages 25, 27 and 106.)[Chizat 2018] Lénaïc Chizat et Francis Bach.

On the Global Convergence of Gradient Descent forOver-parameterized Models using Optimal Transport . In S. Bengio, H. Wallach, H. Larochelle,K. Grauman, N. Cesa-Bianchi et R. Garnett, editeurs, Advances in Neural Information ProcessingSystems 31, pages 3036–3046. Curran Associates, Inc., 2018. (Cited on page 65.)[Chowdhury 2019a] Samir Chowdhury et Facundo Mémoli.

The Gromov–Wasserstein distance betweennetworks and stable network invariants . Information and Inference: A Journal of the IMA, vol. 8,no. 4, pages 757–787, 2019. (Cited on pages 27, 28, 32 and 38.)[Chowdhury 2019b] Samir Chowdhury et Tom Needham.

Gromov-Wasserstein Averaging in a RiemannianFramework , 2019. (Cited on page 38.)

70 Bibliography [Cohen 2016] Taco Cohen et Max Welling.

Group Equivariant Convolutional Networks . In Maria FlorinaBalcan et Kilian Q. Weinberger, editeurs, Proceedings of The 33rd International Conference onMachine Learning, volume 48 of

Proceedings of Machine Learning Research , pages 2990–2999, NewYork, New York, USA, 20–22 Jun 2016. PMLR. (Cited on pages 5 and 3.)[Collins 2002] Michael Collins.

Discriminative Training Methods for Hidden Markov Models: Theoryand Experiments with Perceptron Algorithms . In Proceedings of the ACL-02 Conference onEmpirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, page 1–8, USA,2002. Association for Computational Linguistics. (Cited on pages 5 and 3.)[Courty 2017] Nicolas Courty, Rémi Flamary, Devis Tuia et Alain Rakotomamonjy.

Optimal transportfor domain adaptation . IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39,no. 9, pages 1853–1865, 2017. (Cited on pages 3, 5, 1, 24, 43, 70, 102 and 110.)[Cramér 1936] H. Cramér et H. Wold.

Some Theorems on Distribution Functions . Journal of the LondonMathematical Society, vol. s1-11, no. 4, pages 290–294, 1936. (Cited on page 134.)[Csiszar 1975] I. Csiszar. I -Divergence Geometry of Probability Distributions and Minimization Problems .The Annals of Probability, vol. 3, no. 1, pages 146–158, 1975. (Cited on pages 6 and 4.)[Cui 2014] Zhen Cui, Hong Chang, Shiguang Shan et Xilin Chen. Generalized Unsupervised ManifoldAlignment . In NIPS, pages 2429–2437. 2014. (Cited on page 102.)[Custic 2016] Ante Custic, Vladyslav Sokol, Abraham Punnen et Binay Bhattacharya.

The Bilinear As-signment Problem: Complexity and polynomially solvable special cases . Mathematical Programming,vol. 166, 2016. (Cited on page 105.)[Cuturi 2013] Marco Cuturi.

Sinkhorn distances: Lightspeed computation of optimal transport . In NIPS,pages 2292–2300, 2013. (Cited on pages 3, 1, 21, 103 and 106.)[Cuturi 2014] Marco Cuturi et Arnaud Doucet.

Fast Computation of Wasserstein Barycenters . In Eric P.Xing et Tony Jebara, editeurs, Proceedings of the 31st International Conference on MachineLearning, volume 32 of

Proceedings of Machine Learning Research , pages 685–693, Bejing, China,22–24 Jun 2014. PMLR. (Cited on pages 26 and 49.)[Cuturi 2017] Marco Cuturi et Mathieu Blondel.

Soft-DTW: a Diﬀerentiable Loss Function for Time-Series . In Proceedings of the ICML, volume 70, pages 894–903. PMLR, 06–11 Aug 2017. (Citedon pages 42 and 162.)[Cuturi 2018] Marco Cuturi et Gabriel Peyré.

Semi-dual Regularized Optimal Transport . SIAM Review,vol. 60, pages 941–965, 2018. (Cited on page 26.)[Dantzig 1997] George B. Dantzig et Mukund N. Thapa. Linear programming 1: Introduction. Springer-Verlag, Berlin, Heidelberg, 1997. (Cited on pages 19 and 20.)[Day 1985] William HE Day.

Optimal algorithms for comparing trees with labeled leaves . Journal ofclassiﬁcation, vol. 2, no. 1, pages 7–28, 1985. (Cited on pages 4, 3 and 42.)[De Condorcet 1781] Nicolas De Condorcet.

Sur les déblais et les remblais , 1781. (Cited on pages 3 and 1.) ibliography 171 [Debnath 1991] Asim Kumar Debnath, Rosa L. Lopez de Compadre, Gargi Debnath, Alan J. Shustermanet Corwin Hansch.

Structure-activity relationship of mutagenic aromatic and heteroaromatic nitrocompounds. Correlation with molecular orbital energies and hydrophobicity . Journal of MedicinalChemistry, vol. 34, no. 2, pages 786–797, 1991. (Cited on page 51.)[Deﬀerrard 2016] Michaël Deﬀerrard, Xavier Bresson et Pierre Vandergheynst.

Convolutional NeuralNetworks on Graphs with Fast Localized Spectral Filtering . In NIPS, pages 3844–3852. 2016. (Citedon page 42.)[Demetci 2020] Pinar Demetci, Rebecca Santorella, Björn Sandstede, William Staﬀord Noble et Ritamb-hara Singh.

Gromov-Wasserstein optimal transport to align single-cell multi-omics data . bioRxiv,2020. (Cited on page 39.)[Deshpande 2018] Ishan Deshpande, Ziyu Zhang et Alexander G. Schwing.

Generative Modeling Using theSliced Wasserstein Distance . In IEEE Conference on Computer Vision and Pattern Recognition,pages 3483–3491, 2018. (Cited on pages 24, 71 and 77.)[Deshpande 2019] Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, SanmiKoyejo, Zhizhen Zhao, David Forsyth et Alexander G. Schwing.

Max-Sliced Wasserstein Distanceand Its Use for GANs . In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2019. (Cited on pages 24, 74 and 160.)[Dhillon 2003] Inderjit S. Dhillon, Subramanyam Mallela et Dharmendra S. Modha.

Information-theoreticCo-clustering . In SIGKDD, pages 89–98, 2003. (Cited on page 113.)[Ding 2006] C. Ding, T. Li, W. Peng et H. Park.

Orthogonal Nonnegative Matrix Tri-factorizations forClustering . In Proceedings ACM SIGKDD, pages 126–135, 2006. (Cited on page 113.)[Donahue 2014] J. Donahue, Y. Jia, O. Vinyals, J. Hoﬀman, N. Zhang, E. Tzeng et T. Darrell.

DeCAF:A Deep Convolutional Activation Feature for Generic Visual Recognition . In ICML, 2014. (Citedon page 111.)[Dudley 1969] R. M. Dudley.

The Speed of Mean Glivenko-Cantelli Convergence . Ann. Math. Statist.,vol. 40, no. 1, pages 40–50, 02 1969. (Cited on page 18.)[Dym 2017] Nadav Dym, Haggai Maron et Yaron Lipman.

DS++: A Flexible, Scalable and ProvablyTight Relaxation for Matching Problems . ACM Trans. Graph., vol. 36, no. 6, Novembre 2017.(Cited on page 35.)[D v zeroski 2001] Sa v so D v zeroski, Luc De Raedt et Kurt Driessens. Relational Reinforcement Learning .Machine Learning, vol. 43, no. 1, pages 7–52, Apr 2001. (Cited on pages 5 and 3.)[Elshafei 1977] Alwalid N. Elshafei.

Hospital Layout as a Quadratic Assignment Problem . OperationalResearch Quarterly (1970-1977), vol. 28, no. 1, pages 167–179, 1977. (Cited on page 35.)[Ezuz 2017] Danielle Ezuz, Justin Solomon, Vladimir G. Kim et Mirela Ben-Chen.

GWCNN: A MetricAlignment Layer for Deep Shape Analysis . Computer Graphics Forum, vol. 36, no. 5, pages 49–57,2017. (Cited on pages 39, 70 and 107.)[Faliszewski 2019] P. Faliszewski, P. Skowron, A. Slinko, S. Szufa et N. Talmon.

How Similar Are TwoElections?

In AAAI, pages 1909–1916, 2019. (Cited on pages 106, 109, 110, 154 and 155.)

72 Bibliography [Fatras 2020] Kilian Fatras, Younes Zine, Rémi Flamary, Remi Gribonval et Nicolas Courty.

Learning withminibatch Wasserstein : asymptotic and gradient properties . In Silvia Chiappa et Roberto Calandra,editeurs, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence andStatistics, volume 108 of

Proceedings of Machine Learning Research , pages 2131–2141, Online,26–28 Aug 2020. PMLR. (Cited on pages 19 and 24.)[Feragen 2013] Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne et Karsten Borgwardt.

Scalable kernels for graphs with continuous attributes . In Advances in Neural Information ProcessingSystems 26, pages 216–224. 2013. (Cited on pages 51 and 52.)[Ferradans 2014] Sira Ferradans, Nicolas Papadakis, Gabriel Peyré et Jean-François Aujol.

Regularizeddiscrete optimal transport . SIAM Journal on Imaging Sciences, vol. 7, no. 3, pages 1853–1882, 2014.(Cited on pages 25 and 47.)[Fey 2020] Matthias Fey, Jan E. Lenssen, Christopher Morris, Jonathan Masci et Nils M. Kriege.

DeepGraph Matching Consensus . In International Conference on Learning Representations, 2020. (Citedon page 39.)[Feydy 2019] Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouve etGabriel Peyré.

Interpolating between Optimal Transport and MMD using Sinkhorn Divergences . InKamalika Chaudhuri et Masashi Sugiyama, editeurs, Proceedings of Machine Learning Research,volume 89 of

Proceedings of Machine Learning Research , pages 2681–2690. PMLR, 16–18 Apr 2019.(Cited on page 23.)[Figalli 2010] Alessio Figalli.

Regularity of optimal transport maps [after Ma-Trudinger-Wang and Loeper] .In Séminaire Bourbaki : volume 2008/2009 exposés 997-1011 - Avec table par noms d’auteurs de1848/49 à 2008/09, numéro 332 de Astérisque, pages 341–368. Société mathématique de France,2010. talk:1009. (Cited on page 16.)[Flamary 2014] R. Flamary, N.. Courty, D. Tuia et A. Rakotomamonjy.

Optimal transport with Laplacianregularization: Applications to domain adaptation and shape matching . 2014. (Cited on page 47.)[Flamary 2017] R’emi Flamary et Nicolas Courty.

POT Python Optimal Transport library , 2017. (Citedon pages 49 and 76.)[Flamary 2019] Rémi Flamary, Karim Lounici et André Ferrari.

Concentration bounds for linear Mongemapping estimation and optimal transport domain adaptation , 2019. (Cited on pages 18 and 145.)[Frisch 2002] Uriel Frisch, Sabino Matarrese, Roya Mohayaee et Andrei Sobolevski.

A reconstruction ofthe initial conditions of the Universe by optimal mass transportation . Nature, vol. 417, pages 260–2,06 2002. (Cited on pages 3 and 1.)[Frogner 2015] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya et Tomaso A Poggio.

Learning with a Wasserstein Loss . In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama etR. Garnett, editeurs, Advances in Neural Information Processing Systems 28, pages 2053–2061.Curran Associates, Inc., 2015. (Cited on pages 3 and 1.)[Frogner 2019] Charlie Frogner, Farzaneh Mirzazadeh et Justin Solomon.

Learning Embeddings intoEntropic Wasserstein Spaces . ICLR, 2019. (Cited on page 12.) ibliography 173 [Gallo 1977] Giorgio Gallo et Aydin Ülkücü.

Bilinear Programming: An Exact Algorithm . MathematicalProgramming, vol. 12, pages 173–194, 1977. (Cited on page 105.)[Gangbo 1996] Wilfrid Gangbo et Robert J. McCann.

The geometry of optimal transportation . ActaMath., vol. 177, no. 2, pages 113–161, 1996. (Cited on page 16.)[Gärtner 2003] Thomas Gärtner, Peter Flach et Stefan Wrobel.

On graph kernels: Hardness results andeﬃcient alternatives . In IN: CONFERENCE ON LEARNING THEORY, pages 129–143, 2003.(Cited on page 52.)[Gelfand 2005] Natasha Gelfand, Niloy J. Mitra, Leonidas J. Guibas et Helmut Pottmann.

Robust GlobalRegistration . In Mathieu Desbrun et Helmut Pottmann, editeurs, Eurographics Symposium onGeometry Processing 2005. The Eurographics Association, 2005. (Cited on page 37.)[Genevay 2016] Aude Genevay, Marco Cuturi, Gabriel Peyré et Francis Bach.

Stochastic Optimizationfor Large-scale Optimal Transport . In Advances in Neural Information Processing Systems, pages3440–3448, 2016. (Cited on pages 3, 1 and 23.)[Genevay 2018] Aude Genevay, Gabriel Peyre et Marco Cuturi.

Learning Generative Models with SinkhornDivergences . In Amos Storkey et Fernando Perez-Cruz, editeurs, Proceedings of the Twenty-FirstInternational Conference on Artiﬁcial Intelligence and Statistics, volume 84 of

Proceedings ofMachine Learning Research , pages 1608–1617, Playa Blanca, Lanzarote, Canary Islands, 09–11Apr 2018. PMLR. (Cited on pages 3, 1 and 23.)[Genevay 2019] Aude Genevay, Lénaïc Chizat, Francis Bach, Marco Cuturi et Gabriel Peyré.

SampleComplexity of Sinkhorn Divergences . In ICML, pages 1574–1583, 2019. (Cited on pages 23 and 103.)[Gens 2014] Robert Gens et Pedro M Domingos.

Deep Symmetry Networks . In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence et K. Q. Weinberger, editeurs, Advances in Neural Information ProcessingSystems 27, pages 2537–2545. Curran Associates, Inc., 2014. (Cited on pages 5 and 3.)[Geoﬀrion 1976] A. M. Geoﬀrion et G. W. Graves.

Scheduling Parallel Production Lines with ChangeoverCosts: Practical Application of a Quadratic Assignment/LP Approach . Operations Research, vol. 24,no. 4, pages 595–610, 1976. (Cited on page 35.)[Givens 1984] Clark R. Givens et Rae Michael Shortt.

A class of Wasserstein metrics for probabilitydistributions.

Michigan Math. J., vol. 31, no. 2, pages 231–240, 1984. (Cited on page 17.)[Glunt 1990] W. Glunt, T. L. Hayden, S. Hong et J. Wells.

An Alternating Projection Algorithm forComputing the Nearest Euclidean Distance Matrix . SIAM J. Matrix Anal. Appl., vol. 11, no. 4,page 589–600, Septembre 1990. (Cited on page 161.)[Goldfeld 2020] Ziv Goldfeld et Kristjan Greenewald.

Gaussian-Smoothed Optimal Transport: MetricStructure and Statistical Eﬃciency . In Silvia Chiappa et Roberto Calandra, editeurs, Proceedingsof the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108of

Proceedings of Machine Learning Research , pages 3327–3337, Online, 26–28 Aug 2020. PMLR.(Cited on page 19.)[Goodall 1991] Colin Goodall.

Procrustes Methods in the Statistical Analysis of Shape . Journal of theRoyal Statistical Society: Series B (Methodological), vol. 53, no. 2, pages 285–321, 1991. (Cited onpage 102.)

74 Bibliography [Gordaliza 2019] Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice et Jean-Michel Loubes.

ObtainingFairness using Optimal Transport Theory . In Kamalika Chaudhuri et Ruslan Salakhutdinov,editeurs, Proceedings of the 36th International Conference on Machine Learning, volume 97 of

Proceedings of Machine Learning Research , pages 2357–2365, Long Beach, California, USA, 09–15Jun 2019. PMLR. (Cited on pages 3, 1 and 25.)[Gower 2004] John C. Gower et Garmt B. Dijksterhuis. Procrustes problems, volume 30 of

OxfordStatistical Science Series . Oxford University Press, 2004. (Cited on page 102.)[Gramfort 2015] A. Gramfort, G. Peyré et M. Cuturi.

Fast Optimal Transport Averaging of NeuroimagingData . In Sebastien Ourselin, Daniel C. Alexander, Carl-Fredrik Westin et M. Jorge Cardoso, editeurs,Information Processing in Medical Imaging, pages 261–272, Cham, 2015. Springer InternationalPublishing. (Cited on page 25.)[Grave 2019] Edouard Grave, Armand Joulin et Quentin Berthet.

Unsupervised Alignment of Embeddingswith Wasserstein Procrustes . In AISTATS, pages 1880–1890, 2019. (Cited on pages 33 and 102.)[Gretton 2007] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf et Alex J. Smola.

A Kernel Method for the Two-Sample-Problem . In B. Schölkopf, J. C. Platt et T. Hoﬀman, editeurs,Advances in Neural Information Processing Systems 19, pages 513–520. MIT Press, 2007. (Citedon pages 6 and 4.)[Gromov 1999] M. Gromov, J. Lafontaine et P. Pansu. Metric Structures for Riemannian and Non-Riemannian Spaces:. Progress in Mathematics - Birkhäuser. Birkhäuser, 1999. (Cited on page 33.)[Gulrajani 2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin et Aaron CCourville.

Improved Training of Wasserstein GANs . In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan et R. Garnett, editeurs, Advances in Neural InformationProcessing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017. (Cited on page 24.)[Haker 2001] S. Haker et A. Tannenbaum.

Optimal mass transport and image registration . In ProceedingsIEEE Workshop on Variational and Level Set Methods in Computer Vision, pages 29–36, 2001.(Cited on pages 3, 1 and 102.)[Hartigan 1972] J. A. Hartigan.

Direct Clustering of a Data Matrix . Journal of the American StatisticalAssociation, vol. 67, no. 337, pages 123–129, 1972. (Cited on page 113.)[Hinton 2012] Geoﬀrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, NavdeepJaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath et Brian Kingsbury.

Deep Neural Networks for Acoustic Modeling in Speech Recognition . Signal Processing Magazine,2012. (Cited on pages 5 and 3.)[Hjort 2010] N. Hjort, C. Holmes, P. Mueller et S. Walker. Bayesian nonparametrics: Principles andpractice. Cambridge University Press, Cambridge, UK, 2010. (Cited on pages 5 and 3.)[Horn 1991] Roger A. Horn et Charles R. Johnson. Topics in matrix analysis. Cambridge UniversityPress, 1991. (Cited on page 88.)[Horst 1996] R. Horst et H. Tuy. Global optimization: Deterministic approaches. Springer BerlinHeidelberg, 1996. (Cited on pages 105 and 106.) ibliography 175 [Huang 2016] G. Huang, C. Guo, M. Kusner, Y. Sun, F. Sha et K. Weinberger.

Supervised Word Mover’sDistance . In Advances in Neural Information Processing Systems, pages 4862–4870, 2016. (Citedon pages 43 and 70.)[Hull 1994] J. J. Hull.

A database for handwritten text recognition research . IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 16, no. 5, pages 550–554, 1994. (Cited on pages 5 and 3.)[Isserlis 1918] L. Isserlis.

On a formula for the product-moment coeﬃcient of any order of a normalfrequency distribution in any number of variables . Biometrika, vol. 12, no. 1-2, pages 134–139, 111918. (Cited on pages 96 and 146.)[Jaggi 2013] Martin Jaggi.

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization . InSanjoy Dasgupta et David McAllester, editeurs, Proceedings of the 30th International Conferenceon Machine Learning, volume 28 of

Proceedings of Machine Learning Research , pages 427–435,Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. (Cited on pages 47 and 48.)[Jianbo Shi 2000] Jianbo Shi et J. Malik.

Normalized cuts and image segmentation . IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 22, no. 8, pages 888–905, 2000. (Cited on pages 5and 3.)[Jiang 2014] Bo Jiang et Yu-Hong Dai.

A Framework of Constraint Preserving Update Schemes forOptimization onStiefel Manifold . Mathematical Programming, 09 2014. (Cited on page 97.)[Kantorovich 1942] L. Kantorovich.

On the translocation of masses . C.R. (Doklady) Acad. Sci. URSS(N.S.), vol. 37, pages 199–201, 1942. (Cited on page 8.)[Karcher 2014] Hermann Karcher.

Riemannian Center of Mass and so called karcher mean , 2014. (Citedon page 25.)[Kersting 2016] Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel et Marion Neumann.

Benchmark Data Sets for Graph Kernels , 2016. (Cited on page 51.)[Kipf 2016] Thomas N. Kipf et Max Welling.

Semi-Supervised Classiﬁcation with Graph ConvolutionalNetworks . CoRR, vol. abs/1609.02907, 2016. (Cited on page 42.)[Kochurov 2020] Max Kochurov, Rasul Karimov et Serge Kozlukov.

Geoopt: Riemannian Optimizationin PyTorch , 2020. (Cited on page 97.)[Kolouri 2016] Soheil Kolouri, Yang Zou et Gustavo K. Rohde.

Sliced Wasserstein Kernels for ProbabilityDistributions . In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),June 2016. (Cited on pages 24 and 71.)[Kolouri 2017] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev et G. K. Rohde.

Optimal Mass Transport:Signal processing and machine-learning applications . IEEE Signal Processing Magazine, vol. 34,no. 4, pages 43–59, 2017. (Cited on pages 3 and 1.)[Kolouri 2019a] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau et Gustavo Rohde.

Gen-eralized Sliced Wasserstein Distances . In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. Fox et R. Garnett, editeurs, Advances in Neural Information Processing Systems 32, pages261–272. Curran Associates, Inc., 2019. (Cited on pages 24 and 25.)

76 Bibliography [Kolouri 2019b] Soheil Kolouri, Phillip E. Pope, Charles E. Martin et Gustavo K. Rohde.

Sliced WassersteinAuto-Encoders . In International Conference on Learning Representations, 2019. (Cited on page 71.)[Kondor 2018] Risi Kondor et Shubhendu Trivedi.

On the Generalization of Equivariance and Convolutionin Neural Networks to the Action of Compact Groups , 2018. (Cited on pages 5 and 3.)[Konno 1976a] Hiroshi Konno.

A Cutting Plane Algorithm for Solving Bilinear Programs . Math. Program.,vol. 11, no. 1, page 14–27, 1976. (Cited on page 106.)[Konno 1976b] Hiroshi Konno.

Maximization of A convex quadratic function under linear constraints .Mathematical Programming, vol. 11, no. 1, pages 117–127, 1976. (Cited on pages 107 and 150.)[Koopmans 1957] Tjalling Koopmans et Martin J. Beckmann.

Assignment Problems and the Location ofEconomic Activities . Econometrica: journal of the Econometric Society, no. 53–76, 1957. (Citedon pages 34 and 71.)[Korba 2018] Anna Korba, Alexandre Garcia et Florence d’Alché Buc.

A Structured Prediction Approachfor Label Ranking . In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi etR. Garnett, editeurs, Advances in Neural Information Processing Systems 31, pages 8994–9004.Curran Associates, Inc., 2018. (Cited on pages 5 and 3.)[Kriege 2016] Nils M. Kriege, Pierre-Louis Giscard et Richard C. Wilson.

On Valid Optimal AssignmentKernels and Applications to Graph Classiﬁcation . CoRR, vol. abs/1606.01141, 2016. (Cited onpages 4, 3, 42, 51 and 52.)[Kriege 2018] Nils Kriege, Matthias Fey, Denis Fisseler, Petra Mutzel et Frank Weichert.

RecognizingCuneiform Signs Using Graph Based Methods . In International Workshop on Cost-SensitiveLearning (COST), 2018. (Cited on pages 51 and 53.)[Kriege 2020] Nils M. Kriege, Fredrik D. Johansson et Christopher Morris.

A survey on graph kernels .Applied Network Science, vol. 5, no. 1, page 6, Janvier 2020. (Cited on page 162.)[Ktena 2017] Soﬁa Ira Ktena, Sarah Parisot, Enzo Ferrante, Martin Rajchl, Matthew Lee, Ben Glockeret Daniel Rueckert.

Distance metric learning using graph convolutional networks: Application tofunctional brain networks . In MICCAI, pages 469–477, 2017. (Cited on pages 4, 3 and 42.)[Kusner 2015] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin et Kilian Q. Weinberger.

From Word Embeddingsto Document Distances . In Proceedings of the 32nd International Conference on InternationalConference on Machine Learning - Volume 37, ICML’15, page 957–966. JMLR.org, 2015. (Citedon pages 3, 7, 1 and 5.)[Kwon 2020] Oh-Hyun Kwon et Kwan-Liu Ma.

A Deep Generative Model for Graph Layout . IEEETransactions on Visualization and Computer Graphics, vol. 26, 01 2020. (Cited on page 39.)[Laclau 2017] Charlotte Laclau, Ievgen Redko, Basarab Matei, Younès Bennani et Vincent Brault.

Co-clustering through Optimal Transport . In ICML, pages 1955–1964, 2017. (Cited on pages 113, 114and 115.)[Lacoste-Julien 2016] Simon Lacoste-Julien.

Convergence rate of Frank-Wolfe for non-convex objectives .arXiv preprint arXiv:1607.00345, 2016. (Cited on page 48.) ibliography 177 [Laﬀerty 2001] John D. Laﬀerty, Andrew McCallum et Fernando C. N. Pereira.

Conditional RandomFields: Probabilistic Models for Segmenting and Labeling Sequence Data . In Proceedings of theEighteenth International Conference on Machine Learning, ICML ’01, page 282–289, San Francisco,CA, USA, 2001. Morgan Kaufmann Publishers Inc. (Cited on pages 5 and 3.)[Lai 1988] Hang-Chin Lai et Lai-Jui Lin.

The Fenchel-Moreau Theorem for Set Functions . Proceedings ofthe American Mathematical Society, vol. 103, no. 1, pages 85–90, 1988. (Cited on page 89.)[Lai 2014] Rongjie Lai et Hongkai Zhao.

Multi-scale Non-Rigid Point Cloud Registration Using RobustSliced-Wasserstein Distance via Laplace-Beltrami Eigenmap . SIAM Journal on Imaging Sciences,vol. 10, 06 2014. (Cited on page 74.)[Laporte 1988] Gilbert Laporte et Hélène Mercure.

Balancing hydraulic turbine runners: A quadraticassignment problem . European Journal of Operational Research, vol. 35, no. 3, pages 378 – 381,1988. (Cited on page 35.)[Lassalle 2018] Rémi Lassalle.

Causal transport plans and their Monge–Kantorovich problems . StochasticAnalysis and Applications, vol. 36, no. 3, pages 452–484, 2018. (Cited on page 162.)[LeCun 2010] Yann LeCun et Corinna Cortes.

MNIST handwritten digit database . 2010. (Cited on pages 5and 3.)[Lee 2019] John Lee, Max Dabagia, Eva Dyer et Christopher Rozell.

Hierarchical Optimal Transport forMultimodal Distribution Alignment . In NeurIPS, pages 13474–13484. Curran Associates, Inc., 2019.(Cited on page 109.)[Lei 2020] Jing Lei.

Convergence and concentration of empirical measures under Wasserstein distance inunbounded functional spaces . arXiv: Statistics Theory, pages 767–798, 2020. (Cited on page 19.)[Li 2019] Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals et Pushmeet Kohli.

Graph MatchingNetworks for Learning the Similarity of Graph Structured Objects . In Kamalika Chaudhuri etRuslan Salakhutdinov, editeurs, Proceedings of the 36th International Conference on MachineLearning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of

Proceedings ofMachine Learning Research , pages 3835–3845. PMLR, 2019. (Cited on page 162.)[Liberti 2014] Leo Liberti, Carlile Lavor, Nelson Maculan et Antonio Mucherino.

Euclidean DistanceGeometry and Applications . SIAM Review, vol. 56, no. 1, pages 3–69, 2014. (Cited on page 161.)[Lin 2020] Tianyi Lin, Zeyu Zheng, Elynn Y. Chen, Marco Cuturi et Michael I. Jordan.

On ProjectionRobust Optimal Transport: Sample Complexity and Model Misspeciﬁcation , 2020. (Cited on pages 19and 24.)[Liu 2016] Huikang Liu, Weijie Wu et Anthony Man-Cho So.

Quadratic Optimization with OrthogonalityConstraints: Explicit Lojasiewicz Exponent and Linear Convergence of Line-Search Methods . InMaria Florina Balcan et Kilian Q. Weinberger, editeurs, Proceedings of The 33rd InternationalConference on Machine Learning, volume 48 of

Proceedings of Machine Learning Research , pages1158–1167, New York, New York, USA, 20–22 Jun 2016. PMLR. (Cited on page 97.)[Liutkus 2019] Antoine Liutkus, Umut Simsekli, Szymon Majewski, Alain Durmus et Fabian-RobertStöter.

Sliced-Wasserstein Flows: Nonparametric Generative Modeling via Optimal Transport and

78 Bibliography

Diﬀusions . In Kamalika Chaudhuri et Ruslan Salakhutdinov, editeurs, Proceedings of the 36thInternational Conference on Machine Learning, volume 97 of

Proceedings of Machine LearningResearch , pages 4104–4113, Long Beach, California, USA, 09–15 Jun 2019. PMLR. (Cited onpage 71.)[Loiola 2007] Eliane Loiola, Nair Abreu, Paulo Boaventura-Netto, Peter Hahn et Tania Querido.

A surveyof the quadratic assignment problem . European Journal of Operational Research, vol. 176, pages657–690, 2007. (Cited on pages 34, 35 and 47.)[Loukas 2019] Andreas Loukas.

Graph Reduction with Spectral and Cut Guarantees . Journal of MachineLearning Research, vol. 20, no. 116, pages 1–42, 2019. (Cited on pages 54 and 162.)[Luss 2007] Ronny Luss et Alexandre d’Aspremont.

Support Vector Machine Classiﬁcation with IndeﬁniteKernels . In Proceedings of the 20th International Conference on Neural Information ProcessingSystems, NIPS’07, pages 953–960, 2007. (Cited on page 52.)[Lyu 2020] Boyang Lyu, Thao Pham, Giles Blaney, Zachary Haga, Sergio Fantini et Shuchin Aeron.

Domain Adaptation for Robust Workload Classiﬁcation using fNIRS , 2020. (Cited on page 57.)[Lyzinski 2016] V. Lyzinski, D. E. Fishkind, M. Fiori, J. T. Vogelstein, C. E. Priebe et G. Sapiro.

Graph Matching: Relax at Your Own Risk . IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 38, no. 1, pages 60–73, 2016. (Cited on page 35.)[Lévy 2018] Bruno Lévy et Erica L. Schwindt.

Notions of optimal transport theory and how to implementthem on a computer . Computers & Graphics, vol. 72, pages 135 – 148, 2018. (Cited on page 25.)[Ma 2005] Xinan Ma, Neil Trudinger et Xu-Jia Wang.

Regularity of Potential Functions of the OptimalTransportation Problem . Archive for Rational Mechanics and Analysis, vol. 177, pages 151–183, 012005. (Cited on page 16.)[Maclaurin 2015] Dougal Maclaurin, David Duvenaud et Ryan P Adams.

Autograd: Eﬀortless gradientsin numpy . In ICML 2015 AutoML Workshop, 2015. (Cited on page 75.)[Maron 2018] Haggai Maron et Yaron Lipman. (Probably) Concave Graph Matching . In Advances inNeural Information Processing Systems, pages 408–418, 2018. (Cited on pages 35, 72, 108, 132and 153.)[McCann 1995] Robert J. McCann.

Existence and uniqueness of monotone measure-preserving maps .Duke Math. J., vol. 80, no. 2, pages 309–323, 11 1995. (Cited on page 83.)[McCann 1997] Robert J. McCann.

A Convexity Principle for Interacting Gases . Advances in Mathematics,vol. 128, no. 1, pages 153–179, jun 1997. (Cited on pages 17, 18 and 27.)[Meghwanshi 2018] Mayank Meghwanshi, Pratik Jawanpuria, Anoop Kunchukuttan, Hiroyuki Kasai etBamdev Mishra.

McTorch, a manifold optimization library for deep learning . arXiv preprintarXiv:1810.01811, 2018. (Cited on page 75.)[Memoli 2011] Facundo Memoli.

Gromov Wasserstein Distances and the Metric Approach to ObjectMatching . Foundations of Computational Mathematics, pages 1–71, 2011. (Cited on pages 3, 2, 7,27, 33, 37, 39, 43 and 102.) ibliography 179 [Mémoli 2018] Facundo Mémoli et Tom Needham.

Gromov-Monge quasi-metrics and distance distributions .arXiv:1810.09646, 2018. (Cited on pages 72 and 95.)[Mensch 2018] Arthur Mensch et Mathieu Blondel.

Diﬀerentiable Dynamic Programming for StructuredPrediction and Attention . In Jennifer Dy et Andreas Krause, editeurs, Proceedings of the 35thInternational Conference on Machine Learning, volume 80 of

Proceedings of Machine LearningResearch , pages 3462–3471, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. (Citedon pages 5 and 3.)[Mikolov 2013a] Tomas Mikolov, Kai Chen, Greg S. Corrado et Jeﬀrey Dean.

Eﬃcient Estimation ofWord Representations in Vector Space , 2013. (Cited on pages 5 and 3.)[Mikolov 2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado et Jeﬀ Dean.

DistributedRepresentations of Words and Phrases and their Compositionality . In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani et K. Q. Weinberger, editeurs, Advances in Neural InformationProcessing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013. (Cited on pages 5 and 3.)[Monge 1781] Gaspard Monge.

Mémoire sur la théorie des déblais et des remblais . Histoire de l’AcadémieRoyale des Sciences, pages 666–704, 1781. (Cited on pages 3, 1, 7 and 20.)[Murty 1988] K.G. Murty. Linear Complementarity, Linear and Nonlinear Programming. Sigma series inapplied mathematics. Heldermann, 1988. (Cited on page 129.)[Muzellec 2018] Boris Muzellec et Marco Cuturi.

Generalizing Point Embeddings using the WassersteinSpace of Elliptical Distributions . In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi et R. Garnett, editeurs, Advances in Neural Information Processing Systems 31, pages10237–10248. Curran Associates, Inc., 2018. (Cited on page 18.)[Nadif 2008] M. Nadif et G. Govaert.

Algorithms for Model-based Block Gaussian Clustering . In DMIN’08,the 2008 International Conference on Data Mining, 2008. (Cited on page 113.)[Nadjahi 2019] Kimia Nadjahi, Alain Durmus, Umut Simsekli et Roland Badeau.

Asymptotic Guaranteesfor Learning Generative Models with the Sliced-Wasserstein Distance . In Advances in NeuralInformation Processing Systems 32, pages 250–260. Curran Associates, Inc., 2019. (Cited onpage 24.)[Nadjahi 2020] Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour etUmut Simsekli.

Statistical and Topological Properties of Sliced Probability Divergences , 03 2020.(Cited on page 24.)[Nenna 2016] Luca Nenna.

Numerical Methods for Multi-Marginal Optimal Transportation . PhD thesis,12 2016. (Cited on pages 23 and 25.)[Neumann 2016] Marion Neumann, Roman Garnett, Christian Bauckhage et Kristian Kersting.

Propa-gation kernels: eﬃcient graph kernels from propagated information . Machine Learning, vol. 102,no. 2, pages 209–245, Feb 2016. (Cited on page 52.)[Niepert 2016] Mathias Niepert, Mohamed Ahmed et Konstantin Kutzkov.

Learning Convolutional NeuralNetworks for Graphs . In ICML, volume 48 of

Proceedings of Machine Learning Research , pages2014–2023, New York, New York, USA, 2016. PMLR. (Cited on page 52.)

80 Bibliography [Nikolentzos 2017] Giannis Nikolentzos, Polykarpos Meladianos et Michalis Vazirgiannis.

Matching NodeEmbeddings for Graph Similarity . In Proceedings of the Thirty-First AAAI Conference on ArtiﬁcialIntelligence, February 4-9, 2017, San Francisco, California, USA., pages 2429–2435, 2017. (Citedon page 43.)[Niles-Weed 2019] Jonathan Niles-Weed et Philippe Rigollet.

Estimation of Wasserstein distances in theSpiked Transport Model . 09 2019. (Cited on page 19.)[Nowicki 2001] Krzysztof Nowicki et Tom A B Snijders.

Estimation and prediction for stochastic block-structures . Journal of the American statistical association, vol. 96, no. 455, pages 1077–1087, 2001.(Cited on pages 56 and 57.)[Ortega 2018] A. Ortega, P. Frossard, J. Kova v cević, J. M. F. Moura et P. Vandergheynst. Graph SignalProcessing: Overview, Challenges, and Applications . Proceedings of the IEEE, vol. 106, no. 5,pages 808–828, 2018. (Cited on page 162.)[Pardalos 1987] Panos M. Pardalos et J. Ben Rosen, editeurs. Bilinear programming methods for nonconvexquadratic problems, pages 75–83. Springer Berlin Heidelberg, 1987. (Cited on page 105.)[Paszke 2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, ZacharyDeVito, Zeming Lin, Alban Desmaison, Luca Antiga et Adam Lerer.

Automatic diﬀerentiation inpytorch . 2017. (Cited on page 75.)[Patrikainen 2006] A. Patrikainen et M. Meila.

Comparing subspace clusterings . IEEE Transactions onKnowledge and Data Engineering, vol. 18, no. 7, pages 902–916, 2006. (Cited on page 113.)[Paty 2019] François-Pierre Paty et Marco Cuturi.

Subspace Robust Wasserstein Distances . In KamalikaChaudhuri et Ruslan Salakhutdinov, editeurs, Proceedings of the 36th International Conferenceon Machine Learning, volume 97 of

Proceedings of Machine Learning Research , pages 5072–5081,Long Beach, California, USA, 09–15 Jun 2019. PMLR. (Cited on pages 24, 25, 74 and 142.)[Pearl 1986] J Pearl.

Fusion, Propagation, and Structuring in Belief Networks . Artif. Intell., vol. 29, no. 3,pages 241–288, Septembre 1986. (Cited on pages 5 and 3.)[Pearl 2009] Judea Pearl. Causality: Models, reasoning and inference. Cambridge University Press, NewYork, NY, USA, 2nd édition, 2009. (Cited on pages 5 and 3.)[Pennington 2014] Jeﬀrey Pennington, Richard Socher et Christopher Manning.

GloVe: Global Vectorsfor Word Representation . In Proceedings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1532–1543, Doha, Qatar, Octobre 2014. Association forComputational Linguistics. (Cited on pages 5 and 3.)[Petersen 2012] K. B. Petersen et M. S. Pedersen.

The Matrix Cookbook , nov 2012. Version 20121115.(Cited on page 36.)[Peyré 2016] Gabriel Peyré, Marco Cuturi et Justin Solomon.

Gromov-Wasserstein averaging of kerneland distance matrices . In ICML, pages 2664–2672, 2016. (Cited on pages 23, 34, 35, 36, 37, 38, 43,47, 48, 49, 62, 71, 76, 93, 103, 106, 108, 150 and 151.)[Peyré 2019] Gabriel Peyré et Marco Cuturi.

Computational Optimal Transport . Foundations and Trendsin Machine Learning, vol. 11, pages 355–607, 2019. (Cited on pages 7, 11, 16, 20, 22, 24, 25, 43, 73and 149.) ibliography 181 [Puy 2020] Gilles Puy, Alexandre Boulch et Renaud Marlet.

FLOT: Scene Flow on Point Clouds Guidedby Optimal Transport , 2020. (Cited on page 57.)[Rabin 2011] Julien Rabin, Gabriel Peyré, Julie Delon et Marc Bernot.

Wasserstein barycenter and itsapplication to texture mixing . In International Conference on Scale Space and Variational Methodsin Computer Vision, pages 435–446. Springer, 2011. (Cited on pages 24 and 71.)[Rabin 2012] Julien Rabin, Gabriel Peyré, Julie Delon et Marc Bernot.

Wasserstein Barycenter and ItsApplication to Texture Mixing . In Alfred M. Bruckstein, Bart M. ter Haar Romeny, Alexander M.Bronstein et Michael M. Bronstein, editeurs, Scale Space and Variational Methods in ComputerVision, pages 435–446, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. (Cited on page 25.)[Rangarajan 1997a] Anand Rangarajan, Haili Chui et Fred L. Bookstein.

The softassign Procrustesmatching algorithm . In Information Processing in Medical Imaging, pages 29–42, 1997. (Cited onpage 102.)[Rangarajan 1997b] Anand Rangarajan, Alan L Yuille, Steven Gold et Eric Mjolsness.

A ConvergenceProof for the Softassign Quadratic Assignment Algorithm . In M. C. Mozer, M. I. Jordan etT. Petsche, editeurs, Advances in Neural Information Processing Systems 9, pages 620–626. MITPress, 1997. (Cited on page 37.)[Rangarajan 1999] Anand Rangarajan, A Vuille et Eric Mjolsness.

Convergence Properties of the SoftassignQuadratic Assignment Algorithm . Neural computation, vol. 11, pages 1455–74, 09 1999. (Cited onpage 37.)[Rapp 1995] Reinhard Rapp.

Identifying Word Translations in Non-Parallel Texts . In ACL, pages 320–322,1995. (Cited on page 102.)[Raymond 2002] John Raymond, Eleanor Gardiner et Peter Willett.

RASCAL: Calculation of GraphSimilarity using Maximum Common Edge Subgraphs.

Comput. J., vol. 45, pages 631–644, 04 2002.(Cited on page 162.)[Redko 2019] I. Redko, N. Courty, R. Flamary et D. Tuia.

Optimal Transport for Multi-source DomainAdaptation under Target Shift . In International Conference on Artiﬁcial Intelligence and Statistics(AISTAT), 2019. (Cited on page 110.)[Redko 2020] Ievgen Redko, Titouan Vayer, Rémi Flamary et Nicolas Courty.

CO-Optimal Transport ,2020. (Cited on pages 10, 5, 102 and 160.)[Riba 2018] P. Riba, A. Fischer, J. Llados et A. Fornés.

Learning Graph Distances with Message PassingNeural Networks . In 2018 24th International Conference on Pattern Recognition (ICPR), pages2239–2244, 2018. (Cited on page 162.)[Rocci 2008] R. Rocci et M. Vichi.

Two-mode multi-partitioning . Computational Statistics and DataAnalysis, vol. 52, no. 4, pages 1984–2003, 2008. (Cited on page 113.)[Rockafellar 1970] R. Tyrrell Rockafellar. Convex analysis. Princeton Mathematical Series. PrincetonUniversity Press, Princeton, N. J., 1970. (Cited on pages 13 and 129.)[Rubner 1998] Y. Rubner, C. Tomasi et L.J. Guibas.

A metric for distributions with applications to imagedatabases . In ICCV, pages 59–66, 1998. (Cited on pages 3 and 1.)

82 Bibliography [Rubner 2000] Yossi Rubner, Carlo Tomasi et Leonidas J. Guibas.

The Earth Mover’s Distance as aMetric for Image Retrieval . International Journal of Computer Vision, vol. 40, no. 2, pages 99–121,Novembre 2000. (Cited on pages 7 and 5.)[Rudin 1976] W. Rudin. Principles of Mathematical Analysis. International series in pure and appliedmathematics. McGraw-Hill, 1976. (Cited on page 138.)[Rustamov 2013] Raif M Rustamov, Maks Ovsjanikov, Omri Azencot, Mirela Ben-Chen, Frédéric Chazalet Leonidas Guibas.

Map-based exploration of intrinsic shape diﬀerences and variability . ACMTransactions on Graphics (TOG), vol. 32, no. 4, page 72, 2013. (Cited on page 76.)[Saenko 2010] K. Saenko, B. Kulis, M. Fritz et T. Darrell.

Adapting Visual Category Models to NewDomains . In ECCV, LNCS, pages 213–226, 2010. (Cited on page 111.)[Sakoe 1978] Hiroaki Sakoe et Seibi Chiba.

Dynamic programming algorithm optimization for spokenword recognition . ieeeassp, vol. 26, no. 1, pages 43–49, 1978. (Cited on page 162.)[Samaria 1994] Ferdinando S Samaria et Andy C Harter.

Parameterisation of a stochastic model forhuman face identiﬁcation . In Proceedings of 1994 IEEE workshop on applications of computervision, pages 138–142. IEEE, 1994. (Cited on page 114.)[Santambrogio 2015] Filippo Santambrogio.

Optimal Transport for Applied Mathematicians , 2015. (Citedon pages 7, 10, 13, 14, 16, 28, 86, 87, 89 and 121.)[Sato 2020] Ryoma Sato, Marco Cuturi, Makoto Yamada et Hisashi Kashima.

Fast and Robust Comparisonof Probability Measures in Heterogeneous Spaces , 2020. (Cited on page 37.)[Schellewald 2001] Christian Schellewald, Stefan Roth et Christoph Schnörr.

Evaluation of ConvexOptimization Techniques for the Weighted Graph-Matching Problem in Computer Vision . In BerndRadig et Stefan Florczyk, editeurs, Pattern Recognition, pages 361–368, Berlin, Heidelberg, 2001.Springer Berlin Heidelberg. (Cited on page 35.)[Schiebinger 2019] Geoﬀrey Schiebinger, Jian Shu, Marcin Tabaka, Brian Cleary, Vidya Subramanian,Aryeh Solomon, Joshua Gould, Siyan Liu, Stacie Lin, Peter Berube, Lia Lee, Jenny Chen, JustinBrumbaugh, Philippe Rigollet, Konrad Hochedlinger, Rudolf Jaenisch, Aviv Regev et Eric S.Lander.

Optimal-Transport Analysis of Single-Cell Gene Expression Identiﬁes DevelopmentalTrajectories in Reprogramming . Cell, vol. 176, no. 4, pages 928 – 943.e22, 2019. (Cited on pages 3and 1.)[Schmitzer 2013] Bernhard Schmitzer et Christoph Schnörr.

Modelling Convex Shape Priors and MatchingBased on the Gromov-Wasserstein Distance . Journal of Mathematical Imaging and Vision, vol. 46,no. 1, pages 143–159, Mai 2013. (Cited on page 39.)[Schmitzer 2016] Bernhard Schmitzer.

Stabilized Sparse Scaling Algorithms for Entropy RegularizedTransport Problems . SIAM Journal on Scientiﬁc Computing, vol. 41, no. 3, pages A1443–A1481,2016. (Cited on pages 22 and 76.)[Schrodinger 1931] E. Schrodinger.

Uber die umkehrung der naturgesetze . Sitzungsberichte Preuss. Akad.Wiss. Berlin. Phys. Math., 144, pages 876––879, 1931. (Cited on page 21.) ibliography 183 [Seguy 2018] Vivien Seguy, Bharath Bhushan Damodaran, Remi Flamary, Nicolas Courty, Antoine Roletet Mathieu Blondel.

Large Scale Optimal Transport and Mapping Estimation . In InternationalConference on Learning Representations, 2018. (Cited on pages 23 and 24.)[Shan 2010] Hanhuai Shan et Arindam Banerjee.

Residual Bayesian Co-clustering for Matrix Approxima-tion . In SDM, pages 223–234, 2010. (Cited on page 113.)[Shervashidze 2009] Nino Shervashidze, S. V. N. Vishwanathan, Tobias H. Petri, Kurt Mehlhorn et et al.

Eﬃcient graphlet kernels for large graph comparison , 2009. (Cited on pages 52 and 53.)[Shervashidze 2011] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn etKarsten M. Borgwardt.

Weisfeiler-Lehman Graph Kernels . J. Mach. Learn. Res., vol. 12, pages2539–2561, Novembre 2011. (Cited on page 42.)[Siglidis 2018] Giannis Siglidis, Giannis Nikolentzos, Stratis Limnios, Christos Giatsidis, KonstantinosSkianis et Michalis Vazirgianis.

GraKeL: A Graph Kernel Library in Python . arXiv e-prints, Juin2018. (Cited on page 52.)[Sinkhorn 1967] Richard Sinkhorn et Paul Knopp.

Concerning nonnegative matrices and doubly stochasticmatrices.

Paciﬁc J. Math., vol. 21, no. 2, pages 343–348, 1967. (Cited on page 22.)[Solomon 2014] Justin Solomon, Raif Rustamov, Leonidas Guibas et Adrian Butscher.

WassersteinPropagation for Semi-Supervised Learning . In Eric P. Xing et Tony Jebara, editeurs, Proceedingsof the 31st International Conference on Machine Learning, volume 32 of

Proceedings of MachineLearning Research , pages 306–314, Bejing, China, 22–24 Jun 2014. PMLR. (Cited on pages 3and 1.)[Solomon 2015] Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, AndyNguyen, Tao Du et Leonidas Guibas.

Convolutional Wasserstein Distances: Eﬃcient OptimalTransportation on Geometric Domains . ACM Trans. Graph., vol. 34, no. 4, Juillet 2015. (Cited onpages 25 and 27.)[Solomon 2016] Justin Solomon, Gabriel Peyré, Vladimir G. Kim et Suvrit Sra.

Entropic metric alignmentfor correspondence problems . ACM Transactions on Graphics, vol. 35, no. 4, pages 1–13, 2016.(Cited on pages 35, 39, 76 and 102.)[Song 2016] Yang Song, Peter J. Schreier, David Ramírez et Tanuj Hasija.

Canonical correlation analysisof high-dimensional data with very small sample support . Signal Process., vol. 128, pages 449–458,2016. (Cited on page 112.)[Srivastava 2015] Sanvesh Srivastava, Volkan Cevher, Quoc Dinh et David Dunson.

WASP: ScalableBayes via barycenters of subset posteriors . In Guy Lebanon et S. V. N. Vishwanathan, editeurs,Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics,volume 38 of

Proceedings of Machine Learning Research , pages 912–920, San Diego, California,USA, 09–12 May 2015. PMLR. (Cited on page 25.)[Sturm 2012] Karl-Theodor Sturm.

The space of spaces: curvature bounds and gradient ﬂows on the spaceof metric measure spaces . arXiv e-prints, page arXiv:1208.0434, 2012. (Cited on pages 7, 27, 28,32, 34, 66 and 80.)

84 Bibliography [Sun 2020] Yizhou Sun.

Graph Neural Networks for Graph Search . In Proceedings of the 3rd JointInternational Workshop on Graph Data Management Experiences & Systems (GRADES) andNetwork Data Analytics (NDA), GRADES-NDA’20, New York, NY, USA, 2020. Association forComputing Machinery. (Cited on page 162.)[Sutherland 2003] Jeﬀrey J Sutherland, Lee A O’brien et Donald F Weaver.

Spline-ﬁtting with a geneticalgorithm: A method for developing classiﬁcation structure- activity relationships . Journal ofchemical information and computer sciences, vol. 43, no. 6, pages 1906–1915, 2003. (Cited onpage 51.)[Sutskever 2014] Ilya Sutskever, Oriol Vinyals et Quoc V Le.

Sequence to Sequence Learning with NeuralNetworks . In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence et K. Q. Weinberger, editeurs,Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc.,2014. (Cited on pages 5 and 3.)[Szegedy 2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke et Andrew Rabinovich.

Going deeper with convolu-tions . In CVPR, pages 1–9, 2015. (Cited on page 111.)[Takatsu 2011] Asuka Takatsu.

Wasserstein geometry of Gaussian measures . Osaka J. Math., vol. 48,no. 4, pages 1005–1026, 12 2011. (Cited on pages 17, 85 and 95.)[Tao 2005] Pham Dinh Tao et al. The DC (diﬀerence of convex functions) programming and DCA revisitedwith DC models of real world nonconvex optimization problems . Annals of operations research,vol. 133, no. 1-4, pages 23–46, 2005. (Cited on page 108.)[Taskar 2005] Ben Taskar, Vassil Chatalbashev, Daphne Koller et Carlos Guestrin.

Learning StructuredPrediction Models: A Large Margin Approach . In Proceedings of the 22nd International Conferenceon Machine Learning, ICML ’05, page 896–903, New York, NY, USA, 2005. Association forComputing Machinery. (Cited on pages 5 and 3.)[Thorpe 2017] Matthew Thorpe, Serim Park, Soheil Kolouri, Gustavo K. Rohde et Dejan Slep v cev. ATransportation L p Distance for Signal Analysis . Journal of Mathematical Imaging and Vision,vol. 59, no. 2, pages 187–210, 2017. (Cited on pages 43 and 63.)[Townsend 2016] James Townsend, Niklas Koep et Sebastian Weichwald.

Pymanopt: A python toolboxfor optimization on manifolds using automatic diﬀerentiation . The Journal of Machine LearningResearch, vol. 17, no. 1, pages 4755–4759, 2016. (Cited on page 75.)[Vayer 2019a] Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard et Nicolas Courty.

Optimal Transport for structured data with application on graphs . In International Conference onMachine Learning, volume 97, 2019. (Cited on pages 7, 5, 42 and 160.)[Vayer 2019b] Titouan Vayer, Rémi Flamary, Nicolas Courty, Romain Tavenard et Laetitia Chapel.

SlicedGromov-Wasserstein . In Advances in Neural Information Processing Systems 32, pages 14753–14763.Curran Associates, Inc., 2019. (Cited on pages 9, 5, 70 and 160.)[Vayer 2020a] Titouan Vayer, Laetitia Chapel, Nicolas Courty, Rémi Flamary, Yann Soullard et RomainTavenard.

Time Series Alignment with Global Invariances , 2020. (Cited on pages 7, 5 and 160.) ibliography 185 [Vayer 2020b] Titouan Vayer, Laetitia Chapel, Remi Flamary, Romain Tavenard et Nicolas Courty.

FusedGromov-Wasserstein Distance for Structured Objects . Algorithms, vol. 13, no. 9, page 212, Aug2020. (Cited on pages 7, 5, 42 and 160.)[Veraguas 2016] Julio Backhoﬀ Veraguas, Mathias Beiglböck, Yiqing Lin et Anastasiia Zalashko.

Causaltransport in discrete time and applications , 2016. (Cited on page 162.)[Verma 2017] Saurabh Verma et Zhi-Li Zhang.

Hunt For The Unique, Stable, Sparse And Fast FeatureLearning On Graphs . In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathanet R. Garnett, editeurs, NIPS, pages 88–98. Curran Associates, Inc., 2017. (Cited on page 44.)[Villani 2003] C. Villani.

Topics in Optimal Transportation Theory . vol. 58, 01 2003. (Cited on pages 10and 12.)[Villani 2008] Cédric Villani. Optimal Transport: Old and New. Springer, 2008. (Cited on pages 7, 11,14, 15, 66, 122 and 149.)[Vishwanathan 2010] S. V. N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor et Karsten M. Borgwardt.

Graph Kernels . J. Mach. Learn. Res., vol. 11, pages 1201–1242, 2010. (Cited on pages 42 and 52.)[Wale 2008] Nikil Wale, Ian A. Watson et George Karypis.

Comparison of descriptor spaces for chemicalcompound retrieval and classiﬁcation . Knowledge and Information Systems, vol. 14, no. 3, pages347–375, Mar 2008. (Cited on page 51.)[Wang 1987] Yuchung J Wang et George Y Wong.

Stochastic blockmodels for directed graphs . Journalof the American Statistical Association, vol. 82, no. 397, pages 8–19, 1987. (Cited on pages 56and 57.)[Wang 2009] Chang Wang et Sridhar Mahadevan.

Manifold Alignment without Correspondence . In IJCAI,page 1273–1278, 2009. (Cited on page 102.)[Wang 2011] Chang Wang et Sridhar Mahadevan.

Heterogeneous Domain Adaptation Using ManifoldAlignment . In Proceedings of the Twenty-Second International Joint Conference on ArtiﬁcialIntelligence - Volume Volume Two, IJCAI’11, page 1541–1546. AAAI Press, 2011. (Cited onpages 5 and 3.)[Wang 2018] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta et Kaiming He.

Non-local Neural Networks .2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.(Cited on pages 5 and 3.)[Weed 2017] Jonathan Weed et Francis Bach.

Sharp asymptotic and ﬁnite-sample rates of convergence ofempirical measures in Wasserstein distance . arXiv preprint arXiv:1707.00087, 2017. (Cited onpages 18, 64 and 125.)[Weed 2019a] Jonathan Weed et Quentin Berthet.

Estimation of smooth densities in Wasserstein distance .In Alina Beygelzimer et Daniel Hsu, editeurs, Proceedings of the Thirty-Second Conference onLearning Theory, volume 99 of

Proceedings of Machine Learning Research , pages 3118–3119,Phoenix, USA, 25–28 Jun 2019. PMLR. (Cited on page 19.)[Weed 2019b] Jonathan Daniel Weed.

Statistical Problems in Transport and Alignment . PhD thesis, 042019. (Cited on page 19.)

86 Bibliography [Wen 2010] Zaiwen Wen et Wotao Yin.

A feasible method for optimization with orthogonality constraints .Mathematical Programming, vol. 142, 12 2010. (Cited on page 97.)[West 2000] Douglas B. West. Introduction to graph theory. Prentice Hall, 2 édition, September 2000.(Cited on page 46.)[Willett 1998] Peter Willett, John M. Barnard et Geoﬀrey M. Downs.

Chemical similarity searching . J.Chem. Inf. Comput. Sci, pages 983–996, 1998. (Cited on page 162.)[Wills 2020] Peter Wills et François G. Meyer.

Metrics for graph comparison: A practitioner’s guide .PLOS ONE, vol. 15, no. 2, pages 1–54, 02 2020. (Cited on page 162.)[Wilson 1969] A. G. Wilson.

The Use of Entropy Maximising Models, in the Theory of Trip Distribution,Mode Split and Route Split . Journal of Transport Economics and Policy, vol. 3, no. 1, pages108–126, 1969. (Cited on page 21.)[Wu 2019] Jiqing Wu, Zhiwu Huang, Dinesh Acharya, Wen Li, Janine Thoma, Danda Pani Paudel etLuc Van Gool.

Sliced Wasserstein Generative Models . In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2019. (Cited on page 71.)[Wu 2020] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang et P. S. Yu.

A Comprehensive Survey on GraphNeural Networks . IEEE Transactions on Neural Networks and Learning Systems, pages 1–21, 2020.(Cited on pages 5, 3 and 42.)[Xia 2020] Haifeng Xia et Zhengming Ding.

Structure Preserving Generative Cross-Domain Learning . InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),June 2020. (Cited on page 39.)[Xu 2019a] Hongteng Xu, Dixin Luo et Lawrence Carin.

Scalable Gromov-Wasserstein Learning for GraphPartitioning and Matching . In Advances in Neural Information Processing Systems 32, pages3052–3062. Curran Associates, Inc., 2019. (Cited on page 39.)[Xu 2019b] Hongteng Xu, Dixin Luo, Hongyuan Zha et Lawrence Carin Duke.

Gromov-WassersteinLearning for Graph Matching and Node Embedding . In Kamalika Chaudhuri et Ruslan Salakhutdi-nov, editeurs, Proceedings of the 36th International Conference on Machine Learning, volume 97 of

Proceedings of Machine Learning Research , pages 6932–6941, Long Beach, California, USA, 09–15Jun 2019. PMLR. (Cited on page 39.)[Xu 2020] Hongteng Xu, Dixin Luo, Ricardo Henao, Svati Shah et Lawrence Carin.

Learning Autoencoderswith Relational Regularization , 2020. (Cited on page 57.)[Yan 2018] Yuguang Yan, Wen Li, Hanrui Wu, Huaqing Min, Mingkui Tan et Qingyao Wu.

Semi-SupervisedOptimal Transport for Heterogeneous Domain Adaptation . In International Joint Conference onArtiﬁcial Intelligence, pages 2969–2975, 2018. (Cited on pages 39, 70, 102, 110 and 111.)[Yanardag 2015] Pinar Yanardag et S.V.N. Vishwanathan.

Deep Graph Kernels . In ACM, editeur,Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, pages 1365–1374, 2015. (Cited on pages 4, 3, 42 and 51.)[Yang 2020] Heng Yang, Jingnan Shi et Luca Carlone.

TEASER: Fast and Certiﬁable Point CloudRegistration , 2020. (Cited on page 102.) ibliography 187 [Yeh 2014] Yi-Ren Yeh, Chun-Hao Huang et Yu-Chiang Frank Wang.

Heterogeneous domain adaptationand classiﬁcation by exploiting the correlation subspace . IEEE Transactions on Image Processing,vol. 23, no. 5, pages 2009–2018, 2014. (Cited on pages 5, 3, 110 and 111.)[Yuille 2003] Alan L Yuille et Anand Rangarajan.

The concave-convex procedure . Neural computation,vol. 15, no. 4, pages 915–936, 2003. (Cited on page 108.)[Yurochkin 2019] Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh et Justin M.Solomon.

Hierarchical Optimal Transport for Document Representation . In NeurIPS, pages1599–1609, 2019. (Cited on page 109.)[Zalashko 2017] Anastasiia Zalashko. Causal optimal transport : theory and applications. Wien, 2017.(Cited on page 162.)[Zaslavskiy 2009] M. Zaslavskiy, F. Bach et J. Vert.

A Path Following Algorithm for the Graph MatchingProblem . IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pages2227–2242, 2009. (Cited on page 35.)[Zhang 2018] Ruiyi Zhang, Changyou Chen, Chunyuan Li et Lawrence Carin Duke.

Policy Optimizationas Wasserstein Gradient Flows . In Jennifer Dy et Andreas Krause, editeurs, Proceedings of the35th International Conference on Machine Learning, volume 80 of

Proceedings of Machine LearningResearch , pages 5741–5750, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. (Citedon page 65.)[Zhou 2014] Joey Tianyi Zhou, Ivor W.Tsang, Sinno Jialin Pan et Mingkui Tan.

Heterogeneous DomainAdaptation for Multiple Classes . volume 33 of

Proceedings of Machine Learning Research , pages1095–1103, Reykjavik, Iceland, 22–25 Apr 2014. PMLR. (Cited on pages 5 and 3.)[Zhou 2019] Joey Tianyi Zhou, Ivor W. Tsang, Sinno Jialin Pan et Mingkui Tan.

Multi-class HeterogeneousDomain Adaptation . Journal of Machine Learning Research, vol. 20, no. 57, pages 1–31, 2019.(Cited on page 163.) itre :

Une contribution au Transport Optimal sur des espaces incomparables

Mot clés :

Transport Optimal, Données Structurées, Wassertein et Gromov-Wasserstein

Résumé :

Le Transport Optimal est une théo-rie permettant de déﬁnir des notions géo-métriques de distance entre des distributionsde probabilité et de trouver des correspon-dances, des relations, entre des ensembles depoints. De cette théorie, à la frontière entreles mathématiques et l’optimisation, découlede nombreuses applications en machine lear-ning. Cette thèse propose d’étudier le scéna-rio, complexe, dans lequel les différentes don-nées appartiennent à des espaces incompa-rables . En particulier nous abordons les ques-tions suivantes : comment déﬁnir et appli-quer le transport optimal entre des graphes,entre des données structurées ? Commentl’adapter lorsque les données sont variées etne font pas partie d’un même espace mé- trique ? Cette thèse propose un ensembled’outils de Transport Optimal pour ces diffé-rents cas. Un important volet est notammentconsacré à l’étude de la distance de Gromov-Wasserstein dont les propriétés permettent dedéﬁnir d’intéressants problèmes de transportsur des espaces incomparables. Plus large-ment, nous analysons les propriétés mathé-matiques des différents outils proposés, nousétablissons des solutions algorithmiques pourles calculer et nous étudions leur applicabi-lité dans de nombreux scenarii de machinelearning qui couvrent, notamment, la classiﬁ-cation, la simpliﬁcation, le partitionnement dedonnées structurées, ainsi que l’adaptation dedomaines hétérogènes.

Title:

A contribution to Optimal Transport on incomparable spaces

Keywords:

Optimal Transport, Structured Data, Wasserstein and Gromov-Wasserstein

Abstract:

Optimal Transport is a theory thatallows to deﬁne geometrical notions of dis-tance between probability distributions and toﬁnd correspondences, relationships, betweensets of points. Many machine learning appli-cations are derived from this theory, at thefrontier between mathematics and optimiza-tion. This thesis proposes to study the com-plex scenario in which the different data be-long to incomparable spaces. In particular weaddress the following questions: how to de-ﬁne and apply the optimal transport betweengraphs, between structured data? How can itbe adapted when the data are varied and notembedded in the same metric space? This thesis proposes a set of Optimal Transporttools for these different cases. An importantpart is notably devoted to the study of theGromov-Wasserstein distance whose proper-ties allow to deﬁne interesting transport prob-lems on incomparable spaces. More broadly,we analyze the mathematical properties of thevarious proposed tools, we establish algorith-mic solutions to compute them and we studytheir applicability in numerous machine learn-ing scenariiscenarii