[PDF] Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions

Abstract

Embedding complex objects as vectors in low dimensional spaces is a longstanding problem in machine learning. We propose in this work an extension of that approach, which consists in embedding objects as elliptical probability distributions, namely distributions whose densities have elliptical level sets. We endow these measures with the 2-Wasserstein metric, with two important benefits: (i) For such measures, the squared 2-Wasserstein metric has a closed form, equal to a weighted sum of the squared Euclidean distance between means and the squared Bures metric between covariance matrices. The latter is a Riemannian metric between positive semi-definite matrices, which turns out to be Euclidean on a suitable factor representation of such matrices, which is valid on the entire geodesic between these matrices. (ii) The 2-Wasserstein distance boils down to the usual Euclidean metric when comparing Diracs, and therefore provides a natural framework to extend point embeddings. We show that for these reasons Wasserstein elliptical embeddings are more intuitive and yield tools that are better behaved numerically than the alternative choice of Gaussian embeddings with the Kullback-Leibler divergence. In particular, and unlike previous work based on the KL geometry, we learn elliptical distributions that are not necessarily diagonal. We demonstrate the advantages of elliptical embeddings by using them for visualization, to compute embeddings of words, and to reflect entailment or hypernymy.

Full PDF

GGeneralizing Point Embeddings using theWasserstein Space of Elliptical Distributions

Boris Muzellec

CREST, ENSAE [email protected]

Marco Cuturi

Google Brain and CREST, ENSAE [email protected]

Abstract

For suchmeasures, the squared 2-Wasserstein metric has a closed form, equal to a weightedsum of the squared Euclidean distance between means and the squared Buresmetric between covariance matrices. The latter is a Riemannian metric betweenpositive semi-deﬁnite matrices, which turns out to be Euclidean on a suitable factorrepresentation of such matrices, which is valid on the entire geodesic betweenthese matrices. (ii)

The 2-Wasserstein distance boils down to the usual Euclideanmetric when comparing Diracs, and therefore provides a natural framework toextend point embeddings. We show that for these reasons Wasserstein ellipticalembeddings are more intuitive and yield tools that are better behaved numericallythan the alternative choice of Gaussian embeddings with the Kullback-Leiblerdivergence. In particular, and unlike previous work based on the KL geometry, welearn elliptical distributions that are not necessarily diagonal. We demonstrate theadvantages of elliptical embeddings by using them for visualization, to computeembeddings of words, and to reﬂect entailment or hypernymy.

One of the holy grails of machine learning is to compute meaningful low-dimensional embeddingsfor high-dimensional complex data. That ability has recently proved crucial to tackle more advancedtasks, such as for instance: inference on texts using word embeddings [Mikolov et al., 2013b,Pennington et al., 2014, Bojanowski et al., 2017], improved image understanding [Norouzi et al.,2014], representations for nodes in large graphs [Grover and Leskovec, 2016].Such embeddings have been traditionally recovered by seeking isometric embeddings in lowerdimensional Euclidean spaces, as studied in [Johnson and Lindenstrauss, 1984, Bourgain, 1985].Given n input points x , . . . , x n , one seeks as many embeddings y , . . . , y n in a target space Y = R d whose pairwise distances (cid:107) y i − y j (cid:107) do not depart too much from the original distances d X ( x i , x j ) in the input space. Note that when d is restricted to be or , these embeddings ( y i ) i provide a usefulway to visualize the entire dataset. Starting with metric multidimensional scaling (mMDS) [De Leeuw,1977, Borg and Groenen, 2005], several approaches have reﬁned this intuition [Tenenbaum et al.,2000, Roweis and Saul, 2000, Hinton and Roweis, 2003, Maaten and Hinton, 2008]. More generalcriteria, such as reconstruction error [Hinton and Salakhutdinov, 2006, Kingma and Welling, 2014];co-occurence [Globerson et al., 2007]; or relational knowledge, be it in metric learning [Weinbergerand Saul, 2009] or between words [Mikolov et al., 2013b] can be used to obtain vector embeddings.In such cases, distances (cid:107) y i − y j (cid:107) between embeddings, or alternatively their dot-products (cid:104) y i , y j (cid:105) a r X i v : . [ s t a t . M L ] F e b ust comply with sophisticated desiderata. Naturally, more general and ﬂexible approaches in whichthe embedding space Y needs not be Euclidean can be considered, for instance in generalized MDSon the sphere [Maron et al., 2010], on surfaces [Bronstein et al., 2006], in spaces of trees [B˘adoiuet al., 2007, Fakcharoenphol et al., 2003] or, more recently, computed in the Poincaré hyperbolicspace [Nickel and Kiela, 2017]. Probabilistic Embeddings.

Our work belongs to a recent trend, pioneered by Vilnis and McCallum,who proposed to embed data points as probability measures in R d [2015], and therefore generalizepoint embeddings. Indeed, point embeddings can be regarded as a very particular—and degenerate—case of probabilistic embedding, in which the uncertainty is inﬁnitely concentrated on a single point (aDirac). Probability measures can be more spread-out, or event multimodal, and provide therefore anopportunity for additional ﬂexibility. Naturally, such an opportunity can only be exploited by deﬁninga metric, divergence or dot-product on the space (or a subspace thereof) of probability measures.Vilnis and McCallum proposed to embed words as Gaussians endowed either with the Kullback-Leibler (KL) divergence or the expected likelihood kernel [Jebara et al., 2004]. The Kullback-Leiblerand expected likelihood kernel on measures have, however, an important drawback: these geometriesdo not coincide with the usual Euclidean metric between point embeddings when the variances ofthese Gaussians collapse. Indeed, the KL divergence and the (cid:96) distance between two Gaussiansdiverges to ∞ or saturates when the variances of these Gaussians become small. To avoid numericalinstabilities arising from this degeneracy, Vilnis and McCallum must restrict their work to diagonalcovariance matrices. In a concurrent approach, Singh et al. represent words as distributions over theircontexts in the optimal transport geometry [Singh et al., 2018]. Contributions.

We propose in this work a new framework for probabilistic embeddings, in whichpoint embeddings are seamlessly handled as a particular case. We consider arbitrary families ofelliptical distributions, which subsume Gaussians, and also include uniform elliptical distributions,which are arguably easier to visualize because of their compact support. Our approach uses the2-Wasserstein distance to compare elliptical distributions. The latter can handle degenerate measures,and both its value and its gradients admit closed forms [Gelbrich, 1990], either in their naturalRiemannian formulation, as well as in a more amenable local Euclidean parameterization. Weprovide numerical tools to carry out the computation of elliptical embeddings in different scenarios,both to optimize them with respect to metric requirements (as is done in multidimensional scaling)or with respect to dot-products (as shown in our applications to word embeddings for entailment,similarity and hypernymy tasks) for which we introduce a proxy using a polarization identity.

Notations S d ++ (resp. S d + ) is the set of positive (resp. semi-)deﬁnite d × d matrices. For twovectors x , y ∈ R d and a matrix M ∈ S d + , we write the Mahalanobis norm induced by M as (cid:107) x − c (cid:107) M = ( x − c ) T M ( x − c ) and | M | for det( M ) . For V an afﬁne subspace of dimension m of R d , λ V is the Lebesgue measure on that subspace. M † is the pseudo inverse of M . We recall in this section basic facts about elliptical distributions in R d . We adopt a general formulationthat can handle measures supported on subspaces of R d as well as Dirac (point) measures. That levelof generality is needed to provide a seamless connection with usual vector embeddings, seen in thecontext of this paper as Dirac masses. We recall results from the literature showing that the squared2-Wasserstein distance between two distributions from the same family of elliptical distributions isequal to the squared Euclidean distance between their means plus the squared Bures metric betweentheir scale parameter scaled by a suitable constant. Elliptically Contoured Densities.

In their simplest form, elliptical distributions can be seenas generalizations of Gaussian multivariate densities in R d : their level sets describe concentricellipsoids, shaped following a scale parameter C ∈ S d ++ , and centered around a mean parameter c ∈ R d [Cambanis et al., 1981]. The density at a point x of such distributions is f ( (cid:107) x − c (cid:107) C − ) / (cid:112) | C | where the generator function f is such that (cid:82) R d f ( (cid:107) x (cid:107) )d x = 1 . Gaussians are recovered with f = g, g ( · ) ∝ e −· / while uniform distributions on full rank ellipsoids result from f = u, u ( · ) ∝ · ≤ .Because the norm induced by C − appears in formulas above, the scale parameter C must have fullrank for these deﬁnitions to be meaningful. Cases where C does not have full rank can however2ppear when a probability measure is supported on an afﬁne subspace of R d , such as lines in R , oreven possibly a space of null dimension when the measure is supported on a single point (a Diracmeasure), in which case its scale parameter C is . We provide in what follows a more generalapproach to handle these degenerate cases. Elliptical Distributions.

To lift this limitation, several reformulations of elliptical distributions havebeen proposed to handle degenerate scale matrices C of rank rk C < d . Gelbrich [1990, Theorem 2.4]deﬁnes elliptical distributions as measures with a density w.r.t the Lebesgue measure of dimension rk C , in the afﬁne space c + Im C , where the image of C is Im C def = { Cx , x ∈ R d } . This approachis intuitive, in that it reduces to describing densities in their relevant subspace. A more elegantapproach uses the parameterization provided by characteristic functions [Cambanis et al., 1981, Fanget al., 1990]. In a nutshell, recall that the characteristic function of a multivariate Gaussian is equalto φ ( t ) = e i t T c g ( t T Ct ) where, as in the paragraph above, g ( · ) = e −· / . A natural generalizationto consider other elliptical distributions is therefore to consider for g other functions h of positivetype [Ushakov, 1999, Theo.1.8.9], such as the indicator function u above, and still apply them to thesame argument t T Ct . Such functions are called characteristic generators and fully determine, alongwith a mean c and a scale parameter C , an elliptical measure. This parameterization does not requirethe scale parameter C to be invertible, and therefore allows to deﬁne probability distributions that donot have necessarily a density w.r.t to the Lebesgue measure in R d . Both constructions are relativelycomplex, and we refer the interested reader to these references for a rigorous treatment. Rank Deﬁcient Elliptical Distributions and their Variances.

For the purpose of this work, wewill only require the following result: the variance of an elliptical measure is equal to its scaleparameter C multiplied by a scalar that only depends on its characteristic generator. Indeed, given amean vector c ∈ R d , a scale semi -deﬁnite matrix C ∈ S d + and a characteristic generator function h , B = ⇥ B = h i A = h i B = vv T B = h i Figure 1: Five measures from the family of uni-form elliptical distributions in R . Each mea-sure has a mean (location) and scale parameter.In this carefully selected example, the referencemeasure (with scale parameter A ) is equidistant(according to the 2-Wasserstein metric) to thefour remaining measures, whose scale parameters B , B , B , B have ranks equal to their indices(here, v = [3 , , − T ).we deﬁne µ h, c , C to be the measure with char-acteristic function t (cid:55)→ e i t T c h ( t T Ct ) . In thatcase, one can show that the covariance matrix of µ h, c , C is equal to its scale parameter C times aconstant τ h that only depends on h , namely var( µ h, c , C ) = τ h C . (1)For Gaussians, the scale parameter C and itscovariance matrice coincide, that is τ g = 1 . Foruniform elliptical distributions, one has τ u =1 / ( d + 2) : the covariance of a uniform distribu-tion on the volume { c + Cx , x ∈ R d , (cid:107) x (cid:107) = 1 } ,such as those represented in Figure 1, is equalto C / ( d + 2) . The 2-Wasserstein Bures Metric

A naturalmetric for elliptical distributions arises from op-timal transport (OT) theory. We refer interestedreaders to [Santambrogio, 2015, Peyré and Cu-turi, 2018] for exhaustive surveys on OT. Re-call that for two arbitrary probability measures µ, ν ∈ P ( R d ) , their squared 2-Wasserstein dis-tance is equal to W ( µ, ν ) def = inf X ∼ µ,Y ∼ ν E (cid:107) X − Y (cid:107) . This formula rarely has a closed form. However,in the footsteps of Dowson and Landau [1982] who proved it for Gaussians, Gelbrich [1990] showedthat for α def = µ h, a , A and β def = µ h, b , B in the same family P h = { µ h, c , C , c ∈ R d , C ∈ S d + } , one has W ( α, β ) = (cid:107) a − b (cid:107) + B (var α, var β ) = (cid:107) a − b (cid:107) + τ h B ( A , B ) , (2) For instance, the random variable Y in R obtained by duplicating the same normal random variable X in R , Y = [ X, X ] , is supported on a line in R and has no density w.r.t the Lebesgue measure in R . B is the (squared) Bures metric on S d + , proposed in quantum information geometry [1969]and studied recently in [Bhatia et al., 2018, Malagò et al., 2018], B ( X , Y ) def = Tr( X + Y − X YX ) ) . (3)The factor τ h next to the rightmost term B in (2) arises from homogeneity of B in its arguments (3),which is leveraged using the identity in (1). A few remarks (i)

When both scale matrices A = diag d A and B = diag d B are diagonal, W ( α, β ) is the sum of two terms: the usual squared Euclidean distance between their means, plus τ h times the squared Hellinger metric between the diagonals d A , d B : H ( d A , d B ) def = (cid:107)√ d A −√ d B (cid:107) . (ii) The distance W between two Diracs δ a , δ b is equal to the usual distance between vectors (cid:107) a − b (cid:107) . (iii) The squared distance W between a Dirac δ a and a measure µ h, b , B in P h reducesto (cid:107) a − b (cid:107) + τ h Tr B . The distance between a point and an ellipsoid distribution therefore always increases as the scale parameter of the latter increases. Although this point makes sense from thequadratic viewpoint of W (in which the quadratic contribution (cid:107) a − x (cid:107) of points x in the ellipsoidthat stand further away from a than b will dominate that brought by points x that are closer, seeFigure 3) this may be counterintuitive for applications to visualization, an issue that will be addressedin Section 4. (iv) The W distance between two elliptical distributions in the same family P h is alwaysﬁnite, no matter how degenerate they are. This is illustrated in Figure 1 in which a uniform measure µ a , A is shown to be exactly equidistant to four other uniform elliptical measures, some of which aredegenerate. However, as can be hinted by the simple example of the Hellinger metric, that distancemay not be differentiable for degenerate measures (in the same sense that ( √ x − √ y ) is deﬁnedat x = 0 but not differentiable w.r.t x ). (v) Although we focus in this paper on uniform ellipticaldistributions, notably because they are easier to plot and visualize, considering any other ellipticalfamily simply amounts to changing the constant τ h next to the Bures metric in (2). Alternatively,increasing (or tuning) that parameter τ h simply amounts to considering elliptical distributions withincreasingly heavier tails. Our goal in this paper is to use the set of elliptical distributions endowed with the W distance asan embedding space. To optimize objective functions involving W terms, we study in this sectionseveral parameterizations of the parameters of elliptical distributions. Location parameters onlyappear in the computation of W through their Euclidean metric, and offer therefore no particularchallenge. Scale parameters are more tricky to handle since they are constrained to lie in S d + . Ratherthan keeping track of scale parameters, we advocate optimizing directly on factors (square roots) ofsuch parameters, which results in simple Euclidean (unconstrained) updates reviewed below. Geodesics for Elliptical Distributions

When A and B have full rank, the geodesic from α to β isa curve of measures in the same family of elliptic distributions, characterized by location and scaleparameters c ( t ) , C ( t ) , where c ( t ) = (1 − t ) a + t b ; C ( t ) = (cid:0) (1 − t ) I + t T AB (cid:1) A (cid:0) (1 − t ) I + t T AB (cid:1) , (4)and where the matrix T AB is such that x → T AB ( x − a ) + b is the so-called Brenier optimaltransportation map [1987] from α to β , given in closed form as, T AB def = A − ( A BA ) A − , (5)and is the unique matrix such that B = T AB AT AB [Peyré and Cuturi, 2018, Remark 2.30]. When A is degenerate, such a curve still exists as long as Im B ⊂ Im A , in which case the expressionabove is still valid using pseudo-inverse square roots A † / in place of the usual inverse square-root. Differentiability in Riemannian Parameterization

Scale parameters are restricted to lie on thecone S d + . For such problems, it is well known that a direct gradient-and-project based optimizationon scale parameters would prove too expensive. A natural remedy to this issue is to perform manifoldoptimization [Absil et al., 2009]. Indeed, as in any Riemannian manifold, the Riemannian gradient grad x d ( x, y ) is given by − log x y [Lee, 1997]. Using the expressions of the exp and log given in[Malagò et al., 2018], we can show that minimizing B ( A , B ) using Riemannian gradient descentcorresponds to making updates of the form, with step length η A (cid:48) = (cid:0) (1 − η ) I + η T AB (cid:1) A (cid:0) (1 − η ) I + η T AB (cid:1) . (6)4hen ≤ η ≤ , this corresponds to considering a new point A (cid:48) closer to B along the Buresgeodesic between A and B . When η is negative or larger than , A (cid:48) no longer lies on this geodesicbut is guaranteed to remain PSD, as can be seen from (6). Figure 2 shows a W geodesic betweentwo measures µ and µ , as well as its extrapolation following exactly the formula given in (4).That ﬁgure illustrates that µ t is not necessarily geodesic outside of the boundaries [0 , w.r.t. threerelevant measures, because its metric derivative is smaller than 1 [Ambrosio et al., 2006, Theorem1.1.2]. When negative steps are taken (for instance when the W distance needs to be increased), thislack of geodisicity has proved difﬁcult to handle numerically for a simple reason: such updates maylead to degenerate scale parameters A (cid:48) , as illustrated around time t = 1 . of the curve in Figure 2.Another obvious drawback of Riemannian approaches is that they are not as well studied as simplernon-constrained Euclidean problems, for which a plethora of optimization techniques are available.This observations motivates an alternative Euclidean parameterization, detailed in the next paragraph. µ µ µ µ -2 -1 0 1 2 3 curve time | d W / d t | / W ( µ , µ ) Metric derivative on curve µ t → µ − µ t → µ µ t → µ µ t → µ Figure 2: (left) Interpolation ( µ t ) t between two measures µ and µ following the geodesic equa-tion (4). The same formula can be used to interpolate on the left and right of times , . Displayedtimes are [ − , − , − . , , . , . , . , , . , , . Note that geodesicity is not ensured outside of theboundaries [0 , . This is illustrated in the right plot displaying normalized metric derivatives of thecurve µ t to four relevant points: µ , µ , µ − , µ . The curve µ t is not always locally geodesic, as canbe seen by the fact that the metric derivative is strictly smaller than in several cases. Differentiability in Euclidean Parameterization

A canonical way to handle a PSD constraint for A is to rewrite it in factor form A = LL T . In the particular case of the Bures metric, we show thatthis simple parametrization comes without losing the geometric interest of manifold optimization,while beneﬁting from simpler additive updates. Indeed, one can (see supplementary material) that thegradient of the squared Bures metric has the following gradient: ∇ L B ( A , B ) = (cid:0) I − T AB (cid:1) L , with updates L (cid:48) = (cid:0) (1 − η ) I + η T AB (cid:1) L . (7) Links between Euclidean and Riemannian Parameterization

The factor updates in (7) are exactlyequivalent to the Riemannian ones (6) in the sense that A (cid:48) = L (cid:48) L (cid:48) T . Therefore, by using a factorparameterization we carry out updates that stay on the Riemannian geodesic yet only require linearupdates on L , independently of the factor L chosen to represent A (given a factor L of A , anyright-side multiplication of that matrix by a unitary matrix remains a factor of A ).When considering a general loss function L that take as arguments squared Bures distances, one canalso show that L is geodesically convex w.r.t. to scale matrices A if and only if it is convex in theusual sense with respect to L , where A = LL T . Write now L B = T AB L . One can recover that L B L T B = B . Therefore, expanding the expression B for the right term below we obtain B ( A , B ) = B (cid:0) LL T , L B L T B (cid:1) = B (cid:16) LL T , T AB L (cid:0) T AB L (cid:1) T (cid:17) = (cid:107) L − T AB L (cid:107) F Indeed, the Bures distance simply reduces to the Frobenius distance between two factors of A and B .However these factors need to be carefully chosen: given L for A , the factor for B must be computedaccording to an optimal transport map T AB . Polarization between Elliptical Distributions

Some of the applications we consider, such asthe estimation of word embeddings, are inherently based on dot-products. By analogy with thepolarization identity, (cid:104) x , y (cid:105) = ( (cid:107) x − (cid:107) + (cid:107) y − (cid:107) − (cid:107) x − y (cid:107) ) / , we deﬁne a Wasserstein-Bures pseudo -dot-product, where δ = µ d , d × d is the Dirac mass at , [ µ a , A : µ b , B ] def = (cid:0) W ( µ a , A , δ ) + W ( µ b , B , δ ) − W ( µ a , A , µ b , B ) (cid:1) = (cid:104) a , b (cid:105) +Tr ( A BA ) [ · : · ] is not an actual inner product since the Bures metric is not Hilbertian, unless werestrict ourselves to diagonal covariance matrices, in which case it is the the inner product between ( a , √ d A ) and ( b , √ d B ) . We use [ µ a , A : µ b , B ] as a similarity measure which has, however, someregularity: one can show that when a , b are constrained to have equal norms and A and B equaltraces, then [ µ a , A : µ b , B ] is maximal when a = b and A = B . Differentiating all three terms in thatsum, the gradient of this pseudo dot-product w.r.t. A reduces to ∇ A [ µ a , A : µ b , B ] = T AB . Computational Aspects

The computational bottleneck of gradient-based Bures optimization lies inthe matrix square roots and inverse square roots operations that arise when instantiating transportmaps T as in (5). A naive method using eigenvector decomposition is far too time-consuming, andthere is not yet, to the best of our knowledge, a straightforward way to perform it in batches on aGPU. We propose to use Newton-Schulz iterations (Algorithm 1, see [Higham, 2008, Ch. 6]) toapproximate these root computations. These iterations producing both a root and an inverse rootapproximation, and, relying exclusively on matrix-matrix multiplications, stream efﬁciently on GPUs.Another problem lies in the fact that numerous roots and inverse-roots are required to form map T .To solve this, we exploit an alternative formula for T AB (proof in the supplementary material): T AB = A − ( A BA ) A − = B ( B AB ) − B . (8)In a gradient update, both the loss and the gradient of the metric are needed. In our case, we can usethe matrix roots computed during loss evaluation and leverage the identity above to compute on abudget the gradients with respect to either scale matrices A and B . Indeed, a naive computation of ∇ A B ( A , B ) and ∇ B B ( A , B ) would require the knowledge of roots: A , B , ( A BA ) , ( B AB ) , A − , and B − to compute the following transport maps T AB = A − ( A BA ) A − , T BA = B − ( B AB ) B − , namely four matrix roots and two matrix inverse roots. We can avoid computing those six matricesusing identity (8) and limit ourselves to two runs of Algorithm 1, to obtain the same quantities as { Y = A , Z = A − } , { Y = ( A BA ) , Z = ( A BA ) − } T AB = Z Y Z , T BA = Y Z Y . Algorithm 1

Newton-Schulz

Input:

PSD matrix A , (cid:15) > Y ← A (1+ (cid:15) ) (cid:107) A (cid:107) , Z ← I while not converged do T ← (3 I − ZY ) / Y ← YTZ ← TZ end while Y ← (cid:112) (1 + (cid:15) ) (cid:107) A (cid:107) YZ ← Z √ (1+ (cid:15) ) (cid:107) A (cid:107) Output: square root Y , in-verse square root Z When computing the gradients of n × m squared Wasserstein dis-tances W ( α i , β j ) in parallel, one only needs to run n Newton-Schulz algorithms (in parallel) to compute matrices ( Y i , Z i ) i ≤ n ,and then n × m Newton-Schulz algorithms to recover cross matrices Y i,j , Z i,j . On the other hand, using an automatic differentiationframework would require an additional backward computation ofthe same complexity as the forward pass evaluating computation ofthe roots and inverse roots, hence requiring roughly twice as manyoperations per batch. Avoiding Rank Deﬁciency at Optimization Time

Although B ( A , B ) is deﬁned for rank deﬁcient matrices A and B , it isnot differentiable with respect to these matrices if they are rankdeﬁcient. Indeed, as mentioned earlier, this can be compared to thenon-differentiability of the Hellinger metric, ( √ x − √ y ) when x or y becomes , at which point if becomes not differentiable. If Im B (cid:54)⊂ Im A , which is notably the case if rk B > rk A , then ∇ A B ( A , B ) no longer exists.However, even in that case, ∇ B B ( A , B ) exists iff Im A ⊂ Im B . Since it would be cumbersome toaccount for these subtleties in a large scale optimization setting, we propose to add a small commonregularization term to all the factor products considered for our embeddings, and set A ε = LL T + ε I were ε > is a hyperparameter. This ensures that all matrices are full rank, and thus that all gradientsexist. Most importantly, all our derivations still hold with this regularization, and can be shown toleave the method to compute the gradients w.r.t L unchanged, namely remain equal to (cid:0) I − T A ε B (cid:1) L .6 Experiments

We discuss in this section several applications of elliptical embeddings. We ﬁrst consider a simplemMDS type visualization task, in which elliptical distributions in d = 2 are used to embed isomet-rically points in high dimension. We argue that for such purposes, a more natural way to visualizeellipses is to use their precision matrices. This is due to the fact that the human eye somewhat acts inthe opposite direction to the Bures metric, as discussed in Figure 3. We follow with more advancedexperiments in which we consider the task of computing word embeddings on large corpora as atesting ground, and equal or improve on the state-of-the-art.Figure 3: (left) three points on the plane. (middle) isometric elliptic embedding with the Buresmetric: ellipses of a given color have the same respective distances as points on the left. Although themechanics of optimal transport indicate that the blue ellipsoid is far from the two others, in agreementwith the left plot, the human eye tends to focus on those areas that overlap (below the ellipsoid center)rather than those far away areas (north-east area) that contribute more signiﬁcantly to the W distance.(right) the precision matrix visualization, obtained by considering ellipses with the same axes butinverted eigenvalues, agree better with intuition, since they emphasize that overlap and extension ofthe ellipse means on the contrary that those axis contribute less to the increase of the metric.Figure 4: Toy experiment: visualization of a dataset of 10 PISA scores for 35 countries in the OECD.(left) MDS embeddings of these countries on the plane (right) elliptical embeddings on the planeusing the precision visualization discussed in Figure 3. The normalized stress with standard MDSis 0.62. The stress with elliptical embeddings is close to e − after 1000 gradient iterations, withrandom initializations for scale matrices (following a Standard Wishart with 4 degrees of freedom)and initial means located on the MDS solution. Visualizing Datasets Using Ellipsoids

Multidimensional scaling [De Leeuw, 1977] aims at em-bedding points x , . . . , x n in a ﬁnite metric space in a lower dimensional one by minimizingthe stress (cid:80) ij ( (cid:107) x i − x j (cid:107) − (cid:107) y i − y j (cid:107) ) . In our case, this translates to the minimization of L MDS ( a , . . . a n , A , . . . , A n ) = (cid:80) ij ( (cid:107) x i − x j (cid:107) − W ( µ a i , A i , µ a j , A j )) . This objective canbe crudely minimized with a simple gradient descent approach operating on factors as advocated inSection 3, as illustrated in a toy example carried out using data from OECD’s PISA study . Word Embeddings

The skipgram model [Mikolov et al., 2013a] computes word embeddings ina vector space by maximizing the log-probability of observing surrounding context words givenan input central word. Vilnis and McCallum [2015] extended this approach to diagonal

Gaussianembeddings using an energy whose overall principles we adopt here, adapted to elliptical distributionswith full covariance matrices in the 2-Wasserstein space. For every word w , we consider an input(as a word) and an ouput (as a context) representation as an elliptical measure, denoted respectively µ w and ν w , both parameterized by a location vector and a scale parameter (stored in factor form). http://pisadataexplorer.oecd.org/ide/idepisa/ Figure 5: Precision matrix visualization of trained embeddings of a set of words on the plane spannedby the two principal eigenvectors of the covariance matrix of “Bach”.Table 1: Results for elliptical embed-dings (evaluated using our cosine mix-ture) compared to diagonal Gaussian em-beddings trained with the seomoz pack-age (evaluated using expected likelihoodcosine similarity as recommended byVilnis and McCallum).Dataset W2G/45/C Ell/12/CMSimLex

WordSim-R 61.70

WordSim-S 48.99

MEN 65.16

MC 59.48 RG R of positive word/context pairs of words ( w, c ) , and for each input word a set N ( w ) of n negativecontexts words sampled randomly, we adapt Vilnis andMcCallum’s loss function to the W distance to minimizethe following hinge loss: (cid:88) ( w,c ) ∈R  M − [ µ w : ν c ] + n (cid:88) c (cid:48) ∈ N ( w ) [ µ w : ν c (cid:48) ]  + where M > is a margin parameter. We train our em-beddings on the concatenated ukWaC and WaCkypediacorpora [Baroni et al., 2009], consisting of about 3 bil-lion tokens, on which we keep only the tokens appearingmore than 100 times in the text (for a total number of261583 different words). We train our embeddings usingadagrad [Duchi et al., 2011], sampling one negative con-text per positive context and, in order to prevent the normsof the embeddings to be too highly correlated with the cor-responding word frequencies (see Figure in supplementarymaterial), we use two distinct sets of embeddings for theinput and context words.We compare our full elliptical to diagonal Gaussian embeddings trained using the methods describedin [Vilnis and McCallum, 2015] on a collection of similarity datasets by computing the Spearmanrank correlation between the similarity scores provided in the data and the scores we compute basedon our embeddings. Note that these results are obtained using context ( ν w ) rather than input ( µ w )embeddings. For a fair comparison across methods, we set dimensions by ensuring that the numberof free parameters remains the same: because of the symmetry in the covariance matrix, ellipticalembeddings in dimension d have d + d ( d + 1) / free parameters ( d for the means, d ( d + 1) / for thecovariance matrices), as compared with d for diagonal Gaussians. For elliptical embeddings, we usethe common practice of using some form of normalized quantity (a cosine) rather than the direct dotproduct. We implement this here by computing the mean of two cosine terms, each correspondingseparately to mean and covariance contributions: S B [ µ a , A , µ b , B ] := (cid:104) a , b (cid:105)(cid:107) a (cid:107)(cid:107) b (cid:107) + Tr ( A BA ) √ Tr A Tr B Using this similarity measure rather than the Wasserstein-Bures dot product is motivated bythe fact that the norms of the embeddings show some dependency with word frequencies (seeﬁgures in supplementary) and become dominant when comparing words with different fre-quencies scales. An alternative could have been obtained by normalizing the Wasserstein-Bures dot product in a more standard way that pools together means and covariances. How-ever, as discussed in the supplementary material, this choice makes it harder to deal withthe variations in scale of the means and covariances, therefore decreasing performance.8able 2: Entailment benchmark: we evaluate ourembeddings on the Entailment dataset using aver-age precision (AP) and F1 scores. The thresholdfor F1 is chosen to be the best at test time.Model AP F1W2G/45/Cosine 0.70 0.74W2G/45/KL 0.72 0.74Ell/12/CM 0.70 0.73We also evaluate our embeddings on the Entail-ment dataset ([Baroni et al., 2012]), on whichwe obtain results roughly comparable to thoseof [Vilnis and McCallum, 2015]. Note that con-trary to the similarity experiments, in this frame-work using the (unsymmetrical) KL divergencemakes sense and possibly gives an advantage,as it is possible to choose the order of the argu-ments in the KL divergence between the entail-ing and entailed words.

Hypernymy

In this experiment, we use the framework of [Nickel and Kiela, 2017] on hypernymyrelationships to test our embeddings. A word A is said to be a hypernym of a word B if any B is a typeof A, e.g. any dog is a type of mammal , thus constituting a tree-like structure on nouns. The

WORDNET dataset [Miller, 1995] features a transitive closure of 743,241 hypernymy relations on 82,115 distinctnouns, which we consider as an undirected graph of relations R . Similarly to the skipgram model,for each noun u we sample a ﬁxed number n of negative examples and store them in set N ( u ) tooptimize the following loss: (cid:80) ( u,v ) ∈R log e [ µu,µv ] e [ µu,µv ] + (cid:80) v (cid:48)∈N ( u ) e [ µu,µv (cid:48) ] . M e a n A v e r a g e P r e c i s i o n M e a n R a n k EllipticalPoincare

Figure 6: Reconstruction performance of our embeddings againstPoincare embeddings (reported from [Nickel and Kiela, 2017],as we were not able to reproduce scores comparable to thesevalues) evaluated by mean retrieved rank (lower=better) and MAP(higher=better).We train the model using SGDwith only one set of embeddings.The embeddings are then eval-uated on a link reconstructiontask: we embed the full treeand rank the similarity of eachpositive hypernym pair ( u, v ) among all negative pairs ( u, v (cid:48) ) and compute the mean rank thusachieved as well as the mean av-erage precision (MAP), using theWasserstein-Bures dot product asthe similarity measure. Ellipticalembeddings consistently outper-form Poincare embeddings for di-mensions above a small thresh-old, as shown in Figure 6, whichconﬁrms our intuition that the ad-dition of a notion of variance oruncertainty to point embeddings allows for a richer and more signiﬁcant representation of words. Conclusion

We have proposed to use the space of elliptical distributions endowed with the W metricto embed complex objects. This latest iteration of probabilistic embeddings, in which a point anobject is represented as a probability measure, can consider elliptical measures (including Gaussians)with arbitrary covariance matrices. Using the W metric we can provides a natural and seamlessgeneralization of point embeddings in R d . Each embedding is described with a location c and ascale C parameter, the latter being represented in practice using a factor matrix L , where C isrecovered as LL T . The visualization part of work is still subject to open questions. One may seek adifferent method than that proposed here using precision matrices, and ask whether one can includemore advanced constraints on these embeddings, such as inclusions or the presence (or absence) ofintersections across ellipses. Handling multimodality using mixtures of Gaussians could be pursued.In that case a natural upper bound on the W distance can be computed by solving the OT problembetween these mixtures of Gaussians using a simpler proxy: consider them as discrete measuresputting Dirac masses in the space of Gaussians endowed with the W metric as a ground cost, anduse the optimal cost of that proxy as an upper bound of their Wasserstein distance. Finally, note thatthe set of elliptical measures µ c , C endowed with the Bures metric can also be interpreted, given that C = LL T , L ∈ R d × k , and writing ˜ l i = l i − ¯ l for the centered column vectors of L , as a discretepoint cloud ( c + √ k ˜ l i ) i endowed with a W metric only looking at their ﬁrst and second ordermoments. These k points, whose mean and covariance matrix match c and C , can therefore fullycharacterize the geometric properties of the distribution µ c , C , and may provide a simple form ofmultimodal embedding. 9 eferences P-A Absil, Robert Mahony, and Rodolphe Sepulchre.

Optimization algorithms on matrix manifolds . PrincetonUniversity Press, 2009.L. Ambrosio, N. Gigli, and G. Savaré.

Gradient ﬂows in metric spaces and in the space of probability measures .Springer, 2006.Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. The wacky wide web: a collectionof very large linguistically processed web-crawled corpora.

Language Resources and Evaluation , 43(3):209–226, September 2009.Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. Entailment above the word level indistributional semantics. In

Proceedings of the 13th Conference of the European Chapter of the Associationfor Computational Linguistics , pages 23–32. ACL, 2012.Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures-Wasserstein distance between positive deﬁnitematrices.

Expositiones Mathematicae , 2018.Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subwordinformation.

Transactions of the Association for Computational Linguistics , 5:135–146, 2017.Ingwer Borg and Patrick JF Groenen.

Modern multidimensional scaling: Theory and applications . SpringerScience & Business Media, 2005.Jean Bourgain. On Lipschitz embedding of ﬁnite metric spaces in Hilbert space.

Israel Journal of Mathematics ,52(1):46–52, 1985.Yann Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs.

CR Acad. Sci. ParisSér. I Math , 305(19):805–808, 1987.Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized multidimensional scaling: aframework for isometry-invariant partial surface matching.

Proceedings of the National Academy of Sciences ,103(5):1168–1172, 2006.Elia Bruni, Nam Khanh Tran, and Marco Baroni. Multimodal distributional semantics.

J. Artif. Int. Res. , 49(1):1–47, January 2014.Mihai B˘adoiu, Piotr Indyk, and Anastasios Sidiropoulos. Approximation algorithms for embedding generalmetrics into trees. In

Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms ,pages 512–521. Society for Industrial and Applied Mathematics, 2007.Donald Bures. An extension of Kakutani’s theorem on inﬁnite product measures to the tensor product ofsemiﬁnite w*-algebras.

Transactions of the American Mathematical Society , 135:199–212, 1969.Stamatis Cambanis, Steel Huang, and Gordon Simons. On the theory of elliptically contoured distributions.

Journal of Multivariate Analysis , 11(3):368 – 385, 1981.Jan De Leeuw. Applications of convex analysis to multidimensional scaling. In

Recent Developments inStatistics , 1977.DC Dowson and BV Landau. The Fréchet distance between multivariate normal distributions.

Journal ofmultivariate analysis , 12(3):450–455, 1982.John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochasticoptimization.

Journal of Machine Learning Research , 12(Jul):2121–2159, 2011.Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by treemetrics. In

Proceedings of the thirty-ﬁfth annual ACM symposium on Theory of computing , pages 448–455.ACM, 2003.KT Fang, S Kotz, and KW Ng.

Symmetric Multivariate and Related Distributions . Chapman and Hall/CRC,1990.Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin.Placing search in context: the concept revisited.

ACM Trans. Inf. Syst. , 20(1):116–131, 2002.Matthias Gelbrich. On a formula for the l2 Wasserstein metric between measures on Euclidean and Hilbertspaces.

Mathematische Nachrichten , 147(1):185–203, 1990. mir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean embedding of co-occurrencedata. Journal of Machine Learning Research , 8(Oct):2265–2295, 2007.Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In

Proceedings of the 22ndACM SIGKDD international conference on Knowledge discovery and data mining , pages 855–864. ACM,2016.Guy Halawi, Gideon Dror, Evgeniy Gabrilovich, and Yehuda Koren. Large-scale learning of word relatednesswith constraints. In

Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining , KDD ’12, pages 1406–1414, New York, NY, USA, 2012. ACM.Nicholas J. Higham.

Functions of Matrices: Theory and Computation (Other Titles in Applied Mathematics) .Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008.Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with genuine similarityestimation.

Comput. Linguist. , 41(4):665–695, December 2015.Geoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In

Advances in Neural InformationProcessing Systems , pages 857–864, 2003.Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks.

Science , 313(5786):504–507, 2006.Tony Jebara, Risi Kondor, and Andrew Howard. Probability product kernels.

Journal of Machine LearningResearch , 5:819–844, 2004.William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In

Conference in modern analysis and probability (New Haven, Conn., 1982) , volume 26 of

Contemp. Math. ,pages 189–206. Amer. Math. Soc., Providence, RI, 1984.Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In

Proceedings of the InternationalConference on Learning Representations , 2014.J.M. Lee.

Riemannian Manifolds: An Introduction to Curvature . Graduate Texts in Mathematics. Springer NewYork, 1997.Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.

Journal of Machine LearningResearch , 9(Nov):2579–2605, 2008.Luigi Malagò, Luigi Montrucchio, and Giovanni Pistone. Wasserstein-Riemannian geometry of positive-deﬁnitematrices. arXiv preprint arXiv:1801.09269 , 2018.Yariv Maron, Michael Lamar, and Elie Bienenstock. Sphere embedding: An application to part-of-speechinduction. In

Advances in Neural Information Processing Systems , pages 1567–1575, 2010.Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations invector space.

ICLR Workshop , 2013a.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations ofwords and phrases and their compositionality. In

Advances in Neural Information Processing Systems , pages3111–3119, 2013b.George A. Miller. Wordnet: A lexical database for english.

Commun. ACM , 38(11):39–41, November 1995.George A. Miller and Walter G. Charles. Contextual correlates of semantic similarity.

Language and CognitiveProcesses , 6(1):1–28, 1991.Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. InI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Advances in Neural Information Processing Systems 30 , pages 6341–6350. Curran Associates, Inc., 2017.Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, GregCorrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In

Proceedingsof the International Conference on Learning Representations , 2014.Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.In

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages1532–1543, 2014. abriel Peyré and Marco Cuturi. Computational optimal transport. arXiv preprint arXiv:1803.00567 , 2018.Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: Computingword relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference onWorld Wide Web , WWW ’11, pages 337–346, New York, NY, USA, 2011. ACM.Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding.

Science ,290(5500):2323–2326, 2000.Herbert Rubenstein and John B. Goodenough. Contextual correlates of synonymy.

Commun. ACM , 8(10):627–633, October 1965.Filippo Santambrogio.

Optimal Transport for Applied Mathematicians . Birkhauser, 2015.Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut, and Martin Jaggi. Context mover’s distance & barycenters:Optimal transport of contexts for building representations. arXiv preprint arXiv:1808.09663 , 2018.Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlineardimensionality reduction.

Science , 290(5500):2319–2323, 2000.Minh thang Luong, Richard Socher, and Christopher D. Manning. Better word representations with recursiveneural networks for morphology. In

In Proceedings of the Thirteenth Annual Conference on Natural LanguageLearning. Tomas Mikolov, Wen-tau , 2013.Nikolai G Ushakov.

Selected topics in characteristic functions . Walter de Gruyter, 1999.Luke Vilnis and Andrew McCallum. Word representations via Gaussian embedding.

Proceedings of theInternational Conference on Learning Representations , 2015. arXiv preprint arXiv:1412.6623.K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation.

TheJournal of Machine Learning Research , 10:207–244, 2009.Dongqiang Yang and David M. W. Powers. Measuring semantic similarity in the taxonomy of wordnet. In

Proceedings of the Twenty-eighth Australasian Conference on Computer Science - Volume 38 , ACSC ’05,pages 315–322, Darlinghurst, Australia, Australia, 2005. Australian Computer Society, Inc. upplementary Material Equivalent formulations of T AB T AB is deﬁned as the unique PSD matrix verifying T AB AT AB = B . Using this deﬁnition, wederive two equivalent formulations for T AB : T AB = A − ( A BA ) A − = B ( B AB ) − B The ﬁrst is derived as in [Malagò et al., 2018]: T AB AT AB = BA T AB A A T AB A = A BA A T AB A = (cid:18) A BA (cid:19) T AB = A − (cid:18) A BA (cid:19) A − We then adapt this derivation to obtain a second formulation of T AB : T AB AT AB = B (cid:0) T AB (cid:1) − B (cid:0) T AB (cid:1) − = AB (cid:0) T AB (cid:1) − B B (cid:0) T AB (cid:1) − B = B AB B (cid:0) T AB (cid:1) − B = (cid:18) B AB (cid:19) (cid:0) T AB (cid:1) − = B − (cid:18) B AB (cid:19) B − T AB = B (cid:18) B AB (cid:19) − B Derivation of the Riemannian gradient updates

From [Malagò et al., 2018], we have that the exp and log maps of the Riemannian Bures metric aregiven by: exp C ( V ) = ( L C ( V ) + I ) C ( L C ( V ) + I )log C ( B ) = (cid:0) T CB − I (cid:1) C + C (cid:0) T CB − I (cid:1) where L C ( V ) is the solution of Lyapunov equation L C ( V ) C + C L C ( V ) = V . One can showthat the L C operator is linear, and that the following identity holds: L C ( XC + CX ) . In particular, L C (log C B ) = T CB − I . 13rom this, since grad A B ( A , B ) = − log A B , the Riemannian gradient update is given by A t +1 = exp A t ( η t log A t B )= (cid:0) η t L A t (log A t B ) + I (cid:1) A t (cid:0) η t L A t (log A t B ) + I (cid:1) = (cid:0) (1 − η t ) I + η t T A t B (cid:1) A t (cid:0) (1 − η t ) I + η t T A t B (cid:1) Derivation of the Euclidean gradient

Notations: ⊗ is the Kronecker product of matrices. Recall that [ B (cid:62) ⊗ A ]vec( X ) = vec( AXB )[ A ⊗ B ][ C ⊗ D ] = [ AC ⊗ BD ] In the following, we will often omit the vec( . ) and treat matrices as vectors when the context makesit clear. We will make use of the following identities: ∂ X f ◦ g ( X ) = ∂ X f ( g ( X )) ∂ X g ( X ) ∂ X ( f g )( X ) = [ g ( X ) (cid:62) ⊗ I ] ∂ X f ( X ) + [ I ⊗ g ( X )] ∂ X g ( X ) and ∂ X X = [ X ⊗ I + I ⊗ X ] − Let f ( A , B ) = Tr( B AB ) .Let us differentiate f w.r.t A : ∇ A f ( A , B ) = (cid:20) ∂ A ( B AB ) (cid:21) (cid:62) I = (cid:34)(cid:20) B AB ) ⊗ I + I ⊗ ( B AB ) (cid:21) − ∂ A ( B AB ) (cid:35) (cid:62) I = (cid:20) B ⊗ B (cid:21) (cid:20) ( B AB ) ⊗ I + I ⊗ ( B AB ) (cid:21) − I = (cid:20) B ⊗ B (cid:21)

12 ( B AB ) − = 12 B ( B AB ) − B Therefore ∇ A f ( A , B ) = T AB Let now A = LL (cid:62) , let us differentiate w.r.t L : ∇ L f ( LL (cid:62) , B ) = (cid:20) ∂ L ( B AB ) (cid:21) (cid:62) I = ∂ L A (cid:62) (cid:20) ∂ A ( B AB ) (cid:21) (cid:62) I = (cid:2) L (cid:62) ⊗ I (cid:3) [ I + T n,n ] 12 B ( B AB ) − B = B ( B AB ) − B L where T n,n is the transposition tensor, such that ∀ X ∈ R n × n , T n,n vec( X ) = vec( X (cid:62) ) .Therefore ∇ L f ( LL (cid:62) , B ) = T AB L .Using the same calculations, one can see that if A = LL (cid:62) + ε I , then we still have ∇ L f ( LL (cid:62) + ε I , B ) = T AB L since ∂ L (cid:2) LL (cid:62) + ε I (cid:3) = ∂ L (cid:2) LL (cid:62) (cid:3) odel Hyperparameters and Training Details Word Embeddings

We train our embeddings on the concatenated ukWaC and WaCkypedia corpora[Baroni et al., 2009], consisting of about 3 billion tokens, on which we keep only the tokens appearingmore than 100 times in the text after lowercasing and removal of all punctuation (for a total numberof 261583 different words). We optimize 5 epoches using adagrad [Duchi et al., 2011] with (cid:15) = 10 − with a learning rate of . . We use a window size of 10 (i.e. positive examples consist of the ﬁrst5 preceding and ﬁrst 5 succeeding words), set the margin to , sample one negative context perpositive context and, in order to prevent the norms of the embeddings to be too highly correlatedwith the corresponding word frequencies (see Figure 7), we use two distinct sets of embeddings forthe input and context words. In order to use as much parallelization as possible, we use batches ofsize 10000, but believe that smaller batches would lead to improved performances. We limit matrixsquare root approximations to 6 Newton-Schulz iterations and add . I to the covariances to ensurenon-singularity.To generate batches, we use the same sampling tricks as in [Mikolov et al., 2013b], namely sub-sampling the frequent terms (using a threshold of − as recommended for large datasets) andsmoothing the negative distribution by using probabilities { f / i /Z } where f i is the frequency ofword i for sampling negative contexts { c (cid:48) i } .We then evaluate our embeddings on the following datasets: Simlex [Hill et al., 2015], WordSim[Finkelstein et al., 2002], MEN [Bruni et al., 2014], MC [Miller and Charles, 1991], RG [Rubensteinand Goodenough, 1965], YP [Yang and Powers, 2005], MTurk [Radinsky et al., 2011] [Halawi et al.,2012], RW [thang Luong et al., 2013], using the context embeddings and the Wasserstein-Burescosine as a similarity measure. Hypernymy

We train our embeddings on the transitive closure of the

WORDNET dataset [Miller,1995] which features 743,241 hypernymy relations on 82,115 distinct nouns. For disambiguation,note that if ( u, v ) is a hypernymy relation with u (cid:54) = v , ( v, u ) is in general not a positive relation, but ( u, u ) is as a noun is always its own hypernym.We perform our optimization using SGD with batches of 1000 relations, a learning rate 0.02 fordimensions 3 and 4 and 0.01 for higher dimensions, sample 50 negative examples per positive relation,use 6 square root iterations and add . I to the covariances. Contrary to the skipgram experiment,we use a single set of embeddings and use the Wasserstein-Bures dot product as a similarity measure. Wasserstein-Bures Cosine

16 14 12 10 8 6 4 2 log f l o g T r A pearsonr = 0.94; p = 0 (a) inputs

16 14 12 10 8 6 4 2 log f l o g T r A pearsonr = 0.5; p = 0 (b) contexts Figure 7: log-log plot of the traces of the embeddings’ covariances vs. word frequency: the sizes ofthe input embeddings follow a power law, whereas context embeddings give less importance to veryfrequent words and emphasize on medium frequency words.15 log f l o g a pearsonr = 0.91; p = 0 (a) inputs