[PDF] Unsupervised Sentence-embeddings by Manifold Approximation and Projection

Abstract

The concept of unsupervised universal sentence encoders has gained traction recently, wherein pre-trained models generate effective task-agnostic fixed-dimensional representations for phrases, sentences and paragraphs. Such methods are of varying complexity, from simple weighted-averages of word vectors to complex language-models based on bidirectional transformers. In this work we propose a novel technique to generate sentence-embeddings in an unsupervised fashion by projecting the sentences onto a fixed-dimensional manifold with the objective of preserving local neighbourhoods in the original space. To delineate such neighbourhoods we experiment with several set-distance metrics, including the recently proposed Word Mover's distance, while the fixed-dimensional projection is achieved by employing a scalable and efficient manifold approximation method rooted in topological data analysis. We test our approach, which we term EMAP or Embeddings by Manifold Approximation and Projection, on six publicly available text-classification datasets of varying size and complexity. Empirical results show that our method consistently performs similar to or better than several alternative state-of-the-art approaches.

Full PDF

UUnsupervised Sentence-embeddings by Manifold Approximation andProjection

Subhradeep Kayal

Prosus N.V.Amsterdam, Netherlands. [email protected]

Abstract

The concept of unsupervised universal sen-tence encoders has gained traction recently,wherein pre-trained models generate effec-tive task-agnostic ﬁxed-dimensional represen-tations for phrases, sentences and paragraphs.Such methods are of varying complexity, fromsimple weighted-averages of word vectors tocomplex language-models based on bidirec-tional transformers. In this work we pro-pose a novel technique to generate sentence-embeddings in an unsupervised fashion by pro-jecting the sentences onto a ﬁxed-dimensionalmanifold with the objective of preserving localneighbourhoods in the original space. To delin-eate such neighbourhoods we experiment withseveral set-distance metrics, including the re-cently proposed

Word Mover’s distance , whilethe ﬁxed-dimensional projection is achievedby employing a scalable and efﬁcient mani-fold approximation method rooted in topologi-cal data analysis. We test our approach, whichwe term

EMAP or Embeddings by ManifoldApproximation and Projection , on six publiclyavailable text-classiﬁcation datasets of varyingsize and complexity. Empirical results showthat our method consistently performs similarto or better than several alternative state-of-the-art approaches.

Dense vector representation of words, or word-embeddings , form the backbone of most modernNLP applications and can be constructed usingcontext-free (Bengio et al., 2003; Mikolov et al.,2013; Pennington et al., 2014) or contextualizedmethods (Peters et al., 2018; Devlin et al., 2019).Given that practical systems often beneﬁt fromhaving representations for sentences and docu-ments, in addition to word-embeddings (Palangiet al., 2016; Yan et al., 2016), a simple trick is to use the weighted average over some or all ofthe embeddings of words in a sentence or docu-ment. Although sentence-embeddings constructedthis way often lose information because of the dis-regard for word-order during averaging, they havebeen found to be surprisingly performant (Aldar-maki and Diab, 2018).More sophisticated methods focus on jointlylearning the embeddings of sentences and wordsusing models similar to

Word2Vec (Le and Mikolov,2014; Chen, 2017), using encoder-decoder ap-proaches that reconstruct the surrounding sentencesof an encoded passage (Kiros et al., 2015), or train-ing bi-directional LSTM models on large exter-nal datasets (Conneau et al., 2017). Meaningfulsentence-embeddings have also been constructedby ﬁne-tuning pre-trained bidirectional transform-ers (Devlin et al., 2019) using a Siamese architec-ture (Reimers and Gurevych, 2019).In parallel to the approaches mentioned above, astream of methods have emerged recently which ex-ploit the inherent geometric properties of the struc-ture of sentences, by treating them as sets or se-quences of word-embeddings. For example, Aroraet al. (2017) propose the construction of sentence-embeddings based on weighted word-embeddingaverages with the removal of the dominant singularvector, while R¨uckl´e et al. (2018) produce sentence-embeddings by concatenating several power-meansof word-embeddings corresponding to a sentence.Very recently, spectral decomposition techniqueswere used to create sentence-embeddings, whichproduced state-of-the-art results when used in con-catenation with averaging (Kayal and Tsatsaronis,2019; Almarwani et al., 2019).Our work is most related to that of Wu et al.(2018) who use Random Features (Rahimi andRecht, 2008) to learn document embeddings whichpreserve the properties of an explicitly-deﬁned ker-nel based on the Word Mover’s Distance (Kusner a r X i v : . [ c s . C L ] F e b t al., 2015). Where Wu et al. predeﬁne the na-ture of the kernel, our proposed approach can learnthe similarity-preserving manifold for a given set-distance metric, offering increased ﬂexibility. A simple way to form sentence-embeddings is tocompute the dimension-wise arithmetic mean ofthe embeddings of the words in a particular sen-tence. Even though this approach incurs informa-tion loss by disregarding the fact that sentencesare sequences (or, at the very least, sets) of wordvectors, it works well in practice. This already pro-vides an indication that there is more informationin the sentences to be exploited.Kusner et al. (2015) aim to use more of theinformation available in a sentence by represent-ing sentences as a weighted point cloud of embed-ded words. Rooted in transportation theory, their

Word Mover’s distance (WMD) is the minimumamount of distance that the embedded words ofa sentence need to travel to reach the embeddedwords of another sentence. The approach achievesstate-of-the-art results for sentence classiﬁcationwhen combined with a k -NN classiﬁer (Cover andHart, 1967). Since their work, other distance met-rics have been suggested (Singh et al., 2019; Wanget al., 2019), also motivated by how transportationproblems are solved.Considering that sentences are sets of word vec-tors, a large variety of methods exist in literaturethat can be used to calculate the distance betweentwo sets, in addition to the ones based on transporttheory. Thus, as a ﬁrst contribution , we comparealternative metrics to measure distances betweensentences. The metrics we suggest, namely the Hausdorff distance and the

Energy distance , areintuitive to explain and reasonably fast to calculate.The choice of these particular distances are moti-vated by their differing origins and their generalusefulness in the respective application domains.Once calculated, these distances can be used inconjunction with k -nearest neighbours for classi-ﬁcation tasks, and k -means for clustering tasks.However, these learning algorithms are rather sim-plistic and the state-of-the-art machine learningalgorithms require a ﬁxed-length feature represen-tation as input to them. Moreover, having ﬁxed-length representations for sentences ( sentence-embeddings ) also provides a large degree of ﬂex-ibility for downstream tasks, as compared to hav- ing only relative distances between them. Withthis as motivation, the second contribution of thiswork is to produce sentence-embeddings that ap-proximately preserve the topological properties ofthe original sentence space. We propose to do sousing an efﬁcient scalable manifold-learning algo-rithm termed UMAP (McInnes et al., 2018) fromtopological data analysis. Empirical results showthat this process yields sentence-embeddings thatdeliver near state-of-the-art classiﬁcation perfor-mance with a simple classiﬁer.

In this work, we experiment with three differentdistance measures to determine the distance be-tween sentences. The ﬁrst measure (Energy dis-tance) is motivated by a useful linkage criterionfrom hierarchical clustering (Rokach and Maimon,2005), while the second one (Hausdorff distance)is an important metric from algebraic topology thathas been successfully used in document indexing(Tsatsaronis et al., 2012). The ﬁnal metric (WordMover’s distance) is a recent extension of an exist-ing distance measure between distributions, that isparticularly suited for use with word-embeddings(Kusner et al., 2015).Prior to deﬁning the distances that have beenused in this work, we ﬁrst proceed to outline thenotations that we will be using to describe them.

Let W ∈ R N × d denote a word-embedding matrix,such that the vocabulary corresponding to it con-sists of N words, and each word in it, w i ∈ R d , is d -dimensional. This word-embedding matrix andits constituent words may come from pre-trainedrepresentations such as Word2Vec (Mikolov et al.,2013) or GloVe (Pennington et al., 2014), in whichcase d = 300 .Let S be a set of sentences and s, s (cid:48) be twosentences from this set. Each such sentence canbe viewed as a set of word-embeddings, { w } ∈ s .Additionally, let the length of a sentence, s , bedenoted as | s | , and the cardinality of the set, S , bedenoted by | S | .Let e ( w i , w j ) denote the distance between twoword-embeddings, w i , w j . In the context of thispaper, this distance is Euclidean: e ( w i , w j ) = (cid:107) w i − w j (cid:107) (1)inally, D ( s, s (cid:48) ) denotes the distance betweentwo sentences. Energy distance is a statistical distance betweenprobability distributions, based on the inter andintra-distribution variance, that satisﬁes all the cri-teria of being a metric (Sz´ekely and Rizzo, 2013).Using the notations deﬁned earlier, we write itas: D ( s, s (cid:48) ) = 2 | s || s (cid:48) | (cid:88) w i ∈ s (cid:88) w j ∈ s (cid:48) e ( w i , w j ) − | s | (cid:88) w i ∈ s (cid:88) w j ∈ s e ( w i , w j ) − | s (cid:48) | (cid:88) w i ∈ s (cid:48) (cid:88) w j ∈ s (cid:48) e ( w i , w j ) (2)The original conception of the energy distancewas inspired by gravitational potential energy ofcelestial objects. Looking closely at Equation 2,it can be quickly observed that it has two parts:the ﬁrst term resembles the attraction or repulsionbetween two objects (or in our case, sentences),while the second and the third term indicate theself-coherence of the respective objects. As shownby Sz´ekely and Rizzo (2013), energy distance isscale equivariant, which would make it sensitiveto contextual changes in sentences, and thereforemake it useful in NLP applications. Given two subsets of a metric space, the

Hausdorffdistance is the maximum distance of the pointsin one subset to the nearest point in the other. Asigniﬁcant work has gone into making it fast tocalculate (Atallah, 1983) so that it can be appliedto real-world problems, such as shape-matching incomputer vision (Dubuisson and Jain, 1994).To calculate it, the distance between each pointfrom one set and the closest point from the other setis determined ﬁrst. Then, the Hausdorff distanceis calculated as the maximal point-wise distance.Considering sentences { s, s (cid:48) } as subsets of word-embedding space, R d × N , the directed Hausdorffdistance can be given as: h ( s, s (cid:48) ) = max w i ∈ s min w j ∈ s (cid:48) e ( w i , w j ) (3)such that the symmetric Hausdorff distance is: D ( s, s (cid:48) ) = max { h ( s, s (cid:48) ) , h ( s (cid:48) , s ) } (4) In addition to the representation of a sentence as aset of word-embeddings, a sentence s can also berepresented as a N -dimensional normalized term-frequency vector, where n si is the number of timesword w i occurs in sentence s normalized by thetotal number of words in s : n si = c si (cid:80) k = Nk =1 c sk (5)where, c si is the number of times word w i appearsin sentence s .The goal of the Word Mover’s distance (WMD) (Kusner et al., 2015) is to construct a sentence sim-ilarity metric based on the distances between theindividual words within each sentence, given byEquation 1. In order to calculate the distance be-tween two sentences, WMD introduces a transportmatrix, T ∈ R N × N , such that each element in it, T ij , denotes how much of n si should be transportedto n s (cid:48) j . Then, the WMD between two sentences isgiven as the solution of the following minimizationproblem: D ( s, s (cid:48) ) = min T ≥ N (cid:88) i,j =1 T ij e ( i, j ) subject to, N (cid:88) j =1 T ij = n si and N (cid:88) i =1 T ij = n s (cid:48) j (6)Thus, WMD between two sentences is deﬁned asthe minimum distance required to transport thewords from one sentence to another. In this work, we propose to construct sentence-embeddings which preserve the neighbourhoodaround sentences delineated by the relative dis-tances between them. We posit that preservingthe local neighbourhoods will serve as a proxy forpreserving the original topological properties.In order to learn a topology-preserving ﬁxed-dimensional manifold, we seek inspiration frommethods in non-linear dimensionality-reduction(Lee and Verleysen, 2007) and topological dataanalysis literature (Carlsson, 2009). When broadlycategorized, these techniques consist of methods,such as

Locally Linear Embedding (Roweis andSaul, 2000), that preserve local distances betweenoints, or those like

Stochastic Neighbour Embed-ding (Hinton and Roweis, 2003; van der Maatenand Hinton, 2008) that preserve the conditionalprobabilities of points being neighbours. However,existing manifold-learning algorithms suffer fromtwo shortcomings: they are computationally expen-sive and are often restricted in the number of outputdimensions. In our work we use a method termed

Uniform Manifold Approximation and Projection(UMAP) (McInnes et al., 2018), which is scalableand has no computational restrictions on the outputembedding dimension.The building block of UMAP is a particular typeof a simplicial complex, known as the Vietoris-Rips complex. Recalling that a k-simplex is a k -dimensional polytope which is the convex hull ofits k + 1 vertices, and a simplicial complex is aset of simplices of various orders, the Vietoris-Rips simplicial complex is a collection of 0 and1-simplices. In essence, this is a means to buildinga simple neighbourhood graph by connecting theoriginal data points. Figure 1: Figure showing a simple example of the em-bedding algorithm. On the left is the original sentence-space, approximated by the nearest neighbours graphformed by the Vietoris-Rips complex. Instead of pointsand edges, our simplicial complex has sets of pointsand edges between them, formed by one of the dis-tance metrics mentioned in Section 2.1. In this ex-ample, four sentences, denoted by S through S ,form two simplices, with S being a -simplex. Thesentences are denoted by colored ellipses, while thehigh-dimensional embedding of each word in a sen-tence is depicted by a point having the same coloras the parent sentence ellipse. The UMAP algorithmis then employed to ﬁnd a similarity-preserving Eu-clidean embedding-space, shown on the right, by min-imizing the cross-entropy between the two representa-tions. A key difference, in this work, to the original formulation is that an individual data sample (i.e.,the vertex of a simplex) is not a d -dimensionalpoint but a set of d -dimensional words that makeup a sentence. By using any of the distance metricsdeﬁned in Section 2.1, it is possible to construct thesimplicial complex that UMAP needs in order tobuild the topological representation of the originalsentence space. An illustration can be found inFigure 1.As per the formulation laid out for UMAP, thesimilarity between sentences s (cid:48) and s is deﬁned as: v s (cid:48) | s = exp − ( D ( s, s (cid:48) ) − ρ s ) σ s (7)where σ s is a normalisation factor selected based onan empirical heuristic (See Algorithm 3 in the workof McInnes et al. 2018), D ( s, s (cid:48) ) is the distance be-tween two sentences as outlined by Equation 2, 4or 6, and ρ s is the distance of s from its nearestneighbour. It is worth mentioning that for scala-bility, v s (cid:48) | s is calculated only for predeﬁned set ofapproximate nearest neighbours, which is a user-deﬁned input parameter to the UMAP algorithm,using the efﬁcient nearest-neighbour descent algo-rithm (Dong et al., 2011).The similarity depicted in Equation 7 is asym-metric, and symmetrization is carried out by a fuzzyset union using the probabilistic t-conorm: v ss (cid:48) = ( v s (cid:48) | s + v s | s (cid:48) ) − v s (cid:48) | s v s | s (cid:48) (8)As UMAP builds a Vietoris-Rips complex gov-erned by Equation 7, it can take advantage of thenerve theorem (Borsuk, 1948), which makes thisconstruction a homotope of the original topologicalspace. In our case, this implies that we can builda simple nearest neighbours graph from a givencorpus of sentences, which has certain guaranteesof approximating the original topological space, asdeﬁned by the aforementioned distance metrics.The next step is to deﬁne a similar nearest neigh-bours graph in a ﬁxed low-dimensional Euclideanspace. Let s E , s (cid:48) E ∈ R d E be the corresponding d E -dimensional sentence-embeddings. Then thelow dimensional similarities are given by: w ss (cid:48) = (1 + a || s E − s (cid:48) E || b )) − (9)where, || s E − s (cid:48) E || is the Euclidean distance be-tween the d E -dimensional embeddings, and setting a, b are input-parameters, set to . and . ,respectively, as per the original implementation. lgorithm 1: Constructing sentence- E mbeddings by M anifold A pproximationand P rojection: EMAP

Data:

A pre-trained word-embeddingsmatrix, W ; a set of sentences, S ;desired dimension of the generatedsentence-embeddings, d E Result:

A set of sentence-embeddings, { s E } ∈ S E Calculate the distance matrix for the entireset of sentences, such that the distancebetween any two sentences is given byEquation 2, 4 or 6; Using this distance matrix, calculate thenearest neighbour graph between all inputsentences, given by Equations 7 and 8; Calculate the initial guess for the lowdimensional embeddings, S E ∈ R | S |× D E ,using the graph laplacian of the originalnearest neighbour graph; Until convergence, minimize thecross-entropy between the tworepresentations (Equation 10) usingstochastic gradient descent; Return the set of d E -dimensionalsentence-embeddings, S E ;The ﬁnal step of the process is to optimize thelow dimensional representation to have as closea fuzzy topological representation as possible tothe original space. UMAP proceeds to do so byminimizing the cross-entropy between the two rep-resentations: C = (cid:88) s (cid:54) = s (cid:48) v ss (cid:48) log v ss (cid:48) w ss (cid:48) + (1 − v ss (cid:48) ) log 1 − v ss (cid:48) − w ss (cid:48) (10)usually done via stochastic gradient descent.A summary of the proposed process used toproduce sentence-embeddings is provided in Al-gorithm 1, and pictorially presented in Figure 1. Six public datasets have been used to empiricallyvalidate the method proposed in this paper. Thesedatasets are of varying sizes, tasks and complex-ities, and have been used widely in existing liter- https://drive.google.com/open?id=1sGgAo2SBoYKhQQK_kilUp8KSToCI55jl ature, thereby making comparisons and reportingpossible. Information about the datasets can befound in Table 1. : We usethe pre-trained set of word-embeddings providedby Mikolov et al (2013) . Software implementations : We use a variety ofsoftware packages and custom-written programsperform our experiments, the starting point beingthe calculation of sentence-wise distances. We cal-culate the Hausdorff distance using a directed im-plementation provided in the

Scipy python library ,whereas the energy distance is calculated using dcor . Lastly, the word mover’s distance is cal-culated using implementation provided by Kusneret al. (2015) . In order to produce the symmetricdistance matrix for a dataset, we employ customparallel implementation which distributes the calcu-lations over all available logical cores in a machine.To calculate the sentence-embeddings, the im-plementation of UMAP provided by McInnes et al(2018) is used . Finally, the classiﬁcation is donevia linear kernel support vector machines from the scikit-learn library (Pedregosa et al., 2011) .All of the code and datasets have been packagedand released to rerun all of the experiments. Compute infrastructure : All experiments wererun on a m4.2xlarge machine on AWS-EC2 , whichhas 8 virtual CPUs and 32GB of RAM. In order to check the usefulness of our proposedapproach, we benchmark its performance in twodifferent ways. The ﬁrst, and most obvious, ap-proach is to consider the performance of the k -NN https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.directed_hausdorff.html https://dcor.readthedocs.io/en/latest/functions/dcor.energy_distance.html https://github.com/mkusner/wmd https://umap-learn.readthedocs.io/en/latest/api.html https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html https://github.com/DeepK/distance-embed https://aws.amazon.com/ec2/ ataset bbcsport classic ohsumed

10 3999 5153 104 Medical abstracts categorized by subject headings reuters8 twitter

Table 1:

Dataset information : Metadata describing the datasets used in our experiments. classiﬁer as a baseline. This is motivated by thestate-of-the-art k -NN based classiﬁcation accuracyreported by Kusner et al. for the word mover’sdistance. Thus, our embeddings need to match orsurpass the performance of a k -NN based approach,in order to be considered for practical use.The second approach is to compare the clas-siﬁcation accuracies of several state-of-the-artembedding-generation algorithms on our chosendatasets. These are: dct (Almarwani et al., 2019): embeddings are gen-erated by employing discrete cosine transform ona set of word vectors. eigensent (Kayal and Tsatsaronis, 2019): sentencerepresentations produced via higher-order dynamicmode decomposition (Le Clainche and Vega, 2017)on a sequence of word vectors. wmovers (Wu et al., 2018): a competing methodwhich can learn sentence representations from theword mover’s distance based on kernel learning,termed in the original work as word mover’s em-beddings . p-means (R¨uckl´e et al., 2018): produces sentence-embeddings by concatenating several power-meansof word-embeddings corresponding to a sentence. doc2vec (Le and Mikolov, 2014): embeddings pro-duced by jointly learning the representations ofsentences, together with words, as a part of theword2vec procedure. s-bert (Reimers and Gurevych, 2019): embeddingsproduced by ﬁne-tuning a pre-trained BERT modelusing a Siamese architecture to classify two sen-tences as being similar or different.Note that the results for wmovers and doc2vec are taken from Table 3 of Wu et al.’s work (2018),while all the other algorithms are explicitly tested. Extensive experiments are performed to provide aholistic overview of our neighbourhood-preservingembedding algorithm, for various sets of input pa-rameters. The steps involved are as follows:

Choose a dataset (one of the six mentioned in Section 3.1). For every word in every sentencein the train and test splits of the dataset, retrievethe corresponding word-embedding from the pre-trained embedding corpus (as stated in Section 3.2).

Calculate symmetric distance matrices corre-sponding to each of the chosen distance metrics,for all of the sets of word-embeddings from thetrain and test splits.

Apply the UMAP algorithm on the distance ma-trices to generate embeddings for all sentences inthe train and the test splits.

Calculate embeddings for competing methods for the methods outlined in Section 4.1.Embeddings are generated for various hyperpa-rameter combinations for

EMAP as well as all thecompared approaches, as listed in Table 2.

Train a classiﬁer on the produced embeddings to perform the dataset-speciﬁc task. In this work,we train a simple linear-kernel support vector ma-chine (Cortes and Vapnik, 1995) for every compet-ing method and every dataset tested. The classiﬁeris trained on the train-split of a dataset and eval-uated on the test-split. The only parameter tunedfor the SVM is the L2 regularization strength, var-ied between 0.001 and 100. The overall test ac-curacy has been been reported as a measure ofperformance.

The results of all our experiments are in compiledin Tables 3 and 4. All statistical tests reported arez-tests, where we compute the right-tailed p-valueand call a result signiﬁcantly different if p < . . Performance of the distance metrics : From Ta-ble 3 it can be observed that the word mover’s dis-tance consistently performs better than the othersexperimented with in this paper. WMD calculatesthe total effort of aligning two sentences, whichseems to capture more useful information com-pared to the hausdorff metric’s worst-case effortof alignment. As for the energy distance, it cal-culates pairwise potentials amongst words withinand between sentences, and may suffer if there are ethod Parameter Value(s) TestedEMAP n neighbors embedding dim

50, 100, 300, 1000 min dist spread n iters distance wmd, hausdorff, energy kNN k distance wmd, hausdorff, energy dct components eigensent components time lag

1, 2, 3, [1,2], [1,2,3], [1,2,3,4] pmeans powers

1, [1,2], [1,2,3], [1,2,3,4,5,6] s-bert model bert-base-nli-mean-tokens

Table 2:

Hyperparameter values tested . For

EMAP , n neighbours refers to the size of local neighborhood usedfor manifold approximation, embedding dim is the ﬁxed dimensionality of the generated sentence-embeddings, min dist is the minimum distance apart that points are allowed to be in the low dimensional representation, spread determines the scale at which embedded points will be spread out, n iters is the number of iterations that the UMAPalgorithm is allowed to run, and ﬁnally, distance is one of the metrics proposed in Section 2.1. For the spectraldecomposition based algorithms, dct and eigensent , components represents the number of components to keep inthe resulting decomposition, while time lag corresponds to the window-length in the dynamic mode decomposi-tion process. For pmeans , powers represents the different powers which are used to generate the concatenatedembeddings. Distance energydist hausdorffdist wmddistMethod knn EMAP knn EMAP knn EMAPamazon 0.923 * 0.909 0.781 * 0.918 * bbcsport * 0.961 * ohsumed * 0.491 * 0.551 * r8 * * 0.837 0.951 * twitter Table 3:

Comparison versus kNN . Results shown here compare the classiﬁcation accuracies of k -nearest neigh-bour to our proposed approach for various distance metrics. For every distance, bold indicates better accuracy,while ∗ indicates that the winning accuracy was statistically signiﬁcant with respect to the compared value (,i.e.,EMAP vs kNN for a given distance metric). It can be observed that our method almost always outperforms k -nearest neighbour-based classiﬁcation. Method wmd-EMAP dct eigensent wmovers pmeans doc2vec s-bertamazon ∨ ∧ ∨ bbcsport 0.986 ∨ ohsumed ∨ ∨ ∧ ∨ ∨ ∨ r8 0.973 ∨ ∨ ∨ twitter ∨ ∨ ∨ ∨ ∨ Table 4:

Comparison versus competing methods . We compare

EMAP based on word mover’s distance to variousstate-of-the-art approaches. The best and second-best classiﬁcation accuracies are highlighted in bold and italics .We perform statistical signiﬁcance tests of our method ( wmd-EMAP ) against all other methods, for a given dataset,and denote the outcomes by ∨ when the compared method is worse and ∧ when our method is worse, while theabsence of a symbol indicates insigniﬁcant differences. In terms of absolute accuracy, we observe that our methodachieves state-of-the-art results in 2 out of 6 datasets. shared commonly-occurring words in both the sen-tences. However, given that energy and hausdorffdistances are reasonably fast to calculate and per-form respectably well, they might be worth using inapplications with a large number of long sentences. Comparison versus kNN : EMAP almost always outperforms k -nearest neighbours based classiﬁca-tion, for all the tested distance metrics. The perfor-mance boost for WMD is between a relative per-centage accuracy of 0.5% to 14%. This illustratesthe efﬁciency of the proposed manifold-learningmethod. uery Sentence Best Match Sentence Cosine Sim I have spent thousands of dollar’s On Meyerscookware everthing from KitchenAid AnolonPrestige Faberware & Circulan just to name a fewThough Meyers does manufacture very high qualitypots & pans and I would recommend them to anyoneit’s just sad that if you have any problem with themunder warranty you have to go throught the chainof command that never gets you anywhere even ifyou want to speak with upper management aboutthe rudeness of the customer service departmentTheir customer service department employees arealways very rude and snotty and they act like theyare doing you a favor to even talk to you about theirproducts When I opened the box I noticed corrosionon the lid When I contacted Rival customerservice via email they told me I had to purchasea new lid I called and spoke with a customerservice representative and they told me that alid was not covered under warranty When Iexplained that I just opened it and it wasdefective they told me to just return theproduct that there was nothing that they weregoing to do After being treated this way I willNOT be purchasing any more Rival productsif they don’t stand behind their product VERYVERY poor customer service 0.997This movie will bring up your racial prejudices inways that most movies just elude to It demonstrateshow connected we all are as people and how seperatedwe are by only one thing our viewpoints The actingis superb and you get one cameo appearance afteranother which is a treat Of course the soundtrack isterriﬁc The ending is intense to witness one situationafter another coming to an unfortunate ﬁnish I waited years for this movie to be released in theUnited States As far as I was concerned it wasn’tabout the acting as much as it was about thefeeling the actors wanted to portray in whichthey profoundly accomplished I would recommendthis movie to anyone who can reach that one stepdeeper into the minds of creativity and passionand appreciate the struggles of rising above andbeyond the pain of broken dreams 0.998We see a phrase a lot when we visit how to sites forwriters World building By this we mean the settingthe characters and everything else where our storywill occur For me this often means maps memoriesand visits since I write about where I live But ifyou’d like to see exactly what world building meanshead down to your local library and grab SALEM’SLOT by Stephen King When StephenKing mania ﬁrst gripped the English speaking worldI missed it I saw the ﬁlm of CARRIE and hated itYears later at a guard desk on a long shift scheduledso suddenly that I hadn’t had a chance to visit thelibrary I read what was in the desk instead THINNERIf I were Stephen King I’d have put a pen name onthat crap as well One of King’s fans brought mearound She recommended THE SHINING Of courseI thought of that Kubrick/Nicholson travesty No noshe said read the book It’s much different Yes it isIt’s fantastic for its perceptiveness Next up PETSEMATARY which scared the crap out of meAnd that my friends is not easy ON WRITING I’vegushed about that enough times The ﬁlms STANDBY ME and THE APT PUPIL So in the end Iappreciate King and forgive him for CARRIEand I think he’s forgiven himself in the possibility that Steve Berry could evertranscend his not so great debut The Amber RoomRomanov Prophecy started in the right directionThird Secret was OK but I think he hit his *peak*right there 0.955

Table 5:

Examples of best-matching sentences . From the amazon reviews dataset using wmd-EMAP . Comparison versus state-of-the-art methods :Consulting Table 4, it seems that wmovers , pmeans and s-bert form the strongest baselines as com-pared to our method, wmd-EMAP ( EMAP withword mover’s distance). Considering the statisticalsigniﬁcance of the differences in performance be-tween wmd-EMAP and the others, it can be seenthat it is almost always equivalent to or better thanthe other state-of-the-art approaches. In terms ofabsolute accuracy, it wins in 3 out of 6 evaluations,where it has the highest classiﬁcation accuracy, andcomes out second-best for the others. Compared to it’s closest competitor, the word mover’s embed-ding algorithm, the performance of wmd-EMAP isfound to be on-par (or slightly better, by 0.8% inthe case of the classic dataset) to slightly worse(3% relative p.p., in case of the twitter dataset). In-terestingly, both of the distance-based embeddingapproaches, wmd-EMAP and wmovers , are foundto perform better than the siamese-BERT basedapproach, s-bert .Thus, the overall conclusion from our empiri-cal studies is that

EMAP performs favourably ascompared to various state-of-the-art approaches. xamples of similar sentences with EMAP : Weprovide motivating examples of similar sentencesfrom the amazon dataset, as deemed by our ap-proach, in Table 5. As can be seen, our methodperforms quite well in matching complex sentenceswith varying topics and sentiments to their closestpairs. The ﬁrst example pair has the theme of a cus-tomer who is unhappy about poor customer servicein the context of cookware warranty, while the sec-ond one is about positive reviews of deeply-movingmovies. The third example, about book reviews,is particularly interesting: in the ﬁrst example, areviewer is talking about how she disliked the ﬁrstStephen King work which she was exposed to, butsubsequently liked all the next ones, while in thematched sentence the reviewer talks about a simi-lar sentiment change towards the works of anotherauthor, Steve Berry. Thus in the last example, thesimilarity between sentences is the change of senti-ment, from negative to positive, towards the worksof books of particular authors.

In this work, we propose a novel mechanism toconstruct unsupervised sentence-embeddings bypreserving properties of local neighbourhoods inthe original space, as delineated by set-distancemetrics. This method, which we term,

EMAP or Embeddings by Manifold Approximation and Pro-jection leverages a method from topological dataanalysis can be used as a framework with any dis-tance metric that can discriminate between sets,three of which we test in this paper. Using bothquantitative empirical studies, where we comparewith state-of-the-art approaches, and qualitativeprobing, where we retrieve similar sentences basedon our generated embeddings, we illustrate the ef-ﬁciency of our proposed approach to be on-par orexceeding in-use methods. This work demonstratesthe successful application of topological data anal-ysis in sentence embedding creation, and we leavethe design of better distance metrics and manifoldapproximation algorithms, particularly targeted to-wards NLP, for future research.

References

Hanan Aldarmaki and Mona Diab. 2018. Evaluationof unsupervised compositional representations. In

Proceedings of the 27th International Conference onComputational Linguistics , pages 2666–2677, SantaFe, New Mexico, USA. Nada Almarwani, Hanan Aldarmaki, and Mona Diab.2019. Efﬁcient sentence embedding using discretecosine transform. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 3672–3678, Hong Kong, China.Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.A simple but tough-to-beat baseline for sentence em-beddings. In

International Conference on LearningRepresentations .Mikhail J. Atallah. 1983. A linear time algorithmfor the hausdorff distance between convex polygons.Technical report, Department of Computer Science,Purdue University.Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model.

Journal of Machine Learning Re-search , 3:1137–1155.Karol Borsuk. 1948. On the imbedding of systems ofcompacta in simplicial complexes. In

FundamentaMathematicae , volume 35, pages 217–234.Gunnar Carlsson. 2009. Topology and data.

Bulletin ofthe American Mathematical Society , 46(2):255–308.Minmin Chen. 2017. Efﬁcient vector representationfor documents through corruption. .Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In

Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing , pages 670–680, Copen-hagen, Denmark.Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks.

Machine Learning , 20(3):273–297.T. Cover and P. Hart. 1967. Nearest neighbor patternclassiﬁcation.

IEEE Transactions on InformationTheory , 13(1):21–27.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Wei Dong, Charikar Moses, and Kai Li. 2011. Efﬁcientk-nearest neighbor graph construction for genericsimilarity measures. In

Proceedings of the 20th In-ternational Conference on World Wide Web , pages577–586.. . Dubuisson and A. K. Jain. 1994. A modiﬁed haus-dorff distance for object matching. In

Proceedingsof 12th International Conference on Pattern Recog-nition , volume 1, pages 566–568 vol.1.Geoffrey E Hinton and Sam T. Roweis. 2003. Stochas-tic neighbor embedding. In S. Becker, S. Thrun, andK. Obermayer, editors,

Advances in Neural Informa-tion Processing Systems 15 , pages 857–864.Subhradeep Kayal and George Tsatsaronis. 2019.EigenSent: Spectral sentence embeddings usinghigher-order dynamic mode decomposition. In

Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4536–4546, Florence, Italy.Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett, editors,

Advances in Neural Informa-tion Processing Systems 28 , pages 3294–3302.Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kil-ian Q. Weinberger. 2015. From word embeddings todocument distances. In

Proceedings of the 32nd In-ternational Conference on International Conferenceon Machine Learning - Volume 37 , ICML’15, pages957–966.Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In

Proceed-ings of the 31st International Conference on Inter-national Conference on Machine Learning - Volume32 , ICML’14, pages II–1188–II–1196.Soledad Le Clainche and Jos´e M. Vega. 2017. Higherorder dynamic mode decomposition.

SIAM Journalon Applied Dynamical Systems , 16(2):882–925.John A. Lee and Michel Verleysen. 2007.

NonlinearDimensionality Reduction , 1st edition.Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE.

Journal of MachineLearning Research , 9:2579–2605.L. McInnes, J. Healy, and J. Melville. 2018. UMAP:Uniform Manifold Approximation and Projectionfor Dimension Reduction.

ArXiv e-prints .Leland McInnes, John Healy, Nathaniel Saul, andLukas Grossberger. 2018. Umap: Uniform manifoldapproximation and projection.

The Journal of OpenSource Software , 3(29):861.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In

Proceedings of the 26th International Con-ference on Neural Information Processing Systems -Volume 2 , NIPS’13, pages 3111–3119. H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen,X. Song, and R. Ward. 2016. Deep sentence em-bedding using long short-term memory networks:Analysis and application to information retrieval.

IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , 24(4):694–707.Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, Jake Vanderplas, Alexan-dre Passos, David Cournapeau, Matthieu Brucher,Matthieu Perrot, and ´Edouard Duchesnay. 2011.Scikit-learn: Machine learning in python.

J. Mach.Learn. Res. , 12:2825–2830.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) , pages 1532–1543.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In

Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers) , pages 2227–2237.Ali Rahimi and Benjamin Recht. 2008. Random fea-tures for large-scale kernel machines. In J. C. Platt,D. Koller, Y. Singer, and S. T. Roweis, editors,

Ad-vances in Neural Information Processing Systems20 , pages 1177–1184.Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages3982–3992, Hong Kong, China.Lior Rokach and Oded Maimon. 2005. Clusteringmethods. In

The Data Mining and Knowledge Dis-covery Handbook , pages 321–352.Sam T. Roweis and Lawrence K. Saul. 2000. Nonlin-ear dimensionality reduction by locally linear em-bedding.

Science , 290:2323–2326.Andreas R¨uckl´e, Steffen Eger, Maxime Peyrard, andIryna Gurevych. 2018. Concatenated p-mean wordembeddings as universal cross-lingual sentence rep-resentations.

CoRR , abs/1803.01400.Sidak Pal Singh, Andreas Hug, Aymeric Dieuleveut,and Martin Jaggi. 2019. Context mover’s distance &barycenters: Optimal transport of contexts for build-ing representations. In

Deep Generative Models forHighly Structured Data, ICLR 2019 Workshop, NewOrleans, Louisiana, United States, May 6, 2019 .´abor J. Sz´ekely and Maria L. Rizzo. 2013. En-ergy statistics: A class of statistics based on dis-tances.

Journal of Statistical Planning and Infer-ence , 143(8):1249 – 1272.George Tsatsaronis, Iraklis Varlamis, and KjetilNørv˚ag. 2012. Semafor: Semantic document in-dexing using semantic forests. In

Proceedings ofthe 21st ACM International Conference on Informa-tion and Knowledge Management , CIKM ’12, page1692–1696, New York, NY, USA.Zihao Wang, Datong Zhou, Yong Zhang, Hao Wu, andChenglong Bao. 2019. Wasserstein-ﬁsher-rao docu-ment distance.

CoRR , abs/1904.10294.Lingfei Wu, Ian En-Hsu Yen, Kun Xu, FangliXu, Avinash Balakrishnan, Pin-Yu Chen, PradeepRavikumar, and Michael J. Witbrock. 2018. Wordmover’s embedding: From Word2Vec to documentembedding. In

Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing , pages 4524–4534, Brussels, Belgium.Zhao Yan, Nan Duan, Junwei Bao, Peng Chen, MingZhou, Zhoujun Li, and Jianshe Zhou. 2016. Doc-Chat: An information retrieval approach for chatbotengines using unstructured documents. In