[PDF] Stochastic Neighbor Embedding under f-divergences

Abstract

Full PDF

SStochastic Neighbor Embedding under f -divergences Daniel Jiwoong Im , Nakul Verma , and Kristin Branson Janelia Research Campus, HHMI, Virginia AIFounded Inc., Toronto Columbia University, New York

November 6, 2018

Abstract

The t -distributed Stochastic Neighbor Embedding ( t -SNE) is a powerful and popular method forvisualizing high-dimensional data. It minimizes the Kullback-Leibler (KL) divergence between theoriginal and embedded data distributions. In this work, we propose extending this method to other f -divergences. We analytically and empirically evaluate the types of latent structure—manifold, cluster,and hierarchical—that are well-captured using both the original KL-divergence as well as the proposed f -divergence generalization, and ﬁnd that diﬀerent divergences perform better for diﬀerent types ofstructure.A common concern with t -SNE criterion is that it is optimized using gradient descent, and can becomestuck in poor local minima. We propose optimizing the f -divergence based loss criteria by minimizing avariational bound. This typically performs better than optimizing the primal form, and our experimentsshow that it can improve upon the embedding results obtained from the original t -SNE criterion as well. A key aspect of exploratory data analysis is to study two-dimensional visualizations of the given high-dimensional input data. In order to gain insights about the data, one hopes that such visualizations faithfullydepict salient structures that may be present in the input. t -distributed Stochastic Neighbor Embedding( t -SNE) introduced by van der Maaten and Hinton [19] is a prominent and popular visualization techniquethat has been applied successfully in several application domains [1, 5–8, 13].Arguably, alongside PCA, t -SNE has now become the de facto method of choice used by practitioners for2D visualizations to study and unravel the structure present in data. Despite its immense popularity, verylittle work has been done to systematically understand the power and limitations of the t -SNE method, andthe quality of visualizations that it produces. Only recently researchers showed that if the high-dimensionalinput data does contain prominent clusters then the 2D t -SNE visualization will be able to successfullycapture the cluster structure [12, 3]. While these results are a promising start, a more fundamental questionremains unanswered: what kinds of intrinsic structures can a t -SNE visualization reveal? Intrinsic structure in data can take many forms. While clusters are a common structure to study, theremay be several other important structures such as manifold, sparse or hierarchical structures that are presentin the data as well. How does the t -SNE optimization criterion fare at discovering these other structures?Here we take a largely experimental approach to answer this question. Perhaps not surprisingly, minimizing t -SNE’s KL-divergence criterion is not suﬃcient to discover all these important types of structure. Weadopt the neighborhood-centric precision-recall analysis proposed by Venna et al. [20], which showed thatKL-divergence maximizes recall at the expense of precision. We show that this is geared speciﬁcally towardsrevealing cluster structure and performs rather poorly when it comes to ﬁnding manifold or hierarchicalstructure. In order to discover these other types of structure eﬀectively, one needs a better balance between1 a r X i v : . [ c s . L G ] N ov able 1: A list of commonly used f -divergences (along with their generating function) and their corresponsing t -SNE objective (which we refer to as f t -SNE). The last column describes what kind of distance relationshipgets emphasized by diﬀerent choices of f -divergence. D f ( P (cid:107) Q ) f ( t ) ft -SNE objective EmphasisKullback-Leibler (KL) t log t (cid:80) p ij (cid:16) log p ij q ij (cid:17) LocalChi-square ( X or CH) ( t − (cid:80) ( p ij − q ij ) q ij LocalReverse-KL (RKL) − log t (cid:80) q ij (cid:16) log q ij p ij (cid:17) GlobalJensen-Shannon (JS) ( t + 1) log t +1) + t log t ( KL ( p ij (cid:107) p ij + q ij ) + KL ( q ij | p ij + q ij )) BothHellinger distance (HL) ( √ t − (cid:80) ( √ p ij − √ q ij ) Both precision and recall, and we show that this can be achieved by minimizing f -divergences other than theKL-divergence.We prescribe that data scientists create and explore low-dimensional visualizations of their data corre-sponding to several diﬀerent f -divergences, each of which is geared toward diﬀerent types of structure. To thisend, we provide eﬃcient code for ﬁnding t -SNE embeddings based on ﬁve diﬀerent f -divergences . Users caneven provide their own speciﬁc instantiation of an f -divergence, if needed. Our code can optimize either thestandard criterion, or a variational lower bound based on convex conjugate of the f -divergence. Empirically,we found that minimizing this dual variational form was computationally more eﬃcient and produced betterquality embeddings, even for the standard case of KL-divergence. To our knowledge, this is the ﬁrst workthat explicitly compares the optimization of both the primal and dual form of f -divergences, which would beof independent interest to the reader. Given a set of m high-dimensional datapoints x , . . . , x m ∈ R D , the goal of Stochastic Neighbor Embedding(SNE) is to represent these datapoints in one- two- or three-dimensions in a way that faithfully capturesimportant intrinsic structure that may be present in the given input. It aims to achieve this by ﬁrst modellingneighboring pairs of points based on distance in the original, high-dimensional space. Then, SNE aims to ﬁnda low-dimensional representation of the input datapoints whose pairwise similarities induce a probabilitydistribution that is as close to the original probability distribution as possible. More speciﬁcally, SNEcomputes p ij , the probability of selecting a pair of neighboring points i and j , as p ij = p i | j + p j | i m , where p j | i and p i | j represent the probability that j is i ’s neighbor and i is j ’s neighbor, respectively. Theseare modeled as p j | i := exp (cid:0) −(cid:107) x i − x j (cid:107) / σ i (cid:1)(cid:80) k (cid:54) = i exp ( −(cid:107) x i − x k (cid:107) / σ i ) . The parameters σ i control the eﬀective neighborhood size for the individual datapoints x i . In practicalimplementations the neighborhood sizes are controlled by the so-called perplexity parameter, which can beinterpreted as the eﬀective number of neighbors for a given datapoint and is proportional to the neighborhoodsize [19].The pairwise similarities between the corresponding low-dimensional datapoints y , . . . , y m ∈ R d (where d = 1 , or typically), are modelled as Student’s t -distribution q ij := (1 + (cid:107) y i − y j (cid:107) ) − (cid:80) k (cid:54) = i (1 + (cid:107) y i − y k (cid:107) ) − . The code is available at github.com/jiwoongim/ft − SNE . .00 0.25 0.50 0.75 1.00p_ij0.00.20.40.60.81.0 q _ ij q _ ij q _ ij q _ ij q _ ij q _ ij KL q _ ij RKL q _ ij JS q _ ij CH q _ ij HL Figure 1: Top: f -divergence loss. Bottom: gradient of f -divergence. The color limit represents the magnitudeof f -divergence (resp. gradient of f -divergence) of p ij and q ij .The choice of a heavy-tailed t -distribution to model the low-D similarities is deliberate and is key to circumventthe so-called crowding problem [19], hence the name t -SNE.The locations of the mapped y i ’s are determined by minimizing the discrepancy between the originalhigh-D pairwise similarity distribution P = ( p ij ) and the corresponding low-D distribution Q = ( q ij ) . t -SNE prescribes minimizing the KL-divergence ( D KL ) between distributions P and Q to ﬁnd an optimalconﬁguration of the mapped points J KL ( y , . . . , y m ) := D KL ( P || Q ) = (cid:88) i (cid:54) = j p ij log p ij q ij . While it is reasonable to use KL-divergence to compare the pairwise distributions P and Q , there is nocompelling reason why it should be preferred over other measures. In fact we will demonstrate that using KL-divergence is restrictive for some types of structure discovery, and one should explore other divergence-basedmeasures as well to gain a wholistic understanding of the input data. f -Divergence-based Stochastic Neighbor Embedding KL-divergence is a special case of a broader class of divergences called f -divergences. A few popular specialcases of f -divergences include the reverse KL divergence, Jenson-Shannon divergence, Hellinger distance(HL), total variation distance and χ -divergence. Of course, each instantiation compares discrepancy betweenthe distributions diﬀerently [16] and it would be instructive to study what eﬀects, if any, do these otherdivergences have on low-D visualizations of a given input. Formally f -divergence between two distributions P and Q (over the same measurable space Ω ) is deﬁned as D f ( P || Q ) := (cid:90) Ω f (cid:18) P ( x ) Q ( x ) (cid:19) dQ ( x ) , where f is a convex function such that f (1) = 0 . Intuitively, f -divergence tells us the average odds-ratiobetween P and Q weighted by the function f . For the t -SNE objective, the generic form of f -divergencesimpliﬁes to J f ( y , . . . , y m ) := D f ( P || Q ) = (cid:88) i (cid:54) = j q ij f (cid:18) p ij q ij (cid:19) . (1)Table 1 shows a list of common instantiations of f -divergences and their corresponding t -SNE objectives,which we shall call f t -SNE. 3bviously, one expects diﬀerent optimization objectives (i.e. diﬀerent choices of f ) to produce diﬀerentresults. A more signiﬁcant question is whether these diﬀerences have any signiﬁcant qualitative eﬀects ontypes of structure discovery.An indication towards why the choice of f might aﬀect the type of structure revealed is to notice that f -divergences are typically asymmetric, and penalize the ratio p ij /q ij (cf. Eq. 1) diﬀerently. KL-SNE (i.e. f taken as KL-divergence, cf. Table 1) for instance penalizes pairs of nearby points in the original spacegetting mapped far away in the embedded space more heavily than faraway points being mapped nearby(since the corresponding p ij (cid:29) q ij ≈ ). Thus KL-SNE optimization prefers visualizations that don’t distortlocal neighborhoods. In contrast, SNE with the reverse-KL-divergence criterion, RKL-SNE, as the namesuggests, emphasizes the opposite, and better captures global structure in the corresponding visualizations.A nice balance between the two extremes is achieved by the JS- and HL-SNE (cf. Table 1), where JS issimply an arithmetic mean of the KL and RKL penalties, and HL is a sort of aggregated geometric mean.Meanwhile, CH-SNE can be viewed as relative version of the (squared) L distance between the distributions,and is a popular choice for comparing bag-of-words models [23].We can empirically observe how p and q similarities are penalized by divergence (see Figure 1). Ourobservation matches with our intuition: KL and CH are sensitive to high p and low q , whereas RKL issensitive to low p and high q , and JS and HL are symmetric. The corresponding gradients w.r.t. q show thatall divergence are generally sensitive to when p is high and q is low. However, RKL, JS, and HL providemuch smoother gradient signals over p > q space. KL penalize strictly towards high p and low q and CH ismuch stricter towards p (cid:29) q space. A Neighborhood-level Precision-Recall Analysis.

The optimization criterion of f t -SNE is a complexnon-convex function that is not conducive to a straightforward analysis without simplifying assumptions. Tosimplify the analysis, we consider pairs of points in a binary neighborhood setting, where, for each datapoint,other datapoints are either in its neighborhood, or not in its neighborhood.Let N (cid:15) ( x i ) and N (cid:15) ( y i ) denote the neighbors of points x i and y i by thesholding the pairwise similarities p j | i and q j | i at a ﬁxed threshold (cid:15) , respectively. Let r i := | N (cid:15) ( x i ) | and k i := | N (cid:15) ( y i ) | denote the number oftrue and retrieved neighbors. Our simplifying binary neighborhood assumption can be formalized as: p ij := (cid:40) a i , x j ∈ N (cid:15) ( x i ) b i , x j / ∈ N (cid:15) ( x i ) , q ij := (cid:40) c i , y j ∈ N (cid:15) ( y i ) d i , y j / ∈ N (cid:15) ( y i ) . where a i and c i are large ( a i ≥ − δr i , c i ≥ − δk i ) and b i and d i are small ( b i ≤ δm − r i − , d i ≤ δm − k i − ), for small δ . P e r p l e x i t y RKL (high precision) KL (high recall) α = 0 α = 0.01 α = 0.1 α = 0.5 α = 1 (a) 3 well-separated Gaussian clusters P e r p l e x i t y RKL (high precision) KL (high recall) α = 0 α = 0.01 α = 0.1 α = 0.5 α = 1 (b) Swiss roll manifold Figure 2: f t -SNE embeddings obtained with interpolated divergences between KL and RKL. The perplexityfor each row corresponds to 10, 100, and 500 respectively.In this binary formulation, we can rewrite each of the f -divergences in terms related to the embeddingprecision, the fraction of embedding-neighbors are true neighbors, and the recall, the fraction of true neighborsare also embedding-neighbors. Deﬁne n i TP := | N (cid:15) ( x i ) ∩ N (cid:15) ( y i ) | , n i FP := | N (cid:15) ( y i ) \ N (cid:15) ( x i ) | and n i FN := N (cid:15) ( x i ) \ N (cid:15) ( y i ) | to denote the number of true-positive, false-positive and false-negative neighbors respectively.In this notation, per-neighborhood precision is n i TP /k i = 1 − n i FP /k i and recall is n i TP /r i = 1 − n i FN /r i . Thisinformation retrieval analysis has previously been performed for KL-SNE [20]. Novelly, we extend it to other f -divergences to understand their assumptions. Proposition 1.

Under the binary-neighborhood assumption, for δ suﬃciently small,(i) J KL ∝ (cid:16) (cid:80) i n i FN /r i (cid:124) (cid:123)(cid:122) (cid:125) − recall (cid:17) , maximizes recall.(ii) J RKL ∝ (cid:16) (cid:80) i n i FP /k i (cid:124) (cid:123)(cid:122) (cid:125) − precision (cid:17) maximizes precision.(iii) J JS ∝ J KL + J RKL balances precision and recall,(iv) The ﬁrst two terms of HL-SNE balance precision and recall (the coeﬃcients are close to 1, since δ issmall). The last term forces preservation of neighborhood sizes, and strongly penalizes small embeddingneighborhoods when precision is high. J HL ∝ (cid:88) i (cid:104) (cid:16) n i FN r i (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) − recall · (1 − O (( δr i ) )) (cid:105) + (cid:104) (cid:16) n i FP k i (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) − precision · (1 − O (( δk i ) )) (cid:105) + (cid:16) n i TP k i (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) precision · (cid:16)(cid:114) r i k i − (cid:17) (cid:124) (cid:123)(cid:122) (cid:125) neighborhoodsize ratio . (v) CH-SNE is biased towards maximizing recall, since the multiplier of recall is much larger than that onprecision. Like HL-SNE, the last term forces preservation of neighborhood sizes, and strongly penalizessmall embedding neighborhoods when precision is high. J CH ∝ (cid:88) i (cid:104) (cid:16) n i FN r i (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) − recall · (cid:16) m − k i r i δ (cid:17)(cid:105) + (cid:16) n i FP k i (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) − precision + (cid:104) (cid:16) n i TP k i (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) precision · (cid:16) r i k i − (cid:17) (cid:124) (cid:123)(cid:122) (cid:125) neighborhoodsize ratio (cid:105) . This proposition corroborates our intuition (see also Table 1), and provides a relationship between theproposed f t -SNE criteria and the types of neighborhood similarities that are preserved. KL-SNE maximizesneighborhood recall, while RKL-SNE maximizes neighborhood precision. All other criteria balance precisionand recall in diﬀerent ways. JS-SNE provides equal weight to precision and recall. HL-SNE gives approximatelyequal weight to precision and recall, with an extra term encouraging the original and embedding neighborhoodsizes to match. This regularization term gives more severe penalties if the embedding neighborhood is muchsmaller than the original neighborhood than the reverse, and thus HL-SNE can be viewed as a regularizedversion of JS-SNE. CH-SNE gives more weight to maximizing recall, again with an extra term encouragingthe original and embedding neighborhood sizes to match, and is and thus similar to a regularized version ofKL-SNE.Next, we connect these precision-recall interpretations of the various criteria to the types of intrinsicstructure they preserve. Suppose the intrinsic structure within the data is clusters. A good embedding of thisdata would have points belonging to the same true cluster all grouped together in the visualization, but thespeciﬁc locations of the embedded points within the cluster do not matter. Thus, cluster discovery requiresgood neighborhood recall, and one might expect KL-SNE to perform well. For neighborhood sizes similar totrue cluster sizes, this argument is corroborated both theoretically by previous work [12, 3] and empirically byour experiments (Experiments Section). Both theoretically and practically, the perplexity parameter—whichis a proxy for neighborhood size—needs to be set so that the eﬀective neighborhood size matches the clustersize for successful cluster discovery.If the intrinsic structure within the data is a continuous manifold, a good embedding would preserve thesmoothly varying structure, and not introduce artiﬁcial breaks in the data that lead to the appearance ofclusters. Having a large neighborhood size (i.e. large perplexity) may not be conducive to this goal, againbecause the SNE optimization criterion does not care about the speciﬁc mapped locations of the datapoints5able 2: Variational f t -SNE. D f ( P (cid:107) Q ) f ( t ) f ∗ ( t ) h ( x ) Kullback-Leibler (KL) t log t exp( t − x Reverse-KL (RKL) − log t − − log( − t ) − exp( − x ) Jensen-Shannon (JS) − ( t + 1) log (1+ t )2 + t log t − log(1 − exp( t )) log(2) − log (1 + exp( − x )) Hellinger distance (HL) ( √ t − t − t − exp( − x ) Chi-square ( X or CS) ( t − t + t x within the neighborhood. Instead, it is more preferable to have small enough neighborhood where the manifoldsections are approximately linear one require high precision in these small neighborhoods. Thus one mightexpect RKL-SNE to fare well manifold discovery tasks. Indeed, this is also corroborated practically in ourexperiments. (To best of our knowledge, no theory work exists on this.) Variational f t -SNE for practical usage and improved optimization.

The f t -SNE criteria can beoptimized using gradient descent or one of its variants, e.g. stochastic gradient descent, and KL-SNE isclassically optimized in this way. The proposed f t -SNE criteria (including KL-SNE) are non-convex, andgradient descent may not converge to a good solution. We explored minimizing the f t -SNE criteria byexpressing it in terms of its conjugate dual [17, 16]: D f ( P || Q ) = (cid:88) i (cid:54) = j (cid:20) q ij (cid:18) sup h ∈H h (( x i , x j )) p ij q ij − f ∗ ( h (( x i , x j ))) (cid:19)(cid:21) where H is the space of real-valued functions on the underlying measure space and f ∗ is the Fenchel conjugateof f . In this equation, the maximum operator acts per data point, making optimization infeasible. Instead,we optimize the variational lower bound D f ( P || Q ) ≥ sup h ∈H (cid:88) i (cid:54) = j [ h (( x i , x j )) p ij − f ∗ ( h (( x i , x j ))) q ij ] , which is tight for suﬃciently expressive H . In practice, one uses a parameteric hypothesis class ¯ H , and we usemultilayer, fully-connected neural networks. Table 2 shows a list of common instantiations of f -divergencesand their corresponding h ( x ) functions. Our variational form of f t -SNE objective (or vf t -SNE) ﬁnallybecomes the following minimax problem J ( y , . . . , y m ) = min y ,...,y m max ¯ h ∈ ¯ H (cid:88) i (cid:54) = j (cid:2) ¯ h (( x i , x j )) p ij − f ∗ (cid:0) ¯ h (( x i , x j )) (cid:1) q ij (cid:3) . We alternatively optimize y , . . . , y m and ¯ h (see Algorithm 1, more details available in S.M.). Algorithm 1

Variational (Adversarial) SNE Optimization Algorithm procedure Optimization (Dataset { X tr , X vl } , learning rate η , f -divergence J ) Initialize the discriminant parameter φ . while φ has not converged do for j = 1 , . . . , J do φ t +1 = φ t + η ∇ φ J . for k = 1 , . . . , K do y it +1 = y it − η y ∇ y J . In this section, we compare the performance of the proposed f t -SNE methods in preserving diﬀerent typesof structure present in selected data sets. Next, we compare the eﬃcacy of optimizing the primal versus6able 3: Best f t -SNE method for each dataset and criterion, according to maximum F-score in Figure 3 and7.

Data-Embeddings Class-EmbedingsData Type K-Nearest K-Farthest F-Score on X-Y F-Score on Z-YMNIST (Digit 1) Manifold RKL RKL RKL -Face Manifold HL,RKL RKL RKL JSMNIST Clustering KL KL CS KLGENE Clustering KL KL KL KL20 News Groups Sparse & CS CS CS HLHierachicalImageNet (sbow) Sparse & CS CS CS KLHierachical the dual form of the f t -SNE. Details about the datasets, optimization parameters, and architectures aredescribed in the Supplementary Material (S.M.).

Datasets.

We compared the proposed f t -SNE methods on a variety of datasets with diﬀerent latentstructures. The MNIST dataset consists of images of handwritten digits from 0 to 9 [10], thus the latentstructure is clusters corresponding to each digit. We also tested on just MNIST images of the digit 1, whichcorresponds to a continuous manifold. The Face dataset, proposed in [18], consists of rendered images of facesalong a 3-dimensional manifold corresponding to up-down rotation, left-right rotation, and left-right positionof the light source. The Gene dataset consists of RNA-Seq gene expression levels for patients with ﬁvediﬀerent types of tumors [21], and thus has a cluster latent structure. The 20-Newsgroups dataset consistedof text articles from a hierarchy of topics [9], and thus the latent structure corresponded to a hierarchicalclustering. In addition, we used a bag-of-words representation of the articles, thus the feature representationis sparse (many of the features in the original representation will be 0). We also examined two syntheticdatasets: the Swiss Roll dataset [18] which has a continuous manifold latent structure, and a simple datasetconsisting of 3 Gaussian clusters in 2 dimensions, which has a cluster latent structure. Details of thesedatasets can be found in Appendix. f -divergences for SNE We developed several criteria for quantifying the performance of the diﬀerent f t -SNE methods. Our criteriaare based on the observation that, if the local structure is well-preserved, then the nearest neighbours in theoriginal data space X should match the nearest neighbours in the embedded space Y . In addition, many ofour datasets include a known latent variable, e.g. the discrete digit label for MNIST and the continuous headangle for Face. Thus, we also measure how well the embedded space captures the known structure of thelatent space Z . We deﬁne the neighbors N (cid:15) ( x i ) , N (cid:15) ( y i ) , and N (cid:15) ( z i ) of points x i , y i , and z i by thresholdingthe pairwise similarity p j | i , q j | i and r j | i , respectively, at a selected threshold (cid:15) . Here, r j | i = r ( z j | z i ) is thepairwise similarity in the latent space Z . For discrete labels, we deﬁne r j | i ∝ I ( z i = z j ) . For continuouslatent spaces, we use a t-distribution.Using these deﬁnitions of neighbors, we can deﬁne precision and recall, considering the original X orlatent Z spaces as true and the embedded space Y as the predicted:Precision X ( (cid:15) ) = 1 N N (cid:88) i | N (cid:15) ( y i ) ∩ N (cid:15) ( x i ) || N (cid:15) ( y i ) | , Precision Z ( (cid:15) ) = 1 N N (cid:88) i | N (cid:15) ( y i ) ∩ N (cid:15) ( z i ) || N (cid:15) ( y i ) | Recall X ( (cid:15) ) = 1 N N (cid:88) i | N (cid:15) ( y i ) ∩ N (cid:15) ( x i ) || N (cid:15) ( x i ) | , Recall Z ( (cid:15) ) = 1 N N (cid:88) i | N (cid:15) ( y i ) ∩ N (cid:15) ( z i ) || N (cid:15) ( z i ) | . Alternatively, we can measure how well the embedded space preserves the nearest and farthest neighborstructure. Let

N N K ( x i ) and N N K ( y i ) indicate the K nearest neighbors and F N K ( x i ) and F N K ( y i ) indicate7NIST1 Face MNIST GENE NEWS SBOW P r e c i s i o n - X KL RKL JS HL CH P r e c i s i o n - X P r e c i s i o n - X P r e c i s i o n - X P r e c i s i o n - X P r e c i s i o n - X (a) Precision-Recall curve for XY (Precision X ( (cid:15) ) vs. Recall X ( (cid:15) ) ) P r e c i s i o n - Z P r e c i s i o n - Z P r e c i s i o n - Z P r e c i s i o n - Z P r e c i s i o n - Z (b) Precision-Recall curve for ZY (Precision Z ( (cid:15) ) vs. Recall Z ( (cid:15) ) ) Figure 3: Precision-Recall curves for each of the proposed algorithms on all datasets. Each row correspondsto diﬀerent quantitative criteria, each column to a diﬀerent dataset, and each line to a diﬀerent algorithm.the K farthest neighbors. We deﬁneNN-Precision ( K ) = 1 N K N (cid:88) i | NN K ( y i ) ∩ NN K ( x i ) | , FN-Precision ( K ) = 1 N K N (cid:88) i | FN K ( y i ) ∩ FN K ( X i ) | For each of the datasets, we produced Precision X ( (cid:15) ) -Recall X ( (cid:15) ) and Precision Z ( (cid:15) ) -Recall Z ( (cid:15) ) curves byvarying (cid:15) , and NN Precision ( K ) -FN Precision ( K ) curves by varying K . Results are shown in Figure 3 and7. Table 4-12 summarizes these results by presenting the algorithm with the highest maximum f-score percriterion. For the two manifold datasets, MNIST-Digit-1 and Face, RKL and JS outperformed KL. Thisreﬂects the analysis (see Proposition 1) that RKL and JS emphasize global structure more than KL, andglobal structure preservation is more important for manifolds. Conversely, KL performs best on the twocluster datasets, MNIST and GENE. Finally, CH and HL performed best on the hierarchical dataset, News(cf. 23).To better understand the relative strengths of KL and RKL, we qualitatively compared the embeddingsresulting from interpolating between them: αKL -SNE + (1 − α ) RKL -SNE (2)for α = 0 , . , . , . , . ( α = . corresponds to JS). Figure 2 presents the embedding results for twosynthetic datasets: the Swiss Roll which is a continuous manifold, and three Gaussian clusters, for a rangeof perplexity and α values. We observe that RKL worked better for manifolds with low perplexity whileKL worked better clusters with larger perplexity (as predicted in Section 3. In addition, KL broke up thecontinuous Swiss Roll manifold into disjoint pieces, which produces smoother embeddings compare to KL-SNEunder low perplexity. Finally, we did not see a continuous gradient in embedding results as we changed α .Instead, even for α = 0 . , the Swiss Roll embedding was more similar to the discontinuous KL embedding.For this dataset, the embedding produced by JS was more similar to that produced by KL than RKL. Forthe three Gaussian dataset, all algorithms separated the three clusters, however KL and JS correctly formedcircular clusters, while smaller values of α resulted in diﬀerently shaped clusters. In this section, we quantitatively and qualitatively compared the eﬃcacy of optimizing the primal, f t -SNE,versus the variational, vf t -SNE, forms of the criteria. Quantitatively, we compared the primal f t -SNE criteriaat solutions found using both methods during and after optimization.Figure 4 shows the log primal f t -SNE criteria of the ﬁnal solutions using both optimization methods fordiﬀerent f -divergences and diﬀerent perplexities for MNIST (Supp. Fig. 13 shows results for other datasets).8

500 1000 1500 2000Perplexity0.51.01.52.02.53.0 f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE KL f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE JS f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE CH Figure 4: Log of the primal f t -SNE loss for the f t -SNE and vf t -SNE algorithms for diﬀerent perplexitieson MNIST. The number of updates were set to J:K=10:10 and two hidden layer (10-20) deep ReLU neuralnetwork. l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE 1:1vft-SNE 10:1 vft-SNE 1:10vft-SNE 10:10 (a) Number of updates ( J : K ) l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE linearvft-SNE 5-5 hids vft-SNE 10-10 hidsvft-SNE 10-20 hids (b) Network depth l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE 20 hids vft-SNE 10-20 hidsvft-SNE 5-10-20 hids (c) Network width Figure 5: Comparison of log t -SNE criterion for diﬀerent parameter choices. All plots show results for the KLdivergence on the MNIST dataset (perplexity 2,000), results for other divergences and datasets are in S.M.(a) Diﬀerent numbers of updates to the discriminator and embedding weights with ﬁxed network architecture(2 layers of 10 and 20 hidden units). (b) Diﬀerent network widths, with ﬁxed J : K = 10:10. (c) Diﬀerentnetwork depths (widths speciﬁed in legend), J : K = 10:10.We found that for small perplexities vf t -SNE outperforms f t -SNE, while this diﬀerence decreases as perplexityincreases f t -SNE and vf t -SNE converges to same loss values as the perplexity increases. However, evenat perplexity , vf t -SNE achieves a slightly lower loss than f t -SNE. This is surprising since vf t -SNEminimizes a lower bound of f t -SNE, the criterion we are using for comparison, and suggests that optimizingthe primal form using gradient descent can result in bad local minima.We next evaluated the performance of the vf t -SNE algorithm as we vary some of the parameters ofthe method. Figure 5a compares the results as we vary the number of updates J and K to perform to thediscriminator and embedding weights (Algorithm 1). For the KL divergence, we found that optimizing thevariational form performed better for all choices of J and K , both in terms of the rate of convergence andthe ﬁnal solution found. For the CH and JS divergences, nearly all choices of J and K resulted in fasteroptimization (see Supp. Fig. 11). This plot is in terms of the number of updates, wall clock time is shown inTable 13 (S.M.).Figure 5b and 5c compares the results as we change the architecture of the discriminator. We experimentedwith a linear classiﬁer and neural networks with 1-3 hidden layers of varying numbers of hidden units (networkwidth). Figure 5a compares results results as we vary network width (architecture shown in Supp. Fig. 14)and Figure 5b compares results as we change network depth (architecture shown in Supp. Fig. 15). Weobserved that the performance was largely consistent as we changed network architecture. The results for JSand CH-SNE are shown in Supp. Fig. 16a and 16b. 9 Discussion and Related Work

Other divergences for t -SNE optimization have been explored previously. Perhaps the ﬁrst detailed studywas done by Bunte et al. [4] where they explored divergences from various families (Gamma- Bregman- and f -divergences) and their corresponding visualizations on some image processing datasets. Yang et al. [22]and Narayan et al. [15] recently discussed how diﬀerent divergences can be used to ﬁnd micro and macrorelationships in data. An interesting line of work by Lee et al. [11] and Najim and Lim [14] highlights theissues of trustworthy structure discovery and multi-scale visualizations to ﬁnd local and global structures.The work by Amid et al. [2] is closely related where they study α -divergences from an informationalretrieval perspective. Our work extends it to the general class of f -divergences and explores the relationshipsbetween data structure and the type of divergence used.It is worth emphasizing that no previous study makes an explicit connection between the choice ofdivergence and the type of structure discovery. Our work makes this explicit and should help a practitionergain better insights about their data in the data exploration phase. Our work goes a step further and attemptsto ameliorate the issues non-convex objective function in the f t -SNE criterion. By studying the variationaldual form, we can achieve better quality (locally optimal) solutions, which would be extremely beneﬁcial tothe practitioner. References [1] W. Abdelmoula, K. S krá s ková, B. Balluﬀ, R. Carreira, E. Tolner, B.P.F. Lelieveldt, L.J.P. van derMaaten, H. Morreau, A. van den Maagdenberg, R. Heeren, L. McDonnell, and J. Dijkstra. Automaticgeneric registration of mass spectrometry imaging data to histology using nonlinear stochastic embedding. Analytical Chemistry , 86(18):9204–9211, 2014.[2] Ehsan Amid, Onur Dikmen, and Erkki Oja. Optimizing the information retrieval trade-oﬀ in datavisualization using α -divergence. In arXiv preprint arXiv:1505.05821 , 2015.[3] Sanjeev Arora, Wei Hu, and Pravesh Kothari. An analysis of the t-SNE algorithm for data visualization. Conference on Learning Theory (COLT) , 2018.[4] Kerstin Bunte, Sven Haase, Michael Biehl, and Thomas Villmann. Stochastic neighbor embedding (SNE)for dimension reduction and visualization using arbitrary divergences.

Neurocomputing , 90:23–45, 2012.[5] Kelvin Ch’ng, Nick Vazquez, and Eshan Khatami. Unsupervised machine learning account of magnetictransitions in the hubbard model.

Physical Review E , 97, 2018.[6] I. Gashi, V. Stankovic, C. Leita, and O. Thonnard. An experimental study of diversity with oﬀ-the-shelf antivirus engines.

Proceedings of the IEEE International Symposium on Network Computing andApplications , pages 4–11, 2009.[7] P. Hamel and D. Eck. Learning features from music audio with deep belief networks.

Proceedings of theInternational Society for Music Information Retrieval Conference (ISMIR) , pages 339–344, 2010.[8] A.R. Jamieson, M.L. Giger, K. Drukker, H. Lui, Y. Yuan, and N. Bhooshan. Exploring nonlinear featurespace dimension reduction and data representation in breast CADx with laplacian eigenmaps and t-SNE.

Medical Physics , 37(1):339–351, 2010.[9] Thorsten Joachims. A probabilistic analysis of the rocchio algorithm with tﬁdf for text categorization.Technical report, Carnegie-mellon univ pittsburgh pa dept of computer science, 1996.[10] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[11] John A. Lee, Diego H. Peluﬀo-Ordóñez, and Michel Verleysen. Multi-scale similarities in stochas-tic neighbour embedding: Reducing dimensionality while preserving both local and global structure.

Neurocomputing , 169:246–261, 2015. 1012] George C. Linderman and Stefan Steinerberger. Clustering with t-SNE, provably. In arXiv preprintarXiv:1706.02582 , 2017.[13] A. Mahfouz, M. van de Giessen, L.J.P. van der Maaten, S. Huisman, M.J.T. Reinders, M.J. Hawrylycz,and B.P.F. Lelieveldt. Visualizing the spatial gene expression organization in the brain through non-linearsimilarity embeddings.

Methods , 73:79–89, 2015.[14] Safa A. Najim and Ik Soo Lim. Trustworthy dimension reduction for visualization diﬀerent data sets.

Information Sciences , 278:206–220, 2014.[15] Karthik Narayan, Ali Punjani, and Pieter Abbeel. Alpha-beta divergences discover micro and macrostructures in data.

International Conference on Machine Learning (ICML) , 2015.[16] XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionalsand the likelihood ratio by penalized convex risk minimization. In

Proceedings of the Neural InformationProcessing Systems (NIPS) , 2008.[17] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f -GAN: Training generative neural samplersusing variational divergence minimization. In arXiv preprint arXiv:1606.00709 , 2016.[18] Joshua. B. Tenenbaum, Vin. de Silva, and John. C. Langford. A global geometric framework for nonlineardimensionality reduction. Science , 290:2319–2323, 2000.[19] Laurens van der Maaten and Geoﬀrey Hinton. Visualizing data using t-SNE.

Journal of MachineLearning Research (JMLR) , 9:2579–2605, 2008.[20] Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. Information retrievalperspective to nonlinear dimensionality reduction for data visualization.

Journal of Machine LearningResearch (JMLR) , 11:451–490, 2010.[21] John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, KyleEllrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, Cancer Genome Atlas Research Network, et al.The cancer genome atlas pan-cancer analysis project.

Nature genetics , 45(10):1113, 2013.[22] Zhirong Yang, Jaakko Peltonen, and Samuel Kaski. Optimization equivalence of divergences improvesneighbor embedding.

International conference on machine learning (ICML) , 2014.[23] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classiﬁcation oftexture and object categories: a comprehensive study.

International Journal of Computer Vision (IJCV) ,2007. 11 ppendix A Precision and Recall

Proof.

For KL: J KL ( x i ) = (cid:88) j p ji (cid:18) log p ji q ji (cid:19) = (cid:88) j (cid:54) = i,p ji = a i ,q ji = c i a i log (cid:18) a i c i (cid:19) + (cid:88) j (cid:54) = i,p ji = a i ,q ji = d i a i log (cid:18) a i d i (cid:19) + (cid:88) j (cid:54) = i,p ji = b i ,q ji = c i b i log (cid:18) b i c i (cid:19) + (cid:88) j (cid:54) = i,p ji = b i ,q ji = d i b i log (cid:18) b i d i (cid:19) = n iT P a i log (cid:18) a i c i (cid:19) + n iF N a i log (cid:18) a i d i (cid:19) n iF P b i log (cid:18) b i c i (cid:19) + n iT N b i log (cid:18) b i d i (cid:19) where n iT P , n iF N , n iF P , n iT N are number of true positives, false negatives (missed points), false positives, andtrue negatives respectively for point x i . Given that δ is close to , then the coeﬃcient of n iF N and n iF P dominates the other terms, J KL = n iF N a i log (cid:18) a i d i (cid:19) + n iF P b i log (cid:18) b i c i (cid:19) + O ( δ )= n iF N − δr i log (cid:18) − δδ m − k i − r i (cid:19) + n iF P δm − r i − (cid:18) δ − δ k i m − r i − (cid:19) + O ( δ ) . Again, the log − δδ dominates the other logarithmic terms (cid:18) log (cid:16) m − r i − k i (cid:17) and log (cid:16) m − k i − r i (cid:17) (cid:19) , so we have J KL = (cid:18) n iF N r i (1 − δ ) − n iF P m − r i − δ (cid:19) log (cid:18) − δδ (cid:19) + O ( δ )= n iF N r i C = (1 − Recall ( i )) C + O ( δ ) where C = log (cid:0) − δδ (cid:1) .For Reverse KL: J RKL ( x i ) = − (cid:88) j q ji (cid:18) log p ji q ji (cid:19) = − (cid:88) j (cid:54) = i,p ji = a i ,q ji = c i c i log (cid:18) a i c i (cid:19) − (cid:88) j (cid:54) = i,p ji = a i ,q ji = d i d i log (cid:18) a i d i (cid:19) − (cid:88) j (cid:54) = i,p ji = b i ,q ji = c i c i log (cid:18) b i c i (cid:19) − (cid:88) j (cid:54) = i,p ji = b i ,q ji = d i d i log (cid:18) b i d i (cid:19) = − n iT P c i log (cid:18) a i c i (cid:19) − n iF N d i log (cid:18) a i d i (cid:19) − n iF P c i log (cid:18) b i c i (cid:19) − n iT N d i log (cid:18) b i d i (cid:19) where n iT P , n iF N , n iF P , n iT N are number of true positives, false negatives (missed points), false positives, andtrue negatives respectively for point x i . Given that δ is close to , then the coeﬃcient of n iF N and n iF P dominates the other terms, J RKL = − n iF N d i log (cid:18) a i d i (cid:19) − n iF P c i log (cid:18) b i c i (cid:19) + O ( δ )= − n iF N δm − k i − (cid:18) − δδ m − k i − r i (cid:19) − n iF P − δk i log (cid:18) δ − δ k i m − r i − (cid:19) + O ( δ ) . log − δδ dominates the other logarithmic terms (cid:18) log (cid:16) m − r i − k i (cid:17) and log (cid:16) m − k i − r i (cid:17) (cid:19) , so we have J RKL = (cid:18) n iF P k i (1 − δ ) − n iF N m − k i − δ (cid:19) log (cid:18) − δδ (cid:19) + O ( δ )= n iF P r i C = (1 − Precision ( i )) C + O ( δ ) where C = log (cid:0) − δδ (cid:1) .For Jensen-Shanon : J JS ( x i ) = 12 (cid:88) j p ij log 2 p ij p ij + q ij + (cid:88) j q ij log 2 q ij p ij + q ij  = − (cid:88) j p ij log p ij + q ij p ij + (cid:88) j q ij log p ij + q ij q ij + log 4  = − (cid:88) j p ij log q ij p ij + (cid:88) j q ij log p ij q ij + log 4  = 12 ( J KL ( x i ) + J RKL ( x i ) + log 4) For Chi-Square distance: J CS ( x i ) = (cid:88) j q ji (cid:18) p ji q ji (cid:19) = (cid:88) j (cid:54) = i,p ji = a i ,q ji = c i c i (cid:18) a i c i − (cid:19) + (cid:88) j (cid:54) = i,p ji = a i ,q ji = d i d i (cid:18) a i d i − (cid:19) + (cid:88) j (cid:54) = i,p ji = b i ,q ji = c i c i (cid:18) b i c i − (cid:19) + (cid:88) j (cid:54) = i,p ji = b i ,q ji = d i d i (cid:18) b i d i − (cid:19) = n iT P c i (cid:18) a i c i − (cid:19) + n iF N d i (cid:18) a i d i − (cid:19) + n iF P c i (cid:18) b i c i − (cid:19) + n iT N d i (cid:18) b i d i − (cid:19) . Given that δ is near , then the last term (cid:18) n iT N d i (cid:16) b i d i − (cid:17) (cid:19) gets eliminated. So, we have J CS ( x i ) = n iT P c i (cid:18) a i c i − (cid:19) + n iF N d i (cid:18) a i d i − (cid:19) + n iF P c i (cid:18) b i c i − (cid:19) + O ( δ )= n iT P − δk i (cid:18) r i k i − (cid:19) + n iF N δm − k i − (cid:18) − δδ m − k i − r i − (cid:19) + n iF P − δk i (cid:18) δ − δ k i m − r i − − (cid:19) + O ( δ )= n iT P k i (1 − δ ) C + n iF N r i (1 − δ ) C + n iF P k i (1 − δ ) + O ( δ )= Precision ( i ) C + (1 − Precision ( i )) + (1 − Recall ( i )) C . where C = ( r i k i − and C = (cid:16) − δδ m − k i − r i − (cid:17) . 13he proof layout is similar for Hellinger distance, except that it emphasize recall and has less strictpenalities, J HL ( x i ) = n iT P c i (cid:18)(cid:114) a i c i − (cid:19) + n iF N d i (cid:18)(cid:114) a i d i − (cid:19) + n iF P c i (cid:32)(cid:114) b i c i − (cid:33) = n iT P − δk i (cid:18)(cid:114) r i k i − (cid:19) + n iF N δm − k i − (cid:32)(cid:114) − δδ m − k i − r i − (cid:33) + n iF P − δk i (cid:32)(cid:114) δ − δ k i m − r i − − (cid:33) + O ( δ )= n iT P k i (1 − δ ) C + n iF N r i (1 − δ ) C + n iF P k i (1 − δ ) C + O ( δ )= Precision ( i ) C + (1 − Precision ( i )) C + (1 − Recall ( i )) C + O ( δ ) . where C = (cid:16)(cid:113) r i k i − (cid:17) , C = (cid:16) − (cid:113) r i δ − δ (cid:17) , and C = (cid:16) − (cid:113) k i δ − δ (cid:17) .14 ppendix B Variational f t -SNE It is standard to relax the optimization of the variational f t -SNE objective function in Eq 2 by alternativelyoptimizing the paramters φ and y , . . . , y m . Algorithm 1 alternatively updates y , . . . , y m and φ . Theparametric hypothesis class ¯ H is parameterized by φ (for instance, φ are the weights of the deep neuralnetwork). Remark that this is not guaranteed to return the same solution as the original minimax objectivein Eq 2. Thus it is possible that Algorithm 1 can ﬁnd a diﬀerent solution depending on the choice of J and K and under diﬀerent measures. 15 ppendix C Experimental Supplementary Materials (a) Three Gaussian clusters (b) Swiss Roll Figure 6: Synthetic Datasets

Datasets . Throughout the experiments, we diversiﬁed our datasets by selecting manifold, cluster, andhierical datasets. We ﬁrst experimented with two synthetic datasets, swiss roll and three Gaussian cluterdatasets (see S. M. Figure 6). Thence, we conducted the set of experiments on FACE, MNIST, and 20Newsgroups datasets. FACE and MNIST with single digits (MNIST1) fall under manifold datasets andMNIST and 20 Newsgroups fall under cluster and hierical cluster datasets. • FACE contains 698 64 x 64 face images. The face varies smoothly with respect to light intensities andposes. The face dataset used in isomap paper [18]. We use face dataset as a manifold dataset. • MNIST consists of 28 x 28 handwritten digits dataset with digits from 0 to 9. MNIST data points wereprojected down to features using PCA. We used MNIST as both clustering and manifold datasets.For clustering dataset, we used 6,000 examples of ﬁrst ﬁve digits (MNIST). For manifold dataset, weused 6,000 examples of digits of ones (MNIST1). • rec.autos, rec.motorcycles, rec.sport.baseball,rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, soc.religion.christian, talk.politics.guns,talk.politics.mideast, talk.politics.misc, and talk.religion.misc. Hence, this dataset corresponds to sparsehierarchical clustering dataset.

Optimization . We use gradient decent method with momentum to optimize the f t -SNE. We decreased thelearning rate and momentum overtime as such (cid:15) t +1 = (cid:15) t (1+ tρ ) and λ t +1 = λ t (1+ tη ) where (cid:15) t and λ t are learningrate and momentum, and ρ and η are learning rate decay and momentum decay parameters. t -SNE hasvery tiny gradients in the beginning since all the parameters are intialize in the quite small domain (theinitial embeddings are drawn from the Normal distribution with zero mean and ∗ e − standard deviation).However, once the embedding parameters spread, the gradients become relatively large compare to earlystage. Thus, the learning rate and momentum require to be adjusted appropriately over diﬀerent stage ofoptimization. 16NIST1 Face MNIST GENE NEWS SBOW F N P r e c i s i o n KL RKL JS HL CH F N P r e c i s i o n F N P r e c i s i o n F N P r e c i s i o n F N P r e c i s i o n F N P r e c i s i o n (a) FN Precision vs. NN Precision Figure 7: Precision-Recall curves for each of the proposed algorithms on all datasets. Each row correspondsto a diﬀerent dataset, each column to diﬀerent quantitative criteria, and each line to a diﬀerent algorithm.

C.1 More Experimental Results : Synthetic Data Experiments

RKL: α = 0 α = 0 . α = 0 . JS: α = 0 . KL: α = 1 Perplexity=10Perplexity=100Perplexity=500

Figure 8: t -SNE embeddings on three Gaussian clusters17 KL: α = 0 α = 0 . α = 0 . JS: α = 0 . KL: α = 1 (a) Perplexity=10(b) Perplexity=100(c) Perplexity=500 Figure 9: t -SNE embeddings on Swiss RollTable 4: F-Score on X-Y MNIST KL RKL JS HL CH0.3524 0.4241 0.4190 0.4155 0.09220.4440 0.5699 0.5395 0.5332 0.14760.4724 0.6057 0.5570 0.5546 0.23490.4603 0.5687 0.5220 0.5217 0.30070.4407 0.5234 0.4843 0.4843 0.32220.4202 0.4798 0.4491 0.4503 0.33500.4016 0.4413 0.4189 0.4210 0.34200.3836 0.4090 0.3939 0.3964 0.34490.3667 0.3814 0.3722 0.3745 0.3446

Table 5: F-Score on X-Y FACE

KL RKL JS HL CH0.4019 0.4186 0.4156 0.4115 0.21120.5648 0.6446 0.6216 0.6197 0.29190.6236 0.7534 0.7200 0.7146 0.35050.5865 0.6970 0.6793 0.6764 0.36740.5354 0.6207 0.6105 0.6095 0.36890.4870 0.5447 0.5441 0.5412 0.36420.4464 0.4841 0.4862 0.4826 0.3573

KL RKL JS HL CH0.4795 0.4444 0.4693 0.4494 0.47870.5938 0.5805 0.6006 0.5667 0.61090.5466 0.5872 0.5834 0.5426 0.57720.4891 0.5334 0.5256 0.4895 0.52000.4457 0.4868 0.4783 0.4494 0.47420.4149 0.4495 0.4409 0.4186 0.43910.3925 0.4180 0.4103 0.3950 0.41120.3749 0.3920 0.3856 0.3757 0.38790.3597 0.3699 0.3657 0.3595 0.3688

Table 7: F-Score on X-Y NEWS

KL RKL JS HL CH0.3996 0.3665 0.3922 0.3989 0.26370.4256 0.4108 0.4328 0.4331 0.37030.3820 0.3933 0.4001 0.3964 0.44660.3387 0.3569 0.3559 0.3526 0.46330.3062 0.3209 0.3188 0.3163 0.45990.2796 0.2899 0.2877 0.2862 0.44760.2562 0.2626 0.2610 0.2594 0.42970.2346 0.2380 0.2370 0.2357 0.40760.2145 0.2162 0.2157 0.2148 0.3824

Table 8: F-Score on X-Y IMAGENET SBOW

KL RKL JS HL CH0.4317 0.3411 0.3456 0.2431 0.43950.4686 0.3825 0.3823 0.2977 0.48890.4297 0.3635 0.3638 0.3157 0.44930.3747 0.3259 0.3280 0.3066 0.38380.3265 0.2932 0.2941 0.2914 0.33200.2867 0.2635 0.2659 0.2728 0.29190.2553 0.2403 0.2421 0.2532 0.26140.2306 0.2218 0.2222 0.2336 0.23730.2111 0.2065 0.2061 0.2149 0.2177

Table 9: F-Score on X-Z FACE

KL RKL JS HL CH0.1444 0.1461 0.1482 0.1464 0.09300.2220 0.2318 0.2370 0.2327 0.14340.3012 0.3284 0.3262 0.3217 0.20430.3322 0.3511 0.3585 0.3524 0.24050.3421 0.3401 0.3652 0.3596 0.26450.3431 0.3267 0.3615 0.3557 0.28070.3396 0.3171 0.3531 0.3486 0.29150.3337 0.3068 0.3423 0.3384 0.29650.3249 0.3015 0.3278 0.3261 0.2983

Table 10: F-Score on X-Z MNIST

KL RKL JS HL CH0.3872 0.3503 0.3810 0.3686 0.37830.6137 0.5269 0.6007 0.5622 0.59670.7238 0.6128 0.7023 0.6584 0.68840.7531 0.6406 0.7246 0.6958 0.70540.6903 0.5974 0.6649 0.6826 0.65240.6027 0.5379 0.5795 0.6224 0.56980.5277 0.4850 0.5073 0.5537 0.49890.4667 0.4344 0.4472 0.4940 0.44140.4186 0.3923 0.3994 0.4448 0.3958

Table 11: F-Score on X-Z NEWS

KL RKL JS HL CH0.0066 0.0076 0.0074 0.0076 0.00690.0036 0.0041 0.0038 0.0040 0.00340.0018 0.0020 0.0020 0.0020 0.00180.0013 0.0014 0.0015 0.0015 0.00140.0011 0.0011 0.0012 0.0012 0.00130.0011 0.0010 0.0011 0.0011 0.00110.0010 0.0009 0.0010 0.0009 0.00110.0010 0.0009 0.0009 0.0009 0.00100.0010 0.0009 0.0009 0.0009 0.0010

Table 12: F-Score on X-Z IMAGENET SBOW

KL RKL JS HL CH0.0063 0.0051 0.0044 0.0052 0.00430.0025 0.0021 0.0025 0.0027 0.00230.0016 0.0014 0.0017 0.0018 0.00140.0012 0.0013 0.0013 0.0013 0.00110.0010 0.0011 0.0011 0.0011 0.00100.0009 0.0009 0.0009 0.0009 0.00090.0008 0.0008 0.0008 0.0008 0.00080.0007 0.0007 0.0007 0.0007 0.00070.0006 0.0006 0.0007 0.0006 0.0007 v KL-SNE to achieve same level of loss as KL-SNE

KL-SNE v KL-SNEData - vSNE 20 hids vSNE 10-20 hids vSNE 5-10-20 hidsMNIST (Digit 1) 294s 230.1s

C.2 More Experimental Results : Optimization of the primal form versus vari-ational form (Duality Gap) Analysis y x [f(x); f(x); f(x)*f(y)] f(x) f(y)

Figure 10: Discriminant Architecture20

100 200 300 l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE 1:1vft-SNE 10:1 vft-SNE 1:10vft-SNE 10:10 CS-SNE l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE 1:1vft-SNE 10:1 vft-SNE 1:10vft-SNE 10:10 JS-SNE l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE 1:1vft-SNE 10:1 vft-SNE 1:10vft-SNE 10:10 KL-SNE(a) MNIST l o g o f f - d i v e r g e n c e o f p r i m a l f o r m tSNEvSNE 1:1vSNE 10:1 vSNE 1:10vSNE 10:10 MNIST1 l o g o f f - d i v e r g e n c e o f p r i m a l f o r m tSNEvSNE 1:1vSNE 10:1 vSNE 1:10vSNE 10:10 NEWS(b) Chi-Square l o g o f f - d i v e r g e n c e o f p r i m a l f o r m tSNEvSNE 1:1vSNE 10:1 vSNE 1:10vSNE 10:10 MNIST1 l o g o f f - d i v e r g e n c e o f p r i m a l f o r m tSNEvSNE 1:1vSNE 10:1 vSNE 1:10vSNE 10:10 NEWS(c) Jensen-Shannon

Figure 11: Log f t -SNE criterion during optimization for diﬀerent choices of J and K on MNIST, MNIST1,and NEWS. Two hidden layer (10-20) deep ReLU neural network was used as the discriminator and perplexitywas set to 2,000. L o ss KL-SNE L o ss JS-SNE L o ss RKL-SNE

Figure 12: f t -SNE Loss with respect to diﬀerent perplexities on three Gaussian cluster and swiss roll datasets.21

500 1000 1500 2000Perplexity0500100015002000 f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE MNIST1 f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE NEWS(a) Chi-Square f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE MNIST1 f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE NEWS(b) Jensen-Shannon

Figure 13: f t -SNE and vf t -SNE Loss with respect to diﬀerent perplexities MNIST1 and NEWS.

C.2.1 More Experimental Results with diﬀerent discriminant functions

Architecture

Optimizing under vf -SNE require having a discriminant function. Throughout the experiments,we used deep neural network the discriminator, The architecture tology is deﬁned as D ( x i , x j ) = g ([ f ( x i ) + f ( x j ); f ( x i ) (cid:12) f ( x j )]) (depicted in S. M. Figure 10). f ( · ) is the neural network that encodes the pair of datapoints, f ( x ) and f ( y ) , and g ( · ) is the neural network that takes [ f ( x ) + f ( y ); f ( x ) (cid:12) f ( y )] and outputs thescore value. Our architecture is invariant to the ordering of data points (i.e., D ( x i , x j ) = D ( x j , x i ) ). We used10 hidden layers and 20 hidden layers for f ( · ) and g ( · ) in the experiments except when we experiments withexpressibility of discriminant function. y x [f(x); f(x); f(x)*f(y)] f(x) f(y)

10 Dim. 10 Dim. y x [f(x); f(x); f(x)*f(y)] f(x) f(y) y x [f(x); f(x); f(x)*f(y)] f(x) f(y)

20 Dim. 10 Dim. y x [f(x); f(x); f(x)*f(y)] f(x) f(y)

20 Dim. 10 Dim.

Figure 14: Discriminant Architecture : y x [f(x); f(x); f(x)*f(y)] f(x) f(y) y x [f(x); f(x); f(x)*f(y)] f(x) f(y) y x [f(x); f(x); f(x)*f(y)] f(x) f(y)

20 Dim. 20 Dim. 10 Dim. 20 Dim. 10 Dim. 5 Dim.

Figure 15: Discriminant Architecture :

100 200 300 l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE linearvft-SNE 5-5 hids vft-SNE 10-10 hidsvft-SNE 10-20 hids KL-SNE l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE linearvft-SNE 5-5 hids vft-SNE 10-10 hidsvft-SNE 10-20 hids JS-SNE l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE linearvft-SNE 5-5 hids vft-SNE 10-10 hidsvft-SNE 10-20 hids CS-SNE(a) Varying discriminator network width l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE 20 hids vft-SNE 10-20 hidsvft-SNE 5-10-20 hids KL-SNE l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE 20 hids vft-SNE 10-20 hidsvft-SNE 5-10-20 hids JS-SNE l o g o f f - d i v e r g e n c e o f p r i m a l f o r m ft-SNEvft-SNE 20 hids vft-SNE 10-20 hidsvft-SNE 5-10-20 hids CS-SNE(b) Varying discrimnator network depth

Figure 16: Log f t -SNE criterion during optimization for diﬀerent discriminator network architectures onMNIST. In (a), we compare two-layer networks of diﬀerent widths (legend indicates number of units in ﬁrstand second hidden layers). In (b), we compare networks with diﬀerent depths (legend indicates number ofunits in each hidden layer). The number of updates were set to J:K=10:10 and perplexity was set to 2,000. l o g o f f - d i v e r g e n c e o f p r i m a l f o r m tSNEvSNE linearvSNE 5:5 vSNE 10:10vSNE 10:20 MNIST1 l o g o f f - d i v e r g e n c e o f p r i m a l f o r m tSNEvSNE linearvSNE 5:5 vSNE 10:10vSNE 10:20 NEWSCS-SNE l o g o f f - d i v e r g e n c e o f p r i m a l f o r m tSNEvSNE linearvSNE 5:5 vSNE 10:10vSNE 10:20 MNIST1 l o g o f f - d i v e r g e n c e o f p r i m a l f o r m tSNEvSNE linearvSNE 5:5 vSNE 10:10vSNE 10:20 NEWSJS-SNE

Figure 17: Log f t -SNE criterion during optimization for diﬀerent discriminator network architectures ofdiﬀerent widths on MNIST. 23 ppendix D Embeddings

Figure 18 presnets the embeddings of KL-SNE and RKL-SNE. Note that KL-SNE generates spurious clusterson the bottom left of the embeddings, whereas RKL-SNE generated smooth embeddigns that captures themanifold structure. Note that in practice, we do not want to generate such a spurious cluster because thepractitioners can misinterpret the visualization of the dataset. (a) RKL-SNE(b) RKL-SNE

Figure 18: Face Embeddings with KL and RKL-SNE24

L-SNE RKL-SNE JS-SNE HL-SNE CH-SNE(a) Coloured based on pose 1KL-SNE RKL-SNE JS-SNE HL-SNE CH-SNE(b) Coloured based on pose 2

Figure 19: Face Embeddings using f t -SNE. Perplexity=300

KL-SNE RKL-SNE JS-SNE HL-SNE CH-SNE

Figure 20: MNIST Embeddings using ff