LDLE: Low Distortion Local Eigenmaps
LLDLE: Low Distortion Local Eigenmaps
Dhruv Kohli a, ∗ , Alexander Cloninger a,b , Gal Mishne b a Department of Mathematics, University of California, San Diego, CA b Halicio˘glu Data Science Institute, University of California, San Diego, CA
Abstract
We present Low Distortion Local Eigenmaps (LDLE), a manifold learning technique which constructs a setof low distortion local views of a dataset in lower dimension and registers them to obtain a global embedding.The local views are constructed using the global eigenvectors of the graph Laplacian and are registered usingProcrustes analysis. The choice of these eigenvectors may vary across the regions. In contrast to existingtechniques, LDLE is more geometric and can embed manifolds without boundary as well as non-orientablemanifolds into their intrinsic dimension.
Keywords:
Manifold learning, graph Laplacian, local correlation, Procrustes analysis, low distortion, noboundary, non-orientable.
1. Introduction
Nonlinear dimensionality reduction techniques such as Local Linear Embedding [1], Hessian eigenmaps [2],Laplacian eigenmaps [3], t-SNE [4] and UMAP [5], aim at preserving local distances as they map a manifoldembedded in higher dimension into lower (possibly intrinsic) dimension. In particular, UMAP and t-SNEfollow a top-down approach where they start with an initial low-dimensional global embedding and thenrefine it by minimizing a local distortion measure on it. In contrast, a bottom-up embedding approachcan be imagined to consist of two steps, first obtaining low distortion local views of the manifold in lowerdimension and then registering them to obtain a global embedding of the manifold. These local views canbe constructed using the coordinate charts of the manifold. For this paper, we take the local perspective toembed a manifold in low dimension.Let ( M , g ) be a d -dimensional Riemannian manifold with finite volume. By definition, for every x k in M ,there exists a coordinate chart ( U k , Φ k ) such that x k ∈ U k , U k ⊂ M and Φ k maps U k into R d . One canimagine U k to be a local view of M in the ambient space. Using rigid transformations, these local views canbe registered to recover M . Similarly, Φ k ( U k ) can be imagined to be a local view of M in the d -dimensionalembedding space R d . Again using rigid transformations, these local views can be registered to obtain the d -dimensional embedding of M .As there may exist multiple mappings which map U k into R d , a natural strategy would be to choose amapping with low distortion. This would result in an embedding in which the local geodesic distancesare approximately preserved. Let d g ( x, y ) denote the shortest geodesic distance between x, y ∈ M . Thedistortion of Φ k on U k as defined in [6] is given byDistortion(Φ k , U k ) = (cid:107) Φ k (cid:107) Lip (cid:13)(cid:13) Φ − k (cid:13)(cid:13) Lip (1) ∗ Corresponding author
Email addresses: [email protected] (Dhruv Kohli), [email protected] (Alexander Cloninger), [email protected] (Gal Mishne) a r X i v : . [ m a t h . SP ] J a n here (cid:107) Φ k (cid:107) Lip is the Lipschitz norm of Φ k given by (cid:107) Φ k (cid:107) Lip = sup x,y ∈U k x (cid:54) = y (cid:107) Φ k ( x ) − Φ k ( y ) (cid:107) d g ( x, y ) , (2)and similarly, (cid:13)(cid:13) Φ − k (cid:13)(cid:13) Lip = sup x,y ∈U k x (cid:54) = y d g ( x, y ) (cid:107) Φ k ( x ) − Φ k ( y ) (cid:107) . (3)If Distortion(Φ k , U k ) = 1, then Φ k is said to have no distortion on U k . It is not always possible to obtaina mapping with no distortion. For example, a locally curved region on a surface cannot be isometricallymapped into a Euclidean plane. Various algorithms for manifold learning including Laplacian eigenmaps [3], UMAP [5], and t-SNE [4] alsofocus on preserving local distances. Laplacian eigenmaps uses the eigenvectors corresponding to the d smallest eigenvalues (excluding zero) of the normalized graph Laplacian to embed the manifold in R d . Formanifolds with high aspect ratio, the distortion of the local parameterizations based on these eigenvectorscould become extremely high. For example, as shown in Figure 1, the Laplacian eigenmaps embedding of arectangle with an aspect ratio of 16 looks like a parabola. This issue is explained in detail in [7, 8, 9, 10].To a large extent UMAP resolves this issue by first computing an embedding by applying standard spectralmethods on a symmetric normalized Laplacian and then “sprinkling” white noise in it. It then refinesthe noisy embedding by minimizing a local distortion measure based on fuzzy set cross entropy. AlthoughUMAP embeddings seem to be topologically correct, they occasionally tend to have twists and sharp turnswhich may be unwanted (see Figure 1). t-SNE takes a different approach of randomly initializing the globalembedding, defining a local t-distribution in the embedding space and local Gaussian distribution on thehigh dimensional points, and finally refining the embedding by minimizing the Kullback–Leibler divergencebetween the two sets of distributions. As shown in Figure 1, t-SNE tends to output a dissected embeddingeven when the manifold is connected. Input LDLE LDLE with ∂ M knownapriori UMAP t-SNE LaplacianEigenmaps Figure 1: Embeddings of a rectangle (4 × .
25) with high aspect ratio in R into R . Another missing feature in the above methods is their ability to embed manifolds without boundary intotheir intrinsic dimensions. For example, a sphere in R is a 2-dimensional manifold which can be representedby a connected domain in R with boundary gluing instructions provided in the form of colors. We solvethis issue in this paper (see Figure 2).This paper takes motivation from the work of Jones, Maggioni, and Schul in [6] where they provide guaranteeson the distortion of the coordinate charts of the manifold constructed using carefully chosen eigenfunctionsof the Laplacian. However, this only applies to the charts for small neighborhoods on the manifold and2 nput LDLE UMAP t-SNE Laplacian Eigenmaps Figure 2: Embeddings of a sphere in R into R . The top and bottom row contain the same plots colored by the height andthe azimuthal angle of the sphere (0 − π ), respectively. LDLE automatically colors the boundary so that the points on theboundary which are adjacent on the sphere have the same color. The arrows are manually drawn to help the reader identify thetwo pieces of the boundary which are to be stitched together to recover the original sphere. UMAP and Laplacian eigenmapssqueezed the sphere into two different viewpoints of R (side or top view of the spehere). t-SNE also tore apart the sphere butthe embedding lacks interpretability as it is “unaware” of the boundary. does not provide a global embedding. In this paper, we present an approach to realize their work in thediscrete setting and obtain low-dimensional low distortion local views of the given dataset. Moreover, wepiece together these local views to obtain a global embedding of the manifold. In this paper we present Low Distortion Local Eigenmaps (LDLE), a new manifold learning approach. Themain contributions of our work are as follows:1. We present an algorithmic realization of the construction procedure in [6] that applies to the discretizedsetting and yields low-dimensional low distortion views of small metric balls on the given discretizedmanifold (See Section 2 for a summary of their procedure).2. We present an algorithm to obtain a global embedding of the manifold by registering its local views.The algorithm is designed so as to embed manifolds without boundary as well as non-orientablemanifolds into their intrinsic dimension by tearing them apart. It also provides gluing instructions forthe boundary of the embedding by coloring it such that the points on the boundary which are adjacenton the manifold have the same color (see Figure 2).LDLE consists of three main steps. In the first step, we estimate the inner product of the eigenfunctionsusing the local correlation between them to construct low-dimensional low distortion parameterizations Φ k of the small discrete balls U k on the discretized manifold. In the second step, we develop a clusteringalgorithm to obtain a small number of intermediate views (cid:101) Φ m ( (cid:101) U m ) with low distortion, from the largenumber of smaller local views Φ k ( U k ). This makes the subsequent registration procedure faster and lessprone to errors. Finally, in the third step, we register the intermediate views (cid:101) Φ m ( (cid:101) U m ) and obtain the finalglobal embedding of M . The results on a 2D rectangular strip and a 3D sphere are presented in Figures 1and 2, to motivate our approach. 3he paper organization is as follows. Section 2 provides relevant background and motivation. In section 3we present the construction of low-dimensional low distortion local parameterizations. Section 4 presentsour clustering algorithm to obtain intermediate views. Section 5 registers the intermediate views to a globalembedding. In section 6 we compare the embeddings produced by our algorithm with existing techniqueson multiple datasets. Section 7 concludes our work and discusses directions for future work.
2. Background and Motivation
Manifold learning techniques such as Laplacian eigenmaps [3], Diffusion maps [11] and UMAP [5] rely ona fixed set of eigenfunctions of the Laplacian operator to construct a low-dimensional embedding of themanifold. The following example illustrates that a mapping based on a fixed set of eigenfunctions may notprovide low distortion on different U k and U k (cid:48) . This is why a postprocessing step such as in UMAP, thatminimizes a local distortion measure on the embedding becomes necessary when using the above techniquesto obtain a low distortion embedding of the manifold. Figure 3: (Left) Distortion of Φ ∗ on discs of radius 0 .
01 centered at ( x, y ) for all x, y ∈ [0 , × [0 , ∗ . Consider a unit square [0 , × [0 ,
1] such that for every point x k in the square, U k is the disc of radius 0 . x k . Consider a mapping Φ ∗ based on the first two non-trivial eigenfunctions cos( πx ) and cos( πy )of the Laplace-Beltrami operator on the square with Neumann boundary conditions, that is,Φ ∗ ( x, y ) = (cos( πx ) , cos( πy )) . (4)As shown in Figure 3, Φ ∗ maps the discs along the diagonals to other discs. The discs along the horizontaland vertical lines through the center are mapped to ellipses. The skewness of these ellipses increases as wemove closer to the middle of the edges of the unit square. Thus, the distortion of Φ ∗ is low on the discsalong the diagonals and high on the discs close to the middle of the edges of the square.Now, consider a different mapping based on another set of eigenfunctions,Φ ∗ ( x, y ) = (cos(5 πx ) , cos(5 πy )) . (5)Compared to Φ ∗ , Φ ∗ produces almost no distortion on the discs of radius 0 .
01 centered at (0 . , .
5) and(0 . , .
5) (see Figure 4). Therefore, it makes sense to construct local mappings for different regions basedon different sets of eigenfunctions to obtain their embeddings with low distortion.At this point we state a result from [6] which shows the existence of low distortion parameterizations ofsmall metric balls on a Riemannian manifold of finite volume using the globally defined eigenfunctions ofthe Laplacian. The choice of the eigenfunctions depend on the underlying metric ball.4 igure 4: (Left) Distortion of Φ ∗ on discs of radius 0 .
01 centered at ( x, y ) for all x, y ∈ [0 , × [0 , ∗ produces close toinfinite distortion on the discs located in the white region. (Right) Mapping of the discs at the same locations in the square asin Figure 3 using Φ ∗ . Theorem 1 (see Theorem 2.2.1 in [6]) . Let ( M , g ) be a d -dimensional Riemannian manifold. Let ∆ g bethe Laplace-Beltrami operator on it with Dirichlet or Neumann boundary conditions. Assume that |M| = 1 where |M| is the volume of M and the uniform ellipticity conditions for ∆ g are satisfied. Let x k ∈ M and r k be less than the injectivity radius at x k (the maximum radius where the the exponential map is adiffeomorphism). Then, there exists a constant κ > which depends on d and the metric tensor g such thatthe following hold. Let ρ ≤ r k and B k ≡ B κ − ρ ( x k ) where B (cid:15) ( x ) = { y ∈ M | d g ( x, y ) < (cid:15) } . (6) Then there exist i , i , . . . , i d such that, if we let γ ki = (cid:32) (cid:82) B k φ i ( y ) dy | B k | (cid:33) − / (7) then the map Φ k : B k → R d x → ( γ ki φ i ( x ) , . . . , γ ki d φ i d ( x )) (8) is bilipschitz such that for any y , y ∈ B k it satisfies κ − ρ d g ( y , y ) ≤ (cid:107) Φ k ( y ) − Φ k ( y ) (cid:107) ≤ κρ d g ( y , y ) , (9) where the associated eigenvalues satisfy κ − ρ − ≤ λ i , . . . , λ i d ≤ κρ − , (10) and the distortion is bounded from above by κ i.e. sup y ,y ∈ B k y (cid:54) = y (cid:107) Φ k ( y ) − Φ k ( y ) (cid:107) d g ( y , y ) sup y ,y ∈ B k y (cid:54) = y d g ( y , y ) (cid:107) Φ k ( y ) − Φ k ( y ) (cid:107) ≤ κρ ρκ − = κ . (11) Construction Procedure . In [6] the authors describe a procedure to choose the eigenfunctions for Φ k assummarized below. The main aim is to identify d eigenfunctions such that they are close to being locallyorthogonal and that their local scaling factors γ ki s (cid:107)∇ φ i s ( x k ) (cid:107) are close to each other.5nder the assumptions of the theorem, the authors showed that, for all i , γ ki (cid:107)∇ φ i ( x k ) (cid:107) is bounded fromabove where the bound depends on λ i ρ . Therefore, any choice of the eigenfunctions would result in aLipschitz map. However, satisfying the Lipschitz condition alone is not sufficient to obtain low distortion.For example, a Φ k which maps B k to a fixed point is trivially Lipschitz but has infinite distortion. It isnecessary for Φ k to satisfy the bilipschitz condition to at least ensure finite distortion. Since γ ki (cid:107)∇ φ i ( x k ) (cid:107) isalready bounded from above, the bilipschitz condition for Φ k can be satisfied by choosing the eigenfunctionsfor which γ ki (cid:107)∇ φ i ( x k ) (cid:107) is also bounded from below. Although this would result in Φ k having finitedistortion on B k , one cannot guarantee it to be low. Suppose φ i and φ i satisfy the above condition, twoscenarios are possible which can lead to high distortion:1. If the vectors ∇ φ i ( x k ) and ∇ φ i ( x k ) are far from being orthogonal then Φ k would squeeze the firsttwo dimensions of B k into a single dimension.2. If the local scaling factors γ ki (cid:107)∇ φ i ( x k ) (cid:107) and γ ki (cid:107)∇ φ i ( x k ) (cid:107) are far from each other then Φ k would result in a highly skewed output along the two dimensions.Thus, to obtain low distortion, the authors proposed the following procedure to select the eigenfunctions sothat Φ k is bilipschitz and avoids the above scenarios as much as possible.First, a subset S k of the eigenfunctions is chosen such that for all i ∈ S k , λ i ρ is bounded within a specificinterval. This interval eventually comes out to be [ κ − , κ ] (see Eq. (10)). Then, a direction p is selected atrandom and subsequently i ∈ S k is selected so that γ ki |∇ φ i ( x k ) T p | is larger than c ρ − where c dependson the geometric properties of M . Then, to find the s -th eigenfunction for s ∈ { , . . . , d } , a direction p s is chosen such that it is orthogonal to ∇ φ i ( x k ) , . . . , ∇ φ i s − ( x k ). Subsequently, i s ∈ S k is chosen so that γ ki s |∇ φ i s ( x k ) T p s | is again larger than c ρ − . The core of their work lies in proving that these i , . . . , i d always exist under the assumptions of the theorem.It is easy to discern that the above procedure motivates the chosen eigenfunctions to be locally orthog-onal. The requirement of γ ki s |∇ φ i s ( x k ) T p s | to be larger than c ρ − results in the local scaling factors γ ki s (cid:107)∇ φ i s ( x k ) (cid:107) to also be bounded from below by c ρ − . Since they are already bounded from above, thegap between the local scaling factors along any two dimensions is guaranteed to be bounded, the size ofwhich depends on the geometric properties of M . The authors ultimately proved that the above construc-tion results in a bilipschitz map Φ k on B k whose distortion is at most κ , a universal constant which alsodepends on the geometric properties of M . The main challenge in practically realizing the above procedurelies in the estimation of ∇ φ i s ( x k ) T p s . The next section overcomes this challenge.
3. Low-dimensional Low Distortion Local Parameterization
In the procedure to construct Φ k as described above, the selection of the first eigenfunction φ i relies onthe derivative of the eigenfunctions at x k along an arbitrary direction p , that is, on ∇ φ i ( x k ) T p . In ouralgorithmic realization of the construction procedure, we take p to be the gradient of an eigenfunction at x k itself (say ∇ φ j ( x k )). We relax the unit norm constraint on p ; note that this will neither affect the mathnor the output of our algorithm. Then the selection of φ i would depend on ∇ φ i ( x k ) T ∇ φ j ( x k ). We obtaina numerical estimate of this inner product by making use of the local correlation between the eigenfunctionswhich was introduced in [12, 13]. These estimates are used to select the subsequent eigenfunctions too.We first review the local correlation between the eigenfunctions of the Laplacian. In Theorem 2 we showthat the limiting value of the scaled local correlation between two eigenfunctions equals the inner productof their gradients. Using this result, we estimate the inner product of the gradients which will be used inSection 3.3 to obtain low distortion local parameterizations of the underlying manifold.6 .1. Inner Product of Eigenfunction Gradients using Local Correlation Let ( M , g ) be a d -dimensional Riemannian manifold with or without boundary, rescaled so that |M| ≤ y by ω g ( y ). Let φ i and φ j be the eigenfunctions of the Laplacian operator∆ g (see statement of Theorem 1) with eigenvalues λ i and λ j . Let x k ∈ M and defineΨ kij ( y ) = ( φ i ( y ) − φ i ( x k ))( φ j ( y ) − φ j ( x k )) . (12)Then the local correlation between the two eigenfunctions φ i and φ j at the point x k at scale t − / k (definedin [12, 13]) is given by A kij = (cid:90) M p ( t k , x k , y )Ψ kij ( y ) ω g ( y ) , (13)where p ( t, x, y ) is the fundamental solution of the heat equation on ( M , g ). As noted in [12, 13], for( t k , x k ) ∈ R ≥ × M fixed, we have p ( t k , x k , y ) ∼ (cid:26) t − d/ k d g ( x k , y ) ≤ t − / k (cid:90) M p ( t k , x k , y ) ω g ( y ) = 1 . (14)Therefore, p ( t k , x k , · ) acts as a local probability measure centered at x k with scale t − / k (see Eq. (A.3) inthe appendix for a precise form of p ). We define the scaled local correlation to be the ratio of the localcorrelation A kij and a factor of 2 t k . Theorem 2.
Denote the limiting value of the scaled local correlation by (cid:101) A kij , (cid:101) A kij = lim t k → A kij t k . (15) Then (cid:101) A kij equals the inner product of the gradients of the eigenfunctions φ i and φ j at x k , that is, (cid:101) A kij = ∇ φ i ( x k ) T ∇ φ j ( x k ) . (16)A detailed proof is provided in Appendix A. In summary, we choose a sufficiently small (cid:15) k and show thatlim t k → A kij = lim t k → (cid:90) B (cid:15)k ( x k ) G ( t k , x k , y )Ψ kij ( y ) ω g ( y ) (17)where B (cid:15) ( x ) is defined in Eq. (6) and G ( t, x, y ) = e − d g ( x,y ) / t (4 πt ) d/ . (18)Then, by using the properties of the exponential map at x k and applying basic techniques in calculus, weshow that lim t k → A kij / t k evaluates to ∇ φ i ( x k ) T ∇ φ j ( x k ). (cid:101) A kij in the discrete setting To apply Theorems 1 and 2 in practice on data, we need an estimate of (cid:101) A kij in the discrete setting. Theestimation procedure is presented below.Let ( x k ) nk =1 be uniformly distributed points on ( M , g ). Let d e ( x k , x (cid:48) k ) be the distance between x k and x k (cid:48) .The accuracy with which (cid:101) A kij can be estimated mainly depends on the accuracy of d e ( · , · ) to the localgeodesic distances. For simplicity, we use d e ( x k , x (cid:48) k ) to be the Euclidean distance (cid:107) x k − x k (cid:48) (cid:107) . A moreaccurate estimate of the local geodesic distances can be computed using the method described in [14].7e construct a sparse unnormalized graph Laplacian L using Algo. 1, where the weight matrix K of thegraph edges is defined using the Gaussian kernel. The bandwidth of the Gaussian kernel is set using thelocal scale of the neighborhoods around each point as in self-tuning spectral clustering [15]. Let φ i be the i th non-trivial eigenvector of L and denote φ i ( x j ) by φ ij . Algorithm 1:
Sparse Unnormalized Graph Laplacian based on [15]
Input: d e ( x k , x k (cid:48) ) nk,k (cid:48) =1 , k nn , k tune where k tune ≤ k nn Output: L N k ← set of indices of k nn nearest neighbours of x k based on d e ( x k , · ); (cid:15) k ← d e ( x k , x k ∗ ) where x k ∗ is the k tune th nearest neighbor of x k ; K kk ← , K kk (cid:48) ← e − d e ( x k ,x k (cid:48) ) /(cid:15) k (cid:15) k (cid:48) , k (cid:48) ∈ N k ; D kk ← (cid:80) k (cid:48) K kk (cid:48) , D kk (cid:48) ← , k (cid:54) = k (cid:48) ; L ← D − K ;We estimate (cid:101) A kij by evaluating the scaled local correlation A kij / t k at a small value of t k . The limitingvalue of A kij is estimated by substituting a small t k in the finite sum approximation of the integral inEq. (17). The sum is taken on a discrete ball of a small radius (cid:15) k around x k and is divided by 2 t k to obtainan estimate of (cid:101) A kij .We start by choosing (cid:15) k to be the distance of k th nearest neighbor of x k where k is a hyperparameter witha small integral value. Thus, (cid:15) k = distance to the k th nearest neighbor of x k . (19)Then the limiting value of t k is given by (cid:112) chi2inv( p, d ) √ t k = (cid:15) k = ⇒ t k = 12 (cid:15) k chi2inv( p, d ) , (20)where chi2inv is the inverse cdf of the chi-squared distribution with d degrees of freedom evaluated at p . Wetake p to be 0 .
99 in our experiments. The rationale behind the above choice of t k is described in AppendixB.Now define the discrete ball around x k as U k = { x k (cid:48) | d e ( x k , x k (cid:48) ) ≤ (cid:15) k } . (21)We call U k the k th local view of the data in the high dimensional ambient space. For convenience, denotethe estimate of G ( t k , x k , x k (cid:48) ) by G kk (cid:48) where G is as in Eq. (18). Then G kk (cid:48) = (cid:40) exp( − d e ( x k ,x k (cid:48) ) / t k ) (cid:80) x ∈ Uk exp( − d e ( x k ,x ) / t k ) , x k (cid:48) ∈ U k , x k (cid:48) (cid:54)∈ U k . (22)Finally, the estimate of (cid:101) A kij is given by (cid:101) A kij = 12 t k (cid:88) x k (cid:48) ∈ U k G kk (cid:48) Ψ kij ( x k (cid:48) ) , (23)where (see Eq. (12)) Ψ kij ( x k (cid:48) ) = ( φ ik (cid:48) − φ ik )( φ jk (cid:48) − φ jk ) . (24)8he summation in Eq. (23) corresponds to the finite sum approximation of the integral in Eq. (17) evaluatedat the t k computed in Eq. (20). Example . This example will follow us throughout the paper. Consider a square grid [0 , × [0 ,
1] with aspacing of 0 .
01 in both x and y direction. With k nn = 49, k tune = 7 and d e ( x i , x j ) = (cid:107) x i − x j (cid:107) as input tothe Algo. 1, we construct the graph Laplacian L . Using k = 25, d = 2 and p = 0 .
99, we obtain the discreteballs U k and t k . Finally, using Eq. (22, 23) with the first two non-trivial eigenvectors φ and φ of L , weobtain the estimate of (cid:101) A k as shown in Figure 5. Note that the eigenvectors are a slightly rotated versionof the analytic eigenfunctions cos( πx ) and cos( πy ). Figure 5: (Top) The first two non-trivial eigenvectors φ and φ of the unnormalized graph Laplacian on a square grid.(Bottom) The estimated value of (cid:101) A k using the above procedure. Numerically estimated value of ∇ φ ( x k ) T ∇ φ ( x k ). Based on the construction procedure following Theorem 1, we compute a set S k of eigenvectors to constructlow distortion parameterization Φ k of U k . There is no easy way to retrieve the set S k in the discrete settingas in the procedure. Therefore, we make the natural choice of using the first N nontrivial eigenvectors( φ i ) Ni =1 corresponding to the N smallest eigenvalues ( λ i ) Ni =1 of the unnormalized graph Laplacian L as theset S k . Therefore, we set S k to be { , . . . , N } . Using the estimates of (cid:101) A kij , we now present an algorithmicconstruction of Φ k . The pseudocode is provided below followed by a full explanation of the steps.9 lgorithm 2: BiLipschitz-Local-Parameterization
Input:
L, N, k , d, p, ( τ s , δ s ) ds =1 Output: (Φ k , U k , ζ kk ) nk =1 Compute ( φ i ) Ni =1 , λ ≤ . . . ≤ λ N by eigendecomposition of L ; for k ← to n do Compute U k , ( (cid:101) A kij ) Ni,j =1 (Eq. (21, 23)); Compute ( γ ki ) Ni =1 (Eq. (7)); θ ← τ -percentile of ( (cid:101) A kii ) Ni =1 ; Compute ˜ S k (Eq. (26)); Compute i (Eq. (30)); for s ← to d do Compute H skij (Eq. (32)); θ s ← τ s -percentile of ( H skii ) i ∈ ˜ S k ; Compute i s (Eq. (37)); end Φ k ← ( γ ki φ i , . . . , γ ki d φ i d ) (Eq. (38)); Compute ζ kk (Eq. (39)); end The estimated value of γ ki is based on the discrete analog of Eq. (7) and is given by γ ki = Root-Mean-Square( { φ ij | x j ∈ U k } ) − . (25)We use ∇ φ i ≡ ∇ φ i ( x k ) for brevity. For the stability of our algorithm, the set S k is restricted to the indicesof the eigenfunctions with sufficiently large gradient, denoted by˜ S k = S k ∩ { i : (cid:107)∇ φ i (cid:107) ≥ θ } = S k ∩ { i : (cid:101) A kii ≥ θ } , (26)where θ is τ -percentile of the set { (cid:101) A kii : i ∈ S k } and the second equality follows from Eq. (16). Here τ isa hyperparameter.Now we derive a procedure to compute ∇ φ Ti p s for s ∈ { , . . . , d } . The unit norm constraint on p s is relaxed.This will neither affect the math nor the output of our algorithm. We start by computing a formula for ∇ φ Ti p . We do not know if one exists for an arbitrary p . However, if p is restricted to be the gradient ofan eigenvector r ( ∇ φ r ) then using Eq. (16), ∇ φ Ti p = ∇ φ Ti ∇ φ r = (cid:101) A kir . (27)The choice of r will determine φ i . To obtain a low frequency eigenvector, r is chosen so that the eigenvalue λ r is minimal, therefore r = argmin j ∈ ˜ S k λ j . (28)Then we need to find the eigenvector φ i so that γ ki |∇ φ Ti p | is larger than a certain threshold. We do notknow what the value of this threshold would be in the discrete setting. Therefore, to obtain a low frequencyeigenvector φ i such that γ ki |∇ φ Ti p | is sufficiently large, we first define the maximum possible value of γ ki |∇ φ Ti p | using Eq. (27) as α = max i ∈ ˜ S k γ ki |∇ φ Ti p | = max i ∈ ˜ S k γ ki | (cid:101) A kir | . (29)10hen we choose i i = argmin i ∈ ˜ S k { λ i : γ ki |∇ φ Ti p | ≥ δ α } = argmin i ∈ ˜ S k { λ i : γ ki | (cid:101) A kir | ≥ δ α } (30)where δ ∈ (0 ,
1] is a hyperparameter.After obtaining φ i , we move on to obtain the s -th eigenvector φ i s where s ∈ { , . . . , d } in order. First, weneed a vector p s orthogonal to ∇ φ i , . . . , ∇ φ i s . For convenience, denote by V s the matrix with ∇ φ i , . . . , ∇ φ i s − as columns and let R ( V s ) be the range of V s . Let φ r s be an eigenvector such that ∇ φ r s (cid:54)∈ R ( V s ). To findsuch an r s , we define H skij = ∇ φ Ti ( I − V s ( V Ts V s ) − V Ts ) ∇ φ j (31)= (cid:101) A kij − (cid:104) (cid:101) A kii . . . (cid:101) A kii s − (cid:105) (cid:101) A ki i (cid:101) A ki i . . . (cid:101) A ki i s − (cid:101) A ki i (cid:101) A ki i . . . (cid:101) A ki i s − ... ... . . . ... (cid:101) A ki s − i (cid:101) A ki s − i . . . (cid:101) A ki s − i s − − (cid:101) A ki j (cid:101) A ki j ... (cid:101) A ki s − j (32)Note that H skii is the squared norm of the projection of ∇ φ i onto the vector space orthogonal to R ( V s ).Clearly ∇ φ i (cid:54)∈ R ( V s ) if and only if H skii >
0. To obtain a low frequency eigenvector φ r s such that H skr s r s > r s = argmin i ∈ ˜ S k { λ i : H skii ≥ θ s } (33)where θ s is the τ s -percentile of the set { H skii : i ∈ ˜ S k } and τ s is a hyperparameter. Then we take p s to bethe component of ∇ φ r s which is orthogonal to R ( V s ), p s = ( I − V s ( V Ts V s ) − V Ts ) ∇ φ r s . (34)Note that ∇ φ Ti p s = H skir s . (35)To obtain a low frequency eigenvector φ i s such that γ ki s |∇ φ Ti s p s | is sufficiently large, we first define themaximum possible value of γ ki s |∇ φ Ti p s | using Eq. (35) as, α s = max i ∈ ˜ S k γ ki |∇ φ Ti p s | = max i ∈ ˜ S k γ ki | H skir s | , (36)and then choose i s such that i s = argmin i ∈ ˜ S k { λ i : γ ki |∇ φ Ti p s | ≥ δ s α s } = argmin i ∈ ˜ S k { λ i : γ ki | H skir s | ≥ δ s α s } (37)where δ s ∈ [0 ,
1] is a hyperparameter.Finally, the d -dimensional parameterization Φ k of U k is given byΦ k ≡ ( γ ki φ i , . . . , γ ki d φ i d ) whereΦ k ( x k (cid:48) ) = ( γ ki φ i k (cid:48) , . . . , γ ki d φ i d k (cid:48) ) and (38)Φ k ( U k ) = (Φ k ( x k (cid:48) )) x k (cid:48) ∈ U k .
11e call Φ k ( U k ) the k th local view of the data in the d -dimensonal embedding space. It is a matrix with | U k | rows and d columns. Now, denote the distortion of Φ k (cid:48) on U k by ζ kk (cid:48) . Using Eq. (1) we obtain ζ kk (cid:48) = Distortion(Φ k (cid:48) , U k ) = sup x l ,x l (cid:48) ∈ U k x l (cid:54) = x l (cid:48) (cid:107) Φ k (cid:48) ( x l ) − Φ k (cid:48) ( x l (cid:48) ) (cid:107) d e ( x l , x l (cid:48) ) sup x l ,x l (cid:48) ∈ U k x l (cid:54) = x l (cid:48) d e ( x l , x l (cid:48) ) (cid:107) Φ k (cid:48) ( x l ) − Φ k (cid:48) ( x l (cid:48) ) (cid:107) . (39)The obtained local parameterizations are post-processed so as to remove the anomalous parameterizationshaving unusually high distortion. We replace the local parameterization Φ k of U k by that of a neighbor, Φ k (cid:48) where x k (cid:48) ∈ U k , if the distortion ζ kk (cid:48) produced by Φ k (cid:48) on U k is smaller than ζ kk , the distortion producedby Φ k on U k . If ζ kk (cid:48) < ζ kk for multiple k (cid:48) then we choose the parameterization which produces the leastdistortion on U k . This procedure is repeated until no replacement is possible. The pseudocode is providedbelow. Algorithm 3:
Postprocess-Local-Parameterization
Input: d e ( x k , x k (cid:48) ) nk,k (cid:48) =1 , ( I k , Φ k , ζ kk ) nk =1 Output: (Φ k , ζ kk ) nk =1 N replaced ← while N replaced > do N replaced ← Φ old k ← Φ k for all k ∈ { , . . . , n } ; for k ← to n do Compute ( ζ kk (cid:48) ) x k (cid:48) ∈ U k (Eq. (39)); k ∗ ← argmin x k (cid:48) ∈ U k ζ kk (cid:48) ; if k ∗ (cid:54) = k then Φ k ← Φ old k ∗ ; ζ kk ← ζ kk ∗ ; N replaced ← N replaced + 1; end end endExample . We now build upon the example of the square grid at the end of Section 3.1. The values of theadditional inputs are N = 100, τ i = 50 and δ i = 0 .
75 for all i ∈ { , . . . , d } . Using Algo. 2 and 3 we obtain10 local views U k and Φ k ( U k ) where | U k | = 25 for all k . In the left image of Figure 6, we colored eachpoint x k with the distortion ζ kk of the local parameterization Φ k on U k . The mapped discrete balls Φ k ( U k )for some values of k are also shown in Figure F.22 in the Appendix. Remark 1.
Note that the parameterizations of the discrete balls close to the boundary have higher distor-tion. This is because the injectivity radius at the points close to the boundary is low and precisely zero atthe points on the boundary. As a result, the size of the balls around these points exceeds the limit beyondwhich Theorem 1 is applicable.At this point we note the following remark in [6].
Remark 2.
As was noted by L. Guibas, when M has a boundary, in the case of Neumann boundary values,one may consider the “doubled” manifold, and may apply the result in Theorem 1 for a possibly larger r k .Due to the above remark, assuming that the points on the boundary are known, we computed the distancematrix for the doubled manifold using the method described in [16]. Then we recomputed the local param-eterizations Φ k keeping all other hyperparameters the same as before. In the right image of Figure 6, wecolored each point x k with the distortion of the updated parameterization Φ k on U k . Note the reduction inthe distortion on the discs close to the boundary. The corners still have high distortion.12 igure 6: Distortion of the obtained local parameterizations when the points on the boundary are not known (left) versus whenthey are known apriori (right).
4. Clustering for Intermediate Views
Recall that the discrete balls U k are the local views of the data in the high dimensional ambient space.In the previous section, we obtained the mappings Φ k to construct the local views Φ k ( U k ) of the data inthe d -dimensional embedding space. In theory, one can use the Generalized Procrustes Analysis (GPA)[17, 18, 19] to register these local views to recover the embedding. In practice, too many small local views(high n and small | U k | ) result in extremely high computational complexity. Moreover, small overlaps betweenthe local views makes their registration susceptible to errors. Therefore, we perform clustering to obtain M (cid:28) n intermediate views, (cid:101) U m and (cid:101) Φ m ( (cid:101) U m ), of the data in the ambient space and the embedding space,respectively. This reduces the time complexity and increases the overlaps between the views, leading totheir quick and robust registration.Our clustering algorithm transforms the notion of a local view per an individual point to an intermediateview per a cluster of points. It is designed so as to ensure low distortion of the parameterizations (cid:101) Φ m on (cid:101) U m . We first describe the notation used and then present the pseudocode followed by a full explanation ofthe steps. Let c k be the index of the cluster x k belongs to. Then the set of points which belong to cluster m is given by C m = { x k | c k = m } . (40)Denote by c U k the set of indices of the neighboring clusters of x k . These are the clusters to which theneighbors of x k belong. Therefore, c U k = { c k (cid:48) | x k (cid:48) ∈ U k } . (41)Let (cid:101) U m be the set of neighboring points of cluster m . These constitute the neighbors of all the pointsbelonging to cluster m , that is, (cid:101) U m = (cid:91) c k = m U k . (42)We call (cid:101) U m the m th intermediate view of the data in the ambient space. Note that C m ⊂ (cid:101) U m and U k ⊆ (cid:101) U c k . (43)13et (cid:101) Φ m be the d -dimensional parameterization attached with the m th cluster. This parameterization is usedto map (cid:101) U m and obtain (cid:101) Φ m ( (cid:101) U m ), the m th intermediate view of the data in the embedding space. Algorithm 4:
Clustering
Input: ( U k , Φ k ) nk =1 , η min Output: ( C m , (cid:101) U m , (cid:101) Φ m ) Mm =1 , ( c k ) nk =1 Initialize c k ← k , C m ← { x m } , (cid:101) Φ m ← Φ m for all k, m ∈ { , . . . , n } ; for η ← η min do Compute cost k and d k for all k ∈ { , . . . , n } (Eq. (45, 46)); k ← argmin j cost j ; cost ∗ ← cost k ; while cost ∗ < ∞ do s ← c k ; C s ← C s ∪ x k ; c k ← d k ; C d k ← C d k − x k ; S ← { j | c j = d k or d j = d k or d j = s or s ∈ c U j } ; Recompute cost k and d k for all k ∈ S (Eq. (45, 46)); k ← argmin j cost j ; cost ∗ ← cost k ; end end M ← the number of non-empty clusters; Remove C m , (cid:101) Φ m when |C m | = 0, relabel clusters from 1 to M and update c k ; Compute ( (cid:101) U m ) Mm =1 (Eq. (42));We initialize the clustering algorithm with n singleton clusters where the point x k belongs to the k th clusterand the parameterization attached with the k th cluster is Φ k . Thus, c k = k , C m = { x m } and (cid:101) Φ m = Φ m forall k, m ∈ { , . . . , n } . We then move the points around the clusters, updating c k and C m , so that at the endof our procedure the size of the resulting non-empty clusters is at least η min , a hyperparameter. We iterateover η which varies from 2 to η min such that at the end of each iteration the size of the non-empty clustersis at least η . The size of the non-empty clusters at the start of an iteration is at least η −
1. In a singleiteration, we say that the m th cluster is small if it is non-empty and has a size less than η , that is, when |C m | ∈ (0 , η ). During an iteration, clusters undergo expansion or contraction as the points in small clustersare moved to neighboring clusters until no small cluster remains. After the last iteration, we prune away allempty clusters.In a given iteration, we start by selecting a neighboring cluster d k for each point to move into and calculatethe cost k associated with moving it. The cost function is designed so that the points which are in largeclusters are not moved. If x k is in a small cluster then the cost of moving it to its neighboring cluster is equalto the distortion produced by the parameterization attached with the cluster on the combined neighbors of x k and of the cluster. In other words, the cost function favors the movement of x k to its neighboring clusterwhose parameterization results in the least distortion on the union of the neighbors of x k and the neighborsof the cluster itself. In order to avoid the situation where x k continuously oscillates between two clusters,we move x k to a cluster which is at least as big as the cluster which x k belongs to.To be more precise, as shown in Figure 7, let x k be a point in small cluster. Since x k belongs to cluster c k therefore |C c k | ∈ (0 , η ). Let cluster m be a neighboring cluster of x k , that is m ∈ c U k , such that cluster m is at least as big as cluster c k so that |C m | ≥ |C c k | . Then the cost of moving x k to cluster m is given by thedistortion of (cid:101) Φ m over the combined neighbors of x k and the cluster m , that is Distortion( (cid:101) Φ m , U k ∪ (cid:101) U m ) (seeEq. (39)). In all other scenarios, the cost is infinity. In the form of an equation, the cost of moving x k tocluster m is cost x k → m = (cid:26) Distortion( (cid:101) Φ m , U k ∪ (cid:101) U m ) if |C c k | ∈ (0 , η ) ∧ m ∈ c U k ∧ |C m | ≥ |C c k |∞ otherwise . (44)14hen the cost of moving x k is cost k = min m cost x k → m (45)and the destination cluster to move it into is d k = (cid:40) argmin m cost x k → m if cost k < ∞ c k otherwise. (46) Figure 7: Computation of the cost of moving a point in a small cluster to a neighboring cluster. (left) x k is a point representedby a small red disc, in a small cluster c k enclosed by solid red line. The dashed red line enclose U k . Since cluster c k is small, |C c k | ∈ (0 , η ). Clusters m , m , m and m are enclosed by solid colored lines too. Note that m , m and m lie in c U k (thenonempty overlap between these clusters and U k indicate that). Since m (cid:54)∈ c U k , the cost of moving x k to m is infinity. Sincethe size of cluster m is less than the size of cluster c k , |C m | < |C c k | , the cost of moving x k to cluster m is also infinity. Thecost of moving x k to clusters m and m are to be computed. (right) The cost of moving x k to cluster m , cost x k → m , is givenby the distortion of (cid:101) Φ m on U k ∪ (cid:101) U m , where the dashed blue line enclose (cid:101) U m . If cost x k → m is less (greater) than cost x k → m ,then the destination of x k is cluster m ( m ) and the associated cost of moving x k , cost k , is cost x k → m (cost x k → m ). Given the cost of moving the points and their destination clusters, we start by picking the point with theminimum cost, say x k . As shown in Figure 8, let x k be in the cluster s (note that c k is s before x k ismoved). We relabel c k , the cluster of x k , to d k , and update the clusters C s and C d k (see Eq. (40)). Since x k moved from cluster s to d k , the cost of moving a certain set of points needs to be recomputed. The setof indices of these points is given by S where j ∈ S if x j either belongs to cluster d k ( c j = d k ) or has d k asthe destination cluster ( d j = d k ) or has cluster s as a neighboring cluster ( s ∈ c U j ) (see Appendix C fordetails). After recomputing the cost of moving x j and its destination cluster using Eq. (44, 45, 46) for all j ∈ S , we repeat the procedure until the minimum cost becomes infinity indicating that no small componentremain.In the end, let M be the number of non-empty clusters. We prune away the empty clusters and relabel thenon-empty ones from 1 to M while updating c k accordingly, obtaining the clusters ( C m ) Mm =1 with attachedparameterizations ( (cid:101) Φ m ) Mm =1 . Finally, using Eq. (42), we obtain the M intermediate views ( (cid:101) U m ) Mm =1 of thedata in the ambient space. Then, the intermediate views of the data in the embedding space are given by( (cid:101) Φ m ( (cid:101) U m )) Mm =1 . Note that (cid:101) Φ m ( (cid:101) U m ) is a matrix with | (cid:101) U m | rows and d columns (see Eq. (38)). Example . We continue with our example of the square grid which originally contained about 10 points.Therefore, before clustering we had about 10 small local views U k and Φ k ( U k ), each containing 25 points.After clustering with η min = 10, we obtained 635 clusters and therefore that many intermediate views (cid:101) U m and (cid:101) Φ m ( (cid:101) U m ) with an average size of 79. When the points on the boundary are known then we obtained562 intermediate views with an average size of 90. Note that there is a trade-off between the size of theintermediate views and the distortion of the parameterizations used to obtain them. For convenience, define˜ ζ mm to be the distortion of (cid:101) Φ m on (cid:101) U m using Eq. (39). Then, as the size of the views are increased (byincreasing η min ), the value of ˜ ζ mm would also increase. In Figure 9 we colored the points in cluster m , C m ,with ˜ ζ mm . Note the increased distortion in comparison to Figure 6.15 igure 8: The solid red and blue lines enclose clusters s and d k . The dashed red and blue lines enclose (cid:101) U s and (cid:101) U d k . Points( x k , x j , x j and x j ) are represented by small colored discs where the colors indicate the clusters they belong to. Note that c j = d k , d j = d k and s ∈ c U j . Vertical bars on the left side of these points represent the cost associated with moving themto their destination clusters. (left) x k is to be moved from a small cluster s to cluster d k . (right) After moving x k , the label(color of disc) of x k and the clusters (solid red and blue lines) are updated. The set of their neighboring points (dashed redand blue lines) are intrinsically updated. The destination clusters for the points x k , x j , x j and x j , and the cost of movingthem to their destination clusters are also updated (see Appendix C). Note that x j may no longer have cluster d k as thedestination.Figure 9: Distortion of the obtained parameterizations of the clusters on the neighboring points of the cluster when the pointson the boundary of the square grid are not known (left) versus when they are known apriori (right).
5. Global Embedding using Procrustes Analysis
In this section, we present an algorithm based on Procrustes analysis to piece together the intermediate views (cid:101) Φ m ( (cid:101) U m ) to obtain a global embedding. The M views (cid:101) Φ m ( (cid:101) U m ) are transformed by an orthogonal matrix T m of size d × d , a d -dimensional translation vector v m and a positive scalar b m as a scaling component. Thetransformed views are given by (cid:101) Φ gm ( (cid:101) U m ) such that for all x k ∈ (cid:101) U m , (cid:101) Φ gm ( x k ) = b m (cid:101) Φ m ( x k ) T m + v m . (47)First we describe a general procedure to estimate these parameters, and its limitations. Then we presentan algorithm which computes these parameters and a global embedding of the data, while addressing thelimitations of the general procedure. Finally, in Section 5.1 we describe a simple modification to ouralgorithm to tear apart manifolds without boundary.In general, the parameters ( T m , v m , b m ) Mm =1 are estimated so that for all m and m (cid:48) , the two transformedviews of the overlap between (cid:101) U m and (cid:101) U m (cid:48) , obtained using the parameterizations (cid:101) Φ gm and (cid:101) Φ gm (cid:48) , align witheach other. To be more precise, define the overlap between the m th and the m (cid:48) th intermediate views in the16 igure 10: (left) The intermediate views (cid:101) U m and (cid:101) U m (cid:48) of a 2d manifold in a possibly high dimensional ambient space. Theseviews trivially align with each other. The red star in blue circles represent their overlap (cid:101) U mm (cid:48) . (middle) The m th and m (cid:48) thintermediate views in the 2d embedding space. (right) The views after aligning the (cid:101) Φ m ( (cid:101) U mm (cid:48) ) with (cid:101) Φ m (cid:48) ( (cid:101) U mm (cid:48) ). ambient space as the set of points that are neighbors of both the m th and m (cid:48) th clusters: (cid:101) U mm (cid:48) = (cid:101) U m ∩ (cid:101) U m (cid:48) . (48)In the ambient space, the m th and the m (cid:48) th views are neighbors if (cid:101) U mm (cid:48) is non-empty. As shown inFigure 10(left), these neighboring views trivially align on the overlap between them. It is natural to ask fora low distortion global embedding of the data. Therefore, we must ensure that the embeddings of (cid:101) U mm (cid:48) dueto the m th and the m (cid:48) th view in the embedding space, also align with each other. Thus, the parameters( T m , v m , b m ) Mm =1 are estimated so that (cid:101) Φ gm ( (cid:101) U mm (cid:48) ) aligns with (cid:101) Φ gm (cid:48) ( (cid:101) U mm (cid:48) ) for all m and m (cid:48) (see Figure 10).However, due to the distortion of the parameterizations it is usually not possible to perfectly align the twoembeddings. We can represent both embeddings of the overlap as matrices with | (cid:101) U mm (cid:48) | rows and d columns.We then measure the error in the the alignment as the squared Frobenius norm of the difference of thetwo matrices. The error is trivially zero if (cid:101) U mm (cid:48) is empty. Overall, the estimated parameters minimize thefollowing alignment error L (( T m , v m , b m ) Mm =1 ) = (cid:88) m,m (cid:48) (cid:13)(cid:13)(cid:13)(cid:101) Φ gm ( (cid:101) U mm (cid:48) ) − (cid:101) Φ gm (cid:48) ( (cid:101) U mm (cid:48) ) (cid:13)(cid:13)(cid:13) F . (49)In theory, one can start with T m , v m and b m as I d , and 1, and directly use GPA [17, 18, 19] to obtain alocal minimum of the above alignment error. This approach has two issues. Like most optimization algorithms, the rate of convergence to a local minimum and the quality of itdepends on the initialization of the parameters. With a poor initialization of the parameters, the algorithmmay take enormous time to converge and may also converge to an inferior local minimum. Using GPA to align a view with all of its adjacent views would prevent us from tearing apart the manifoldswithout a boundary; as an example see Figure 11(b).To address the first issue, we present the pseudocode of our algorithm below followed by a full explanationof the steps. Subsequently, in Section 5.1, to overcome the second issue we present a simple modification ofthe step R2 in the algorithm. 17 lgorithm 5:
Calculate-Global-Embedding
Input: ( x k , c k , k ) nk =1 , ( C m , (cid:101) Φ m , (cid:101) U m ) Mm =1 , ν Output: ( T m , b m , v m ) Mm =1 Initialize T m ← I, v m ← Compute b m (Eq. (50)); Compute ( s m , p m ) Mm =1 (Eq. (D.5, D.7) in Appendix D); for m ← M do s ← s m , p ← p m ; (R1) T s , v s ← Procrustes ( (cid:101) Φ gp ( (cid:101) U sp ), (cid:101) Φ gs ( (cid:101) U sp ), No scaling) [20, 21]; (R2) Compute Z s (either Eq. (51) or (53)); (R3) µ s ← Centroid of ( (cid:101) Φ gm (cid:48) ( (cid:101) U sm (cid:48) )) m (cid:48) ∈Z ; (R4) T s , v s ← Procrustes ( µ s , (cid:101) Φ gs ( ∪ m (cid:48) ∈Z U sm (cid:48) ), No scaling) [20, 21]; end Refine ( T m , b m , v m ) Mm =1 by executing R1 to R4 for each m in { , . . . , M } , several times; Compute ( y k ) nk =1 (Eq. (52)).(a.1) (a.2)(b) (c) Figure 11: This example will follow us in this section. (a.1 and a.2) Nine intermediate views ( (cid:101) U s m ) m =1 of a 2d manifold areshown when they are placed on a flat plane. In the first case, (cid:101) U s has (cid:101) U s and (cid:101) U s as the only neighboring views among thenine views as shown in (a.1). In the second case, the 2d manifold is assumed to have no boundary in which case there will beat least one view with neighboring views lying far apart on the flat plane. We assume (cid:101) U s to be that view having (cid:101) U s , (cid:101) U s and (cid:101) U s as the neighboring views as shown in (a.1) and (a.2). (b) The intermediate views ( (cid:101) Φ s m ( (cid:101) U s m )) m =1 in the 2d embeddingspace, as they were passed as input to the Algo. 5. These views are scrambled in the embedding space and the Algo. 5 willmove them to the right location. (c) The transformed views after calculating the scaling b m as in Eq. (50). We initialize with T m = I d , v m as the zero vector and compute b m so as to bring the intermediate views (cid:101) Φ m ( (cid:101) U m ) to the same scale as their counterpart (cid:101) U m in the ambient space. In turn this brings all the viewsto similar scale (see Figure 11(c)). We compute the scaling component b m to be the ratio of the median18istance between unique points in (cid:101) U m and in (cid:101) Φ k ( (cid:101) U m ), that is, b m = median (cid:110) d e ( x k , x k (cid:48) ) | x k , x k (cid:48) ∈ (cid:101) U m , x k (cid:54) = x k (cid:48) (cid:111) median (cid:110)(cid:13)(cid:13)(cid:13)(cid:101) Φ m ( x k ) − (cid:101) Φ m ( x k (cid:48) ) (cid:13)(cid:13)(cid:13) | x k , x k (cid:48) ∈ (cid:101) U m , x k (cid:54) = x k (cid:48) (cid:111) . (50)Then we transform the the views in a sequence ( s m ) Mm =1 . Let the p m th view be the parent of the s m th view.The computation of these sequences is described in Appendix D. Here p m lies in { s , . . . , s m − } and it isa neighboring view of the s m th view in the ambient space, that is, (cid:101) U s m p m is non-empty. The first view inthe sequence ( s th view) is not transformed, therefore T s and v s are not updated. We then iterate over m which varies from 2 to M . For convenience, denote s m by s and p m by p . To update T s and v s , the followingfour step procedure is followed.(d) (e) Figure 12: In continuation of Figure 11. (d) The transformed intermediate views ( (cid:101) Φ gs m ( (cid:101) U s m )) m =1 before the start of theiteration m = 9 are shown. (e) Assuming p = s , in step R1, T s and v s are computed so that the view (cid:101) Φ gs ( (cid:101) U s s ) alignswith the view (cid:101) Φ gs ( (cid:101) U s s ). The resulting transformed view (cid:101) Φ gs ( (cid:101) U s ) is shown. Note that step R1 results in the same outputfor both cases described in Figure 11. Step R1 . We compute a temporary value of T s and v s by aligning the views (cid:101) Φ gs ( (cid:101) U sp ) and (cid:101) Φ gp ( (cid:101) U sp ) of theoverlap (cid:101) U sp , using Procrustes analysis [20, 21] without modifying b s (see Figure 12).(f.1) (f.2) Figure 13: In continuation of Figures 11 and 12. In the first case as described in the caption of Figure 11, among the nineviews, (cid:101) U s has non-empty overlaps with (cid:101) U s and (cid:101) U s only. Therefore, step R2 results in Z s = { s , s } and the µ s obtainedin step R3 is shown in black in (f.1). On the other hand, in the second case when the manifold is without a boundary, (cid:101) U s hasnon-empty overlaps with (cid:101) U s , (cid:101) U s and (cid:101) U s . Therefore, in this case, step R2 results in Z s = { s , s , s } and the µ s obtainedin step R3 is again shown in black in (f.2). Step R2 . Then we identify more views to align the s th view with. We compute a subset Z s of { s , . . . , s m − } ,the indices of already aligned m − m (cid:48) ∈ Z s if the s th view and the m (cid:48) th view are neighbors19n the ambient space (see Figure 13). Therefore, Z s = { m (cid:48) | (cid:101) U sm (cid:48) (cid:54) = ∅} ∩ { s , . . . , s m − } . (51) Step R3 . We then compute the centroid µ s of the views ( (cid:101) Φ gm (cid:48) ( (cid:101) U sm (cid:48) )) m (cid:48) ∈Z s (see Figure 13). Here µ s is amatrix with d columns and the number of rows given by the size of the set ∪ m (cid:48) ∈Z s (cid:101) U sm (cid:48) . A point in thisset can have multiple embeddings due to multiple parameterizations ( (cid:101) Φ gm (cid:48) ) m (cid:48) ∈Z s depending on the overlaps( (cid:101) U sm (cid:48) ) m (cid:48) ∈Z s it lies in. The mean of these embeddings forms a row in µ s .(g.1) (g.2) Figure 14: In continuation of Figures 11-13. For the first case, in step R4, T s and v s are updated so that the view (cid:101) Φ gs ( (cid:101) U s s ∪ (cid:101) U s s ) aligns with µ s . The resulting view (cid:101) Φ gs ( (cid:101) U s ) is shown in (g.1). In the second case when the manifoldhas no boundary, T s and v s are updated in step R4 so that the view (cid:101) Φ gs ( (cid:101) U s s ∪ (cid:101) U s s ∪ (cid:101) U s s ) aligns with µ s . Theresulting view (cid:101) Φ gs ( (cid:101) U s ) is shown in (g.2), which is not a desired output as it sets off the distortion of the global embedding.We resolve this issue in Section 5.1. Step R4 . Finally, we update T s and v s by aligning the view (cid:101) Φ gs ( (cid:101) U sm (cid:48) ) with (cid:101) Φ gm (cid:48) ( (cid:101) U sm (cid:48) ) for all m (cid:48) ∈ Z s .This alignment is based on the approach in [17, 18] where, using the Procrustes analysis [20, 21], the view (cid:101) Φ gs ( ∪ m (cid:48) ∈Z s (cid:101) U sm (cid:48) ) is aligned with the centroid µ s , without modifying b s (see Figure 14).To further refine the parameters ( T m , v m , b m ) Mm =1 we iterate over all the views in random order and performsteps R2, R3 and R4 in Algo. 5. We repeat this procedure several times without refining b m in the Procrustesalgorithm [17, 21] and then with refining b m (in this case, the Procrustes method in step R4 will returnan additional output b s ). In the end, we compute the global embedding y k of x k by mapping x k using thetransformed parameterization attached with the cluster c k it belongs to, y k = (cid:101) Φ gc k ( x k ) . (52) When the manifold has no boundary, then the step R2 in Algo. 5 may result in a set Z s containing theindices of the views which are neighbors of the s th view in the ambient space but are far apart from thetransformed s th view in the embedding space, obtained right after step R1. For example, as shown in Figure13 (f.2), s ∈ Z s because the s th view and the s th view are neighbors in the ambient space (see Figure11 (a.1, a.2)) but in the embedding space, they are far apart. Due to such indices in Z s , the step R3 resultsin a centroid, which when used in step R4, results in a fallacious estimation of the parameters T s and v s ,giving rise to a high distortion embedding. By trying to align with all its neighbors in the ambient space,the s th view is misaligned with respect to all of them (see Figure 14 (g.2)).Therefore we modify the step R2 so as to include the indices of only those views in the set Z s which areneighbors of the s th view in both the ambient space as well as in the embedding space. We denote the20verlap between the m th and m (cid:48) th view in the embedding space by (cid:101) U gmm (cid:48) . There may be multiple heuristicsfor computing (cid:101) U gmm (cid:48) which could work. In the Appendix E, we describe a simple approach based on thealready developed machinery in this paper. Having obtained (cid:101) U gmm (cid:48) , we say that the m th and the m (cid:48) thintermediate views in the embedding space are neighbors if (cid:101) U gmm (cid:48) is non-empty. Step R2 updated . Finally, we compute Z s as, Z s = { m (cid:48) | (cid:101) U sm (cid:48) (cid:54) = ∅ , (cid:101) U gsm (cid:48) (cid:54) = ∅} ∩ { s , . . . , s m − } . (53)Note that if it is known apriori that the manifold can be embedded in lower dimension without tearing itapart then we do not require the above modification. In all of our experiments except the one in Section6.3, we do not assume that this information is available.With this modification, the set Z s in Figure 13 (f.2) will not include s and therefore the resulting centroidin the step R3 would be the same as the one in Figure 13 (f.1). Subsequently, the transformed s th viewwould be the one in Figure 14 (g.1) rather than Figure 14 (g.2).Having knowingly torn the manifold apart, we provide at the output, information on the points belongingto the tear and their neighboring points in the ambient space. To encode the “gluing” instructions along thetear in the form of colors at the output of our algorithm, we recompute (cid:101) U gmm (cid:48) . If (cid:101) U mm (cid:48) is non-empty but (cid:101) U gmm (cid:48) is empty, then this means that the m th and m (cid:48) th views are neighbors in the ambient space but aretorn apart in the embedding space. Therefore, we color the global embedding of the points on the overlap (cid:101) U mm (cid:48) with the same color to indicate that although these points are separated in the embedding space, theyare adjacent in the ambient space (see Figures 19, 20 and F.23). Example . The obtained global embeddings of our square grid with ν = 3 as in the Appendix E, are shownin Figure 15. Note that the boundary of the obtained embedding is more distorted when the points on theboundary are unknown than when they are known apriori. This is because the intermediate views near theboundary have higher distortion in the former case than in the latter case (see Figure 9). Figure 15: Global embedding of the square grid when the points on the boundary are unknown (left) versus when they areknown apriori (right).
6. Experimental Results
We present experiments to compare LDLE with UMAP [5], t-SNE [4] and Laplacian eigenmaps [3] onseveral datasets. These datasets are discretized 2d manifolds embedded in R , R or R , containing about10 points. These manifolds can be grouped based on the presence of the boundary and their orientability, asshown in Figures 16, 19 and 20. The inputs are shown in the figures themselves except for the flat torus and21he Klein bottle, as their 4D parameterizations cannot be plotted. Therefore, we describe their constructionbelow. Flat Torus . A flat torus is a parallelogram whose opposite sides are identified. In our case, we construct adiscrete flat torus using a rectangle with sides 2 and 0 . X ( θ i , φ j ) = 14 π (4 cos ( θ i ) , θ i ) , cos( φ j ) , sin( φ j )) (54)where θ i = 0 . iπ , φ j = 0 . jπ , i ∈ { , . . . , } and j ∈ { , . . . , } . Klein bottle . A Klein bottle is a non-orientable two dimensional manifold without boundary. We constructa discrete Klein bottle using its 4D M¨obius tube representation as follows, X ( θ i , φ j ) = ( R ( φ j ) cos θ i , R ( φ j ) sin θ i , r sin φ j cos θ i , r sin φ j sin θ i R ( φ j ) = R + r cos φ j (56)where θ i = iπ/ φ j = jπ/ i ∈ { , . . . , } and j ∈ { , . . . , } . To embed using LDLE, for all the examples, we use the Euclidean metric, k nn = 49, k tune = 7, N = 100, d = 2, p = 0 . k = 25, τ i = 50, δ i = 0 . ν = 3, N r = 100. Only the value of η min varies across theexamples and is provided in the Table F.1.For UMAP, t-SNE and Laplacian eigenmaps, we use the Euclidean metric and select the hyperparametersby grid search, choosing the values which result in best visualization quality. For UMAP, we use 500 epochsand search for optimal n neighbors in { , , , } and min dist in { . , . , . , . } . For t-SNE, weuse 1000 iterations and search for optimal perplexity in { , , , } and early exaggeration in { , , } .For Laplacian eigenmaps, we search for k nn in { , , , } and k tune in { , , } . The chosen values ofthe hyperparameters are provided in Table F.1. We note that the Laplacian eigenmaps fails to correctlyembed most of the examples regardless of the choice of the hyperparameters. In Figure 16, we show the 2d embeddings of 2d manifolds with boundary, in R or R . To a large extent,LDLE preserved the shape of the holes. UMAP and Laplacian eigenmaps distorted the shape of the holesand the region around them, while t-SNE produced dissected embeddings. For the sphere with a hole whichis a curved 2d manifold with boundary, embedded in R , both UMAP and Laplacian eigenmaps squeezed itinto R while LDLE and t-SNE tore it apart. The correctness of the LDLE embedding is proved in FigureF.23.We note that the boundaries of the LDLE embeddings in Figure 16 are usually distorted. The cause of thisis explained in Remark 1. When the points in the input which lie on the boundary are known apriori thenthe distortion near the boundary can be reduced using the doubled manifold as discussed in Remark 2 andshown in Figure 5. The obtained LDLE embeddings when the points on the boundary are known, are shownin Figure 17.To compare the embeddings in a quantitative manner, we plot the distortion of the geodesics originatingfrom each point in the input in Figure 18. In the discrete setting, we take the geodesic between two givenpoints as the shortest path between them. Using Dijkstra algorithm over a graph of 5 nearest neighborswith d e as the underlying metric, we compute the sequence of nodes on the shortest path between x k and22 arbell Square with two holes Sphere with a hole Swiss roll with a hole Noisy swiss roll I npu t L D L E U M A P t - S N E L a p l a c i a n E i g e n m a p s Figure 16: Embeddings of 2d manifolds with boundary into R . The noisy swiss roll is constructed by adding uniform noise inall three dimensions, with support on [0 , . x k (cid:48) . Denote the number of nodes on this path by n kk (cid:48) , the sequence of nodes by ( x kk (cid:48) s ) n kk (cid:48) s =1 and the lengthof the path in the input space by L kk (cid:48) = n kk (cid:48) (cid:88) s =2 d e ( x kk (cid:48) s , x kk (cid:48) s − ) . (57)Denote the embedding of x k by y k . Then the length of the shortest path between x k and x k (cid:48) in the embeddingspace is given by L gkk (cid:48) = n kk (cid:48) (cid:88) s =2 d e ( y kk (cid:48) s , y kk (cid:48) s − ) . (58)23 arbell Square with two holes Swiss roll with a hole L D L E w i t h ∂ M k n o w n a p r i o r i Figure 17: LDLE embeddings when the points on the boundary are known apriori.
Finally, the distortion of the paths originating from x k is given by D k = sup k (cid:48) L gkk (cid:48) L kk (cid:48) sup k (cid:48) L kk (cid:48) L gkk (cid:48) . (59)Compared to the other algorithms, the global embedding produced by LDLE least distorts the shortestgeodesics in the input space, especially when the points on the boundary are known apriori.In Figure 19, we show the 2d embeddings of 2d manifolds without boundary, a curved torus in R and aflat torus in R . LDLE produced similar representation for both the inputs. None of the other methods dothat. The main difference in the LDLE embedding of the two inputs is based on the length of the samecolored pieces of the boundary. Note that these pieces are adjacent in the input space. For the flat torus,the two pieces with same colors are almost equal in length while for the curved torus, they usually havedifferent lengths. This is because of the difference in the curvature of the two inputs, zero everywhere forthe flat torus and non-zero almost everywhere on the curved torus. The mathematical correctness of theLDLE embeddings using the cut and paste argument is shown in Figure F.23.In Figure 20, we show the 2d embeddings of non-orientatble 2d manifolds, a M¨obius strip in R and aKlein bottle in R . Laplacian eigenmaps produced incorrect embeddings, t-SNE produced dissected andnon-interpretable embeddings and UMAP squeezed the inputs into R . On the other hand, LDLE producedmathematically correct embeddings by tearing apart both inputs to embed them into R . In Figure 21, motivated from [22], we embed a 42 dimensional synthetic data set representing the signalstrength of 42 transmitters at about n = 6000 receiving locations on a toy floor plan. The transmitters andthe receivers are distributed uniformly across the floor. Let ( t r k ) k =1 be the transmitter locations and r i bethe i th receiver location. Then the i th data point x i is given by ( e − (cid:107) r i − t rk (cid:107) ) k =1 . The resulting data set isembedded into R using LDLE, UMAP, t-SNE and Laplacian eigenmaps. The hyperparameters are set inthe same way as in the above examples (see Table F.1) except for ν in Algo. 5, which is set to infinity sincewe know apriori that the data does not represent a manfiold without boundary. The obtained embeddingsare shown in Figure 21. The shapes of the holes are better preserved by LDLE than the other algorithmsbut the corners of the LDLE embedding are more distorted. The reason for distorted corners is given inRemark 1.
7. Conclusions and Future Directions
We have presented a new bottom-up approach (LDLE) for manifold learning which constructs low-dimensionallow distortion local parameterizations of the data and registers them to obtain a global embedding. We24emonstrated on various examples that LDLE competes with the other methods in terms of visualizationquality. In particular, the embeddings produced by LDLE are geometrically more accurate than those pro-duced by UMAP, t-SNE and Laplacian Eigenmaps. We also demonstrated that LDLE can embed manifoldswithout boundary as well as non-orientable manifolds into their intrinsic dimension, a feature that is missingfrom the existing techniques.Some of the future directions of our work are as follows. • It is natural to expect a manifold learning technique to fall back to clustering when the data in highdimension does not represent a low dimensional manifold. Therefore, we aim to augment LDLE so asto obtain global embedding of clustered data (at multiple resolutions) as in [23]. • As observed in the experimental results, when the boundary of the manifold is known apriori, then theglobal embedding tend to also have low distortion near the boundary. We aim to develop a methodbased on [24] to approximately calculate the boundary of the manifold and analyze LDLE results usingthe doubled manifold along the estimated boundary. • The spectrum of the Laplacian has been used in prior work for anomaly detection [25, 26, 27, 28]. In[29, 27], subsets of Laplacian eigenvectors were identified so as to separate small clusters from a largebackground component. In our case, the local views built using Laplacian eigenfunctions, which haveunusually high distortion may point to the presence of anomalous data points in that view. We aimto further investigate this direction to develop an anomaly detection technique.25ectangle (4 × .
25) Barbell Square withtwo holes Swissrollwith hole Noisy swissroll
Figure 18: Distortion of the geodesics originating from each point in the input (See Eq. (59)). urved torus Flat torus I npu t See Eq. (54) L D L E U M A P t - S N E L a p l a c i a n E i g e n m a p s Figure 19: Embeddings of 2d manifolds without boundary into R . For each manifold, the left and right columns contain thesame plots colored by the two parameters of the manifold. A proof of the mathematical correctness of the LDLE embeddingsis provided in Figure F.23. ¨obius strip Klein bottle I npu t See Eq. (56) L D L E U M A P t - S N E L a p l a c i a n E i g e n m a p s Figure 20: Embeddings of 2d non-oreintable manifolds into R . For each manifold, the left and right columns contain the sameplots colored by the two parameters of the manifold. A proof of the mathematical correctness of the LDLE embeddings isprovided in Figure F.23. rue floor plan LDLE UMAP t-SNE Laplacianeigenmaps Figure 21: Embedding of the 42 dimensional signal strength data set into R (see Section 6.3 for details). References [1] S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, science 290 (5500) (2000)2323–2326.[2] D. L. Donoho, C. Grimes, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proceedingsof the National Academy of Sciences 100 (10) (2003) 5591–5596.[3] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural computation15 (6) (2003) 1373–1396.[4] L. v. d. Maaten, G. Hinton, Visualizing data using t-SNE, Journal of machine learning research 9 (Nov) (2008) 2579–2605.[5] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection for dimension reduction, arXivpreprint arXiv:1802.03426.[6] P. W. Jones, M. Maggioni, R. Schul, Universal local parametrizations via heat kernels and eigenfunctions of the Laplacian,arXiv preprint arXiv:0709.1975.[7] N. Saito, How Can We Naturally Order and Organize Graph Laplacian Eigenvectors?, in: 2018 IEEE Statistical SignalProcessing Workshop (SSP), 2018, pp. 483–487. doi:10.1109/SSP.2018.8450808 .[8] Y.-C. Chen, M. Meila, Selecting the independent coordinates of manifolds with large aspect ratios, in: Advances in NeuralInformation Processing Systems, Vol. 32, Curran Associates, Inc., 2019, pp. 1088–1097.URL https://proceedings.neurips.cc/paper/2019/file/6a10bbd480e4c5573d8f3af73ae0454b-Paper.pdf [9] C. J. Dsilva, R. Talmon, R. R. Coifman, I. G. Kevrekidis, Parsimonious representation of nonlinear dynamical systemsthrough manifold learning: A chemotaxis case study, Applied and Computational Harmonic Analysis 44 (3) (2018) 759 –773. doi:https://doi.org/10.1016/j.acha.2015.06.008 .URL http://kkk.sciencedirect.com/science/article/pii/S1063520315000949 [10] Y. Blau, T. Michaeli, Non-Redundant Spectral Dimensionality Reduction, CoRR abs/1612.03412. arXiv:1612.03412 .URL http://arxiv.org/abs/1612.03412 [11] R. R. Coifman, S. Lafon, Diffusion maps, Applied and computational harmonic analysis 21 (1) (2006) 5–30.[12] S. Steinerberger, On the spectral resolution of products of Laplacian eigenfunctions, arXiv preprint arXiv:1711.09826.[13] A. Cloninger, S. Steinerberger, On the Dual Geometry of Laplacian Eigenfunctions, Experimental Mathematics 0 (0)(2018) 1–11. arXiv:https://doi.org/10.1080/10586458.2018.1538911 , doi:10.1080/10586458.2018.1538911 .URL https://doi.org/10.1080/10586458.2018.1538911 [14] D. Li, D. B. Dunson, Geodesic distance estimation with spherelets, arXiv preprint arXiv:1907.00296.[15] L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering, in: Advances in neural information processing systems, 2005,pp. 1601–1608.[16] S. Lafon, Diffusion Maps and Geometric Harmonics, PhD Thesis (2004) 45.[17] F. Crosilla, A. Beinat, Use of generalised Procrustes analysis for the photogrammetric block adjustment by indepen-dent models, ISPRS Journal of Photogrammetry and Remote Sensing 56 (2002) 195–209. doi:10.1016/S0924-2716(02)00043-6 .[18] J. C. Gower, Generalized procrustes analysis, Psychometrika 40 (1) (1975) 33–51.[19] J. M. Ten Berge, Orthogonal Procrustes rotation for two or more matrices, Psychometrika 42 (2) (1977) 267–276.[20] J. C. Gower, G. B. Dijksterhuis, et al., Procrustes problems, Vol. 30, Oxford University Press on Demand, 2004.[21] MATLAB, Procrustes analysis, Statistics and Machine Learning Toolbox, The MathWorks, Natick, MA, USA, 2018.[22] E. Peterfreund, O. Lindenbaum, F. Dietrich, T. Bertalan, M. Gavish, I. G. Kevrekidis, R. R. Coifman, LOCA: LOcalConformal Autoencoder for standardized data coordinates, arXiv preprint arXiv:2004.07234.[23] A. Y. Ng, M. I. Jordan, Y. Weiss, On Spectral Clustering: Analysis and an Algorithm, in: Proceedings of the 14th Inter-national Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, MIT Press, Cambridge,MA, USA, 2001, p. 849–856.[24] T. Berry, T. Sauer, Density estimation on manifolds with boundary, Computational Statistics & Data Analysis 107 (2017)1–17.[25] A. Cloninger, W. Czaja, Eigenvector localization on data-dependent graphs, in: 2015 International Conference on SamplingTheory and Applications (SampTA), 2015, pp. 608–612. doi:10.1109/SAMPTA.2015.7148963 .
26] G. Mishne, I. Cohen, Multiscale anomaly detection using diffusion maps, IEEE Journal of Selected Topics in SignalProcessing 7 (1) (2013) 111–123. doi:10.1109/JSTSP.2012.2232279 .[27] X. Cheng, G. Mishne, Spectral embedding norm: Looking deep into the spectrum of the graph laplacian, SIAM Journalon Imaging Sciences 13 (2) (2020) 1015–1048.[28] G. Mishne, U. Shaham, A. Cloninger, I. Cohen, Diffusion nets, Applied and Computational Harmonic Analysis 47 (2)(2019) 259 – 285. doi:https://doi.org/10.1016/j.acha.2017.08.007 .[29] G. Mishne, R. R. Coifman, M. Lavzin, J. Schiller, Automated cellular structure extraction in biological images withapplications to calcium imaging data, bioRxiv doi:10.1101/313981 .[30] Y. Canzani, Analysis on manifolds via the Laplacian.[31] P. H. Sch¨onemann, A generalized solution of the orthogonal procrustes problem, Psychometrika 31 (1) (1966) 1–10.
Appendix A. Proof of Theorem 2
Choose (cid:15) > x : T x M → M is a well defined diffeomorphism on B (cid:15) ⊂ T x M where T x M is the tangent space to M at x , exp x (0) = x and B (cid:15) = { v ∈ T x M | (cid:107) v (cid:107) < (cid:15) } . (A.1)Then using [30, lem. 48, prop. 50, th. 51], for all y ∈ B (cid:15) ( x ) such that B (cid:15) ( x ) = { y ∈ M | d g ( x, y ) < (cid:15) } (A.2)we have, p ( t, x, y ) = G ( t, x, y )( u ( x, y ) + tu ( x, y ) + O ( t )) , (A.3)where G ( t, x, y ) = e − d g ( x,y ) / t (4 πt ) d/ , (A.4) u ( x, y ) = 1 + O ( (cid:107) v (cid:107) ) , y = exp x ( v ) , v ∈ T x M , (A.5)and for f ∈ C ( M ), the following hold f ( x ) = lim t → (cid:90) M p ( t, x, y ) f ( y ) ω g ( y ) (A.6)= lim t → (cid:90) B (cid:15) ( x ) p ( t, x, y ) f ( y ) ω g ( y ) , (A.7) f ( x ) = lim t → (cid:90) B (cid:15) ( x ) G ( t, x, y ) f ( y ) ω g ( y ) , (A.8) u ( x, x ) f ( x ) = lim t → (cid:90) B (cid:15) ( x ) G ( t, x, y ) u ( x, y ) f ( y ) ω g ( y ) . (A.9)Using the above equations and the definition of Ψ kij ( y ) in Eq. (12) and A kij in Eq. (13) we compute thelimiting value of the scaled local correlation (see Eq. (16)), (cid:101) A kij = lim t → A kij t (A.10)= lim t → t (cid:90) M p ( t, x k , y )Ψ kij ( y ) ω g ( y ) . (A.11)which will turn out to be the inner product between the gradients of the eigenfunctions φ i and φ j at x k .We start by choosing an (cid:15) k > x k is a well defined diffeomorphism on B (cid:15) k ⊂ T x k M . UsingEq. (A.7) we change the region of integration from M to B (cid:15) k ( x k ), (cid:101) A kij = lim t k → t k (cid:90) B (cid:15)k ( x k ) p ( t k , x k , y )Ψ kij ( y ) ω g ( y ) . (A.12)30ubstitute p ( t k , x k , y ) from Eq. (A.3) and simplify using Eq. (A.8, A.9) and the fact that Ψ kij ( x k ) = 0 toget (cid:101) A kij = lim t k → t k (cid:90) B (cid:15)k ( x k ) G ( t k , x k , y )( u ( x k , y ) + t k u ( x k , y ) + O ( t k ))Ψ kij ( y ) ω g ( y ) . = lim t k → (cid:32) t k (cid:90) B (cid:15)k ( x k ) G ( t k , x k , y ) u ( x k , y )Ψ kij ( y ) ω g ( y )+ t k u ( x k , x k )Ψ kij ( x k ) + O ( t k )Ψ kij ( x k )2 t k (cid:19) = lim t k → t k (cid:90) B (cid:15)k ( x k ) G ( t k , x k , y ) u ( x k , y )Ψ kij ( y ) ω g ( y ) . (A.13)Replace y ∈ B (cid:15) k ( x k ) by exp x k ( v ) where v ∈ B (cid:15) k ⊂ T x k M and (cid:107) v (cid:107) = d g ( x k , y ). Denote the Jacobian forthe change of variable by J ( v ) i.e. J ( v ) = ddv exp x k ( v ). Note that exp x k (0) = x k and J (0) = I . Using theTaylor expansion of φ i and φ j about 0 we obtain φ s ( y ) = φ s (exp x k ( v )) = φ s (exp x k (0)) + ∇ φ s (exp x k (0)) T J (0) v + O ( (cid:107) v (cid:107) )= φ s ( x k ) + ∇ φ s ( x k ) T v + O ( (cid:107) v (cid:107) ) , s = i, j. (A.14)Substituting the above equation in the definition of Ψ kij ( y ) (see Eq. (12)) we getΨ kij ( y ) = Ψ kij (exp x k ( v ))= v T ∇ φ i ∇ φ Tj v + ( ∇ φ Ti v + ∇ φ Tj v ) O ( (cid:107) v (cid:107) ) + O ( (cid:107) v (cid:107) ) , (A.15)where ∇ φ s ≡ ∇ φ s ( x k ) , s = i, j . Now we substitute Eq. (A.15, A.4, A.5) in Eq. (A.13) while replacingvariable y with exp x k ( v ) where J ( v ) is the Jacobian for the change of variable as before, to get (cid:101) A kij = lim t k → t k (cid:90) B (cid:15)k e −(cid:107) v (cid:107) / t k (4 πt k ) d/ (1 + O ( (cid:107) v (cid:107) ))Ψ kij (exp x k ( v )) J ( v ) dv = L + L , (A.16)where L and L are the terms obtained by expanding 1 + O ( (cid:107) v (cid:107) ) in the integrand. We will show that L = 0 and (cid:101) A kij = L = ∇ φ Ti ∇ φ j . L = lim t k → t k (cid:90) B (cid:15)k e −(cid:107) v (cid:107) / t k (4 πt k ) d/ O ( (cid:107) v (cid:107) )(tr( ∇ φ i ∇ φ Tj vv T )+( ∇ φ Ti v + ∇ φ Tj v ) O ( (cid:107) v (cid:107) ) + O ( (cid:107) v (cid:107) )) J ( v ) dv = lim t k → t k ( O ( t k ) + 0 + 0 + O ( t k ))= 0 . (A.17)31herefore, (cid:101) A kij = L = lim t k → t k (cid:90) B (cid:15)k e −(cid:107) v (cid:107) / t k (4 πt k ) d/ Ψ kij (exp x k ( v )) J ( v ) dv (A.18)= lim t k → t k (cid:90) B (cid:15)k e −(cid:107) v (cid:107) / t k (4 πt k ) d/ ( v T ∇ φ i ∇ φ Tj v +( ∇ φ Ti v + ∇ φ Tj v ) O ( (cid:107) v (cid:107) ) + O ( (cid:107) v (cid:107) )) J ( v ) dv = lim t k → t k (cid:90) B (cid:15)k e −(cid:107) v (cid:107) / t k (4 πt k ) d/ v T ∇ φ i ∇ φ Tj vJ ( v ) dv + 0 + 0 + O ( t k )2 t k = lim t k → t k (cid:90) B (cid:15)k e −(cid:107) v (cid:107) / t k (4 πt k ) d/ v T ∇ φ i ∇ φ Tj vJ ( v ) dv. (A.19)Substitution of t k = 0 leads to the indeterminate form . Therefore, we apply L’Hospital’s rule and thenLeibniz integral rule to get, (cid:101) A kij = lim t k → (cid:90) B (cid:15)k (cid:18) (cid:107) v (cid:107) t k − d t k (cid:19) e −(cid:107) v (cid:107) / t k (4 πt k ) d/ v T ∇ φ i ∇ φ Tj vJ ( v ) dv = tr (cid:32) ∇ φ i ∇ φ Tj lim t k → (cid:90) B (cid:15)k (cid:18) (cid:107) v (cid:107) t k − d t k (cid:19) e −(cid:107) v (cid:107) / t k (4 πt k ) d/ vv T J ( v ) dv (cid:33) = tr (cid:18) ∇ φ i ∇ φ Tj (cid:18) lim t k → (cid:18) (12 + 4( d − t k t k − t k d t k (cid:19) I + O ( t k ) I (cid:19)(cid:19) = ∇ φ Ti ∇ φ j . (A.20) Finally, note that the Eq. (A.18) is same as the following equation with y replaced by exp x k ( v ), (cid:101) A kij = lim t k → t k (cid:90) B (cid:15)k ( x k ) G ( t k , x k , y )Ψ kij ( y ) ω g ( y ) . (A.21)We used the above equation to estimate (cid:101) A kij in Section 3.1. Appendix B. Rationale behind the choice of t k in Eq. (20) Since |M| ≤
1, we note that (cid:15) k ≤ Γ( d/ /d / √ π (B.1)where the maximum can be achieved when M is a d -dimensional ball of unit volume. Then we take thelimiting value of t k as in Eq. (20) where chi2inv is the inverse cdf of the chi-squared distribution with d degrees of freedom evaluated at p . Since the covariance matrix of G ( t k , x, y ) is √ t k I (see Eq. (18)), theabove value of t k ensures p probability mass to lie in B (cid:15) k ( x k ). We take p to be 0 .
99 in our experiments.Also, using Eq. (B.1) and Eq. (20) we have t k ≤ π Γ( d/ /d chi2inv( p, d ) << , when p = 0 . . (B.2)Using the above inequality with p = 0 .
99, for d = 2 , ,
100 and 1000, the upper bound on t k = 0 . , . , . . t k is indeed a small value close to 0.32 ppendix C. Construction of S in Algo. 4 In Algo. 4, the set S contained the indices of the points for which the cost of moving and the destinationclusters are recomputed at the end of an iteration. These indices correspond to the points which belong tothe cluster d k or has d k as the destination cluster or has s as the neighboring cluster. We now provide thereason for recomputing the cost of moving for the points with these indices. Case 1 . After moving x k to d k , if |C d k | becomes greater than η , then the cluster d k was not small prior tothe move. Therefore, x k will not be moved anymore and we must update cost k to infinity. If |C d k | becomes η then the cluster d k was small prior to the move but it is not small anymore. Therefore, we need to updatethe cost of all the points in cluster d k to be infinity as none of these points will be moved anymore. If |C d k | is less than η then the cluster d k is still small. So x k may be moved again and therefore its destinationcluster and cost of moving must be updated. Therefore, the cost of moving is updated for all the points incluster d k . Case 2 . Suppose x j is a point in a small cluster whose destination cluster is d k , that is d j = d k . Prior tomoving x k to d k the cost of moving x j to d k was Distortion( (cid:101) Φ d k , U j ∪ (cid:101) U d k ) (see Eq. (44)). After moving x k to cluster d k , the neighbors of the cluster d k will expand to (cid:101) U d k ∪ U k . Then, the new cost of moving x j to d k is Distortion( (cid:101) Φ d k , U j ∪ (cid:101) U d k ∪ U k ). Therefore, the the cost of moving all the points whose destinationcluster is d k must be updated. Case 3 . Suppose after moving x k out of cluster s it becomes empty. Since |C s | = 0, the cost of moving anyother point to cluster s is infinity as the third condition in Eq. (44) no longer holds. Therefore, we mustrecompute the cost of moving all those points whose destination cluster is cluster s . These points form asubset of the points which have cluster s as the neighboring cluster. Case 4 . Suppose after moving x k out of cluster s it is still non-empty. Also, suppose that x j is a pointin a small cluster with cluster s as its neighboring cluster. This means s ∈ c U j . Prior to moving x k outof cluster s the cost of moving x j to cluster s was Distortion( (cid:101) Φ s , I j ∪ (cid:101) U s ) (see Eq. (44)). After moving x k out of cluster s , (cid:101) U s will possibly contract (see Eq. (42)). Therefore, the cost of moving all the points withcluster s as a neighboring cluster must be updated. Appendix D. Computation of ( s m , p m ) Mm =1 in Algo. 5 Algo. 5 aligns the intermediate views in a sequence. The computation of these sequences is motivated bythe necessary and sufficient conditions for a unique solution to the standard orthogonal Procrustes problem.We start by a brief review of a variant of the orthogonal Procrustes problem. Given the two matrices A and B of same size with d columns, one asks for an orthogonal matrix T of size d × d and a d -dimensionalcolumns vector v which most closely map A to B , that is, T, v = argmin Ω ,ω (cid:13)(cid:13) A Ω + n ω T − B (cid:13)(cid:13) F such that Ω T Ω = I. (D.1)Here n is the n -dimensional column vector containing ones. Equating the derivative of the objective withrespect to ω to zero, we obtain the following condition for ω , ω = n n T ( A Ω − B ) . (D.2)Substituting this back in Eq. (D.1), we reduce the above problem to the standard orthogonal Procrustesproblem, T = argmin Ω (cid:13)(cid:13) A Ω − B (cid:13)(cid:13) F (D.3)33here X = (cid:18) I − n n Tn (cid:19) X (D.4)for any matrix X . This is equivalent to subtracting the mean of the rows in X from each row of X .As proved in [31], the above problem, and therefore the variant, has a unique solution if and only if thesquare matrix A T B has full rank d . Denote by σ d ( X ) the d th smallest singular value of X . Then A T B hasfull rank if σ d ( A T B ) is non-zero, otherwise there exists multiple T which minimize Eq. (D.1).Now, we come back to the computation of the sequences ( s m , p m ) Mm =1 . The first view in the sequencecorresponds to the largest cluster, s = argmax m |C m | . (D.5)For convenience, we denote s m by s , p m by p and V mm (cid:48) by (cid:101) Φ gm ( (cid:101) U mm (cid:48) ) whenever needed. Then the sequences( s m ) Mm =2 and ( p m ) Mm =2 are computed so that ( T s , v s ) can be estimated without any ambiguity , to align theview (cid:101) Φ gs ( (cid:101) U sp ) with the view (cid:101) Φ gp ( (cid:101) U sp ). The alignment is done by minimizing the objective in Eq. D.1 with A and B replaced by V sp and V ps . An ambiguity would arise if multiple values of ( T s , v s ) can align the twoviews. This happens when σ d ( V Tsp V ps ) is zero.We therefore quantify the ambiguity in aligning the m th and the m (cid:48) th intermediate views on their overlap,that is, V mm (cid:48) and V m (cid:48) m , by W mm (cid:48) = σ d ( V Tmm (cid:48) V m (cid:48) m ) . (D.6)Note that W mm (cid:48) = W m (cid:48) m . Finally, we compute the sequences ( s m ) Mm =2 and ( p m ) Mm =2 so that W s m p m is highfor all m . To achieve this, we obtain a maximum spanning tree T rooted at s , of the graph with M nodesand W as the adjacency matrix (if the graph is disconnected then either k or η min should be increased).Denote the parent of node k in T by ρ k . Then ( s m ) Mm =2 is the sequence in which a breadth first search visitsnodes in T starting from s . And, p m is the parent of s m in the tree. Thus,( s m ) Mm =2 = Breadth-First-Search( T, s ) and p m = ρ s m . (D.7) Appendix E. Computation of (cid:101) U gm in Eq. (53) For each transformed intermediate view (cid:101) Φ gm ( (cid:101) U m ) in the embedding space, we first construct a secondaryview (cid:101) U gm in the same manner as an intermediate view (cid:101) U m in the ambient space is constructed. Then, justlike Eq. (48), we define the overlap of the m th and the m (cid:48) th intermediate views in the embedding space asthe overlap of their secondary counterparts, given by, (cid:101) U gmm (cid:48) = (cid:101) U gm ∩ (cid:101) U gm (cid:48) . (E.1)To construct (cid:101) U gm , we first define the discrete balls in the embedding space in the same way as in ambientspace (see Eq. (21)), as U gk = { y k (cid:48) | d e ( y k , y k (cid:48) ) < (cid:15) gk } (E.2)where (cid:15) gk is the distance to the ν k th nearest neighbor of y k . Here, the hyperparameter ν ≥ k toaccount for a possibly increased separation between the points in the embedding space than in the ambientspace, due to the distortion of the parameterizations (cid:101) Φ gm .34hen we define the clusters in the embedding space using the same cluster labels c k as in the ambient space,that is, the embedding y k of x k also belongs to cluster c k . Now, just as (cid:101) U m were constructed in Eq. (42),we construct (cid:101) U gm as (cid:101) U gm = ∪ c k = m U gk . (E.3) Appendix F. Supplementary Tables and Figures I npu t Algorithm Hyperparameters R ec t a n g l e B a r b e ll S q u a r e w i t h t w o h o l e s Sph e r e w i t h a h o l e S w i ss r o ll w i t h a h o l e N o i s y s w i ss r o ll Sph e r e C u r v e d t o r u s F l a tt o r u s M ¨o b i u ss t r i p K l e i n B o tt l e - d i m s i g n a l s t r e n g t hd a t a LDLE η min k nn - - 16 - - - - - - - - 16 k tune - - 7 - - - - - - - - 7Table F.1: The values of the hyperparameters for all the algorithms and examples. For Laplacian eigenmaps, for all theexamples except square with two holes, all the searched values of the hyperparameters result in similar plots. igure F.22: (first column) Input square grid where the points are colored by the distortion of the obtained parameterizationson the neighborhood surrounding them. A local view U k is also shown in black. (second column) The mapped square Ψ k ( M )is shown in red and the local view in the embedding space Ψ k ( U k ) is shown in black. (third column) The first eigenvector φ ki is shown where U k is colored in black. (fourth column) The second eigenvector φ ki is shown where U k is colored in black.Note that φ ki and φ ki are close to being orthogonal at x k . DLE with arrows Derived cut and paste diagrams