Multidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey
TTo appear as a part of an upcoming academic book on dimensionality reduction and manifold learning.
Multidimensional Scaling, Sammon Mapping, and Isomap:Tutorial and Survey
Benyamin Ghojogh
BGHOJOGH @ UWATERLOO . CA Department of Electrical and Computer Engineering,Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Ali Ghodsi
ALI . GHODSI @ UWATERLOO . CA Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science,Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada
Fakhri Karray
KARRAY @ UWATERLOO . CA Department of Electrical and Computer Engineering,Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada
Mark Crowley
MCROWLEY @ UWATERLOO . CA Department of Electrical and Computer Engineering,Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Abstract
Multidimensional Scaling (MDS) is one of thefirst fundamental manifold learning methods. Itcan be categorized into several methods, i.e.,classical MDS, kernel classical MDS, metricMDS, and non-metric MDS. Sammon mappingand Isomap can be considered as special casesof metric MDS and kernel classical MDS, re-spectively. In this tutorial and survey paper, wereview the theory of MDS, Sammon mapping,and Isomap in detail. We explain all the men-tioned categories of MDS. Then, Sammon map-ping, Isomap, and kernel Isomap are explained.Out-of-sample embedding for MDS and Isomapusing eigenfunctions and kernel mapping are in-troduced. Then, Nystrom approximation and itsuse in landmark MDS and landmark Isomap areintroduced for big data embedding. We also pro-vide some simulations for illustrating the embed-ding by these methods.
1. Introduction
Multidimensional Scaling (MDS) (Cox & Cox, 2008), firstproposed in (Torgerson, 1952), is one of the earliest pro-posed manifold learning methods. It can be used for man- ifold learning, dimensionality reduction, and feature ex-traction (Ghojogh et al., 2019c). The idea of MDS isto preserve the similarity (Torgerson, 1965) or dissimi-larity/distances (Beals et al., 1968) of points in the low-dimensional embedding space. Hence, it fits the data lo-cally to capture the global structure of data (Saul & Roweis,2003). MDS can be categorized into classical MDS, metricMDS, and non-metric MDS.In later approaches, Sammon mapping (Sammon, 1969)was proposed which is a special case of the distance-basedmetric MDS. One can consider Sammon mapping as thefirst proposed nonlinear manifold learning method (Gho-jogh et al., 2019b). The disadvantage of Sammon mappingis its iterative solution of optimization, which makes thismethod a little slow.The classical MDS can be generalized to have kernel clas-sical MDS in which any valid kernel can be used. Isomap(Tenenbaum et al., 2000) is a special case of the kernel clas-sical MDS which uses a kernel constructed from geodesicdistances between points. Because of the nonlinearity ofgeodesic distance, Isomap is also a nonlinear manifoldlearning method.MDS and its special cases, Sammon mapping, and Isomaphave had different applications (Young, 2013). For exam-ple, MDS has been used for facial expression recognition(Russell & Bullock, 1985; Katsikitis, 1997). Kernel Isomaphas also been used for this application (Zhao & Zhang,2011).The goal is to embed the high-dimensional input data a r X i v : . [ s t a t . M L ] S e p ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey { x i } ni =1 into the lower dimensional embedded data { y i } ni =1 where n is the number of data points. We de-note the dimensionality of input and embedding spaces by d and p ≤ d , respectively, i.e. x i ∈ R d and y i ∈ R p . Wedenote R d × n (cid:51) X := [ x , . . . , x n ] and R p × n (cid:51) Y :=[ y , . . . , y n ] .The remainder of this paper is organized as follows. Sec-tion 2 explains MDS and its different categories, i.e., clas-sical MDS, generalized classical MDS (kernel classicalMDS), metric MDS, and non-metric MDS. Sammon map-ping and Isomap are introduced in Sections 3 and 4, re-spectively. Section 5 introduced the methods for out-of-sample extensions of MDS and Isomap methods. Land-mark MDS and landmark Isomap, for big data embedding,are explained in Section 6. Some simulations for illustrat-ing the results of embedding are provided in Section 7. Fi-nally, Section 8 concludes the paper.
2. Multidimensional Scaling
MDS, first proposed in (Torgerson, 1952), can be dividedinto several different categories (Cox & Cox, 2008; Borg& Groenen, 2005), i.e., classical MDS, metric MDS, andnon-metric MDS. Note that the results of these are differ-ent (Jung, 2013). In the following, we explain all threecategories.
LASSICAL
MDS
WITH E UCLIDEAN D ISTANCE
The classical MDS is also referred to as
Principal Coor-dinates Analysis (PCoA) , or
Torgerson Scaling , or
Torg-ersonGower scaling (Gower, 1966). The goal of classi-cal MDS is to preserve the similarity of data points in theembedding space as it was in the input space (Torgerson,1965). One way to measure similarity is inner product.Hence, we can minimize the difference of similarities inthe input and embedding spaces:minimize { y i } ni =1 c := n (cid:88) i =1 n (cid:88) j =1 ( x (cid:62) i x j − y (cid:62) i y j ) , (1)whose matrix form is:minimize Y c = || X (cid:62) X − Y (cid:62) Y || F , (2)where (cid:107) · (cid:107) F denotes the Frobenius norm, and X (cid:62) X and Y (cid:62) Y are the Gram matrices of the original data X andthe embedded data Y , respectively.The objective function, in Eq. (2), is simplified as: || X (cid:62) X − Y (cid:62) Y || F = tr (cid:2) ( X (cid:62) X − Y (cid:62) Y ) (cid:62) ( X (cid:62) X − Y (cid:62) Y ) (cid:3) = tr (cid:2) ( X (cid:62) X − Y (cid:62) Y )( X (cid:62) X − Y (cid:62) Y ) (cid:3) = tr (cid:2) ( X (cid:62) X − Y (cid:62) Y ) (cid:3) , where tr ( . ) denotes the trace of matrix. If we decompose X (cid:62) X and Y (cid:62) Y using eigenvalue decomposition (Gho-jogh et al., 2019a), we have: X (cid:62) X = V ∆ V (cid:62) , (3) Y (cid:62) Y = Q Ψ Q (cid:62) , (4)where eigenvectors are sorted from leading (largest eigen-value) to trailing (smallest eigenvalue). Note that, ratherthan eigenvalue decomposition of X (cid:62) X and Y (cid:62) Y , onecan decompose X and Y using Singular Value Decompo-sition (SVD) and take the right singular vectors of X and Y as V and Q , respectively. The matrices ∆ and Ψ arethe obtained by squaring the singular values (to power ).See (Ghojogh & Crowley, 2019, Proposition 1) for proof.The objective function can be further simplified as: ∴ || X (cid:62) X − Y (cid:62) Y || F = tr (cid:2) ( X (cid:62) X − Y (cid:62) Y ) (cid:3) = tr (cid:2) ( V ∆ V (cid:62) − Q Ψ Q (cid:62) ) (cid:3) ( a ) = tr (cid:2) ( V ∆ V (cid:62) − V V (cid:62) Q Ψ Q (cid:62) V V (cid:62) ) (cid:3) = tr (cid:104)(cid:0) V ( ∆ − V (cid:62) Q Ψ Q (cid:62) V ) V (cid:62) (cid:1) (cid:105) = tr (cid:104) V ( ∆ − V (cid:62) Q Ψ Q (cid:62) V ) ( V (cid:62) ) (cid:105) ( b ) = tr (cid:104) ( V (cid:62) ) V ( ∆ − V (cid:62) Q Ψ Q (cid:62) V ) (cid:105) = tr (cid:104) ( V (cid:62) V (cid:124) (cid:123)(cid:122) (cid:125) I ) ( ∆ − V (cid:62) Q Ψ Q (cid:62) V ) (cid:105) ( c ) = tr (cid:104) ( ∆ − V (cid:62) Q Ψ Q (cid:62) V ) (cid:105) , where ( a ) and ( c ) are for V (cid:62) V = V V (cid:62) = I because V is a non-truncated (square) orthogonal matrix (where I denotes the identity matrix). The reason of ( b ) is the cyclicproperty of trace.Let R n × n (cid:51) M := V (cid:62) Q , so: || X (cid:62) X − Y (cid:62) Y || F = tr (cid:104) ( ∆ − M Ψ M (cid:62) ) (cid:105) . Therefore: ∴ minimize Y || X (cid:62) X − Y (cid:62) Y || F ≡ minimize M , Ψ tr (cid:104) ( ∆ − M Ψ M (cid:62) ) (cid:105) . The objective function is: c = tr (cid:104) ( ∆ − M Ψ M (cid:62) ) (cid:105) = tr ( ∆ + ( M Ψ M (cid:62) ) − ∆ M Ψ M (cid:62) )= tr ( ∆ ) + tr (( M Ψ M (cid:62) ) ) − tr ( ∆ M Ψ M (cid:62) ) . ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey M , we have: R n × n (cid:51) ∂c ∂ M = 2( M Ψ M (cid:62) ) M Ψ − ∆ M Ψ set = = ⇒ ( M Ψ M (cid:62) )( M Ψ ) = ( ∆ )( M Ψ ) ( a ) = ⇒ M Ψ M (cid:62) = ∆ , (5)where ( a ) is because M Ψ (cid:54) = .For the derivative with respect to the second objective vari-able, i.e., Ψ , we simplify the objective function a little bit: c = tr ( ∆ ) + tr (( M Ψ M (cid:62) ) ) − tr ( ∆ M Ψ M (cid:62) )= tr ( ∆ ) + tr ( M Ψ M (cid:62) ) − tr ( ∆ M Ψ M (cid:62) ) ( a ) = tr ( ∆ ) + tr ( M (cid:62) M Ψ ) − tr ( M (cid:62) ∆ M Ψ )= tr ( ∆ ) + tr (( M (cid:62) M Ψ ) ) − tr ( M (cid:62) ∆ M Ψ ) , where ( a ) is because of the cyclic property of trace.Taking derivative with respect to the second objective vari-able, i.e., Ψ , gives: R n × n (cid:51) ∂c ∂ Ψ = 2 M (cid:62) ( M Ψ M (cid:62) ) M − M (cid:62) ∆ M set = = ⇒ M (cid:62) ( M Ψ M (cid:62) ) M = M (cid:62) ( ∆ ) M ( a ) = ⇒ M Ψ M (cid:62) = ∆ , (6)where ( a ) is because M (cid:54) = . Both Eqs. (5) and (6) are: M Ψ M (cid:62) = ∆ , whose one possible solution is: M = I , (7) Ψ = ∆ . (8)which means that the minimum value of the non-negativeobjective function tr (( ∆ − M Ψ M (cid:62) ) ) is zero.We had M = V (cid:62) Q . Therefore, according to Eq. (7), wehave: ∴ V (cid:62) Q = I = ⇒ Q = V . (9)According to Eq. (4), we have: Y (cid:62) Y = Q Ψ Q (cid:62) ( a ) = Q Ψ Ψ Q (cid:62) = ⇒ Y = Ψ Q (cid:62) (8) , (9) = ⇒ Y = ∆ V (cid:62) , (10)where ( a ) can be done because Ψ does not include negativeentry as the gram matrix Y (cid:62) Y is positive semi-definite bydefinition. In summary, for embedding X using classical MDS, theeigenvalue decomposition of X (cid:62) X is obtained as in Eq.(3). Then, using Eq. (10), Y ∈ R n × n is obtained. Truncat-ing this Y to have Y ∈ R p × n , with the first (top) p rows,gives us the p -dimensional embedding of the n points. Notethat the leading p columns are used because singular valuesare sorted from largest to smallest in SVD which can beused for Eq. (3).2.1.2. G ENERALIZED C LASSICAL
MDS (K
ERNEL C LASSICAL
MDS)If d ij = || x i − x j || is the squared Euclidean distancebetween x i and x j , we have: d ij = || x i − x j || = ( x i − x j ) (cid:62) ( x i − x j )= x (cid:62) i x i − x (cid:62) i x j − x (cid:62) j x i + x (cid:62) j x j = x (cid:62) i x i − x (cid:62) i x j + x (cid:62) j x j = G ii − G ij + G jj , where R n × n (cid:51) G := X (cid:62) X is the Gram matrix. If R n (cid:51) g := [ g , . . . , g n ] = [ G , . . . , G nn ] = diag ( G ) ,we have: d ij = g i − G ij + g j , D = g (cid:62) − G + g (cid:62) = g (cid:62) − G + g (cid:62) , where is the vector of ones and D is the distance matrixwith squared Euclidean distance ( d ij as its elements). Let R n × n (cid:51) H := I − n (cid:62) denote the centering matrix. Wedouble-center the matrix D as follows (Oldford, 2018): HDH = ( I − n (cid:62) ) D ( I − n (cid:62) )= ( I − n (cid:62) )( g (cid:62) − G + g (cid:62) )( I − n (cid:62) )= (cid:2) ( I − n (cid:62) ) (cid:124) (cid:123)(cid:122) (cid:125) = g (cid:62) − I − n (cid:62) ) G + ( I − n (cid:62) ) g (cid:62) (cid:3) ( I − n (cid:62) )= − I − n (cid:62) ) G ( I − n (cid:62) )+ ( I − n (cid:62) ) g (cid:62) ( I − n (cid:62) ) (cid:124) (cid:123)(cid:122) (cid:125) = = − I − n (cid:62) ) G ( I − n (cid:62) ) = − HGH ∴ HGH = HX (cid:62) XH = − HDH . (11)Note that ( I − n (cid:62) ) = and (cid:62) ( I − n (cid:62) ) = because removing the row mean of and column mean ofof (cid:62) results in the zero vectors, respectively. ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey X are already centered, i.e., the mean has been re-moved ( X ← XH ), Eq. (11) becomes: X (cid:62) X = − HDH . (12) Corollary 1.
If using Eq. (3) as Gram matrix, the classicalMDS uses the Euclidean distance as its metric. Because ofusing Euclidean distance, the classical MDS using Grammatrix is a linear subspace learning method.Proof.
The Eq. (3) in classical MDS is the eigenvalue de-composition of the Gram matrix X (cid:62) X . According to Eq.(12), this Gram matrix can be restated to an expressionbased on squared Euclidean distance. Hence, the classicalMDS with Eq. (3) uses Euclidean distance and is linear,consequently.In Eq. (11) or (12), we can write a general kernel ma-trix (Hofmann et al., 2008) rather than the double-centeredGram matrix, to have (Cox & Cox, 2008): R n × n (cid:51) K = − HDH . (13)Note that the classical MDS with Eq. (3) is using a linearkernel X (cid:62) X for its kernel. This is another reason for whyclassical MDS with Eq. (3) is a linear method. It is alsonoteworthy that Eq. (13) can be used for unifying the spec-tral dimensionality reduction methods as special cases ofkernel principal component analysis with different kernels.See (Ham et al., 2004; Bengio et al., 2004a) and (Strange& Zwiggelaar, 2014, Table 2.1) for more details.Comparing Eqs. (11), (12), and (13) with Eq. (3) showsthat we can use a general kernel matrix, like Radial BasisFunction (RBF) kernel, in classical MDS to have general-ized classical MDS . In summary, for embedding X usingclassical MDS, the eigenvalue decomposition of the kernelmatrix K is obtained similar to Eq. (3): K = V ∆ V (cid:62) . (14)Then, using Eq. (10), Y ∈ R n × n is obtained. It is note-worthy that in this case, we are replacing X (cid:62) X with thekernel K = Φ ( X ) (cid:62) Φ ( X ) and then, according to Eqs.(10) and (14), we have: K = Y (cid:62) Y . (15)Truncating the Y , obtained from Eq. (10), to have Y ∈ R p × n , with the first (top) p rows, gives us the p -dimensional embedding of the n points. It is noteworthythat, because of using kernel in the generalized classicalMDS, one can name it the kernel classical MDS . 2.1.3. E QUIVALENCE OF
PCA
AND KERNEL
PCA
WITH C LASSICAL
MDS
AND G ENERALIZED C LASSICAL
MDS, R
ESPECTIVELY
Proposition 1.
Classical MDS with Euclidean distance isequivalent to Principal Component Analysis (PCA). More-over, the generalized classical MDS is equivalent to kernelPCA.Proof.
On one hand, the Eq. (3) can be obtained by theSVD of X . The projected data onto classical MDS sub-space is obtained by Eq. (10) which is ∆ V (cid:62) . On the otherhand, according to (Ghojogh & Crowley, 2019, Eq. 42),the projected data onto PCA subspace is ∆ V (cid:62) where ∆ and V (cid:62) are from the SVD of X . Comparing these showsthat classical MDS is equivalent to PCA.Moreover, Eq. (14) is the eigenvalue decomposition ofthe kernel matrix. The projected data onto the generalizedclassical MDS subspace is obtained by Eq. (10) which is ∆ V (cid:62) . According to (Ghojogh & Crowley, 2019, Eq. 62),the projected data onto the kernel PCA subspace is ∆ V (cid:62) where ∆ and V (cid:62) are from the eigenvalue decompositionof the kernel matrix; see (Ghojogh & Crowley, 2019, Eq.61). Comparing these shows that the generalized classicalMDS is equivalent to kernel PCA. Recall that the classical MDS tries to preserve the similar-ities of points in the embedding space. In later approachesafter classical MDS, the cost function was changed to pre-serve the distances rather than the similarities (Lee & Ver-leysen, 2007; Bunte et al., 2012).
Metric MDS has thisopposite view and tries to preserve the distances of pointsin the embedding space (Beals et al., 1968). For this, itminimizes the difference of distances of points in the inputand embedding spaces (Ghodsi, 2006). The cost functionin metric MDS is usually referred to as the stress function (Mardia, 1978; De Leeuw, 2011). This method is namedmetric MDS because it uses distance metric in its optimiza-tion. The optimization in metric MDS is:minimize { y i } ni =1 c := (cid:32) (cid:80) ni =1 (cid:80) nj =1 ,j
3. Sammon Mapping
Sammon mapping (Sammon, 1969) is a special case ofmetric MDS; hence, it is a nonlinear method. It is proba-bly correct to call this method the first proposed nonlinearmethod for manifold learning (Ghojogh et al., 2019b).This method has different names in the literature suchas
Sammon’s nonlinear mapping , Sammon mapping , and
Nonlinear Mapping (NLM) (Lee & Verleysen, 2007). Sam-mon originally named it NLM (Sammon, 1969). Its mostwell-known name is Sammon mapping.The optimization problem in Sammon mapping is almost aweighted version of Eq. (16), formulated as:minimize { y i } ni =1 a n (cid:88) i =1 n (cid:88) j =1 ,j
The gradient of the cost function c with re-spect to y i,k is (Sammon, 1969; Lee & Verleysen, 2007): ∂c ∂y i,k = − a n (cid:88) i =1 n (cid:88) j =1 ,j
Proof is according to (Lee & Verleysen, 2007). Ac-cording to chain rule, we have: ∂c ∂y i,k = ∂c ∂d y ( y i , y j ) × ∂d y ( y i , y j ) ∂y i,k . The first derivative is: ∂c ∂d y ( y i , y j ) = − a n (cid:88) i =1 n (cid:88) j =1 ,j
The second derivative of the cost function c with respect to y i,k is (Sammon, 1969; Lee & Verleysen,2007): ∂ c ∂y i,k = − a n (cid:88) i =1 n (cid:88) j =1 ,j
We have: ∂ c ∂y i,k = ∂∂y i,k (cid:16) ∂c ∂y i,k (cid:17) , where ∂c /∂y i,k is Eq. (26). Therefore: ∂ c ∂y i,k = − a n (cid:88) i =1 n (cid:88) j =1 ,j
4. Isomap
Isomap (Tenenbaum et al., 2000) is a special case of thegeneralized classical MDS, explained in Section 2.1.2.Rather than the Euclidean distance, Isomap uses an approx-imation of the geodesic distance. As was explained, theclassical MDS is linear; hence, it cannot capture the non-linearity of the manifold. Isomap makes use of the geodesicdistance to make the generalized classical MDS nonlinear.4.1.1. G
EODESIC D ISTANCE
The geodesic distance is the length of shortest path be-tween two points on the possibly curvy manifold. It isideal to use the geodesic distance; however, calculationof the geodesic distance is very difficult because it re-quires traversing from a point to another point on the man-ifold. This calculation requires differential geometry andRiemannian manifold calculations (Aubin, 2001). There-fore, Isomap approximates the geodesic distance by piece-wise Euclidean distances. It finds the k -Nearest Neigh-bors ( k NN) graph of dataset. Then, the shortest path be-tween two points, through their neighbors, is found usinga shortest-path algorithm such as the Dijkstra algorithmor the Floyd-Warshal algorithm (Cormen et al., 2009). Asklearn function in python for this is “graph shortest path”from the package “sklearn.utils.graph shortest path”. Notethat the approximated geodesic distance is also refered toas the curvilinear distance (Lee et al., 2002). The ap-proximated geodesic distance can be formulated as (Bengioet al., 2004b): D ( g ) ij := min r l (cid:88) i =2 (cid:107) r i − r i +1 (cid:107) , (30) Figure 1.
An example of the Euclidean distance, geodesic dis-tance, and approximated geodesic distance using piece-wise Eu-clidean distances. where l ≥ is the length of sequence of points r i ∈{ x i } ni =1 and D ( g ) ij denotes the ( i, j ) -th element of thegeodesic distance matrix D ( g ) ∈ R n × n .An example of the Euclidean distance, geodesic distance,and the approximated geodesic distance using piece-wiseEuclidean distances can be seen in Fig. 1. A real-world ex-ample is the distance between Toronto and Athens. TheEuclidean distance is to dig the Earth from Toronto toreach Athens directly. The geodesic distance is to movefrom Toronto to Athens on the curvy Earth by the shortestpath between two cities. The approximated geodesic dis-tance is to dig the Earth from Toronto to London in UK,then dig from London to Frankfurt in Germany, then digfrom Frankfurt to Rome in Italy, then dig from Rome toAthens. Calculations of lengths of paths in the approxi-mated geodesic distance is much easier than the geodesicdistance.4.1.2. I SOMAP F ORMULATION
As was mentioned before, Isomap is a special case of thegeneralized classical MDS with the geodesic distance used.Hence, Isomap uses Eq. (13) as: R n × n (cid:51) K = − HD ( g ) H . (31)It then uses Eqs. (14) and (10) to embed the data. AsIsomap uses the nonlinear geodesic distance in its kernelcalculation, it is a nonlinear method. Consider K ( D ) to be Eq. (13). Consequently, we have: R n × n (cid:51) K ( D ) = − HD H , (32)where D is the geodesic distance matrix, defined by Eq.(30). ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey R n × n (cid:51) K (cid:48) := K ( D ) + 2 c K ( D ) + 12 c H . (33)According to (Cailliez, 1983), K (cid:48) is guaranteed to be pos-itive semi-definite for c ≥ c ∗ where c ∗ is the largest eigen-value of the following matrix: (cid:20) K ( D ) − I − K ( D ) (cid:21) ∈ R n × n . (34) Kernel Isomap (Choi & Choi, 2004) chooses a value c ≥ c ∗ and uses K (cid:48) in Eq. (14) and then uses Eq. (10) forembedding the data.
5. Out-of-sample Extensions for MDS andIsomap
So far, we embedded the training dataset { x i ∈ R d } ni =1 or X = [ x , . . . , x n ] ∈ R d × n to have their embedding { y i ∈ R p } ni =1 or Y = [ y , . . . , y n ] ∈ R p × n . Assume we havesome out-of-sample (test data), denoted by { x ( t ) i ∈ R d } n t i =1 or X t = [ x ( t )1 , . . . , x ( t ) n ] ∈ R d × n t . We want to find theirembedding { y ( t ) i ∈ R p } n t i =1 or Y t = [ y ( t )1 , . . . , y ( t ) n ] ∈ R p × n t after the training phase. IGENFUNCTIONS
Consider a Hilbert space H p of functions with the innerproduct (cid:104) f, g (cid:105) = (cid:82) f ( x ) g ( x ) p ( x ) dx with density function p ( x ) . In this space, we can consider the kernel function K p : ( K p f )( x ) = (cid:90) K ( x, y ) f ( y ) p ( y ) dy, (35)where the density function can be approximated empiri-cally. The eigenfunction decomposition is defined to be(Bengio et al., 2004a;b): ( K p f k )( x ) = δ (cid:48) k f k ( x ) , (36)where f k ( x ) is the k -th eigenfunction and δ (cid:48) k is the corre-sponding eigenvalue. If we have the eigenvalue decompo-sition (Ghojogh et al., 2019a) for the kernel matrix K , wehave Kv k = δ k v k (see Eq. (14)) where v k is the k -theigenvector and δ k is the corresponding eigenvalue. Ac-cording to (Bengio et al., 2004b, Proposition 1), we have δ (cid:48) k = (1 /n ) δ k .5.1.2. E MBEDDING U SING E IGENFUNCTIONS
Proposition 4. If v ki is the i -th element of the n -dimensional vector v k and k ( x , x i ) is the kernel between vectors x and x i , the eigenfunction for the point x and the i -th training point x i are: f k ( x ) = √ nδ k n (cid:88) i =1 v ki ˘ k t ( x i , x ) , (37) f k ( x i ) = √ n v ki , (38) respectively, where ˘ k t ( x i , x ) is the centered kernel be-tween training set and the out-of-sample point x .Let the MDS or Isomap embedding of the point x be R p (cid:51) y ( x ) = [ y ( x ) , . . . , y p ( x )] (cid:62) . The k -th dimension of thisembedding is: y k ( x ) = (cid:112) δ k f k ( x ) √ n = 1 √ δ k n (cid:88) i =1 v ki ˘ k t ( x i , x ) . (39) Proof.
This proposition is taken from (Bengio et al.,2004b, Proposition 1). For proof, refer to (Bengio et al.,2004a, Proposition 1), (Bengio et al., 2006, Proposition1), and (Bengio et al., 2003b, Proposition 1 and Theorem1). More complete proofs can be found in (Bengio et al.,2003a).If we have a set of n t out-of-sample data points, ˘ k t ( x i , x ) is an element of the centered out-of-sample kernel (see(Ghojogh & Crowley, 2019, Appendix C)): R n × n t (cid:51) ˘ K t = K t − n n × n K t − n K n × n t + 1 n n × n K n × n t , (40)where := [1 , , . . . , (cid:62) , K t ∈ R n × n t is the not neces-sarily centered out-of-sample kernel, and K ∈ R n × n is thetraining kernel.5.1.3. O UT - OF - SAMPLE E MBEDDING
One can use Eq. (39) to embed the i -th out-of-sample datapoint x ( t ) i . For this purpose, x ( t ) i should be used in placeof x in Eq. (39).Note that Eq. (39) requires Eq. (40). In MDS and Isomap, K is obtained by the linear kernel, X (cid:62) X , and Eq. (31),respectively. Also, the out-of-sample kernel K t in MDS isobtained by the linear kernel between the training and out-of-sample data, i.e., X (cid:62) X t . In Isomap, the kernel K t isobtained by centering the geodesic distance matrix (see Eq.(31)) where the geodesic distance matrix between the train-ing and out-of-sample data is used. In calculation of thisgeodesic distance matrix, merely the training data points,and not the test points, should be used as the intermediatepoints in paths (Bengio et al., 2004b).It is shown in (Bengio et al., 2004b, Corollary 1) that us-ing the geodesic distance with only training data as inter-mediate points, for teh sake of out-of-sample embedding ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey landmark Isomap method(De Silva & Tenenbaum, 2003): y k ( x ) = 12 √ δ k n (cid:88) i =1 v ki ( D ( g ) avg − D ( g ) t ( x i , x )) , (41)where D ( g ) avg denotes the average geodesic distance betweenthe training points and D ( g ) t is the geodesic distance be-tween the i -th training point x i and the out-of-sample point x , in which the training set is used for intermediate points.Hence, one can use Eq. (41) for out-of-sample embeddingin Isomap.It is noteworthy that in addition to the out-of-sample ex-tension using eigenfunctions (Bengio et al., 2004b), thereexist some other methods for out-of-sample extension ofMDS and Isomap (Bunte et al., 2012; Strange & Zwigge-laar, 2011), which we pass by in this paper. There is a kernel mapping method (Gisbrecht et al., 2012;2015) to embed the out-of-sample data in Isomap, kernelIsomap, and MDS. We introduce this method here.We define a map which maps any data point as x (cid:55)→ y ( x ) ,where: R p (cid:51) y ( x ) := n (cid:88) j =1 α j k ( x , x j ) (cid:80) n(cid:96) =1 k ( x , x (cid:96) ) , (42)and α j ∈ R p , and x j and x (cid:96) denote the j -th and (cid:96) -th train-ing data point. The k ( x , x j ) is a kernel such as the Gaus-sian kernel: k ( x , x j ) = exp( −|| x − x j || σ j ) , (43)where σ j is calculated as (Gisbrecht et al., 2015): σ j := γ × min i ( || x j − x i || ) , (44)where γ is a small positive number.Assume we have already embedded the training data pointsusing MDS (see Section 2), Isomap (see Section 4), or ker-nel Isomap (see Section 4.2); therefore, the set { y i } ni =1 isavailable. If we map the training data points, we want tominimize the following least-squares cost function in orderto get y ( x i ) close to y i for the i -th training point:minimize α j ’s n (cid:88) i =1 || y i − y ( x i ) || , (45)where the summation is over the training data points. Wecan write this cost function in matrix form as below:minimize A || Y − K (cid:48)(cid:48) A || F , (46) where R n × p (cid:51) Y := [ y , . . . , y n ] (cid:62) and R n × p (cid:51) A :=[ α , . . . , α n ] (cid:62) . The K (cid:48)(cid:48) ∈ R n × n is the kernel matrixwhose ( i, j ) -th element is defined to be: K (cid:48)(cid:48) ( i, j ) := k ( x i , x j ) (cid:80) n(cid:96) =1 k ( x i , x (cid:96) ) . (47)The Eq. (46) is always non-negative; thus, its smallestvalue is zero. Therefore, the solution to this equation is: Y − K (cid:48)(cid:48) A = = ⇒ Y = K (cid:48)(cid:48) A ( a ) = ⇒ A = K (cid:48)(cid:48)† Y , (48)where K (cid:48)(cid:48)† is the pseudo-inverse of K (cid:48)(cid:48) : K (cid:48)(cid:48)† = ( K (cid:48)(cid:48)(cid:62) K (cid:48)(cid:48) ) − K (cid:48)(cid:48)(cid:62) , (49)and ( a ) is because K (cid:48)(cid:48)† K (cid:48)(cid:48) = I .Finally, the mapping of Eq. (42) for the n t out-of-sampledata points is: Y t = K (cid:48)(cid:48) t A , (50)where the ( i, j ) -th element of the out-of-sample kernel ma-trix K (cid:48)(cid:48) t ∈ R n t × n is: K (cid:48)(cid:48) t ( i, j ) := k ( x ( t ) i , x j ) (cid:80) n(cid:96) =1 k ( x ( t ) i , x (cid:96) ) , (51)where x ( t ) i is the i -th out-of-sample data point, and x j and x (cid:96) are the j -th and (cid:96) -th training data points.
6. Landmark MDS and Landmark Isomapfor Big Data Embedding
Nystrom approximation, introduced below, can be used tomake the spectral methods such as MDS and Isomap scal-able and suitable for big data embedding.
Nystrom approximation is a technique used to approximatea positive semi-definite matrix using merely a subset of itscolumns (or rows) (Williams & Seeger, 2001). Consider apositive semi-definite matrix R n × n (cid:51) K (cid:23) whose partsare: R n × n (cid:51) K = (cid:20) A BB (cid:62) C (cid:21) , (52)where A ∈ R m × m , B ∈ R m × ( n − m ) , and C ∈ R ( n − m ) × ( n − m ) in which m (cid:28) n .The Nystrom approximation says if we have the small partsof this matrix, i.e. A and B , we can approximate C andthus the whole matrix K . The intuition is as follows. As-sume m = 2 (containing two points, a and b) and n = 5 ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey A , as well as the similarity (or distance)of points c, d, and e from a and b, resulting in matrix B , wecannot have much freedom on the location of c, d, and e,which is the matrix C . This is because of the positive semi-definiteness of the matrix K . The points selected in sub-matrix A are named landmarks . Note that the landmarkscan be selected randomly from the columns/rows of matrix K and, without loss of generality, they can be put togetherto form a submatrix at the top-left corner of matrix.As the matrix K is positive semi-definite, by definition, itcan be written as K = O (cid:62) O . If we take O = [ R , S ] where R are the selected columns (landmarks) of O and S are the other columns of O . We have: K = O (cid:62) O = (cid:20) R (cid:62) S (cid:62) (cid:21) [ R , S ] (53) = (cid:20) R (cid:62) R R (cid:62) SS (cid:62) R S (cid:62) S (cid:21) (52) = (cid:20) A BB (cid:62) C (cid:21) . (54)Hence, we have A = R (cid:62) R . The eigenvalue decomposi-tion (Ghojogh et al., 2019a) of A gives: A = U Σ U (cid:62) (55) = ⇒ R (cid:62) R = U Σ U (cid:62) = ⇒ R = Σ (1 / U (cid:62) . (56)Moreover, we have B = R (cid:62) S so we have: B = ( Σ (1 / U (cid:62) ) (cid:62) S = U Σ (1 / S ( a ) = ⇒ U (cid:62) B = Σ (1 / S = ⇒ S = Σ ( − / U (cid:62) B , (57)where ( a ) is because U is orthogonal (in the eigenvaluedecomposition). Finally, we have: C = S (cid:62) S = B (cid:62) U Σ ( − / Σ ( − / U (cid:62) B = B (cid:62) U Σ − U (cid:62) B (55) = B (cid:62) A − B . (58)Therefore, Eq. (52) becomes: K ≈ (cid:20) A BB (cid:62) B (cid:62) A − B (cid:21) . (59) Proposition 5.
By increasing m , the approximation of Eq.(59) becomes more accurate. If rank of K is at most m ,this approximation is exact.Proof. In Eq. (58), we have the inverse of A . In order tohave this inverse, the matrix A must not be singular. Forhaving a full-rank A ∈ R m × m , the rank of A should be m .This results in m to be an upper bound on the rank of K and a lower bound on the number of landmarks. In practice,it is recommended to use more number of landmarks formore accurate approximation but there is a trade-off withthe speed. Corollary 2.
As we usually have m (cid:28) n , the Nystromapproximation works well especially for the low-rank ma-trices (Kishore Kumar & Schneider, 2017). Usually, be-cause of the manifold hypothesis, data fall on a subman-ifold; hence, usually, the kernel (similarity) matrix or thedistance matrix has a low rank. Therefore, the Nystrom ap-proximation works well for many kernel-based or distance-based manifold learning methods. Consider Eq. (52) or (59) as the partitions of the kernelmatrix K . Note that the (Mercer) kernel matrix is positivesemi-definite so the Nystrom approximation can be appliedfor kernels.Recall that Eq. (14) decomposes the kernel matrix intoeigenvectors and then Eq. (10) embeds data. However, forbig data, the eigenvalue decomposition of kernel matrix isintractable. Therefore, using Eq. (55), we decompose an m × m submatrix of kernel. Comparing Eqs. (15) and (53)shows that: R n × n (cid:51) Y = [ R , S ] ( a ) = [ Σ (1 / U (cid:62) , Σ ( − / U (cid:62) B ] , (60)where ( a ) is because of Eqs. (56) and (57) and the terms U and Σ are obtained from Eq. (55). The Eq. (60) givesthe approximately embedded data, with a good approxima-tion. This is the embedding in landmark MDS (De Silva &Tenenbaum, 2003; 2004). Truncating this matrix to have Y ∈ R p × n , with top p rows, gives the p -dimensional em-bedding of the n points.Comparing Eq. (60) with Eq. (10) shows that the formulaefor embedding of landmarks, R , and the whole data (with-out Nystrom approximation) are similar to each other butone is with only landmarks and the other is with the wholedata. If D ij denotes the ( i, j ) -th element of the distance matrixand v j is the j -th element of a vector v , Eq. (13) can berestated as (Platt, 2005): K = − (cid:16) D ij − j (cid:88) i c i D ij − i (cid:88) j c j D ij + (cid:88) i,j c i c j D ij (cid:17) , (61)where (cid:80) i c i = 1 .Let the partitions of the distance matrix be: R n × n (cid:51) D = (cid:20) E FF (cid:62) G (cid:21) , (62)where E ∈ R m × m , F ∈ R m × ( n − m ) , and G ∈ R ( n − m ) × ( n − m ) in which m (cid:28) n . Comparing Eqs. (52) ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey Figure 2.
Embedding of the training data in (a) classical MDS, (b) PCA, (c) kernel classical MDS (with cosine kernel), (d) Isomap, (e)kernel Isomap, and (f) Sammon mapping. and (62) shows that the partitions of the kernel matrix canbe obtained from the partitions of the distance matrix as(Platt, 2005): A ij = − (cid:16) E ij − i m (cid:88) p E pj − j m (cid:88) q E iq + 1 m (cid:88) p,q E pq (cid:17) , (63) B ij = − (cid:16) F ij − i m (cid:88) p F qj − j m (cid:88) q E iq (cid:17) , (64)and C ij can be obtained from Eq. (58).In landmark MDS and landmark Isomap, the partitions(submatrices) E and F of the Euclidean and geodesic dis-tance matrices are calculated, respectively (see Eq. (62)).Then, Eqs. (63), (64), and (58) give us the partitions of thekernel matrix. Eqs. (55) and (60) provide the embeddeddata.It is noteworthy that the paper (Platt, 2005) shows that dif-ferent landmark MDS methods, such as Landmark MDS(LMDS) (De Silva & Tenenbaum, 2003; 2004),
FastMap (Faloutsos & Lin, 1995), and
MetricMap (Wang et al.,1999) are reduced to landmark MDS introduced here. Thelandmark MDS is also referred to as the sparse MDS (De Silva & Tenenbaum, 2004). Moreover, the
Landmark Isomap (L-Isomap) (De Silva & Tenenbaum, 2003) is re-duced to the landmark Isomap method explained here (see(Bengio et al., 2004b, Corollary 1) for proof). In otherwords, the large-scale manifold learning methods make useof the Nystrom approximation (Talwalkar et al., 2008).
7. Simulations
For simulations, we used the MNIST dataset (LeCun et al.)includes 60,000 training images and 10,000 test images ofsize × pixels. It includes 10 classes for the 10 digits,0 to 9. Because of tractability of the eigenvalue problem,we used a subset of 2000 training points (200 per class) and500 test points (50 per class). LASSICAL
MDS
AND C OMPARISON TO
PCAThe embedding of training data by classical MDS is shownin Fig. 2. As can be seen, this embedding is interpretablebecause, for example, the digits (7 and 9), (6 and 8), and(5 and 6), which can be converted to each other by slightchanges, are embedded close to one another.Figure 2 also depicts the embedding of training data byPCA. As can be seen, the embedding of PCA is equivalentto the embedding of classical MDS because rotation andflipping does not matter in manifold learning. This vali- ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey Figure 3.
Embedding of the out-of-sample data in (a) classicalMDS, and (b) Isomap. The transparent points indicate the em-bedding of training data. dates the claim of equivalence of PCA and classical MDS,stated in Section 2.1.3.7.2.2. K
ERNEL CLASSICAL
MDS, I
SOMAP , K
ERNEL I SOMAP , AND S AMMON M APPING
The embedding of kernel classical MDS or the generalizedclassical MDS (with cosine kernel) is also shown in Fig.2. This figure also includes the embedding by Isomap. Itis empirically observed that the embeddings by Isomap areusually like the legs of an octopus (Ghojogh et al., 2019c).In this embedding, you can see two legs one of which isbigger than the other. Figure 2 also shows the embeddingby kernel Isomap. Note that kernel Isomap still uses thekernel calculated using the geodesic distance. Finally, theembedding by Sammon mapping, with 1000 iterations, isalso illustrated in Fig. 2. The embeddings by all thesemethods are meaningful because the more similar digitshave been embedded close to each other.An important fact about the embeddings is that the mean iszero in the embeddigns by classical MDS, kernel classicalMDS, Isomap, and kernel Isomap. This is because of dou-ble centering the distance matrices in these methods (see Eqs. (12), (13), (31), and (32)).
The out-of-sample embedding of the classical MDS andIsomap can be seen in Fig. 3. For the out-of-sample em-beddings by classical MDS and Isomap, we used Eqs. (39)and (41), respectively. In the Isomap method, as it is dif-ficult to implement the geodesic distance matrix calculatedfrom only the training points as the intermediate points, weused an approximation in which the test points can also beused as intermediate points. A slight shift in the mean ofout-of-sample embedding in the Isomap result is becauseof this approximation.
The Python code implementations of simulations can befound in the repositories of the following github profile: https://github.com/bghojogh
8. Conclusion
This tutorial and survey paper was on MDS, Sammon map-ping, and Isomap. Classical MDS, kernel classical MDS,metric MDS, and non-metric MDS were explained as cat-egories of MDS. Sammon mapping and Isomap were alsoexplained as special cases of metric MDS and kernel clas-sical MDS. Kernel Isomap was also introduced. Out-of-sample extensions of these methods using eigenfunctionsand kernel mapping were also provided. Landmark MDSand landmark Isomap using Nystron approximation werealso covered in this paper. Finally, some simulations wereprovided to show the embeddings.Some specific methods, based on MDS and Isomap werenot covered in this paper for the sake of brevity. someexamples of these methods are supervised Isomap (Wu &Chan, 2004), robust kernel Isomap (Choi & Choi, 2007)and kernel Isomap for noisy data (Choi & Choi, 2005).
References
Agarwal, Sameer, Wills, Josh, Cayton, Lawrence, Lanck-riet, Gert, Kriegman, David, and Belongie, Serge. Gen-eralized non-metric multidimensional scaling. In
Artifi-cial Intelligence and Statistics , pp. 11–18, 2007.Aubin, Thierry.
A course in differential geometry , vol-ume 27. American Mathematical Society, GraduateStudies in Mathematics, 2001.Beals, Richard, Krantz, David H, and Tversky, Amos.Foundations of multidimensional scaling.
Psychologicalreview , 75(2):127, 1968.Bengio, Yoshua, Vincent, Pascal, Paiement, Jean-Franc¸ois,Delalleau, O, Ouimet, M, and LeRoux, N. Learningeigenfunctions of similarity: linking spectral clustering ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey
Spectral clustering and kernel PCA are learningeigenfunctions , volume 1239. Citeseer, 2003b.Bengio, Yoshua, Delalleau, Olivier, Roux, Nicolas Le,Paiement, Jean-Franc¸ois, Vincent, Pascal, and Ouimet,Marie. Learning eigenfunctions links spectral embed-ding and kernel PCA.
Neural computation , 16(10):2197–2219, 2004a.Bengio, Yoshua, Paiement, Jean-franc¸cois, Vincent, Pas-cal, Delalleau, Olivier, Roux, Nicolas L, and Ouimet,Marie. Out-of-sample extensions for LLE, Isomap,MDS, eigenmaps, and spectral clustering. In
Advancesin neural information processing systems , pp. 177–184,2004b.Bengio, Yoshua, Delalleau, Olivier, Le Roux, Nicolas,Paiement, Jean-Franc¸ois, Vincent, Pascal, and Ouimet,Marie. Spectral dimensionality reduction. In
FeatureExtraction , pp. 519–550. Springer, 2006.Borg, Ingwer and Groenen, Patrick JF.
Modern multidi-mensional scaling: Theory and applications . SpringerScience & Business Media, 2005.Bunte, Kerstin, Biehl, Michael, and Hammer, Barbara. Ageneral framework for dimensionality-reducing data vi-sualization mapping.
Neural Computation , 24(3):771–804, 2012.Cailliez, Francis. The analytical solution of the additiveconstant problem.
Psychometrika , 48(2):305–308, 1983.Choi, Heeyoul and Choi, Seungjin. Kernel Isomap.
Elec-tronics letters , 40(25):1612–1613, 2004.Choi, Heeyoul and Choi, Seungjin. Kernel Isomap on noisymanifold. In
Proceedings. The 4th International Confer-ence on Development and Learning, 2005 , pp. 208–213.IEEE, 2005.Choi, Heeyoul and Choi, Seungjin. Robust kernel Isomap.
Pattern Recognition , 40(3):853–862, 2007.Cormen, Thomas H, Leiserson, Charles E, Rivest,Ronald L, and Stein, Clifford.
Introduction to algo-rithms . MIT press, 2009.Cox, Michael AA and Cox, Trevor F. Multidimensionalscaling. In
Handbook of data visualization , pp. 315–347.Springer, 2008. De Leeuw, Jan. Multidimensional scaling. Technical re-port, University of California Los Angeles, 2011.De Silva, Vin and Tenenbaum, Joshua B. Global versuslocal methods in nonlinear dimensionality reduction. In
Advances in neural information processing systems , pp.721–728, 2003.De Silva, Vin and Tenenbaum, Joshua B. Sparse multi-dimensional scaling using landmark points. Technicalreport, Technical report, Stanford University, 2004.Faloutsos, Christos and Lin, King-Ip. Fastmap: A fast algo-rithm for indexing, data-mining and visualization of tra-ditional and multimedia datasets. In
Proceedings of the1995 ACM SIGMOD international conference on Man-agement of data , pp. 163–174, 1995.Ghodsi, Ali. Dimensionality reduction a short tutorial.Technical report, Department of Statistics and ActuarialScience, Univ. of Waterloo, Ontario, Canada, 2006.Ghojogh, Benyamin and Crowley, Mark. Unsupervisedand supervised principal component analysis: Tutorial. arXiv preprint arXiv:1906.03148 , 2019.Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.Eigenvalue and generalized eigenvalue problems: Tuto-rial. arXiv preprint arXiv:1903.11240 , 2019a.Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.Roweis discriminant analysis: A generalized subspacelearning method. arXiv preprint arXiv:1910.05437 ,2019b.Ghojogh, Benyamin, Samad, Maria N, Mashhadi,Sayema Asif, Kapoor, Tania, Ali, Wahab, Karray,Fakhri, and Crowley, Mark. Feature selection and fea-ture extraction in pattern analysis: A literature review. arXiv preprint arXiv:1905.02845 , 2019c.Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark.Quantile-quantile embedding for distribution transfor-mation, manifold embedding, and image embeddingwith choice of embedding distribution. arXiv preprintarXiv:2006.11385 , 2020.Gisbrecht, Andrej, Lueks, Wouter, Mokbel, Bassam, andHammer, Barbara. Out-of-sample kernel extensions fornonparametric dimensionality reduction. In
EuropeanSymposium on Artificial Neural Networks, Computa-tional Intelligence and Machine Learning , volume 2012,pp. 531–536, 2012.Gisbrecht, Andrej, Schulz, Alexander, and Hammer, Bar-bara. Parametric nonlinear dimensionality reduction us-ing kernel t-sne.
Neurocomputing , 147:71–82, 2015. ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey
Biometrika , 53(3-4):325–338, 1966.Ham, Jihun, Lee, Daniel D, Mika, Sebastian, andSch¨olkopf, Bernhard. A kernel view of the dimensional-ity reduction of manifolds. In
Proceedings of the twenty-first international conference on Machine learning , pp.47, 2004.Hofmann, Thomas, Sch¨olkopf, Bernhard, and Smola,Alexander J. Kernel methods in machine learning.
Theannals of statistics , pp. 1171–1220, 2008.Holland, Steven M. Non-metric multidimensional scaling(mds). Technical report, Department of Geology, Uni-versity of Georgia, 2008.Jung, Sungkyu. Lecture: Multidimensional scaling, ad-vanced applied multivariate analysis. Lecture notes, De-partment of Statistics, University of Pittsburgh, 2013.Katsikitis, Mary. The classification of facial expressions ofemotion: A multidimensional-scaling approach.
Percep-tion , 26(5):613–626, 1997.Kishore Kumar, N and Schneider, Jan. Literature survey onlow rank approximation of matrices.
Linear and Multi-linear Algebra , 65(11):2212–2244, 2017.Kruskal, J. Non-metric multidimensional scaling. a numer-ical method.
Psychometrika , 29(1):1, 1964a.Kruskal, Joseph B. Multidimensional scaling by optimisinggoodness-of-fit to non-metric hypotheses.
Psychome-trika , 29(1):115–29, 1964b.LeCun, Yann, Cortes, Corinna, and Burges, Christo-pher J.C. MNIST handwritten digits dataset. http://yann.lecun.com/exdb/mnist/ . Accessed:2019.Lee, John A and Verleysen, Michel.
Nonlinear dimension-ality reduction . Springer Science & Business Media,2007.Lee, John Aldo, Lendasse, Amaury, Verleysen, Michel,et al. Curvilinear distance analysis versus isomap. In
Eu-ropean Symposium on Artificial Neural Networks , vol-ume 2, pp. 185–192, 2002.Mardia, Kanti V. Some properties of clasical multi-dimesional scaling.
Communications in Statistics-Theory and Methods , 7(13):1233–1241, 1978.Oldford, Wayne. Lecture: Recasting principal compo-nents. Lecture notes for Data Visualization, Departmentof Statistics and Actuarial Science, University of Water-loo, 2018. Platt, John. Fastmap, metricmap, and landmark mds are allnystrom algorithms. In
AISTATS , 2005.Russell, James A and Bullock, Merry. Multidimensionalscaling of emotional facial expressions: similarity frompreschoolers to adults.
Journal of personality and socialpsychology , 48(5):1290, 1985.Sammon, John W. A nonlinear mapping for data structureanalysis.
IEEE Transactions on computers , 100(5):401–409, 1969.Saul, Lawrence K and Roweis, Sam T. Think globally, fitlocally: unsupervised learning of low dimensional man-ifolds.
Journal of machine learning research , 4(Jun):119–155, 2003.Schlesinger, ItzchakM and Guttman, Louis. Smallest spaceanalysis of intelligence and achievement tests.
Psycho-logical Bulletin , 71(2):95, 1969.Strange, Harry and Zwiggelaar, Reyer. A generalised solu-tion to the out-of-sample extension problem in manifoldlearning. In
Proceedings of the Twenty-Fifth AAAI Con-ference on Artificial Intelligence , pp. 471–476, 2011.Strange, Harry and Zwiggelaar, Reyer.
Open Problems inSpectral Dimensionality Reduction . Springer, 2014.Talwalkar, Ameet, Kumar, Sanjiv, and Rowley, Henry.Large-scale manifold learning. In , pp.1–8. IEEE, 2008.Tenenbaum, Joshua B, De Silva, Vin, and Langford,John C. A global geometric framework for nonlinear di-mensionality reduction.
Science , 290(5500):2319–2323,2000.Torgerson, Warren S. Multidimensional scaling: I. theoryand method.
Psychometrika , 17(4):401–419, 1952.Torgerson, Warren S. Multidimensional scaling of similar-ity.
Psychometrika , 30(4):379–393, 1965.Wang, Jason Tsong-Li, Wang, Xiong, Lin, King-Ip, Shasha, Dennis, Shapiro, Bruce A, and Zhang,Kaizhong. Evaluating a class of distance-mapping al-gorithms for data mining and clustering. In
Proceed-ings of the fifth ACM SIGKDD international conferenceon Knowledge discovery and data mining , pp. 307–311,1999.Williams, Christopher KI and Seeger, Matthias. Usingthe Nystr¨om method to speed up kernel machines. In
Advances in neural information processing systems , pp.682–688, 2001. ultidimensional Scaling, Sammon Mapping, and Isomap: Tutorial and Survey
Pro-ceedings of 2004 International Conference on MachineLearning and Cybernetics (IEEE Cat. No. 04EX826) ,volume 6, pp. 3429–3433. IEEE, 2004.Young, Forrest W.
Multidimensional scaling: History, the-ory, and applications . Psychology Press, 2013.Zhao, Xiaoming and Zhang, Shiqing. Facial expressionrecognition based on local binary patterns and kernel dis-criminant Isomap.