Learning Isometric Separation Maps
aa r X i v : . [ c s . L G ] A p r LEARNING ISOMETRIC SEPARATION MAPS
Nikolaos Vasiloglou, Alexander G. Gray, David V. Anderson
Georgia Institute of TechnologyAtlanta GA [email protected], [email protected], [email protected]
ABSTRACT
Maximum Variance Unfolding (MVU) and its variants havebeen very successful in embedding data-manifolds in lowerdimensional spaces, often revealing the true intrinsic dimen-sion. In this paper we show how to also incorporate super-vised class information into an MVU-like method withoutbreaking its convexity. We call this method the Isomet-ric Separation Map and we show that the resulting kernelmatrix can be used as a binary/multiclass Support VectorMachine-like method in a semi-supervised (transductive)framework. We also show that the method always finds akernel matrix that linearly separates the training data exactlywithout projecting them in infinite dimensional spaces. Intraditional SVMs we choose a kernel and hope that the databecome linearly separable in the kernel space. In this paperwe show how the hyperplane can be chosen ad-hoc and thekernel is trained so that data are always linearly separable.Comparisons with Large Margin SVMs show comparableperformance.
1. INTRODUCTION
Support Vector Machines have been quite successful in sep-arating classes of data that are not linearly separable. Thekernel trick lifts the data in a high dimensional Hilbert spaceusually of infinite dimension [1]. Embedding datasets in in-finite dimensional spaces gives the advantage of separatingdata with linear hyperplanes in the lifted space, that other-wise were not separable in the original space. So far it isnot clear how the dimensionality of the kernel affects theperformance of SVMs. It is not known yet how many di-mensions are sufficient for separating the classes. It is verylikely that the minimum dimension required for linear sep-arability is much smaller than the original dimension of thedata. This is because the data might already be embeddedin a manifold with redundant dimensions.Maximum Variance Unfolding (MVU) [2] along withother manifold learning methods has addressed the problem
This work was sponsored by Google grants of reducing the dimensionality of the data by preserving lo-cal distances. Most of the time the data end up living in alower dimensional space. MVU explicitly finds the optimalkernel matrix for the data, by solving a semidefinite pro-gram. As a remark MVU usually gives the most compactspectrum [2], revealing the true intrinsic dimensionality ofthe dataset very well. The authors of the MVU point outthough that it has very poor performance when it comes tousing the kernel matrix for SVM classification [2] as it doesnot include any information about the linear separability ofthe classes. For example in figure 1b we show two classeson a Swiss roll manifold. After unfolding with MVU 1d,the classes remain non-linearly separable.In this paper we introduce a variation of MVU that takesinto consideration the linear separability of the classes. Theresult is a new algorithm, the Isometric Separation Map-ping (ISM), that gives an unfolding that preserves the classstructure of the manifold. The algorithm can be seen as atransductive (semi-supervised SVM), since it requires thetest data during training. Previous work on transductiveSVMs has also been studied by several researchers. Whenthe choice of the kernel is ad-hoc, the problem becomesvery difficult as it boils down to mixed integer program-ming [3]. In [4] and [5] the authors train the kernel ma-trix over a set of predefined kernels. Although this giveshigher flexibility in forming the kernel, it might still requirea large number of predefined kernels. For example if oneof the choices was the Gaussian, it would be necessary tokeep a large number of them with different sigmas (band-widths). It is widely known that if the bandwidth of theGaussian is too wide or too narrow kernel methods performpoorly. This technique usually leads to full rank Semidef-inite programs that are computationally hard. Finally, in[6] the Laplacian Eigenmap framework is used for trainingSVMs. Laplacian Eigenmaps are another dimensionality re-duction method based on the Gaussian kernel. It also tries tocapture the local geometry and take advantage of it in SVMtraining. Our technique does not make any assumption onthe kernel function. The only requirement is to preserveisometry on the data.The paper is organized in the following way. In section we give an overview of MVU along with its variants thatmake it scalable. In section 3 we present the ISM algorithm.Some examples on embedding manifolds with ISM are pre-sented in 4. In section 5 we present a transductive SVMbased on the ISM.
2. MAXIMUM VARIANCE UNFOLDING,MAXIMUM FURTHEST NEIGHBOR UNFOLDING.
Weinberger formulated the problem of isometric unfoldingas a Semidefinite Programming algorithm [2].Given a set of data X ∈ ℜ N × d , where N is the num-ber of points and d is the dimensionality, the dot product orGram matrix is defined as G = XX T . The goal is to finda new Gram matrix K such that rank ( K ) < rank ( G ) inother words K = ˆ X ˆ X T where ˆ X ∈ ℜ N × d ′ and d ′ < d .Now the dataset is represented by ˆ X which has fewer di-mensions than X . The requirement of isometric unfoldingis that the euclidian distances in the ℜ d ′ for a given neigh-borhood around every point have to be the same as in the ℜ d . This is expressed in: K ii + K jj − K ij − K ji = G ii + G jj − G ij − G ji , ∀ i, j ∈ I i where I i is the set of the indices of the neighbors of the ith point. From all the K matrices MFNU chooses theone that maximizes the distances between furthest neighborpairs and MVU the one that maximizes the variance of theset (equivalently the distances of the points from the origin).So the algorithm is presented as an SDP: max K N X i =1 B i • K (1)subject to A ij • K = d ij ∀ j ∈ I i K (cid:23) where the A • X = Trace ( AX T ) is the dot product betweenmatrices. A ij has the following form: A ij = . . . − . . . .. . . . . ... . . . . . . − . . . . . . ... . . . . . . . . . . . . . . . (2)and d ij = G ii + G jj − G ij − G ji (3) B i has the same structure of A ij and computes the distance d ij of the i th point with its furthest neighbor for MFNU, while for MVU it is just the unit matrix (computes the dis-tance of the points from the origin). The last condition isjust a centering constraint for the covariance matrix. Thenew lower dimensional representation of data ˆ X is found inthe eigenvectors of K . In general MVU/MFNU gives Grammatrices that have compact spectrum, at least more com-pact than traditional linear Principal Component Analysis(PCA). The method behaves equally well with MVU. Un-fortunately this method can handle datasets of no more thanhundreds of points because of its complexity. Kulis and Vasiloglou showed how the algorithm can bemore scalable [7, 8] by replacing the constraint K (cid:23) [9]with an explicit rank constraint K = RR T . The problembecomes non-convex and it is reformulated to: max R N X i =1 B i • RR T (4)subject to: A ij • RR T = d ij In [9], Burer proved that the above formulation has the sameglobal minimum with the convex one. In this form the algo-rithm scales better.The above problem can be solved with the augmentedLagrangian method [9]. L = − B i • RR T − N X i =1 X ∀ j ∈ I i λ ij ( A ij • RR T − d ij ) + σ N X i =1 X ∀ j ∈ I i λ ij ( A ij • RR T − d ij ) Our goal is to minimize the Lagrangian; that’s why the ob-jective function is − B i • RR T and not B i • RR T The solu-tion is typically found with the LBFGS method [9].
3. ISOMETRIC SEPARATION MAPS (ISM)
Although MVU and its variant MFNU give low rank ker-nel matrices, experiments [2] show that they are performingpoorly when it comes to SVM classification. In this sectionwe will show that MVU/MFNU can be modified so that thekernel matrix can be used for classification too.In traditional SVMs the kernel is chosen ad-hoc and thegoal is to find a hyperplane that can linearly separate theclasses. The kernel is chosen in such a way that it lifts thedata in a high dimensional space hoping that data would belinearly separable. In our approach we have the hyperplanegiven and we are trying to find the kernel matrix that sepa-rates the data along the hyperplane. Finding a kernel matrixo satisfy that condition is trivial as it suffices to add oneextra dimension on the data that will be either -1 or 1. Whatis sort of interesting though is to find a mapping to a (higheror lower) dimensional space that keeps data points linearlyseparable and preserves the local isometry. As we will seelater, depending on the structure of the classes it is likely toend up in a higher dimensional space. We are interested inthe minimum dimension of that space.The solution of the problem is the following. We pickone of the data points x A to be normal to the separating hy-perplane. The choice of the point does not matter since itwill just change the orientation of the points in space. Themanifold consists of two classes C and C . Let x i ∈ C be the points that belong in the same class with x A , then k ( x A , x i ) ≥ , where k ( x A , x i ) is the generalized dot prod-uct between x A and x i . For points that belong to the op-posite class x i ∈ C , k ( x A , x i ) ≤ . Now the problemof MVU/MFNU with linear separability constraints can becast as the following Semidefinite Program: max K N X i =1 B i • K (5)subject to A ij • K = d ij ∀ j ∈ I i K A,i ≥ , ∀ i ∈ C K A,i ≤ , ∀ i ∈ C K (cid:23) Using the same formulation as in [8] we can solve theabove problem in a non-convex framework that scales bet-ter. Extending the problem for more classes is pretty straightforward. The only modification is to use more anchorpoints that will serve as normal vectors to the separatinghyperplanes. The problem is always feasible provided that k ≪ N . . If all pair distances are given then the Gram ma-trix is uniquely defined and the problem might be infeasible.In the trivial case where k = 1 meaning that each point hasexactly one neighbor then the problem is always feasible.In general there is a maximum k where the problem mightbecome infeasible. That means there is always a k wherethe training error is zero. That means we can always finda dimensional space where the Manifold can be embeddedisometrically.If some of the data points are labeled (training data) andsome are not (test data), then the above method can be usedas an SVM-like classifier that always achieves zero train-ing error in contrast to other algorithms proposed for learn-ing the kernel in SVM (mentioned in section 1 , where thekernel is learnt as a convex combination of preselected ker-nels). This might sound as over-fitting on the training data. As long as the k neighbors belong to the tangent space and the man-ifold is smooth, a folding (locally isometric transform) of the manifoldalong a hyperplane always exists [10]
In reality though this is not true since the test data partici-pate during training glued on the training data with the dis-tance constraints. Another remark on the ISM is that it isnot a max margin classifier because it does not regularizethe norm of the normal vector. It is not possible to do itsince we need to also preserve the local distances.
4. DIMENSIONALITY MINIMIZATION WITH ISM
In order to verify ISM on dimensionality adjustment wetested it on the swiss roll dataset (1500 points). Two classeswere defined on the swiss roll that were not linearly sepa-rable. ISM was performed on the dataset. Embedding in2 dimensions was not possible as the isometry cannot bepreserved (the algorithm terminated with 2% error on lo-cal distances). Embedding was though possible in 3 dimen-sions where the algorithm terminated with 0.01% error inthe local distance constraints. In both cases the classifica-tion error was zero. As we see in fig. 1 MVU unfolds thedataset in a strip where the classes are not linearly separa-ble. The ISM on the other hand transforms the manifold ina set that preserves the local distances (k neighborhood=5)and divides the two classes in a linearly separable way. Inorder to demonstrate further the power of ISM we test it intwo even more complex cases. In figure 2 we generated 3classes on a swiss roll. Clearly MVU/MFNU unfolds themanifold in a non-separable way. ISM was able to map theswiss roll in a 12-dimensional space where the 3 classes arecompletely linearly separable.In figure 3 the Principal Component Analysis (PCA)spectrum of the 12 dimensional Swiss roll is shown. Thespectrum is quite rich. ISM can handle even more compli-cated cases. In figure 3 we show 3 classes lying randomlyon a Swiss roll. ISM was able to map the manifold in a 12-dimensional space keeping the 3 classes linearly separable.In figure 3 the PCA spectrum is depicted. The algorithm ter-minated with a very low feasibility error 0.4% for distancepreservation and 0.16% for linear separability. Further im-provement of the feasibility error was possible, but L-BFGSbecomes slow as it goes close to the optimum. In generalthe algorithm converges very quickly to 1% feasibility er-ror. Further improvement is possible but takes time.
5. TRANSDUCTIVE SVMS
The method described above can also be used as a transduc-tive SVM in a semi-supervised setting. Transductive SVMsare in general difficult problems. If the kernel is preselectedthen a mixed integer problem has to be solved. If the ker-nel is learnt from the data then as we mentioned earlier noguarantee can be given that the training data are linearlyseparable. In ISM the kernel is trained over all data, usingall neighborhood information. After solving the optimiza- bc de fg h
Fig. 1 . a)A three dimensional swiss roll painted with colorgradient. b)The same swiss roll with two classes on it, blackand green c)Unfolded swiss roll (a) with MVU/MFNU (noclass information). The color gradient shows that local dis-tances has been preserved. d) Unfolded swiss roll (b) withMVU/MFNU. The two classes are not linearly separable.e,f) Views of the swiss roll (a) with ISM. The class struc-ture was taken from (b). The intension of this figure is toshow how the points are mapped so that the local neighbor-hoods are preserved. g,h)Views of the (b) manifold afterISM. Now points are painted with the class colors to showthat they are linearly separabletion problem, the classification information for the test datawill be on the sign of K A,i ∀ i ∈ T , where T is the test set.At this point we would like to highlight the difference be-tween SVMs and ISMs. In figure 5 we see how SVMs andISMs would classify points. SVMs keep the points fixed Fig. 2 . Left: Three classes laying on a swiss roll. Right:After unfolding them with MVU the classes are not linearlyseparable. Isometric Separation Maps managed to map thismanifold in a 12-dimensional space such that the classeswere linearly separable by 3 hyperplanes 100% of the timeand the 5-neighborhood distances were preserved with 0.1%relative root mean square error SV D m agn i t ude SV D m agn i t ude Fig. 3 . Left: We illustrate the PCA (SVD) spectrum ofthe unfolded swiss roll of figure 4. As we can see it ispretty rich. Despite the bad structure of the classes, the ISMalgorithm was able to map it on a 12 dimensional space.Right: We illustrate the PCA (SVD) spectrum of the un-folded swiss roll of figure 2. As we can see it is pretty richtoo.
Fig. 4 . Top: Three classes laying randomly on a swissroll. Bottom: After unfolding them with MVU the classesare not linearly separable. Isometric Separation Maps man-aged to map this manifold in a 12-dimensional space suchthat the classes were linearly separable by 3 hyperplanes.The optimization algorithm terminated with feasibility er-ror 0.4% for 5-neighborhood distance preservation, while99.83% of the points were correctly classified. The goal ofthis experiment was to verify experimentally that ISM canlift any strange dataset to a high dimensional space, suchthat classes are linearly separableand try to find the optimal curve that separates the points.ISMs picks the hyperplane and moves the points around it(always keeping them connected), so that they are correctlyclassified. ig. 5 . Top: A simple 2 class one dimensional manifold.Bottom:Left: ISM will pick a straight line and fold thepoints around it, so that it classes remain separable. Bot-tom:Right: The traditional SVM will keep the points fixedand find the curve that best separates the classes.In order to evaluate ISM as an SVM classifier we chose apublicly available dataset and compared them versus tradi-tional SVMs in two different modes. We used the publiclyavailable
SVM-light software for traditional SVM classifi-cation. In the first experiment we picked 1000 points fromthe magic gamma telescope dataset, publicly available at theUCI repository. We chose 50 points as training points andused the other 950 as test points. For traditional SVM clas-sification we tested the linear, Gaussian and polynomial ker-nel, with different parameters for the the bandwidth and thepolynomial order. We also tuned the regularization factor sothat the test error was minimized. In other words we pushedtraditional SVMs to their best performance. The criticalparameter for ISM SVM is the k-neighborhood. Usuallysmall values of k allow embedding in lower dimensionalspaces, while large k lead in higher dimensional ones. Intables 1,2 the results are summarized. We tested severalk-neighborhoods for ISM and different Kernels for tradi-tional SVMs. From the results we observe that ISM behavesslightly better that SVM (73.68% versus 70.32%). This ismainly because the training set is small and SVM cannotcapture very well the geometry.In the second experiment we use the whole dataset. Thetraining set contains 12080 datapoints and the test set 6340.Although the dataset is 10 dimensional, it is possible to re-duce its dimension with MVU/MFNU down to 5. In orderto make it linearly separable though with ISM it was nec-essary to use more than 10. In tables 3, 4 the results aresummarized. As we can see SVM performs slightly betterthan ISM (83.28% versus 81.00%). Another remark in both
Table 1 . ISM SVM Classification Score versus k-neighborhood for the First Experiment k-NEIGHBORS DIMENSION SCORE
30 40 73.68% cases is that ISM always behaves better than the linear ker-nel. Gaussian SVM performance has the best performance.This is expected since Gaussian matrices are usually fullrank. ISM uses kernel matrices of much smaller rank andthey achieve equivalent performance.The results don’t necessarily demonstrate big differencebetween SVMs and ISM. We also experimented with sometoy datasets, such as the half moon dataset presented in[6]and a Swiss roll, where one data point is given per class.ISM obviously behaves better than SVM but this is a triv-ial and not a fair comparison. In general the differencesbetween ISM and traditional SVMs are in the same lev-els with the results reported in other transductive SVM pa-pers [6, 4].In practice ISMs are slower than SVMs sincethey are Semidefinite Problems contrary to SVMs that solveQuadratic problems. It is interesting though that they pro-vide an tool for associating the dimensionality of the datasetwith the classification score and linear separability. Themore we increase the dimensionality of the dataset withISM the better the classification score. In fact k acts as aregularizer. Large values of k correspond to better general-ization of the SVM as the test error drops.
6. SUMMARY
In this paper we presented a new Manifold Learning methodthe Isometric Separation Maps. This method is ideal for re-ducing the dimension of Manifold with class informationassociated with them. We also showed how ISM can beused as semi-supervised (transductive) classifiers. Althoughthey don’t have superior performance compared to tradi-tional max margin SVMs, they are a useful tool for deter-mining the dimensionality of the kernel space that is neces-sary for achieving linear separability. We believe that someimprovement of the objective function is necessary so that able 2 . Traditional SVM Classification Score versus k-neighborhood
KERNEL PARAMETER SCORE
Gaussian 0.1 69.89%Gaussian 0.5 70.00%Gaussian 1.0 70.11%Gaussian 1.5 70.00%Gaussian 2.0 70.11%Gaussian 4.0 70.21%
Gaussian 5.0 70.32%
Gaussian 6.0 70.21%Gaussian 8.0 70.11%linear - 69.89%polynomial 1 69.89%polynomial 2 69.68%polynomial 3 69.58%polynomial 4 69.84%polynomial 5 68.84%polynomial 6 68.84%polynomial 8 68.95%
Table 3 . ISM SVM Classification Score versus k-neighborhood For the Whole Dataset k-NEIGHBORS DIMENSION SCORE
12 30 80.22%12 35 79.97%12 40 80.47%12 45 79.76%12 50 79.81%
12 55 81.00 %15 40 80.39%15 45 79.40%15 50 79.07%15 55 79.82%20 40 78.96%20 45 79.68%20 50 80.13%20 55 78.42%
Table 4 . ISM SVM Classification Score versus k-neighborhood For the Whole Dataset k-NEIGHBORS DIMENSION SCOREGaussian 8 83.28%
Gaussian 6 82.77%linear - 78.64%polynomial 2 81.62%polynomial 3 82.07%polynomial 5 81.26% generalization is improved. Probably a term minimizing thenorm of the vector normal to the hyperplane (as in SVMs)can be used.
7. REFERENCES [1] J. Shawe-Taylor and N. Cristianini,
Kernel Methodsfor Pattern Analysis , Cambridge University Press NewYork, NY, USA, 2004.[2] K.Q. Weinberger, F. Sha, and L.K. Saul, “Learning akernel matrix for nonlinear dimensionality reduction,”in
ICML . ACM New York, NY, USA, 2004.[3] K. Bennett and A. Demiriz, “Semi-Supervised Sup-port Vector Machines,”
NIPS , 1999.[4] G.R.G. Lanckriet, N. Cristianini, P. Bartlett,L. El Ghaoui, and M.I. Jordan, “Learning theKernel Matrix with Semidefinite Programming,”
JMLR , vol. 5, pp. 27–72, 2004.[5] S.J.Z. Kim, A. Magnani, A.K. Koh, and S. Boyd,“Learning the kernel via convex optimization,” in
ICASSP , 2008, pp. 1997–2000.[6] M. Belkin, P. Niyogi, and V. Sindhwani, “On manifoldregularization,” in
AISTAT , 2005.[7] B. Kulis, A.C. Surendran, and J.C. Platt, “Fast Low-Rank Semidefinite Programming for Embedding andClustering,” in
AISTATS , 2007.[8] N. Vasiloglou, A. Gray, and D. Anderson, “ScalableSemidefnite Manifold Learning,” in
MLSP , 2008.[9] S. Burer and R.D.C. Monteiro, “A nonlinear program-ming algorithm for solving semidefinite programs vialow-rank factorization,”
Mathematical Programming ,vol. 95, no. 2, 2003.[10] J.M. Lee,