Fast SVM training using approximate extreme points
FFast SVM training using approximate extreme points
Fast SVM Training Using Approximate Extreme Points
Manu Nandan [email protected]
Department of Computer and Information Science and EngineeringUniversity of FloridaGainesville, FL 32611, USA
Pramod P. Khargonekar [email protected]
Department of Electrical and Computer EngineeringUniversity of Florida,Gainesville, FL 32611, USA
Sachin S. Talathi [email protected]
Department of Pediatrics, Division of NeurologyDepartment of Biomedical EngineeringDepartment of NeuroscienceUniversity of FloridaGainesville, FL 32611, USA
Editor:
Abstract
Applications of non-linear kernel Support Vector Machines (SVMs) to large datasets isseriously hampered by its excessive training time. We propose a modification, called theapproximate extreme points support vector machine (AESVM), that is aimed at overcomingthis burden. Our approach relies on conducting the SVM optimization over a carefullyselected subset, called the representative set, of the training dataset. We present analyticalresults that indicate the similarity of AESVM and SVM solutions. A linear time algorithmbased on convex hulls and extreme points is used to compute the representative set inkernel space. Extensive computational experiments on nine datasets compared AESVMto LIBSVM (Chang and Lin, 2001b), CVM (Tsang et al., 2005) , BVM (Tsang et al.,2007), LASVM (Bordes et al., 2005), SVM perf (Joachims and Yu, 2009), and the randomfeatures method (Rahimi and Recht, 2007). Our AESVM implementation was found totrain much faster than the other methods, while its classification accuracy was similar tothat of LIBSVM in all cases. In particular, for a seizure detection dataset, AESVM trainingwas almost 10 times faster than LIBSVM and LASVM and more than forty times fasterthan CVM and BVM. Additionally, AESVM also gave competitively fast classificationtimes. Keywords: support vector machines, convex hulls, large scale classification, non-linearkernels, extreme points
1. Introduction
Several real world applications require solutions of classification problems on large datasets.Even though SVMs are known to give excellent classification results, their application toproblems with large datasets is impeded by the burdensome training time requirements. a r X i v : . [ c s . L G ] A p r andan, Khargonekar, and Talathi Recently, much progress has been made in the design of fast training algorithms (Fan et al.,2008; Shalev-Shwartz et al., 2011) for SVMs with the linear kernel (linear SVMs). However,many applications require SVMs with non-linear kernels for accurate classification. Trainingtime complexity for SVMs with non-linear kernels is typically quadratic in the size of thetraining dataset (Shalev-Shwartz and Srebro, 2008). The difficulty of the long trainingtime is exacerbated when grid search with cross-validation is used to derive the optimalhyper-parameters, since this requires multiple SVM training runs. Another problem thatsometimes restricts the applicability of SVMs is the long classification time. The timecomplexity of SVM classification is linear in the number of support vectors and in someapplications the number of support vectors is found to be very large (Guo et al., 2005).In this paper, we propose a new approach for fast SVM training. Consider a two classdataset of N data vectors, X = { x i : x i ∈ R D , i = 1 , , ..., N } , and the corresponding targetlabels Y = { y i : y i ∈ [ − , , i = 1 , , ..., N } . The SVM primal problem can be representedas the following unconstrained optimization problem (Teo et al., 2010; Shalev-Shwartz et al.,2011): min w ,b F ( w , b ) = 12 (cid:107) w (cid:107) + CN N (cid:88) i =1 l ( w , b, φ ( x i )) (1)where l ( w , b, φ ( x i )) = max { , − y i ( w T φ ( x i ) + b ) } , ∀ x i ∈ X and φ : R D → H , b ∈ R , and w ∈ H , a Hilbert spaceHere l ( w , b, φ ( x i )) is the hinge loss of x i . Note that SVM formulations where the penaltyparameter C is divided by N have been used extensively (Sch¨olkopf et al., 2000; Franc andSonnenburg, 2008; Joachims and Yu, 2009). These formulations enable better analysis ofthe scaling of C with N (Joachims, 2006). The problem in (1) requires optimization over N variables. In general, for SVM training algorithms the training time will reduce if thesize of the training dataset is reduced. In this paper, we present an alternative to (1), called approximate extreme points supportvector machines (AESVM), that requires optimization over only a subset of the trainingdataset.
The AESVM formulation is:min w ,b F ( w , b ) = 12 (cid:107) w (cid:107) + CN M (cid:88) t =1 β t l ( w , b, φ ( x t )) (2)where x t ∈ X ∗ , w ∈ H , and b ∈ R Here M is the number of vectors in the selected subset of X , called the representative set X ∗ . The constants β t are defined in (10). We will prove in Section 3.2 that: • F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ C √ C(cid:15) , where ( w ∗ , b ∗ ) and ( w ∗ , b ∗ ) are the solutions of (1)and (2) respectively • Under the assumptions given in corollary 4, F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ C √ C(cid:15) • The AESVM problem minimizes an upper bound of a low rank Gram matrix approx-imation of the SVM objective function ast SVM training using approximate extreme points Based on these results we claim that solving the problem in (2) yields a solution closeto that of (1). As a by-product of the reduction in size of the training set, AESVM is alsoobserved to result in fast classification. Considering that the representative set will haveto be computed several times if grid search is used to find the optimum hyper-parametercombination, we also propose fast algorithms to compute Z ∗ . In particular, we presentan algorithm of time complexity O ( N ) and an alternative algorithm of time complexity O ( N log NP ) to compute Z ∗ , where P is a predefined large integer.The main contributions of this work can be summarized as follows: • Theoretical:
Theorems 1 and 2, and Corollaries 3 to 5 give rationale for the use ofAESVM as a computationally less demanding alternative to the SVM formulation. • Algorithmic:
The algorithm DeriveRS, described in Section 4, computes the represen-tative set in linear time. • Experimental:
Our extensive experiments on nine datasets of varying characteristics,illustrate the suitability of applying AESVM to classification on large datasets.This paper is organized as follows: in Section 2, we briefly discuss recent research onfast SVM training that is closely related to this work. Next, we provide the definition ofthe representative set and discuss properties of AESVM. In section 4, we present efficientalgorithms to compute the representative set and analyze its computational complexity.Section 5 describes the results of our computational experiments. We compared AESVMto the widely used LIBSVM library, core vector machines (CVM), ball vector machines(BVM), LASVM, SVM perf , and the random features method by Rahimi and Recht (2007).Our experiments used eight publicly available datasets and a data set on EEG from ananimal model of epilepsy (Talathi et al., 2008; Nandan et al., 2010). We conclude with adiscussion of the results of this paper in Section 6.
2. Related Work
Several methods have been proposed to efficiently solve the SVM optimization problem.SVMs require special algorithms, as standard optimization algorithms such as interior pointmethods (Boyd and Vandenberghe, 2004; Shalev-Shwartz et al., 2011) have large memoryand training time requirements that make it infeasible for large datasets. In the followingsections we discuss the most widely used strategies to solve the SVM optimization problem.We present a comparison of some of these methods to AESVM in Section 6. SVM solverscan be broadly divided into two categories as described below.
The SVM primal problem is a convex optimization problem with strong duality (Boyd andVandenberghe, 2004). Hence its solution can be arrived at by solving its dual formulation andan, Khargonekar, and Talathi given below: max α L ( α ) = N (cid:88) i =1 α i − N (cid:88) i =1 N (cid:88) j =1 α i α j y i y j K ( x i , x j ) (3)subject to 0 ≤ α i ≤ CN and N (cid:88) i =1 α i y i = 0Here K ( x i , x j ) = φ ( x i ) T φ ( x j ), is the kernel product (Sch¨olkopf and Smola, 2001) of thedata vectors x i and x j , and α is a vector of all variables α i . Solving the dual problem iscomputationally simpler, especially for non-linear kernels and a majority of the SVM solversuse dual optimization. Some of the major dual optimization algorithms are discussed below. Decomposition methods (Osuna et al., 1997) have been widely used to solve (3). Thesemethods optimize over a subset of the training dataset, called the ‘working set’, at each al-gorithm iteration. SVM light (Joachims, 1999) and SMO (Platt, 1999) are popular examplesof decomposition methods. Both these methods have a quadratic time complexity for linearand non-linear SVM kernels (Shalev-Shwartz and Srebro, 2008). Heuristics such as shrink-ing and caching (Joachims, 1999) enable fast convergence of decomposition methods andreduce their memory requirements. LIBSVM (Chang and Lin, 2001b) is a very popular im-plementation of SMO. A dual coordinate descent (Hsieh et al., 2008) SVM solver computesthe optimal α value by modifying one variable α i per algorithm iteration. Dual coordinatedescent SVM solvers, such as LIBLINEAR (Fan et al., 2008), have been proposed primarilyfor the linear kernel. Approximations of the Gram matrix (Fine and Scheinberg, 2002; Drineas and Mahoney,2005), have been proposed to increase training speed and reduce memory requirements ofSVM solvers. The Gram matrix is the N x N square matrix composed of the kernel products K ( x i , x j ) , ∀ x i , x j ∈ X . Training set selection methods attempt to reduce the SVM trainingtime by optimizing over a selected subset of the training set. Several distinct approacheshave been used to select the subset. Some methods use clustering based approaches (Pavlovet al., 2000) to select the subsets. In Yu et al. (2003), hierarchical clustering is performedto derive a dataset that has more data vectors near the classification boundary than awayfrom it. Minimum enclosing ball clustering is used in Cervantes et al. (2008) to remove datavectors that are unlikely to contribute to the SVM training.
Random sampling of training data is another approach followed by approximate SVMsolvers. Lee and Mangasarian (2001) proposed reduced support vector machines (RSVM),in which only a random subset of the training dataset is used. They solve a modifiedformulation of the L2-SVM that minimizes the l -norm of ξ instead of its l -norm. Bordeset al. (2005) proposed the LASVM algorithm that uses active selection techniques to trainSVMs on a subset of the training dataset.A core set (Clarkson, 2010) can be loosely defined as the subset of X for which thesolution of an optimization problem such as (3) has a solution similar to that for the entiredataset X . Tsang et al. (2005) proved that the L2-SVM is a reformulation of the minimumenclosing ball problem for some kernels. They proposed core vector machine (CVM) thatapproximately solves the L2-SVM formulation using core sets. A simplified version of CVMcalled ball vector machine (BVM) was proposed in Tsang et al. (2007), where only an ast SVM training using approximate extreme points enclosing ball is computed. G¨artner and Jaggi (2009) proposed an algorithm to solve theL1-SVM problem, by computing the shortest distance between two polytopes (Bennett andBredensteiner, 2000) using core sets. However, there are no published results on solvingL1-SVM with non-linear kernels using their algorithm.Another method used to approximately solve the SVM problem is to map the datavectors into a randomized feature space that is relatively low dimensional compared to thekernel space H (Rahimi and Recht, 2007). Inner products of the projections of the datavectors are approximations of their kernel product. This effectively reduces the non-linearSVM problem into the simpler linear SVM problem, enabling the use of fast linear SVMsolvers. This method is referred as RfeatSVM in the following sections of this document. In recent years, linear SVMs are finding increased use in applications with high-dimensionaldatasets. This has led to a surge in publications on efficient primal SVM solvers, which aremostly used for linear SVMs. To overcome the difficulties caused by the non-differentiabilityof the primal problem, the following methods are used.
Stochastic sub-gradient descent (Zhang, 2004) uses the sub-gradient computed at somedata vector x i to iteratively update w . Shalev-Shwartz et al. (2011) proposed a stochasticsub-gradient descent SVM solver, Pegasos, that is reported to be among the fastest linearSVM solvers. Cutting plane algorithms (Kelley, 1960) solve the primal problem by succes-sively tightening a piecewise linear approximation. It was employed by Joachims (2006)to solve linear SVMs with their implementation SVM perf . This work was generalized inJoachims and Yu (2009) to include non-linear SVMs by approximately estimating w witharbitrary basis vectors using the fix-point iteration method (Sch¨olkopf and Smola, 2001).Teo et al. (2010) proposed a related method for linear SVMs, that corrected some stabilityissues in the cutting plane methods.
3. Analysis of AESVM
As mentioned in the introduction, AESVM is an optimization problem on a subset of thetraining dataset called the representative set. In this section we first define the representa-tive set. Then we present some properties of AESVM. These results are intended to providetheoretical justifications for the use of AESVM as an approximation to the SVM problem(1). We denote the cardinality of a set S by | S | . The convex hull of a set X is the smallest convex set containing X (Rockafellar, 1996) andcan be obtained by taking all possible convex combinations of elements of X . Assuming X is finite, the convex hull is a polygon. The extreme points of X , EP ( X ), are defined to bethe vertices of the convex polygon formed by the convex hull of X . Any vector x i in X canbe represented as a convex combination of vectors in EP ( X ): x i = (cid:88) x t ∈ EP ( X ) π it x t , where 0 ≤ π it ≤
1, and (cid:88) x t ∈ EP ( X ) π it = 1 andan, Khargonekar, and Talathi We can see that functions of any data vector in X can be computed using only EP ( X )and the convex combination weights { π it } . The design of AESVM is motivated by theintuition that the use of extreme points may provide computational efficiency. However,extreme points are not useful in all cases, as for some kernels all data vectors are extremepoints in kernel space. For example, for the Gaussian kernel, K ( x i , x i ) = φ ( x i ) T φ ( x i ) = 1.This implies that all the data vectors lie on the surface of the unit ball in the Gaussian kernelspace and therefore are extreme points. Hence, we introduce the concept of approximateextreme points .Consider the set of transformed data vectors: Z = { z i : z i = φ ( x i ) , ∀ x i ∈ X } (4)Here, the explicit representation of vectors in kernel space is only for the ease of under-standing and all the computations are performed using kernel products. Let V be a positiveinteger that is much smaller than N and (cid:15) be a small positive real number. For notationalsimplicity, we assume N is divisible by V . Let Z l be subsets of Z for l = 1 , , ..., ( NV ), suchthat Z = ∪ l Z l and Z l ∩ Z m = ∅ for l (cid:54) = m , where m = 1 , , ..., ( NV ). We require that thesubsets Z l satisfy | Z l | = V, ∀ l and ∀ z i , z j ∈ Z l , we have y i = y j (5)Let Z ql denote an arbitrary subset of Z l . Next, for any z i ∈ Z l we define: f ( z i , Z ql ) = min µ i (cid:107) z i − (cid:88) z t ∈ Z ql µ it z t (cid:107) (6)s.t. 0 ≤ µ it ≤
1, and (cid:88) z t ∈ Z ql µ it = 1Consider the collection of subsets Z (cid:15) := { Z ql : max z i ∈ Z l f ( z i , Z ql ) ≤ (cid:15) } A set of approximate extreme points of Z l is denoted by Z ∗ l , and is defined as follows Z ∗ l ∈ argmin Z ql ∈Z (cid:15) | Z ql | (7)It can be seen that µ it for z t ∈ Z ∗ l are analogous to the convex combination weights π it for x t ∈ EP ( X ). The representative set Z ∗ of Z is the union of the sets of approximate extremepoints of its subsets Z l . Z ∗ = NV ∪ l =1 Z ∗ l
1. The properties derived for AESVM in Section 3.2 are valid for any Z ql . The requirement for the smallest Z ql is made only for the sake of a computationally simpler AESVM problem ast SVM training using approximate extreme points The representative set has properties that are similar to EP ( X ). Given any z i ∈ Z , wecan find Z l such that z i ∈ Z l . Let γ it = { µ it for z t ∈ Z ∗ l and z i ∈ Z l , and 0 otherwise } . Nowusing (6), we can write: z i = (cid:88) z t ∈ Z ∗ γ it z t + τ i (8)Here τ i is a vector that accounts for the approximation error f ( z i , Z ql ) in (6). From (6)-(8)we can conclude that: (cid:107) τ i (cid:107) ≤ (cid:15) ∀ z i ∈ Z (9)Since (cid:15) will be set to a very small positive constant, we can infer that τ i is a very smallvector. The weights γ it are used to define β t in (2) as: β t = N (cid:88) i =1 γ it (10)For ease of notation, we refer to the set X ∗ := { x t : z t ∈ Z ∗ } as the representativeset of X in the remainder of this paper. For the sake of simplicity, we assume that all γ it , β t , X , and X ∗ are arranged so that X ∗ is positioned as the first M vectors of X , where M = | Z ∗ | . Consider the following optimization problem. min w ,b F ( w , b ) = 12 (cid:107) w (cid:107) + CN N (cid:88) i =1 l ( w , b, u i ) (11)where u i = M (cid:88) t =1 γ it z t , z t ∈ Z ∗ , w ∈ H , and b ∈ R We use the problem in (11) as an intermediary between (1) and (2). The intermediateproblem (11) has a direct relation to the AESVM problem, as given in the following theorem.The properties of the max function given below are relevant to the following discussion: max (0 , A + B ) ≤ max (0 , A ) + max (0 , B ) (12) max (0 , A − B ) ≥ max (0 , A ) − max (0 , B ) (13) N (cid:88) i =1 max (0 , c i A ) = max (0 , A ) N (cid:88) i =1 c i (14)for A, B, c i ∈ R and c i ≥ Theorem 1
Let F ( w , b ) and F ( w , b ) be as defined in (11) and (2) respectively. Then, F ( w , b ) ≤ F ( w , b ) , ∀ w ∈ H and b ∈ R andan, Khargonekar, and Talathi Proof
Let L ( w , b, X ∗ ) = CN M (cid:80) t =1 l ( w , b, z t ) N (cid:80) i =1 γ it and L ( w , b, X ∗ ) = CN N (cid:80) i =1 l ( w , b, u i ), where u i = M (cid:80) t =1 γ it z t . From the properties of γ it in (6), and from (5) we get: L ( w , b, X ∗ ) = CN N (cid:88) i =1 max (cid:34) , (cid:40) − y i ( w T M (cid:88) t =1 γ it z t + b ) (cid:41)(cid:35) (15)= CN N (cid:88) i =1 max (cid:34) , M (cid:88) t =1 γ it (cid:8) − y t ( w T z t + b ) (cid:9)(cid:35) Using properties (12) and (14) we get: L ( w , b, X ∗ ) ≤ CN N (cid:88) i =1 M (cid:88) t =1 max (cid:2) , γ it (cid:8) − y t ( w T z t + b ) (cid:9)(cid:3) = CN M (cid:88) t =1 max (cid:2) , − y t ( w T z t + b ) (cid:3) N (cid:88) i =1 γ it = L ( w , b, X ∗ )Adding (cid:107) w (cid:107) to both sides of the inequality above we get F ( w , b ) ≤ F ( w , b )The following theorem gives a relationship between the SVM problem and the intermediateproblem. Theorem 2
Let F ( w , b ) and F ( w , b ) be as defined in (1) and (11) respectively. Then, − CN N (cid:88) i =1 max (cid:8) , y i w T τ i (cid:9) ≤ F ( w , b ) − F ( w , b ) ≤ CN N (cid:88) i =1 max (cid:8) , − y i w T τ i (cid:9) ∀ w ∈ H and b ∈ R , where τ i ∈ H is the vector defined in (8). Proof
Let L ( w , b, X ) = CN N (cid:80) i =1 l ( w , b, z i ), denote the average hinge loss that is minimizedin (1) and L ( w , b, X ∗ ) be as defined in Theorem 1. Using (8) and (1) we get: L ( w , b, X ) = CN N (cid:88) i =1 max (cid:8) , − y i ( w T z i + b ) (cid:9) = CN N (cid:88) i =1 max (cid:40) , − y i ( w T ( M (cid:88) t =1 γ it z t + τ i ) + b ) (cid:41) ast SVM training using approximate extreme points From the properties of γ it in (6), and from (5) we get: L ( w , b, X ) = CN N (cid:88) i =1 max (cid:40) , M (cid:88) t =1 γ it (1 − y t ( w T z t + b )) − y i w T τ i (cid:41) (16)Using (12) on (16), we get: L ( w , b, X ) ≤ CN N (cid:88) i =1 max (cid:34) , M (cid:88) t =1 γ it (cid:8) − y t ( w T z t + b ) (cid:9)(cid:35) + CN N (cid:88) i =1 max (cid:8) , − y i w T τ i (cid:9) = L ( w , b, X ∗ ) + CN N (cid:88) i =1 max (cid:8) , − y i w T τ i (cid:9) Using (13) on (16), we get: L ( w , b, X ) ≥ CN N (cid:88) i =1 max (cid:34) , M (cid:88) t =1 γ it (cid:8) − y t ( w T z t + b ) (cid:9)(cid:35) − CN N (cid:88) i =1 max (cid:8) , y i w T τ i (cid:9) = L ( w , b, X ∗ ) − CN N (cid:88) i =1 max (cid:8) , y i w T τ i (cid:9) From the two inequalities above we get, L ( w , b, X ∗ ) − CN N (cid:88) i =1 max (cid:8) , y i w T τ i (cid:9) ≤ L ( w , b, X ) ≤ L ( w , b, X ∗ )+ CN N (cid:88) i =1 max (cid:8) , − y i w T τ i (cid:9) Adding (cid:107) w (cid:107) to the inequality above we get F ( w , b ) − CN N (cid:88) i =1 max (cid:8) , y i w T τ i (cid:9) ≤ F ( w , b ) ≤ F ( w , b ) + CN N (cid:88) i =1 max (cid:8) , − y i w T τ i (cid:9) Using the above theorems we derive the following corollaries. These results provide thetheoretical justification for AESVM.
Corollary 3
Let ( w ∗ , b ∗ ) be the solution of (1) and ( w ∗ , b ∗ ) be the solution of (2). Then, F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ C √ C(cid:15)
Proof
It is known that (cid:107) w ∗ (cid:107) ≤ √ C (refer Theorem 1 in Shalev-Shwartz et al. (2011)). Itis straight forward to see that the same result also applies to AESVM, (cid:107) w ∗ (cid:107) ≤ √ C . Basedon (9) we know that (cid:107) τ i (cid:107) ≤ √ (cid:15) . From Theorem 2 we get: F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ CN N (cid:88) i =1 max (cid:8) , − y i w ∗ T τ i (cid:9) ≤ CN N (cid:88) i =1 (cid:107) w ∗ (cid:107)(cid:107) τ i (cid:107)≤ CN N (cid:88) i =1 √ C(cid:15) = C √ C(cid:15) andan, Khargonekar, and Talathi Since ( w ∗ , b ∗ ) is the solution of (1), F ( w ∗ , b ∗ ) ≤ F ( w ∗ , b ∗ ). Using this property andTheorem 1 in the inequality above, we get: F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ C √ C(cid:15) (17)Now we demonstrate some properties of AESVM using the dual problem formulationsof AESVM and the intermediate problem. The dual form of AESVM is given by: max ˆ α L ( ˆ α ) = M (cid:88) t =1 ˆ α t − M (cid:88) t =1 M (cid:88) s =1 ˆ α t ˆ α s y t y s z Tt z s (18)subject to 0 ≤ ˆ α t ≤ CN N (cid:88) i =1 γ it and M (cid:88) t =1 ˆ α t y t = 0The dual form of the intermediate problem is given by: max ˘ α L ( ˘ α ) = N (cid:88) i =1 ˘ α i − N (cid:88) i =1 N (cid:88) j =1 ˘ α i ˘ α j y i y j u Ti u j (19)subject to 0 ≤ ˘ α i ≤ CN and N (cid:88) i =1 ˘ α i y i = 0Consider the mapping function h : R N → R M , defined as h ( ˘ α ) = { ˜ α t : ˜ α t = N (cid:88) i =1 γ it ˘ α i } (20)It can be seen that the objective functions L ( h ( ˘ α )) and L ( ˘ α ) are identical. L ( h ( ˘ α )) = M (cid:88) t =1 ˜ α t − M (cid:88) t =1 M (cid:88) s =1 ˜ α t ˜ α s y t y s z Tt z s = N (cid:88) i =1 ˘ α i − N (cid:88) i =1 N (cid:88) j =1 ˘ α i ˘ α j y i y j u Ti u j = L ( ˘ α )It is also straight forward to see that, for any feasible ˘ α of (19), h ( ˘ α ) is a feasible point of(18) as it satisfies the constraints in (18). However, the converse is not always true. Withthat clarification, we present the following corollary. Corollary 4
Let ( w ∗ , b ∗ ) be the solution of (1) and ( w ∗ , b ∗ ) be the solution of (2). Let ˆ α bethe dual variable corresponding to ( w ∗ , b ∗ ) . Let h ( ˘ α ) be as defined in (20). If there existsan ˘ α such that h ( ˘ α ) = ˆ α and ˘ α is a feasible point of (19), then, F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ C √ C(cid:15) ast SVM training using approximate extreme points Proof
Let ( w ∗ , b ∗ ) be the solution of (11) and ˘ α the solution of (19). We know that L ( ˘ α ) = L ( ˆ α ) = F ( w ∗ , b ∗ ) and L ( ˘ α ) = F ( w ∗ , b ∗ ). Since L ( ˘ α ) ≥ L ( ˘ α ), we get F ( w ∗ , b ∗ ) ≥ F ( w ∗ , b ∗ )But, from Theorem 1 we know F ( w ∗ , b ∗ ) ≤ F ( w ∗ , b ∗ ) ≤ F ( w ∗ , b ∗ ). Hence F ( w ∗ , b ∗ ) = F ( w ∗ , b ∗ )From the above result we get F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ − CN N (cid:88) i =1 max (cid:8) , y i w ∗ T τ i (cid:9) ≤ F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) (22) F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ CN N (cid:88) i =1 max (cid:8) , − y i w ∗ T τ i (cid:9) (23)Adding (22) and (23) we get: F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ R + CN N (cid:88) i =1 (cid:2) max (cid:8) , − y i w ∗ T τ i (cid:9) + max (cid:8) , y i w ∗ T τ i (cid:9)(cid:3) (24)where R = F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ). Using (21) and the properties (cid:107) w ∗ (cid:107) ≤ √ C and (cid:107) w ∗ (cid:107) ≤√ C in (24): F ( w ∗ , b ∗ ) − F ( w ∗ , b ∗ ) ≤ CN N (cid:88) i =1 (cid:2) max (cid:8) , − y i w ∗ T τ i (cid:9) + max (cid:8) , y i w ∗ T τ i (cid:9)(cid:3) ≤ CN N (cid:88) i =1 (cid:107) w ∗ (cid:107)(cid:107) τ i (cid:107) + (cid:107) w ∗ (cid:107)(cid:107) τ i (cid:107)≤ CN N (cid:88) i =1 √ C(cid:15) = 2 C √ C(cid:15)
Now we prove a relationship between AESVM and the Gram matrix approximationmethods mentioned in Section 2.1.
Corollary 5
Let L ( α ) , L ( ˘ α ) , and F ( w , b ) be the objective functions of the SVM dual(3), intermediate dual (19) and AESVM (2) respectively. Let z i , τ i , and u i be as definedin (4), (8), and (11) respectively. Let G and ˜ G be the N x N matrices with G ij = y i y j z Ti z j and ˜ G ij = y i y j u Ti u j respectively. Then for any feasible ˘ α, α, w , and b : andan, Khargonekar, and Talathi
1. Rank of ˜ G = M, L ( α ) = N (cid:80) i =1 α i − α G α T , L ( ˘ α ) = N (cid:80) i =1 ˘ α i − ˘ α ˜ G ˘ α T , and Trace( G − ˜ G ) ≤ N (cid:15) + 2 M (cid:88) t =1 z Tt N (cid:88) i =1 γ it τ i F ( w , b ) ≥ L ( ˘ α ) Proof
Using G , the SVM dual objective function L ( α ) can be represented as: L ( α ) = N (cid:88) i =1 α i − α G α T Similarly, L ( ˘ α ) can be represented using ˜ G as: L ( ˘ α ) = N (cid:88) i =1 ˘ α i −
12 ˘ α ˜ G ˘ α T Applying u i = M (cid:80) t =1 γ it z t , ∀ z t ∈ Z ∗ to the definition of ˜ G , we get:˜ G = Γ A Γ T Here A is the M x M matrix comprised of A ts = y t y s z Tt z s , ∀ z t , z s ∈ Z ∗ and Γ is the N x M matrix with the elements Γ it = γ it . Hence the rank of ˜ G = M and intermediate dualproblem (19) is a low rank approximation of the SVM dual problem (3).The Gram matrix approximation error can be quantified using (8) and (9) as:Trace( G − ˜ G ) = N (cid:88) i =1 (cid:34) z Ti z i − ( M (cid:88) t =1 γ it z t ) T ( M (cid:88) s =1 γ is z s ) (cid:35) = N (cid:88) i =1 (cid:34) τ Ti τ i + 2 M (cid:88) t =1 γ it z Tt τ i (cid:35) ≤ N (cid:15) + 2 M (cid:88) t =1 z Tt N (cid:88) i =1 γ it τ i By the principle of duality, we know that F ( w , b ) ≥ L ( ˘ α ) , ∀ w ∈ H and b ∈ R , where˘ α is any feasible point of (19). Using Theorem 1 on the inequality above, we get F ( w , b ) ≥ L ( ˘ α ) , ∀ w ∈ H , b ∈ R and feasible ˘ α Thus the AESVM problem minimizes an upper bound ( F ( w , b )) of a rank M Gram matrixapproximation of L ( α ).Based on the theoretical results in this section, it is reasonable to suggest that for smallvalues of (cid:15) , the solution of AESVM is close to the solution of SVM. ast SVM training using approximate extreme points
4. Computation of the representative set
In this section, we present algorithms to compute the representative set.
The AESVMformulation can be solved with any standard SVM solver such as SMO and hence we donot discuss methods to solve it . As described in Section 3.1, we require an algorithm tocompute approximate extreme points in kernel space. Osuna and Castro (2002) proposedan algorithm to derive extreme points of the convex hull of a dataset in kernel space.Their algorithm is computationally intensive, with a time complexity of O ( N S ( N )), andis unsuitable for large datasets as S ( N ) typically has a super-linear dependence on N. Thefunction S ( N ) denotes the time complexity of a SVM solver (required by their algorithm),to train on a dataset of size N. We next propose two algorithms leveraging the work byOsuna and Castro (2002) to compute the representative set in kernel space Z ∗ with muchsmaller time complexities.We followed the divide and conquer approach to develop our algorithms. The dataset isfirst divided into subsets X q , q = 1 , , .., Q , where | X q | < P , Q ≥ NP and X = { X , X , .., X Q } .The parameter P is a predefined large integer. It is desired that each subset X q containsdata vectors that are more similar to each other than data vectors in other subsets. Ournotion of similarity of data vectors in a subset, is that the distances between data vectorswithin a subset is less than the distances between data vectors in distinct subsets. Thisfirst level of segregation is followed by another level of segregation. We can regard the firstlevel of segregation as coarse segregation and the second as fine segregation. Finally, theapproximate extreme points of the subsets obtained after segregation, are computed. Thetwo different algorithms to compute the representative set differ only in the first level ofsegregation as described in the following sections. We propose the methods, FLS1 and FLS2 given below to perform a first level of segregation.In the following description we use arrays ∆ (cid:48) and ∆ (cid:48) of N elements. Each element of ∆ (cid:48) (∆ (cid:48) ), δ i ( δ i ) , contains the index in X of the last data vector of the subset to which x i belongs. It is straight forward to replace this N element array with a smaller array of sizeequal to the number of subsets. We use a N element array for ease of description.
1. FLS1( X (cid:48) , P ) For some applications, such as anomaly detection on sequential data, data vectors arefound to be homogeneous within intervals. For example, the atmospheric conditions typ-ically do not change within a few minutes and hence weather data is homogeneous for ashort span. For such datasets it is enough to segregate the data vectors based on its positionin the training dataset. The same method can also be used on very large datasets withoutany homogeneity, in order to reduce computation time. The complexity of this method is O ( N (cid:48) ), where N (cid:48) = | X (cid:48) | .
2. FLS2( X (cid:48) , P ) When the dataset is not homogeneous within intervals or it is not excessively large weuse the more sophisticated algorithm, FLS2, of time complexity O ( N (cid:48) log N (cid:48) P ) given below.In step 1 of FLS2, the distance d i in kernel space of all x i ∈ X (cid:48) from x j is computed as d i = (cid:107) φ ( x i ) − φ ( x j ) (cid:107) = k ( x i , x i ) + k ( x j , x j ) − k ( x i , x j ). The algorithm FLS2( X (cid:48) , P ), in andan, Khargonekar, and Talathi [ X (cid:48) ,∆ (cid:48) ] = FLS1( X (cid:48) , P )1. For outerIndex = 1 t o ceiling( | X (cid:48) | P )2. For innerIndex = (outerIndex - 1) P t o min((outerIndex) P , | X (cid:48) | )3. Set δ innerIndex = min (( outerIndex ) P, | X (cid:48) | )effect builds a binary search tree, with each node containing the data vector x k selected instep 2 that partitions a subset of the dataset into two. The size of the subsets successivelyhalve, on downward traversal from the root of the tree to the other nodes. When the size ofall the subsets at a level become ≤ P the algorithm halts. The complexity of FLS2 can bederived easily when the algorithm is considered as an incomplete binary search tree buildingmethod. The last level of such a tree will have O ( N (cid:48) P ) nodes and consequently the heightof the tree is O (log N (cid:48) P ). At each level of the tree the calls to the BFPRT algorithm (Blumet al., 1973) and the rearrangement of the data vectors in steps 2 and 3 are of O ( N (cid:48) ) timecomplexity. Hence the overall time complexity of FLS2( X (cid:48) , P ) is O ( N (cid:48) log N (cid:48) P ).[ X (cid:48) ,∆ (cid:48) ] = FLS2( X (cid:48) , P )1. Compute distance d i in kernel space of all x i ∈ X (cid:48) from the first vector x j in X (cid:48)
2. Select x k such that there exists | X (cid:48) | data vectors x i ∈ X (cid:48) with d i < d k , using thelinear time BFPRT algorithm3. Using x k , rearrange X (cid:48) as X (cid:48) = { X , X } , where X = { x i : d i < d k , x i ∈ X (cid:48) } and X = { x i : x i ∈ X (cid:48) and x i (cid:54)∈ X }
4. If | X (cid:48) | ≤ P For i where x i ∈ X , set δ i = index of last data vector in X .For i where x i ∈ X , set δ i = index of last data vector in X .5. If | X (cid:48) | > P Run FLS2( X , P ) and FLS2( X , P ) After the initial segregation, another method SLS( X (cid:48) , V, ∆ (cid:48) ) is used to further segregate eachset X q into smaller subsets X q r of maximum size V , X q = { X q , X q , ...., X q R } , where V ispredefined ( V < P ) and R = ceiling ( | X q | V ). The algorithm SLS( X (cid:48) , V, ∆ (cid:48) ) is given below.In step 2.b, x t is the data vector in X q that is farthest from the origin in the space of thedata vectors. For some kernels, such as the Gaussian kernel, all data vectors are equidistantfrom the origin in kernel space. If the algorithm chooses a l in step 2.b based on distances insuch kernel spaces, the choice would be arbitrary and such a situation is avoided here. Eachiteration of the For loop in step 2 involves several runs of the BFPRT algorithm, with each ast SVM training using approximate extreme points run followed by a rearrangement of X q . Specifically, the BFPRT algorithm is first run on P data vectors, then on P − V data vectors, then on P − V data vectors and so on. The timecomplexity of each iteration of the For loop including the BFPRT algorithm run and therearrangement of data vectors is: O ( P + ( P − V ) + ( P − V ) + .. + V ) ⇒ O ( P V ). The overallcomplexity of SLS( X (cid:48) , V, ∆ (cid:48) ) considering the Q For loop iterations is O ( N (cid:48) P P V ) ⇒ O ( N (cid:48) PV ),since Q = O ( N (cid:48) P ).[ X (cid:48) ,∆ (cid:48) ] = SLS( X (cid:48) , V, ∆ (cid:48) )1. Initialize l = 12. For q = 1 t o Q (a) Identify subset X q of X (cid:48) using ∆ (cid:48) (b) Set a l = φ ( x t ), where x t ∈ argmax i (cid:107) x i (cid:107) , x i ∈ X q (c) Compute distance d i in kernel space of all x i ∈ X q from a l (d) Select x k such that, there exists V data vectors x i ∈ X q with d i < d k , using theBFPRT algorithm(e) Using x k , rearrange X q as X q = { X , X } , where X = { x i : d i < d k , x i ∈ X q } and X = { x i : x i ∈ X q and x i (cid:54)∈ X } (f) For i where x i ∈ X , set δ i = index of last data vector in X , where δ i is the i th element of ∆ (cid:48) (g) Remove X from X q (h) If | X | > V Set: l = l + 1 and a l = x k Repeat steps 2.c to 2.h(i) If | X | ≤ V For i where x i ∈ X , set δ i = index of last data vector in X After computing the subsets X q r , the algorithm DeriveAE is applied to each X q r to computeits approximate extreme points. The algorithm DeriveAE is described below. DeriveAE usesthree routines. SphereSet( X q r ) returns all x i ∈ X q r that lie on the surface of the smallesthypersphere in kernel space that contains X q r . It computes the hypersphere as a hardmargin support vector data descriptor (SVDD) (Tax and Duin, 2004). SphereSort( X q r )returns data vectors x i ∈ X q r sorted in descending order of distance in the kernel spacefrom the center of the SVDD hypersphere. CheckPoint( x i , Ψ) returns TRUE if x i is anapproximate extreme point of the set Ψ in kernel space. The operator A \ B indicates aset operation that returns the set of the members of A excluding A ∩ B . The matrix X ∗ q r contains the approximate extreme points of X q r and β q r is a | X ∗ q r | sized vector. andan, Khargonekar, and Talathi [ X ∗ q r , β q r ] = DeriveAE( X q r )1. Initialize: X ∗ q r = SphereSet( X q r ) and Ψ = ∅
2. Set ζ = SphereSort( X q r \ X ∗ q r )3. For each x i taken in order from ζ , call the routine CheckPoint( x i , X ∗ q r ∪ Ψ)If it returns
F ALSE , then set Ψ = Ψ ∪ x i
4. For each x i ∈ Ψ, execute CheckPoint( x i , X ∗ q r ∪ { Ψ \ x i } )If it returns F ALSE , set X ∗ q r = X ∗ q r ∪ x i
5. Initialize a matrix Γ of size | X q r | x | X ∗ q r | with all elements set to 0Set µ kk = 1 ∀ x k ∈ X ∗ q r , where µ ij is the element in the i th row and j th column of Γ6. For each x i ∈ X q r and x i (cid:54)∈ X ∗ q r , execute CheckPoint( x i , X ∗ q r )Set the i th row of Γ = µ i , where µ i is the result of CheckPoint( x i , X ∗ q r )7. For j = 1 t o | X ∗ q r | Set β jq r = | X qr | (cid:80) k =1 µ kj CheckPoint( x i , Ψ) is computed by solving the following quadratic optimization problem: min µ i p ( x i , Ψ) = (cid:107) φ ( x i ) − | Ψ | (cid:88) t =1 µ it φ ( x t ) (cid:107) s.t. x t ∈ Ψ , ≤ µ it ≤ | Ψ | (cid:88) t =1 µ it = 1where (cid:107) φ ( x i ) − | Ψ | (cid:80) t =1 µ it φ ( x t ) (cid:107) = K ( x t , x t ) + | Ψ | (cid:80) t =1 | Ψ | (cid:80) s =1 µ it µ is K ( x t , x s ) − | Ψ | (cid:80) t =1 µ it K ( x i , x t ). If theoptimized value of p ( x i , Ψ) ≤ (cid:15) , CheckPoint( x i , Ψ) returns TRUE and otherwise it returnsFALSE. It can be seen that the formulation of p ( x i , Ψ) is similar to (6). The value of µ i computed by CheckPoint( z i , Ψ ), is used in step 6 of DeriveAE.Now we compute the time complexity of DeriveAE. We use the fact that the optimizationproblem in CheckPoint( x i , Ψ) is essentially the same as the dual optimization problem ofSVM given in (3). Since DeriveAE solves several SVM training problems in steps 1,3,4,and 6, it is necessary to know the training time complexity of a SVM. As any SVM solvermethod can be used, we denote the training time complexity of each step of DeriveAE thatsolves an SVM problem as O ( S ( A q r )) . Here A q r is the largest value of X ∗ q r ∪ Ψ during therun of DeriveAE( X q r ). This enables us to derive a generic expression for the complexity ofDeriveAE, independent of the SVM solver method used. Hence the time complexity of step 1is O ( S ( A q r )). The time complexity of steps 3, 4 and 6 are O ( V S ( A q r )), O ( A q r S ( A q r )), and
2. For SMO based implementations, such as the implementation we used for Section 5, S ( A ) = O ( A ) ast SVM training using approximate extreme points O ( A q r S ( A q r )) respectively. The time complexity of step 2 is O ( V | Ψ | + V log V ), whereΨ = SphereSet( X q r ). Hence the time complexity of DeriveAE is O ( V | Ψ | + V log V + V S ( A q r ) + A q r S ( A q r )). Since | Ψ | is typically very small and A q r ≤ V , we denote thetime complexity of DeriveAE by O ( V log V + V S ( A q r )). X ∗ To derive X ∗ , it is required to first rearrange X , so that data vectors from each classare grouped together as X = { X + , X − } . Here X + = { x i : y i = 1 , x i ∈ X } and X − = { x i : y i = − , x i ∈ X } . Then the selected segregation methods are run on X + and X − separately. The algorithm DeriveRS given below, combines all the algorithms definedearlier in this section with a few additional steps, to compute the representative set of X . The complexity of DeriveRS can easily be computed by summing the complexitiesof its steps. The complexity of steps 1 and 6 is O(N). The complexity of step 2 is O ( N )if FLS1 is run or O ( N log NP ) if FLS2 is run. In step 3, the O ( NPV ) method SLS is run.In steps 4 and 5, DeriveAE is run on all the subsets X q r giving a total complexity of O ( N log V + V Q (cid:80) q =1 R (cid:80) r =1 S ( A q r )). Here we use the fact that the number of subsets X q r is O ( NV ). Thus the complexity of DeriveRS is O ( N ( PV + log V ) + V Q (cid:80) q =1 R (cid:80) r =1 S ( A q r )) when FLS1is used and O ( N (log NP + PV + log V ) + V Q (cid:80) q =1 R (cid:80) r =1 S ( A q r )) when FLS2 is used.[ X ∗ , Y ∗ , β ] = DeriveRS( X , Y ,P,V)1. Set X + = { x i : x i ∈ X , y i = 1 } and X − = { x i : x i ∈ X , y i = − }
2. Run [ X + , ∆ + ] = FLS( X + ,P) and [ X − , ∆ − ] = FLS( X − ,P), where FLS is FLS1 orFLS23. Run [ X + , ∆ +2 ] = SLS( X + ,V,∆ + ) and [ X − , ∆ − ] = SLS( X − ,V,∆ − )4. Using ∆ +2 , identify each subset X q r of X + and run [ X ∗ q r , β q r ] = DeriveAE( X q r )Set N + ∗ = sum of number of data vectors in all X ∗ q r derived from X +
5. Using ∆ − , identify each subset X q r of X − and run [ X ∗ q r , β q r ] = DeriveAE( X q r )Set N −∗ = sum of number of data vectors in all X ∗ q r derived from X −
6. Combine in the same order, all X ∗ q r to obtain X ∗ and all β q r to obtain β Set Y ∗ = { y i : y i = 1 for i = 1 , , .., N + ∗ ; and y i = − i = 1 + N + ∗ , N + ∗ , .., N −∗ + N + ∗ }
3. We present DeriveRS as one algorithm in spite of its two variants that use FLS1 or FLS2, for simplicityand to conserve space. andan, Khargonekar, and Talathi
5. Experiments
We focused our experiments on an SMO (Fan et al., 2005) based implementation of AESVMand DeriveRS. We evaluated the classification performance of AESVM using the ninedatasets, described below. Next, we present an evaluation of the algorithm DeriveRS,followed by an evaluation of AESVM.
Nine datasets of varied size, dimensionality and density were used to evaluate DeriveRS andour AESVM implementation. For datasets D2, D3 and D4, we performed five fold crossvalidation. We did not perform five fold cross-validation on the other datasets, becausethey have been widely used in their native form with a separate training and testing set.
D1:
KDD’99 intrusion detection dataset - This dataset is available as a training set of4898431 data vectors and a testing set of 311027 data vectors, with forty one features( D = 41). As described in Tavallaee et al. (2009), a huge portion of this dataset iscomprised of repeated data vectors. Experiments were conducted only on the distinctdata vectors. The number of distinct training set vectors was N = 1074974 and thenumber of distinct testing set vectors was N = 77216. The training set density =33%. D2:
Localization data for person activity - This dataset has been used in a study onagent-based care for independent living (Kaluza et al., 2010). It has N = 164860data vectors of seven features. It is comprised of continuous recordings from sensorsattached to five people and can be used to predict the activity that was performed byeach person at the time of data collection. In our experiments we used this datasetto validate a binary problem of classifying the activities ’lying’ and ’lying down’ fromthe other activities. Features 3 and 4, that gives the time information, were not usedin our experiments. Hence for this dataset D = 5. The dataset density = 96%. D3:
Seizure detection dataset - This dataset has N = 982863 data vectors, three features( D = 3) and density = 100%. It is comprised of continuous EEG recordings from ratsinduced with status epilepticus and is used to evaluate algorithms that classify seizureevents from seizure-free EEG. An important characteristic of this dataset is that itis highly unbalanced, the total number of data vectors corresponding to seizures isminuscule compared to the remaining data. Details of the dataset can be found inNandan et al. (2010), where it is used as dataset A. D4:
Forest cover type dataset - This dataset has N = 581012 data vectors and fifty fourfeatures ( D = 54) and density = 22%. It is used to classify the forest cover of areasof 30mx30m size into one of seven types. We followed the method used in Collobertet al. (2002), where a classification of forest cover type 2 from the other cover typeswas performed. http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data http://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity http://archive.ics.uci.edu/ml/datasets/Covertype ast SVM training using approximate extreme points D5 :
IJCNN1 dataset - This dataset was used in IJCNN 2001 generalization ability chal-lenge (Chang and Lin, 2001a). The training set and testing set have 49990 ( N =49990) and 91701 data vectors respectively. It has 22 features ( D = 22) and trainingset density = 59% D6 :
Adult income dataset - This dataset derived from the 1994 Census database, was usedto classify incomes over $50000 from those below it. The training set has N = 32561with D = 123 and density = 11%, while the testing set has 16281 data vectors. Thedata is pre-processed as described in Platt (1999). D7 :
Epsilon dataset - This is a dataset that was used for 2008 Pascal large scale learningchallenge and in Yuan et al. (2011). It is comprised of 400000 data vectors that are100% dense with D = 2000. Since this is too large for our experiments, we usedthe first 10% of the training set giving N = 40000. The testing set has 100000 datavectors. D8 :
MNIST character recognition dataset - The widely used dataset (Lecun et al., 1998)of hand written characters has a training set of N = 60000, D = 780 and density =19%. We performed the binary classification task of classifying the character ’0’ fromthe others. The testing set has 10000 data vectors. D9 : w8a dataset - This artificial dataset used in Platt (1999) was randomly generatedand has D = 300 features. The training set has N = 49749 with a density = 4% andthe testing set has 14951 data vectors. We began our experiments with an evaluation of the algorithm DeriveRS, described inSection 4. The performances of the two methods FLS1 and FLS2 were compared first. Weran DeriveRS on D1, D2, D4 and D5 with the parameters P = 10 , V = 10 , (cid:15) = 10 − , and g = [2 − , − , − , ..., ], first with FLS1 and then FLS2. For D2, DeriveRS was run on theentire dataset for this particular experiment, instead of performing five fold cross-validation.This was done because, D2 is a small dataset and the difference between the two first levelsegregation methods can be better observed when the dataset is as large as possible. Therelatively small value of P = 10 was also chosen considering the small size of D2 and D5.To evaluate the effectiveness of FLS1 and FLS2, we also ran DeriveRS with FLS1 and FLS2after randomly reordering each dataset. The results are shown in Figure 1.For datasets D1 and D5, FLS2 gave smaller representative sets in a shorter durationthan FLS1. As expected, for the relatively homogeneous dataset D2, FLS1 and FLS2 gavesimilar results, with FLS2 giving slightly larger representative sets. Dataset D4 was seen tohave much smaller representative sets with FLS1 than with FLS2. The results of DeriveRSobtained after randomly rearranging the datasets, indicate the utility of FLS2. For all the andan, Khargonekar, and Talathi Figure 1: Performance of variants of DeriveRS with g = [2 − , − , − , ..., ], for datasetsD1, D2, D4, and D5. The results of DeriveRS with FLS1 and FLS2, after ran-domly reordering the datasets are shown as Random+FLS1 and Random+FLS2,respectivelydatasets, the results of FLS2 after random reordering was seen to be significantly betterthan the results of FLS1 after random rearrangement. Hence we can infer that the goodresults obtained with FLS2 are not caused by any pre-existing order in the datasets. AfterD2 and D4 were randomly rearranged a sharp increase was observed in representative setsizes and computation times for DeriveRS with FLS1. This indicates the importance ofdataset homogeneity to the performance of FLS1. The results indicated for randomizedexperiments on DeriveRS are the averages of five repetitions.Next we investigated the impact of changes in the values of the parameters P and V on the performance of DeriveRS. All combinations of P = { , , , } and V = { , , , , } were used to compute the representative set of D1. Thecomputations were performed for (cid:15) = 10 − and g = 1. The method FLS2 was used for thefirst level segregation in DeriveRS. The results are shown in Table 1. As expected for analgorithm of time complexity O ( N (log NP + PV + log V ) + V Q (cid:80) q =1 R (cid:80) r =1 S ( A q r )), the computationtime was generally observed to increase for an increase in the value of V or P . It should benoted that our implementation of DeriveRS was based on SMO and hence S ( A q r ) = O ( A q r ).In some cases the computation time decreased when P or V increased. This is caused by adecrease in the value of O ( Q (cid:80) q =1 R (cid:80) r =1 A q r ), which is inferred from the observed decrease of the ast SVM training using approximate extreme points size of the representative set M ( M ≈ Q (cid:80) q =1 R (cid:80) r =1 A q r ). A sharp decrease in M was observedwhen V was increased. The impact of increasing P on the size of the representative set wasfound to be less drastic. This observation indicates that DeriveAE selects fewer approximateextreme points when V is larger. MN x100% (Computation time in seconds) P V = 10 V = 5x10 V = 10 V = 2x10 V = 3x10 P and V on the result of DeriveRSAs described in Section 5.3, we compared several SVM training algorithms with ourimplementation of AESVM. We performed a grid search with all combinations of the SVMhyper-parameters C (cid:48) = { − , − , ..., , } and g = { − , − , − , ..., , } . The hyper-parameter C (cid:48) is related to the hyper-parameter C as C (cid:48) = CN . We represent the grid interms of C (cid:48) as it is used in several SVM solvers such as LIBSVM, LASVM, CVM andBVM. Furthermore, the use of C (cid:48) enables the application of the same hyper-parameter gridto all datasets. To train AESVM with all the hyper-parameter combinations in the grid,the representative set has to be computed using DeriveRS for all values of kernel hyper-parameter g in the grid. This is because the kernel space varies when the value of g isvaried. For all the computations, the input parameters were set as P = 10 and V = 10 .The first level segregation in DeriveRS was performed using FLS2. Three values of thetolerance parameter (cid:15) were investigated, (cid:15) = 10 − , − or 10 − .The results of the computation for datasets D1 - D5, are shown in the Table 2. Thepercentage of data vectors in the representative set was found to increase with increasingvalues of g . This is intuitive, as when g increases the distance between the data vectors inkernel space increases. With increased distances, more data vectors x i become approximateextreme points. The increase in the number of approximate extreme points with g causesthe rising trend of computation time shown in Table 2. For a decrease in the value of (cid:15) , M increases. This is because, for smaller (cid:15) fewer x i would satisfy the condition: optimized p ( x i , Ψ) ≤ (cid:15) in CheckPoint( x i , Ψ). This results in the selection of a larger number ofapproximate extreme points in DeriveAE.The results of applying DeriveRS to the high-dimensional datasets D6-D9 are shown inTable 3. It was observed that MN was much larger for D6-D9 than for the other datasets.We computed the representative set with (cid:15) = 10 − only, as for smaller values of (cid:15) we expectthe representative set to be close to 100% of the training set. The increasing trend of thesize of the representative set with increasing g values can be observed in Table 3 also. andan, Khargonekar, and Talathi MN x100% (Computation time in seconds) (cid:15) Dataset g = g = g = g = g = 1 g = 2 g = 2 − D1 1.5(98) 1.9(104) 2.4(110) 3.2(119) 4.3(132) 5.9(148) 8.1(168)D2 1.2(7) 1.5(8) 2(9) 2.8(11) 4.1(15) 6(18) 9.2(23)D3 0.6(37) 0.6(37) 0.6(36) 0.6(36) 0.5(37) 0.6(37) 0.6(39)D4 4.3(45) 6.4(57) 9.4(74) 13.9(103) 20.7(139) 30.7(178) 44.8(216)D5 4.5(7) 8.3(9) 14(11) 21.8(14) 31.8(18) 43.7(21) 54.9(22)10 − D1 3(136) 4(159) 5.3(191) 7.2(240) 9.9(297) 13.3(362) 17.4(435)D2 2.8(12) 3.8(18) 5(27) 6.8(37) 9.3(44) 13.5(44) 19.9(82)D3 0.5(36) 0.6(37) 0.6(38) 0.7(39) 0.8(41) 0.9(43) 1.1(47)D4 13.5(135) 18.3(211) 24.9(300) 34.2(400) 47.7(493) 63.5(513) 74.4(445)D5 20.1(16) 27.9(22) 37.4(27) 47.6(31) 57.3(34) 66(34) 74(34)10 − D1 7(316) 9.3(425) 12.2(552) 15.7(726) 19.6(926) 24.2(1112) 28.9(1235)D2 6.2(59) 7.8(87) 9.8(98) 13(109) 18.3(138) 25.6(187) 34.3(235)D3 0.7(39) 0.8(42) 0.9(45) 1.1(50) 1.4(59) 1.7(73) 2.2(100)D4 30.7(607) 39.5(814) 51.9(1051) 66(1171) 75.1(1044) 77.8(839) 78.4(649)D5 43.3(50) 51.8(58) 60.3(62) 67.7(63) 73.8(59) 78.7(52) 81.8(44)Table 2: The percentage of the data vectors in X ∗ (given by MN x100) and its computationtime for datasets D1-D5 MN x100% (Computation time in seconds)Dataset g = g = g = g = g = 1 g = 2 g = 2 D6 69.3(19) 70.4(19) 73.4(19) 80.3(14) 83.9(9) 84(8) 87.9(8)D7 84.4(1077) 84.6(1089) 84.9(1069) 85.6(1085) 86.9(1079) 89.9(1032) 94.7(818)D8 90(131) 96.6(94) 98.8(78) 99.5(72) 100(70) 100(71) 100(63)D9 60.8(34) 62.9(36) 67(30) 70.8(21) 72.7(16) 75.2(14) 76.7(15)Table 3: The percentage of data vectors in X ∗ and its computation time for datasets D6-D9with (cid:15) = 10 − To judge the accuracy and efficiency of AESVM, its classification performance was comparedwith the SMO implementation in LIBSVM, ver. 3.1. We chose LIBSVM because it is a state-of-the-art SMO implementation that is routinely used in similar comparison studies. Tocompare the efficiency of AESVM to other popular approximate SVM solvers we chose CVM,BVM, LASVM, SVM perf , and RfeatSVM. A description of these methods is given in Section2. We chose these methods because they are widely cited, their software implementationsare freely available and other studies (Shalev-Shwartz et al., 2011) have reported fast SVMtraining using some of these methods. LASVM is also an efficient method for online SVM ast SVM training using approximate extreme points training. However, since we do not investigate online SVM learning in this paper, we did nottest the online SVM training performance of LASVM. We compared AESVM with CVMand BVM even though they are L2-SVM solvers, as they has been reported to be fasteralternatives to SVM implementations such as LIBSVM.The implementation of AESVM and DeriveRS were built upon the LIBSVM implemen-tation. All methods except SVM perf were allocated a cache of size 600 MB. The parametersfor DeriveRS were P = 10 and V = 10 , and the first level segregation was performedusing FLS2. To reflect a typical SVM training scenario, we performed a grid search withall eighty four combinations of the SVM hyper-parameters C (cid:48) = { − , − , ..., , } and g = { − , − , − , ..., , } . As mentioned earlier, for datasets D2, D3 and D4, five foldcross-validation was performed. The results of the comparison have been split into sub-sections given below, due to the large number of SVM solvers and datasets used. First we present the results of the performance comparison for D2 in Figures 2 and 3.For ease of representation, only the results of grid points corresponding to combinations of C (cid:48) = { − , − , , , , } and g = { − , − , , } are shown in Figures 2 and 3. Figure2 shows the graph between training time and classification accuracy for the five algorithms.Figure 3 shows the graph between the number of support vectors and classification accuracy.We present classification accuracy as the ratio of the number of correct classifications to thetotal number of classifications performed. Since the classification time of an SVM algorithmis directly proportional to the number of support vectors, we represent it in terms of thenumber of support vectors. It can be seen that, AESVM generally gave more accurateresults for a fraction of the training time of the other algorithms, and also resulted in lessclassification time. The training time and classification times of AESVM increased when (cid:15) was reduced. This is expected given the inverse relation of M to (cid:15) shown in Tables 2 and3. The variation in accuracy with (cid:15) is not very noticeable.Figures 2 and 3 indicate that AESVM gave better results than the other algorithmsfor SVM training and classification on D2, in terms of standard metrics. To present amore quantitative and easily interpretable comparison of the algorithms, we define the fiveperformance metrics given below. These metrics combine the results of all runs of eachalgorithm into a single value, for each dataset. For these metrics we take LIBSVM as abaseline of comparison, as it gives the most accurate solution among the tested methods.Furthermore, an important objective of these experiments is to show the similarity of theresults of AESVM and LIBSVM. In the description given below, F can refer to any or anyapproximate SVM algorithm such as AESVM, CVM, LASVM etc.1. Root mean squared error of classification accuracy,
RM SE : The similarity of thesolution of F to LIBSVM, in terms of its classification accuracy, is indicated by: RM SE = (cid:32) RS R (cid:88) r =1 S (cid:88) s =1 ( CL rs − C F rs ) (cid:33) . Here CL rs and C F rs are the classification accuracy of LIBSVM and F respectively, inthe s th cross-validation fold with the r th set of hyper-parameters of grid search. andan, Khargonekar, and Talathi l og ( T r a i n i ng t i m e ) Classification accuracy
AESVM, ε = 10 −3 AESVM, ε = 10 −4 AESVM, ε = 10 −5 CVMBVMLASVMLIBSVM
Figure 2: Plot of training time against classification accuracy of the SVM algorithms on D22.
Expected training time speedup,
ET S : The expected speedup in training time is indi-cated by:
ET S = 1 RS R (cid:88) r =1 S (cid:88) s =1 T L rs T F rs Here
T L rs and T F rs are the training times of LIBSVM and F respectively.3. Overall training time speedup,
OT S : It indicates overall training time speedup forthe entire grid search with cross-validation, including the time taken to compute therepresentative set. The total time taken by DeriveRS to compute the representativeset for all values of g is represented as TX ∗ . For methods other than AESVM, TX ∗ = 0. OT S = R (cid:80) r =1 S (cid:80) s =1 T L rsR (cid:80) r =1 S (cid:80) s =1 T F rs + TX ∗ ast SVM training using approximate extreme points l og ( N u m b e r o f s uppo r t vec t o r s ) Classification accuracy
AESVM, ε = 10 −3 AESVM, ε = 10 −4 AESVM, ε = 10 −5 CVMBVMLASVMLIBSVM
Figure 3: Plot of classification time, represented by the number of support vectors, againstclassification accuracy of the SVM algorithms on D24.
Expected classification time speedup,
ECS : The expected speedup in classificationtime is indicated by:
ECS = 1 RS R (cid:88) r =1 S (cid:88) s =1 N L rs N F rs Here
N L rs and N F rs are the number of support vectors in the solution of LIBSVM and F respectively.5. Overall classification time speedup,
OCS : The overall speedup in classification timeis indicated by:
OCS = R (cid:80) r =1 S (cid:80) s =1 N L rsR (cid:80) r =1 S (cid:80) s =1 N F rs The results of the classification performance comparison on datasets D1-D5, are shown inTable 4. It was observed that for all tested values of (cid:15) , AESVM resulted in large reductionsin training and classification times when compared to LIBSVM for a very small difference andan, Khargonekar, and Talathi Metric Dataset AESVM (cid:15) = 10 − AESVM (cid:15) = 10 − AESVM (cid:15) = 10 − CVM BVM LASVM
RM SE (x10 ) D1 0.28 0.16 0.21 0.44 0.6 0.12D2 2.56 1.81 1.19 26.59 24.06 2.18D3 0.16 0.10 0.05 0.33 0.39 55.2D4 1.08 0.82 0.74 9.4 9.44 − D5 0.99 0.39 0.23 0.74 0.84 0.13
ET S
D1 451.5 145 41.7 8.9 28.6 0.8D2 1614.7 289.6 62.8 0.7 0.8 0.2D3 28012.3 14799.3 7573.8 60.4 76.8 0.9D4 103.1 13.8 3.4 8 6.6 − D5 40.2 5 2 0.3 0.5 0.6
OT S
D1 92.1 34.2 9.5 6.2 21.6 0.8D2 148.6 45.5 14.3 0.5 0.5 0.1D3 968.5 800.6 514.4 23.9 22.8 0.5D4 11.9 4.1 2.2 6.2 4.4 − D5 5.2 2.5 1.5 0.2 0.3 0.5
ECS
D1 4.8 3.6 2.8 1.2 2 1.1D2 35.9 15.5 7.9 4.7 5 1D3 48.7 25.8 13.4 0.4 0.6 0.6D4 8.4 3.3 1.8 12.4 12.1 − D5 4.3 1.9 1.4 0.8 1 1
OCS
D1 3.8 3.1 2.5 1.1 1.9 1D2 23.4 10.9 6.1 4.5 4.4 1D3 32.2 16.1 9 0.3 0.5 0.2D4 5.4 2.7 1.7 12 10.7 − D5 2.8 1.8 1.4 0.8 1 1Table 4: Performance comparison of AESVM (with (cid:15) = 10 − , − , − ), CVM, BVM,LASVM and LIBSVM on datasets D1-D5in classification accuracy. Most notably, for D3 the expected and overall training timespeedups were of the order of 10 and 10 respectively, which is outstanding. Comparingthe results of AESVM for different (cid:15) values, we see that RM SE generally improves bydecreasing when (cid:15) decreases, while the metrics improve by increasing when (cid:15) increases. Theincrease in
ET S and
OT S is of a larger order than the increase in
RM SE when (cid:15) increases.Comparing AESVM to CVM, BVM and LASVM, we see that AESVM in general gavethe least values of
RM SE and the largest values of
ET S , OT S , ECS and
OCS . In afew cases LASVM gave low
RM SE values. However, in all our experiments LASVM tooklonger to train than the other algorithms including LIBSVM.
We could not complete theevaluation of LASVM for D4 due to its large training time, which was more than 40 hoursfor some hyper-parameter combinations.
It was also found that LASVM sometimes resulted ast SVM training using approximate extreme points in a larger classification time than the other algorithms including LIBSVM. CVM and BVMgenerally gave high vales of RM SE .Table 4 compares the classification accuracy of CVM, BVM, LASVM and AESVM tothe exact SVM solution given by LIBSVM. Another method to compare the algorithms is interms of the maximum classification accuracy, and the mean and standard deviation of theclassification accuracies, without using LIBSVM as a reference point. Such a comparisonfor datasets D1-D5, is given in Table 5. The five algorithms under comparison were foundto give similar maximum classification accuracies except for D2 and D4, where CVM andBVM gave significantly smaller values. Another interesting result is that for D3, the meanand standard deviation of accuracy of LASVM was found to be widely different from theother algorithms. For all the tested values of (cid:15) the maximum, mean and standard deviationof the classification accuracies of AESVM were found to be similar.Accuracy DatasetAESVM (cid:15) = 10 − AESVM (cid:15) = 10 − AESVM (cid:15) = 10 − CVM BVM LASVM LIBSVMMaximum(x10 ) D1 93.4 93.8 93.6 94.1 94.4 94.3 93.9D2 77.1 77.2 77.8 70.3 67.1 78.1 78.2D3 99.9 99.9 99.9 99.9 99.9 99.9 99.9D4 68.3 68.3 68.3 63.7 62.3 − ) D1 92.2, 0.7 92.3, 0.8 92.3, 0.8 92.7, 0.8 92.6, 0.9 92.5, 0.8 92.4, 0.8D2 72.3, 3.6 73.2, 3.7 73.6, 3.7 52.2, 0.8 54.6, 0.7 73.5, 0.5 74.1, 3.5D3 99.8, 0 99.8, 0.1 99.8, 0.1 99.8, 0.2 99.8, 0.2 69.3, 29.9 99.8, 0.1D4 61.3, 3.1 61, 3.1 61, 3.1 55.5, 3.1 54.9, 3.4 − (cid:15) = 10 − , − , − ),CVM, BVM, LASVM and LIBSVM on datasets D1-D5Next we present the results of performance comparison of CVM, BVM, LASVM, AESVM,and LIBSVM on the high-dimensional datasets D6-D9. As described in Section 5.2, De-riveRS was run with only (cid:15) = 10 − for these datasets. The results of the performancecomparison are shown in Tables 6 and 7. CVM was found to take longer than 40 hours totrain on D6, D7 and D8 with some hyper-parameter values and hence we could not completeits evaluation for those datasets. BVM also took longer than 40 hours to train on D7 and itwas also not evaluated for D7 . AESVM consistently reported
ET S , OT S , ECS and
OCS values that are larger than 1 unlike the other algorithms. Similar to the results in Table4, LASVM and BVM resulted in very large
RM SE values for some datasets. The resultsin Table 7 are similar to Table 5, with similar maximum accuracies for all algorithms andsignificantly lower mean and higher standard deviation of accuracy for BVM and LASVMon some datasets. andan, Khargonekar, and Talathi Metric Dataset AESVM (cid:15) = 10 − CVM BVM LASVM
RM SE (x10 ) D6 0.21 - 7.8 0.85D7 1.37 - - 2.37D8 0.02 - 17.55 0D9 0.15 1 0.89 27.5 ET S
D6 1.8 - 0.6 0.8D7 1.4 - - 0.9D8 1.1 - 4.7 1D9 1.6 1.4 17.5 0.6
OT S
D6 1.5 - 0.6 0.5D7 1.2 - - 0.7D8 1.1 - 2.6 0.9D9 1.3 1.2 16.9 0.5
ECS
D6 1.2 - 1.5 1D7 1.16 - - 1D8 1 - 3.2 1D9 1.2 1.8 4.9 2.3
OCS
D6 1.1 - 1.5 1D7 1.1 - - 1D8 1 - 2.6 1D9 1.1 1.9 5.2 1.1Table 6: Performance comparison of AESVM (with (cid:15) = 10 − ), CVM, BVM, LASVM andLIBSVM on datasets D6-D9Accuracy DatasetAESVM (cid:15) = 10 − CVM BVM LASVM LIBSVMMaximum(x10 ) D6 85.2 - 85.2 85 85.1D7 88.3 - - 88.4 88.6D8 99.7 - 99.7 99.7 99.7D9 99.3 99.5 99.5 99.5 99.5Mean,standarddeviation(x10 ) D6 81.3, 2.8 - 80.2, 8.9 81.1, 2.9 81.4, 2.8D7 85.3, 5.7 - - 85.2, 6.2 85.7, 4.8D8 92.3, 3.6 - 88.5, 18.1 92.3, 3.6 92.3, 3.6D9 98.7, 0.8 98.9, 0.8 98.9, 0.8 85.5, 23.9 98.8, 0.8Table 7: Comparison of classification accuracies of AESVM (with (cid:15) = 10 − ), CVM, BVM,LASVM and LIBSVM on datasets D6-D9 ast SVM training using approximate extreme points perf SVM perf differs from the other SVM solvers in its ability to compute a solution close tothe SVM solution for a given number of support vectors ( k ). The algorithm complexityis directly proportional to the parameter k , but with a decrease in k the approximationbecomes worse and the difference between the solutions of SVM perf and SVM increases.We used a value of k = 1000 for our experiments, as it has been reported to give goodperformance (Joachims and Yu, 2009). SVM perf was tested on datasets D1, D4, D5, D6, D8and D9, with the Gaussian kernel and the same hyper-parameter grid as described earlier.The results of the grid search are presented in Table 8. The results of our experiments onAESVM (with (cid:15) = 10 − ) and LIBSVM are repeated in Table 8 for ease of reference. Themaximum, mean and standard deviation of classification accuracies are represented as max.Acc., mean Acc., and std. Acc. respectively.Dataset Solver RM SE (x10 ) ET S OT S ECS OCS max. Acc.(x10 ) mean Acc.(x10 ) std. Acc.(x10 )D1 AESVM 0.28 451.5 92.1 4.8 3.8 93.4 92.2 0.7SVM perf perf perf perf perf perf perf , AESVM (with (cid:15) = 10 − ), and LIBSVMSVM perf was found to generally give higher RM SE values than AESVM. In particular,for the high dimensional datasets (D6, D8 and D9), the
RM SE values were significantlyhigher. The training speedup values of SVM perf are much lower than AESVM except forD8. As expected, the classification time speedups of SVM perf are significantly higher thanAESVM. The maximum accuracies of all the algorithms were similar. However, the mean
12. We used the software parameters ’-t 2 -w 9 –i 2 –b 0 –k 1000’ as suggested in the author’s website andan, Khargonekar, and Talathi and standard deviation of accuracies of SVM perf were very different from AESVM andLIBSVM for the high dimensional datasets D6, D8 and D9. feat SVM
Rahimi and Recht (2007) proposed a promising method to approximate non-linear kernelSVM solutions using simpler linear kernel SVMs. This is accomplished by first projectingthe training dataset into a randomized feature space and then using any SVM solver with thelinear kernel on the projected dataset. We concentrated our experiments on investigatingthe accuracy of the solution of RfeatSVM and its similarity to the SVM solution. LIBSVMwith the linear kernel was used to compute the RfeatSVM solution on the projected datasets.We used LIBSVM, in spite of the availability of faster linear SVM implementations, as itis an exact SVM solver. Hence only the performance metrics related to accuracy were usedto compare the performance of AESVM, LIBSVM and RfeatSVM. The random Fourierfeatures method, described in Algorithm 1 of Rahimi and Recht (2007), was used to projectthe datasets D1, D5, D6 and D9 into a randomized feature space of dimension E. The resultsof the accuracy comparison are given in Table 9. We used a smaller hyper-parameter gridof all twenty four combinations of C (cid:48) = { − , − , , , , } and g = { − , − , , } forour experiments. The results reported in Table 9 for AESVM and LIBSVM were computedfor this smaller grid.We used the same number of dimensions (E) of the randomized feature space for D1 andD6 as in Rahimi and Recht (2007). The RM SE values for RfeatSVM were significantlyhigher than AESVM for most datasets, especially for D1 and D6. The maximum accuracyfor RfeatSVM was found to be much less than AESVM and LIBSVM for all datasets. Thetime taken to compute the randomized feature space is not reported because it was foundto be negligibly small. Another important observation was that the projected datasetswere found to be almost 100% dense. The training time of SVM solvers are typicallylinearly proportional to the density of the dataset and hence a highly dense dataset cantake a significant training time even with fast linear SVMs. Dense datasets also have largememory requirements.
To validate our proposal of AESVM as a fast alternative to SVM for all non-linear kernels,we performed a few experiments with the polynomial kernel, k ( x , x ) = (1 + x T x ) d . Thehyper-parameter grid composed of all twelve combinations of C (cid:48) = { − , − , , } and d = { , , } was used to compute the solutions of AESVM and LIBSVM on the datasetsD1, D4 and D6. The results of the computation of the representative set using DeriveRSare shown in Table 10. The parameters for DeriveRS were P = 10 , V = 10 and (cid:15) = 10 − ,and the first level segregation was performed using FLS2. The performance comparison ofAESVM and LIBSVM with the polynomial kernel is shown in Table 11. Like in the caseof the Gaussian kernel, we found that AESVM gave results similar to LIBSVM with thepolynomial kernel, while taking shorter training and classification times. ast SVM training using approximate extreme points Dataset Solver
RM SE (x10 ) max. Acc.(x10 ) mean Acc.(x10 ) std. Acc.(x10 ) Originaldensity% Density af-ter projec-tion %D1 AESVM 0.24 93.6 92.2 0.9RfeatSVM(E = 100) 56.18 37.8 36.1 1.3 33 100LIBSVM 93.6 92.3 0.9D5 AESVM 0.9 98.6 95.7 2.8RfeatSVM(E = 100) 5.3 94.7 91.6 1.4 59 100LIBSVM 98.9 96.2 2.7D6 AESVM 0.16 85.1 81.2 2.9RfeatSVM(E = 1000) 4 81.6 78 2.2 11 100LIBSVM 85 81.3 3D9 AESVM 0.15 99.3 98.6 0.8RfeatSVM(E = 1000) 0.6 98.7 97.4 0.6 4 95.8LIBSVM 99.5 98.8 0.9Table 9: Performance comparison of RfeatSVM, AESVM (with (cid:15) = 10 − ), and LIBSVM.The density of the datasets before and after projecting into randomized featurespaces are also shown MN x100% (Computation time in seconds)Dataset d = 2 d = 3 d = 4D1 6.6(410) 14.2(1329) 22.5(3696)D4 30.3(752) 57.7(1839) 76.5(2246)D6 69(20) 69.7(21) 70.4(22)Table 10: Results of DeriveRS for the polynomial kernel
6. Discussion
AESVM is a new problem formulation that is almost identical to, but less complex than, theSVM primal problem. AESVM optimizes over only a subset of the training dataset calledthe representative set, and consequently, is expected to give fast convergence with mostSVM solvers. In contrast, the other studies mentioned in Section 2 are mostly algorithmsthat solve the SVM primal or related problems. Methods such as RSVM also use differentproblem formulations. However, they require special algorithms to solve, unlike AESVM.In fact, AESVM can be solved using many of the methods in Section 2. As described inCorollary 5, there are some similarities between AESVM and the Gram matrix approxi- andan, Khargonekar, and Talathi Dataset Solver
RM SE (x10 ) ET S OT S ECS OCS max. Acc.(x10 ) mean Acc.(x10 ) std. Acc.(x10 )D1 AESVM 0.15 31.2 2 3.1 3.1 94 93.5 0.4LIBSVM 94.1 93.5 0.4D4 AESVM 2.04 3.3 1.5 2 1.9 64.3 60.8 2.5LIBSVM 64.5 60.7 2.5D6 AESVM 0.6 2.7 1.9 1.5 1.5 84.5 80.5 2.5LIBSVM 84.6 81 2.3Table 11: Performance comparison of AESVM (with (cid:15) = 10 − ), and LIBSVM with thepolynomial kernel D1 D2 D3 D4 D5 D6 D7 D8 D90102030405060 R M SE x Datasets
AESVM, ε = 10 −3 CVMBVMLASVMSVM perf
RfeatSVM
Figure 4: Plot of RMSE values for all SVM solversmation methods discussed earlier. It would be interesting to see a comparison of AESVM,with the core set based method proposed by G¨artner and Jaggi (2009). However, due to thelack of availability of a software implementation and of published results on L1-SVM withnon-linear kernels using their approach, the authors find such a comparison study beyondthe scope of this paper.
The theoretical and experimental results presented in this paper demonstrate that the so-lutions of AESVM and SVM are similar in terms of the resulting classification accuracy.
Asummary of the experiments in Section 5, that compared an SMO based AESVM implemen- ast SVM training using approximate extreme points D1 D2 D3 D4 D5 D6 D7 D8 D9405060708090100 M ax i m u m A cc u r acy x Datasets
AESVM, ε = 10 −3 CVMBVMLASVMSVM perf
RfeatSVMLIBSVM
Figure 5: Plot of maximum classification accuracy for all SVM solverstation, CVM, BVM, LASVM, LIBSVM, SVM perf and RfeatSVM, is presented in Figures 4to 7.
It can be seen that AESVM typically gave the lowest approximation error (
RM SE ),while giving highest overall training time speedup (
OT S ). AESVM also gave competitivelyhigh overall classification time speedup (
OCS ) in comparison with the other algorithms ex-cept SVM perf . It was found that the maximum classification accuracies of all the algorithmsexcept RfeatSVM were similar. RfeatSVM, and in some cases CVM and BVM, gave lowermaximum classification accuracies. Though the results of RfeatSVM illustrated in Figures4 and 5, were computed for a smaller hyper-parameter grid (refer Section 5.3.3), we believeit indicates the overall performance of the method. Apart from the excellent experimen-tal results for AESVM with the Gaussian kernel, AESVM also gave good results with thepolynomial kernel as described in Section 5.4.The algorithm DeriveRS was generally found to be efficient, especially for the lowerdimensional datasets D1-D5. For the high dimensional datasets D6-D9, the representativeset was almost the same size as the training dataset, resulting in small gains in trainingand classification time speedups for AESVM. In particular, for D8 (MNIST dataset) therepresentative set computed by DeriveRS was almost 100% of the training set. A similarresult was reported for this dataset in Beygelzimer et al. (2006), where a divide and conquermethod was used to speed up nearest neighbor search. Dataset D8 is reported to haveresulted in nearly no speedup, compared to a speedup of almost one thousand for otherdatasets when their method was used. Their analysis found that the data vectors in D8 werevery distant from each other in comparison with the other datasets . This observation can
13. This is indicated by the large expansion constant for D8 illustrated in Beygelzimer et al. (2006) andan, Khargonekar, and Talathi D1 D2 D3 D4 D5 D6 D7 D8 D9050100150 ↑ OTS value of AESVM for D3 is 968.5 O ve r a ll T r a i n i ng T i m e S p ee dup Datasets
AESVM, ε = 10 −3 CVMBVMLASVMSVM perf
Figure 6: Plot of overall training time speedup (compared to LIBSVM) for all SVM solversexplain the performance of DeriveRS on D8, as data vectors that are very distant from eachother are expected to have large representative sets. It should be noted that irrespectiveof the dimensionality of the datasets, AESVM always resulted in excellent performance interms of classification accuracy. There seems to be no relation between dataset density andthe performance of DeriveRS and AESVM.The authors will provide the software implementation of AESVM and DeriveRS uponrequest. Based on the presented results, we suggest the parameters (cid:15) = 10 − , P = 10 and V = 10 for DeriveRS. A possible extension of this paper is to apply the idea of therepresentative set to other SVM variants and to support vector regression (SVR). It isstraightforward to see that the theorems in Section 3.2 can be extended to SVR. It wouldbe interesting to investigate AESVM solvers implemented using methods other than SMO.Modifications to DeriveRS using the methods in Section 2 might improve its performanceon high dimensional datasets. The authors will investigate improvements to DeriveRS andthe application of AESVM to the linear kernel in their future work. Acknowledgments
Dr. Khargonekar acknowledges support from the Eckis professor endowment at the Uni-versity of Florida. Dr. Talathi was partially supported by the Children’s Miracle Network, ast SVM training using approximate extreme points D1 D2 D3 D4 D5 D6 D7 D8 D901020304050 ↑ OCS value of SVM perf for D4 is 186.8 O ve r a ll C l ass i f i ca t i on T i m e S p ee dup Datasets
AESVM, ε = 10 −3 CVMBVMLASVMSVM perf
Figure 7: Plot of overall classification time speedup (compared to LIBSVM) for all SVMsolversand the Wilder Center of Excellence in Epilepsy Research. The authors acknowledge Mr.Shivakeshavan R. Giridharan, for providing assistance with computational resources.
References
K. P. Bennett and E. J. Bredensteiner. Duality and geometry in SVM classifiers. In
Proceedings of the Seventeenth International Conference on Machine Learning , pages 57–64, 2000.A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In
Proceedingsof the 23rd international conference on Machine learning , ICML ’06, pages 97–104, 2006.M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection.
Journal of Computer and System Sciences , 7:448–461, August 1973.A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online andactive learning.
Journal of Machine Learning Research , 6:1579–1619, December 2005.S. Boyd and L. Vandenberghe.
Convex Optimization . Cambridge University Press, 2004.J. Cervantes, X. Li, W. Yu, and K. Li. Support vector machine classification for large datasets via minimum enclosing ball clustering.
Neurocomputing , 71:611–619, January 2008. andan, Khargonekar, and Talathi C. C. Chang and C. J. Lin. IJCNN 2001 challenge: Generalization ability and text decoding.In
Proceedings of International Joint Conference on Neural Networks , volume 2, pages1031 –1036, 2001a.C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines.
Softwareavailable at , 2001b.K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm.
ACM Transaction on Algorithms , 6(4), September 2010.R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of SVMs for very large scaleproblems.
Neural Computing , 14(5):1105–1114, 2002.P. Drineas and M. W. Mahoney. On the Nystr¨om method for approximating a gram matrixfor improved kernel-based learning.
Journal of Machine Learning Research , 6:2153–2175,December 2005.R. E. Fan, P. H. Chen, and C. J. Lin. Working set selection using second order informationfor training support vector machines.
Journal of Machine Learning Research , 6:1889–1918, 2005.R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. LIBLINEAR: A libraryfor large linear classification.
Journal of Machine Learning Research , 9:1871–1874, June2008.S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations.
Journal of Machine Learning Research , 2:243–264, 2002.V. Franc and S. Sonnenburg. Optimized cutting plane algorithm for support vector ma-chines. In
Proceedings of the 25th international conference on Machine learning , ICML’08, pages 320–327, 2008.B. G¨artner and M. Jaggi. Coresets for polytope distance. In
Proceedings of the 25th annualsymposium on Computational geometry , pages 33–42, 2009.J. Guo, N. Takahashi, and T. Nishi. A learning algorithm for improving the classificationspeed of support vector machines. In
Proceedings of the 2005 European Conference onCircuit Theory and Design , volume 3, pages 381 – 384, 2005.C. J. Hsieh, K. W. Chang, C. J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinatedescent method for large-scale linear SVM. In
Proceedings of the 25th internationalconference on Machine learning , pages 408–415, 2008.T. Joachims. Training linear SVMs in linear time. In
Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge discovery and data mining . ACM, 2006.T. Joachims. Making large-scale support vector machine learning practical. In
Advances inkernel methods , pages 169–184. MIT Press, 1999.T. Joachims and C. N. J. Yu. Sparse kernel SVMs via cutting-plane training.
MachineLearning , 76:179–193, September 2009. ast SVM training using approximate extreme points B. Kaluza, V. Mirchevska, E. Dovgan, M. Lustrek, and M. Gams. An agent-based approachto care in independent living. In
Proceedings of AmI’2010 , 2010.J. Kelley. The cutting-plane method for solving convex programs.
Journal of the Societyfor Industrial and Applied Mathematics , 8(4):703–712, 1960.Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docu-ment recognition.
Proceedings of the IEEE , 86(11):2278 –2324, 1998.Y. J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In
Proceedingsof the SIAM International Conference on Data Mining , 2001.M. Nandan, S. S. Talathi, S. Myers, W. L. Ditto, P. P. Khargonekar, and P. R. Carney.Support vector machines for seizure detection in an animal model of chronic epilepsy.
Journal of Neural Engineering , 7(3), 2010.E. Osuna and O. Castro. Convex hull in feature space for support vector machines. In
Pro-ceedings of the 8th Ibero-American Conference on AI: Advances in Artificial Intelligence ,IBERAMIA 2002, pages 411–419, 2002.E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application toface detection. In
IEEE Computer Society Conference on Computer Vision and PatternRecognition , pages 130 –136, 1997.D. Pavlov, D. Chudova, and P. Smyth. Towards scalable support vector machines usingsquashing. In
Proceedings of the sixth ACM SIGKDD international conference on Knowl-edge discovery and data mining , pages 295–299. ACM, 2000.J. C. Platt. Fast training of support vector machines using sequential minimal optimization.In
Advances in kernel methods , pages 185–208. MIT Press, 1999.A. Rahimi and B. Recht. Random features for large-scale kernel machines.
Advances inNeural Information Processing Systems , 2007.R. T. Rockafellar.
Convex Analysis . Princeton University Press, 1996.B. Sch¨olkopf and A. J. Smola.
Learning with Kernels: Support Vector Machines, Regular-ization, Optimization, and Beyond . MIT Press, 2001.B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vectoralgorithms.
Neural Computation , 12(5):1207–1245, May 2000. ISSN 0899-7667.S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training setsize. In
Proceedings of the 25th international conference on Machine learning , ICML ’08,pages 928–935, 2008.S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM.
Mathematical Programming , 127:3–30, March 2011. andan, Khargonekar, and Talathi S. S. Talathi, D. U. Hwang, M. L. Spano, J. Simonotto, M. D. Furman, S. M. Myers, J. T.Winters, W. L. Ditto, and P. R. Carney. Non-parametric early seizure detection in ananimal model of temporal lobe epilepsy.
Journal of Neural Engineering , 5:85–98, 2008.M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani. A detailed analysis of the KDD CUP99 data set. In
Proceedings of the 2009 IEEE Symposium Computational Intelligence forSecurity and Defense Applications , pages 53–58, 2009.D. Tax and R. Duin. Support vector data description.
Machine Learning , 54(1):45–66,2004.C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularizedrisk minimization.
Journal of Machine Learning Research , 11:311–365, 2010.I. W. Tsang, J. T. Kwok, P. Cheung, and N. Cristianini. Core vector machines: Fast SVMtraining on very large data sets.
Journal of Machine Learning Research , 6:363–392, 2005.I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls.In
Proceedings of the 24th international conference on Machine learning , ICML ’07, pages911–918, 2007.H. Yu, J. Yang, and J. Han. Classifying large data sets using SVMs with hierarchical clus-ters. In
Proceedings of the ninth ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 306–315, 2003.G. X. Yuan, C. H. Ho, and C. J. Lin. An improved GLMNET for l1-regularized logis-tic regression. In
Proceedings of the 17th ACM SIGKDD international conference onKnowledge discovery and data mining , pages 33–41, 2011.T. Zhang. Solving large scale linear prediction problems using stochastic gradient descentalgorithms. In
Proceedings of the twenty-first international conference on Machine learn-ing , ICML ’04, 2004., ICML ’04, 2004.