[PDF] Accelerating Kernel Classifiers Through Borders Mapping

Abstract

Support vector machines (SVM) and other kernel techniques represent a family of powerful statistical classification methods with high accuracy and broad applicability. Because they use all or a significant portion of the training data, however, they can be slow, especially for large problems. Piecewise linear classifiers are similarly versatile, yet have the additional advantages of simplicity, ease of interpretation and, if the number of component linear classifiers is not too large, speed. Here we show how a simple, piecewise linear classifier can be trained from a kernel-based classifier in order to improve the classification speed. The method works by finding the root of the difference in conditional probabilities between pairs of opposite classes to build up a representation of the decision boundary. When tested on 17 different datasets, it succeeded in improving the classification speed of a SVM for 12 of them by up to two orders-of-magnitude. Of these, two were less accurate than a simple, linear classifier. The method is best suited to problems with continuum features data and smooth probability functions. Because the component linear classifiers are built up individually from an existing classifier, rather than through a simultaneous optimization procedure, the classifier is also fast to train.

Full PDF

aa r X i v : . [ s t a t . M L ] J u l Accelerating Kernel Classiﬁers Through BordersMapping

Peter Mills [email protected]

July 27, 2018

Abstract

Support vector machines (SVM) and other kernel techniques rep-resent a family of powerful statistical classiﬁcation methods with highaccuracy and broad applicability. Because they use all or a signiﬁcantportion of the training data, however, they can be slow, especiallyfor large problems. Piecewise linear classiﬁers are similarly versatile,yet have the additional advantages of simplicity, ease of interpreta-tion and, if the number of component linear classiﬁers is not too large,speed. Here we show how a simple, piecewise linear classiﬁer can betrained from a kernel-based classiﬁer in order to improve the classiﬁ-cation speed. The method works by ﬁnding the root of the diﬀerencein conditional probabilities between pairs of opposite classes to buildup a representation of the decision boundary. When tested on 17 dif-ferent datasets, it succeeded in improving the classiﬁcation speed ofa SVM for 9 of them by factors as high as 88 times or more. Themethod is best suited to problems with continuum features data andsmooth probability functions. Because the component linear classi-ﬁers are built up individually from an existing classiﬁer, rather thanthrough a simultaneous optimization procedure, the classiﬁer is alsofast to train.

Keywords class borders, multi-dimensional root-ﬁnding, adaptive Gaussianﬁltering, binary classiﬁer, multi-class classiﬁer, non-parametricstatistics, variable kernel density estimation ontents List of symbols 31 Introduction 42 Theory 6 ist of symbols symbol deﬁnition ﬁrst used g decision function in linear classiﬁer (1) ~v hyperplane normal of decision boundary (1) b constant value deﬁning location of decision boundary (1) ~x test point (1)˜ p kernel density estimator (2) K kernel function taking two vector arguments (2) { ~x i } set of training samples (2) ~κ parameters in vector kernel (2) N kernel normalization coeﬃcient (2) n number of training samples (2) ~x = { x i } point in the feature space (3) c class label (4) y i class label of i th training sample (4) p ( c | ~x ) conditional probability S 2 p ( c, ~x ) joint probability S 2 K kernel function taking a single scalar argument (5) σ “bandwidth”: sets the size of a scalar kernel (5) P ( ~x ) true probability density (6) D number of dimensions in feature space (6) W sum of the kernels at each training sample (7) ~w = { w i } coeﬃcients in weighted kernel estimator (8) ~φ theoretical expanded feature space in kernel-based SVM (9) g svm raw decision function in binary SVM (11) C cost parameter in SVM for reducing over-ﬁtting (20) r diﬀerence in conditional probabilities (21)˜ r decision function which is estimator of r (21)˜ r kern kernel estimator for r (22)˜ r vb variable kernel estimator for r (23) { ~b i } set of vectors deﬁning the class border (27) { ~v i } set of normals to the class border (27) n b number of border vectors (27) g border raw decision function for border classiﬁcation (27) d distance in feature space (27)˙ K derivative of a scalar kernel, K (27)˜ r svm LIBSVM estimator for r (29) A coeﬃcient used in ˜ r svm (29) B coeﬃcient used in ˜ r svm (29)˜ r border borders estimator for r (30) p i = p ( i | ~x ) conditional probability of i th class (31) n c number of classes (31) λ Lagrange multiplier in multi-class problem (33) ymbol deﬁnition ﬁrst used V volume of the feature space occupied by training data (33) n number of training samples needed for good accuracy (33) n b number of border samples needed for good accuracy (33) γ = σ parameter used for Gaussian kernels in SVM S 5 f fraction of data used for testing S 5 { η ij } confusion matrix (34) n test number of test points (34) H i entropy of the prior distribution (36) H ( i | j ) entropy of the posterior distribution (37) U ( i | j ) uncertainty coeﬃcient (normalized channel capacity) (39) n i size of i th class (40) α subsampling fraction as function of class size (40) ζ exponent in subsampling function α (41) C coeﬃcient in subsampling function α (41) Linear classiﬁers are well studied in the literature. Methods such as theperceptron, Fisher discriminant, logistic regression and now linear supportvector machines (SVM) (Michie et al., 1994) are often appropriate for rela-tively simple, binary classiﬁcation problems in which both classes are closelyclustered or are well separated. An obvious extension for more complexproblems is a piecewise linear classiﬁer in which the decision boundary isbuilt up from a series of linear classiﬁers. Piecewise linear classiﬁers enjoyedsome popularity during the early development of the ﬁeld of machine learn-ing (Osborne, 1977; Sklansky and Michelotti, 1980; Lee and Richards, 1984,1985) and because of their versatility, generality and simplicity there hasbeen recent renewed interest (Bagirov, 2005; Kostin, 2006; Gai and Zhang,2010; Webb, 2012; Wang and Saligrama, 2013; Pavlidis et al., 2016).A linear classiﬁer takes the form: g ( ~x ) = ~v · ~x + b (1)where ~x is a test point in the feature space, ~v is a normal to the decisionhyper-surface, b determines the location of the decision boundary along thenormal and g is the decision function which we use to estimate the class ofthe test point through its sign.A piecewise linear classiﬁer collects a set of such linear classiﬁers: { ~v i } = { ~v , ~v , ~v ... } ; { b i } = { b , b , b , ... } . The two challenges here are, ﬁrst, howto eﬃciently train each of the decision boundaries and, second, the relatedproblem of how to partition the feature space to determine which lineardecision boundary is used for a given test point.4n Bagirov (2005) for instance, the decision function is deﬁned by par-titioning the set of linear classiﬁers and maximizing the minimum lineardecision value in each partition. To train the classiﬁer, a cost function isdeﬁned in terms of this decision function and directly minimized using ananalog to the derivative for non-smooth functions (Bagirov, 1999). Natu-rally, such an approach will be quite computationally costly.Partitioning of the feature space can be separate from the discrimina-tion borders (Huang et al., 2013) but more normally the discrimination bor-ders are themselves suﬃcient to partition the feature space (Osborne, 1977;Lee and Richards, 1984; Bagirov, 2005; Kostin, 2006). This means that allor a signiﬁcant fraction of the component linear classiﬁers must be evalu-ated. In Kostin (2006), for instance, the linear classiﬁers form a decisiontree.In the method described in this paper, the constant term, b i , is changedto a vector and the partitioning accomplished through a nearest neighboursto this vector. Thus the zone of inﬂuence for each hyperplane will be de-scribed by the Veronoi tesselation (Kohonen, 2000). If the class domainsare simply connected and don’t curve back on themselves, then the parti-tions will also be shaped as hyper-pyramids, with the axes of the pyramidsroughly perpendicular to the decision border. A dot product with each ofthe vectors must be calculated, similar to a linear classiﬁer, but afterwardsonly a single linear decision function is evaluated.There seems to be some tension in the literature between training the de-cision boundary through simultaneous optimization (Bagirov, 2005; Wang and Saligrama,2013) or through methods that are more piece meal (Gai and Zhang, 2010;Herman and Yeung, 1992; Kostin, 2006). Obviously, simultaneous optimiza-tion will be more accurate but also much more computationally expensive.In addition, ﬁnding global minima for cost functions involving more thana handful of hyper-surfaces will be all but impossible. There is also theissue of separability. Many of the current crop of methods seem to be de-signed with disjoint classes in mind (Herman and Yeung, 1992), for instanceGai and Zhang (2010), who stick the hyper-plane borders between neigh-bouring pairs of opposite classes. Yet there is no reason why a piecewiselinear classiﬁer cannot be just as eﬀective for overlapping classes.The technique under discussion in this paper mitigates all of these issuesbecause it is not a stand-alone method but requires estimates of the con-ditional probabilities. It is used to improve the time performance of kernelmethods, or for that matter, any binary classiﬁer that returns a continuousdecision function that can approximate a conditional probability. This isdone while maintaining, in all but a few cases, most of the accuracy.5everal of the piecewise linear techniques found in the literature workby positioning each hyperplane between pairs of clusters or pairs of trainingsamples of opposite class (Sklansky and Michelotti, 1980; Tenmoto et al.,1998; Kostin, 2006; Gai and Zhang, 2010). Other unsupervised or semi-supervised classiﬁers work by placing the hyperplanes in regions of minimumdensity (Pavlidis et al., 2016). The method described in this paper in somesenses combines these two techniques by ﬁnding the root of the diﬀerence inconditional probabilities along a line between two points of opposite class.It will be tested on two kernel-based classiﬁers—a support vector machine(SVM) (Michie et al., 1994; M¨uller et al., 2001) and a simpler, “pointwiseestimator” (Terrell and Scott, 1992; Mills, 2011)—and evaluated based onhow well it improves classiﬁcation speed and at what cost to accuracy.Section 2 describes the theory of support vector machines and the point-wise estimator (“adaptive Guassian ﬁltering”) as well as the piecewise linearclassiﬁer or “borders” classiﬁer that will be trained on the two kernel esti-mators. Section 3 describes the software and test datasets then in section4 we analyze the diﬀerent classiﬁcation algorithms on a simple, syntheticdataset. Section 5 outlines the results for 17 case studies while in Section 6we discuss the results. Section 7 concludes the paper. A kernel is a scalar function of two vectors that can be used for non-parametric density estimation. A typical “kernel-density estimator” lookslike this: ˜ p ( ~x ) = 1 nN n X i =1 K ( ~x, ~x i , ~κ ) (2)where ˜ p is an estimator for the density, P , K is the kernel function, { ~x i | i ∈ [1 , n ] } are a set of training samples, ~x is the test point, and ~κ is a set ofparameters. The normalization coeﬃcient, N , normalizes K : N = Z V K ( ~x, ~x ′ , ~κ )d ~x ′ (3)The method can be used for statistical classiﬁcation by comparing resultsfrom the diﬀerent classes: c = arg min i X j | y j = i K ( ~x, ~x i , ~κ ) (4)6here y j is the class of the j th sample. Similarly, the method can also re-turn estimates of the joint ( p ( c, ~x )) and conditional probabilities ( p ( c | ~x )) bydividing the sum in (4) by nN or by the sum of all the kernels, respectively.If the same kernel is used for every sample and every test point, theestimator may be sub-optimal, particularly in regions of very high or verylow density. There are at least two ways to address this problem. In a“variable-bandwidth” estimator, the coeﬃcients, ~κ , depend in some wayon the density itself. Since the actual density is normally unavailable, theestimated density can be used as a proxy (Terrell and Scott, 1992; Mills,2011).Let the kernel function take the following form: K ( ~x, ~x ′ , σ ) = K (cid:18) | ~x − ~x ′ | σ (cid:19) (5)where σ is the “bandwidth”. In Mills (2011), σ is made proportional to thedensity: σ ∝ P /D ≈ p /D (6)where D is the dimension of the feature space. Since the normalizationcoeﬃcient, N , must include the factor, σ D , some rearrangement shows that:1 n X i K (cid:18) | ~x − ~x i | σ (cid:19) = W = const. (7)This is a generalization of the k -nearest-neighbours scheme in which the freeparameter, W , takes the place of k (Mills, 2009, 2011). The bandwidth, σ ,can be solved for using any numerical, one-dimensional root-ﬁnding algo-rithm. The bandwidth is determined uniquely for a given test point but isheld constant for that one, which makes this a “balloon” estimator. Con-trast a “point-wise” estimator in which bandwidths are diﬀerent for eachtraining point but need only be determined once (Terrell and Scott, 1992).Another method of improving the performance of a kernel-density esti-mator is to multiply each kernel by a coeﬃcient:˜ p ( ~x ) = X i w i K ( ~x, ~x i , ~κ ) (8)The coeﬃcients, { w i } , are found through an optimization procedure de-signed to minimize the error (Chen et al., 2015). In the most popular formof this kernel method, support vector machines (SVM), the coeﬃcients arethe result of a complex, dual optimization procedure which minimizes theclassiﬁcation error. We will brieﬂy outline this procedure.7 .2 Support Vector Machines The basic “trick” of kernel-based SVM methods is to replace a dot productwith the kernel function in the assumption that it can be rewritten as a dotproduct of a transformed and expanded feature space: K ( ~x, ~x ′ ) = ~φ ( ~x ) · ~φ ( ~x ′ ) (9)For simplicity we have ommitted the kernel parameters. ~φ is a vectorfunction of the feature space. The simplest example of a kernel functionthat has a closed, analytical and ﬁnite-dimensional ~φ is the square of thedot product: K ( ~x, ~x ′ ) = ( ~x · ~x ′ ) (10)= ( x , x , x , ..., √ x x , √ x x , ... √ x x , ... ) · ( x ′ , x ′ , x ′ , ..., √ x ′ x ′ , √ x ′ x ′ , ... √ x ′ x ′ , ... )but it should be noted that in more complex cases, there is no need toactually construct ~φ since it is replaced by the kernel function, K , in theﬁnal analysis.In a binary SVM classiﬁer, the classes are separated by a single hyper-plane deﬁned by ~v and b . In a kernel-based SVM, this hyperplane bisectsnot the regular feature space, but the theoretical, transformed space deﬁnedby the function, ~φ ( ~x ). The decision value is calculated via a dot product: g svm ( ~x ) = ~v · ~φ ( ~x ) + b (11)and the class determined, as before, by the sign of the decision value: c ( ~x ) = g svm ( ~x ) | g svm ( ~x ) | (12)where for convenience, the class labels are given by c ∈ {− , } .In the ﬁrst step of the minimization procedure, the magnitude of theborder normal, ~v , is minimized subject to the constraint that there are noclassiﬁcation errors: min ~v,b | ~v | g svm ( ~x i ) y i ≥ { w i } , as Lagrange multipliers on the constraints:min ~v,b ( | ~v | − X i w i [ g svm ( ~x i ) y i − ) (13)8enerates the following pair of analytic expressions: X i w i y i = 0 (14) ~v = X i w i y ~φ ( ~x i ) (15)through setting the derivatives w.r.t. the minimizers to zero. Substitutingthe second equation, (15), into the decision function in (11) produces thefollowing: g svm ( ~x ) = X i w i y i K ( ~x, ~x i ) + b (16)Thus, the ﬁnal, dual, quadratic optimization problem looks like this:max { w i } X i w i − X i,j w i w j y i y j K ( ~x i , ~x j )  (17) w i ≥ X i w i y i = 0 (19)There are a number of reﬁnements that can be applied to the optimizationproblem in (17)-(19), chieﬂy to reduce over-ﬁtting and to add some “margin”to the decision border to allow for the possibility of classiﬁcation errors. Forinstance, substitute the following for (18):0 ≤ w i ≤ C (20)where C is the cost (M¨uller et al., 2001). Mainly we are concerned herewith the decision function in (16) since the initial ﬁtting will be done withan external software package, namely LIBSVM (Chang and Lin, 2011).Two things should be noted. First, the function ~φ appears in neither theﬁnal decision function, (16), nor in the optimization problem, (17). Second,while the use of ~φ implies that the time complexity of the decision functioncould be O (1) as in a parametric statistical model, in actual fact it is de-pendent on the number of non-zero values in { w i } . While the coeﬃcientset, { w i } , does tend to be sparse, nonetheless in most real problems thenumber of non-zero coeﬃcients is proportional to the number of samples, n ,producing a time complexity of O ( n ). Thus for large problems, calculatingthe decision value will be slow, just as in other kernel estimation problems.9he advantage of SVM lies chieﬂy in its accuracy since it is minimiz-ing the classiﬁcation error whereas a more basic kernel method is more adhoc and does little more than sum the number of samples of a given class,weighted by distance. In kernel SVM, the decision border exists only implicitly in a hypothetical,abstract space. Even in linear SVM, if the software is generalized to recog-nize the simple dot product as only one among many possible kernels, thenthe decision function may be built up, as in (16) through a sum of weightedkernels. This is the case for LIBSVM. The advantage of an explicit decisionborder as in (1) or (11) is that it is fast. The problem with a linear borderis that, except for a small class of problems, it is not very accurate.In the binary classiﬁcation method described in Mills (2011), a non-lineardecision border is built up piece-wise from a collection of linear borders. Itis essentially a root-ﬁnding procedure for a decision function, such as g svm in (16). Let ˜ r be a decision function that approximates the diﬀerence inconditional probabilities:˜ r ( ~x ) ≈ r ( ~x ) = p (1 | ~x ) − p ( − | ~x ) (21)where p ( c | ~x ) represents the conditional probabilities of a binary classiﬁerhaving labels c ∈ {− , } . For a simple kernel estimator, for instance, r isestimated as follows: ˜ r kern ( ~x ) = P i y i K ( ~x, ~x i ) P i K ( ~x, ~x i ) (22)where y i ∈ {− , } . For the variable bandwidth kernel estimator deﬁned by(7), this works out to:˜ r vb ( ~x ) = 1 W X i y i K (cid:18) | ~x − ~x i | σ ( ~x ) (cid:19) (23)A variable bandwidth kernel-density estimator with a Gaussian kernel, K ( ~x, ~x ′ , σ ) = exp (cid:18) − | ~x − ~x ′ | σ (cid:19) (24)we will refer to as an “Adaptive Gaussian Filter” or AGF for short. Thiskernel will also be used for SVM where it’s often called a “radial basisfunction” or RBF for short. 10he procedure is as follows: pick a pair of points on either side of the de-cision boundary (the decision function has opposite signs). Good candidatesare one random training sample from each class. Then, zero the decisionfunction along the line between the points. This can be done as many timesas needed to build up a good representation of the decision boundary. Wenow have a set of points, { ~b i } , such that ˜ r ( ~b i ) = 0 for every i ∈ [1 ..n b ] where n b is the number of border samples.Along with the border samples, { ~b i } , we also collect a series of normalvectors, { ~v i } such that: ~v i = ∇ ~x ˜ r | ~x = ~b i (25)With this system, determining the class is a two step proces. First, thenearest border sample to the test point is found. Second, we deﬁne a newdecision function, g border , equivalent to (1), through a dot product with thenormal: i = arg min j | ~x − ~b j | (26) g border ( ~x ) = ~v i · ( ~x − ~b i )The class is determined by the sign of the decision function as in (12).The time complexity is completely independent of the number of trainingsamples, rather it is linearly proportional to the number of border vectors, n b , a tunable parameter. The number required for accurate classiﬁcationsis dependent on the complexity of the decision border.The gradient of the variable-bandwidth kernel estimator in (23) is: ∂ ˜ r vb ∂x j = 2 σW X i y i ˙ K (cid:18) d i σ (cid:19)  x j − x ij d i − d i P k ˙ K (cid:16) d k σ (cid:17) x j − x kj d k P k d k ˙ K (cid:16) d k σ (cid:17)  (27)where d i = | ~x − ~x i | is the distance between the test point and the i th sampleand ˙ K is the derivative of K . For AGF, this works out to: ∂ ˜ r vb ∂x j = 1 σ W X i y i K (cid:18) d i σ (cid:19)  x ij − x j − d i P k K (cid:16) d k σ (cid:17) ( x kj − x j ) P k K (cid:16) d k σ (cid:17) d k  (28)where K ( x ) = exp( − x /

2) (Mills, 2011).In LIBSVM, conditional probabilities are estimated by applying logisticregression (Michie et al., 1994) to the raw SVM decision function, g svm , in(16): ˜ r svm ( ~x ) = tanh (cid:18) Ag svm ( ~x ) + B (cid:19) (29)11Chang and Lin, 2011), where A and B are coeﬃcients derived from thetraining data via a nonlinear ﬁtting technique (Platt, 1999; Lin et al., 2007).The gradient of the revised SVM decision function, above, is: ∇ ~x ˜ r svm = (cid:2) − ˜ r svm ( ~x ) (cid:3) X i w i y i ∇ ~x K ( ~x, ~x i )Gradients of the initial decision function are useful not just to derive nor-mals to the decision boundary, but also as an aid to root ﬁnding when search-ing for border samples. If the decision function used to compute the bordersamples represents an estimator for the diﬀerence in conditional probabili-ties, then the raw decision value, g border , derived from the border samplingtechnique in (27) can also return estimates of the conditional probabilitieswith little extra eﬀort and little loss of accuracy, also using a sigmoid func-tion: ˜ r border ( ~x ) = tanh [ g border ( ~x )] (30)This assumes that the class posterior probabilities, p ( ~x | c ), are approximatelyGaussian near the border (Mills, 2011).The border classiﬁcation algorithm returns an estimator, ˜ r border , for thediﬀerence in conditional probabilities of a binary classiﬁer using equations(27) and (30). It can be trained with the functions ˜ r kern in (22), ˜ r vb in(23), ˜ r svm in (29), or any other continuous, diﬀerentiable, non-parametricestimator for the diﬀerence in conditional probabilities, r . At the cost ofa small reduction in accuracy, it has the potential to drastically reduceclassiﬁcation time for kernel estimators and other non-parametric statisti-cal classiﬁers, especially for large training datasets, since it has O ( n b ) timecomplexity instead of O ( n ) complexity, where n b , the number of border sam-ples, is a free parameter. The actual number chosen can trade oﬀ betweenspeed and accuracy with rapidly diminishing returns beyond a certain point.One hundred border samples ( n b = 100) is usually suﬃcient. The computa-tion of ˜ r border also involves very simple operations— ﬂoating point addition,multiplication and numerical comparison, with no transcendental functionsexcept for the very last step (which can be omitted)—so the coeﬃcient forthe time complexity will be small.A border classiﬁer trained with AGF will be referred to as an “AGF-borders” classiﬁer while a border classiﬁer trained with SVM estimates willbe referred to as an “SVM-borders” classiﬁer or an “accelerated” SVM clas-siﬁer. 12 .4 Multi-class classiﬁcation The border classiﬁcation algorithm, like SVM, only works for binary classi-ﬁcation problems. It is quite easy to generalize a binary classiﬁer to performmulti-class classiﬁcations by using several of them and the number of waysof doing so grows exponentially with the number of classes. Since LIB-SVM uses the “one-versus-one” method (Hsu and Lin, 2002) of multi-classclassiﬁcation, this is the one we will adopt.A major advantage of the borders classiﬁer is that it returns probabilityestimates. These estimates have many uses including measuring the conﬁ-dence of as well as recalibrating the class estimates (Mills, 2009, 2011). Thusthe multi-class method should also solve for the conditional probabilities inaddition to returning the class label.In a one-vs.-one scheme, the multi-class conditional probabilities can berelated to those of the binary classiﬁers as follows: r ij ( ~x ) = p ( j | ~x ) − p ( i | ~x ) p ( i | ~x ) + p ( j | ~x ) (31)where i ∈ [1 ..n c − j ∈ [1 ..n c ], n c is the number of classes, j > i , and r ij is the diﬀerence in conditional probabilities of the binary classiﬁer thatdiscriminates between the i th and j th classes. Wu et al. (2004) transformthis problem into the following linear system: p i X k | k = i ( r ki + 1) + X j | j = i p j (1 − r ij ) + λ = 0 (32) X j p j = 1where p i = p ( i | ~x ) is the i th multi-class conditional probability and λ is aLagrange multiplier. They also show that the constraints not included inthe problem, that the probabilities are all positive, are always satisﬁed anddescribe an algorithm for solving it iteratively, although a simple matrixsolver is suﬃcient. LIBSVM is a machine learning software library for support vector machinesdeveloped by Chih-Chung Chang and Chih-Jen Lin of the National Taiwan13niversity, Taipei, Taiwan (Chang and Lin, 2011). It includes statisticalclassiﬁcation using two regularization methods for minimizing over-ﬁtting:

C-SVM and ν -SVM . It also includes code for nonlinear regression and den-sity estimation or “one-class SVM”. SVM models were trained using the svm-train command while classiﬁcations done with svm-predict . LIB-SVM can be found at: Similar to LIBSVM, libAGF is a machine learning library but for vari-able kernel estimation (Mills, 2011; Terrell and Scott, 1992) rather thanSVM. Like LIBSVM, it supports statistical classiﬁcation, lonlinear regres-sion and density estimation. It supports both Gaussian kernels and k-nearest neighbours. It was written by Peter Mills and can be found at https://github.com/peteysoft/libmsci .Except for training and classifying the SVM models, all calculations inthis paper were done with the libAGF library. To convert a LIBSVM modelto a borders model, the single command, svm_accelerate , can be used.Classiﬁcations are then performed with classify_m . The borders classiﬁcation algorithm was tested on a total of 17 diﬀerentdatasets. These will be brieﬂy described in this section. The collection coversa fairly broad range of size and types of problems, number of classes andnumber and types of attributes but with the focus on larger datasets wherethe borders technique is actually useful. Four of the datasets are from the“Statlog” project (Michie et al., 1994; King et al., 1995) and are nicknamed“ heart ”, “ shuttle ”, “ sat ” and “ segment ”. The heart disease (“ heart ”) datasetcontains thirteen attributes of 270 patients along with one of two class labelsdenoting either the presence or absence of heart disease. The dataset comesoriginally from the Cleveland Clinic Foundation and two versions are storedon the machine learning database of U. C. Irvine (Lichman, 2013).The shuttle dataset is interesting because the classes have a very unevendistribution meaning that multi-class classiﬁers with a symmetric break-down of the classes, such as one-vs.-one, tend to perform poorly. The shuttle dataset comes originally from NASA and was taken from an actual spaceshuttle ﬂight. The classes describe actions to be taken at diﬀerent ﬂightconﬁgurations. 14able 1: Summary of datasets used in the numerical trials. D Type n c Train Test Referenceheart 13 real 2 270 - (Lichman, 2013) shuttle 9 real 7 43500 14500 (King et al., 1995) sat 36 real 6 4435 2000 (King et al., 1995) segment 19 real 7 2310 - (King et al., 1995) dna 180 binary 3 2000 1186 (Michie et al., 1994) splice 60 cat 3 1000 2175 (Michie et al., 1994) codrna 8 mixed 2 59535 271617 (Uzilov et al., 2006) letter 16 integer 26 20000 - (Frey and Slate, 1991) pendigits 16 integer 10 7494 3498 (Alimoglu, 1996) usps 256 real 10 7291 2001 (Hull, 1994) mnist 665 integer 10 60000 10000 (LeCun et al., 1998) ijcnn1 22 real 2 49990 91701 (Feldkamp and Puskorius, 1998) madelon 500 integer 2 2000 600 (Guyon et al., 2004) seismic 50 real 2 78823 19705 (Duarte and Hu, 2004) mushrooms 112 binary 2 8124 - (Iba et al., 1988) phishing 68 binary 2 11055 - (Mohommad et al., 2014) humidity 7 real 8 86400 - (Mills, 2009)

The satellite (“ sat ”) dataset is a satellite remote-sensing land classiﬁ-cation problem. The attributes represent 3-by-3 segments of pixels in aLandsat image with the class corresponding to the type of land cover inthe central pixel. The segmentation (“ segment ”) dataset is also an imageclassiﬁcation dataset consisting of 3-by-3 pixel sets from outdoor images.The DNA dataset is concerned with classifying a 60 base-pair sequence ofDNA into one of three values: an intron-extron boundary, an extron-intronboundary or neither of those two. That is, during protein creation, partof the sequence is spliced out, with the section kept being the intron andthat spliced out being the extron. There are two versions of it: one called“ splice ” with the original sequence of 4 nucleotide bases but only two classesand one called “ dna ” in which the features data has been reprocessed so thatthe 60 base values are transformed to 180 binary attributes but keeping theoriginal three classes (Michie et al., 1994). Another dataset from the ﬁeldof microbiology is the “ codrna ” dataset which deals with detection of non-coding RNA sequences (Uzilov et al., 2006).There are four text-classiﬁcation datasets: “ letter ”, “ pendigits ”, “ usps ”and “ mnist ”. The “ letter ” dataset is a text-recognition problem concerned15ith classifying a character into one of the 26 letters of the alphabet basedon processed attributes of the isolated character (Frey and Slate, 1991). The pendigits dataset is similar to the letter dataset except for classifying num-bers instead of letters (Alimoglu, 1996). The “ usps ” dataset deals withclassifying text for the purpose of mailing letters (Hull, 1994). The “ mnist ”dataset uses 28 by 28 pixel images to classify text into one of ten diﬀerentcharacters (LeCun et al., 1998). Pixels that always take on the same valuewere removed.Two of the datasets are machine-learning competition challenges. The“ ijcnn1 ” dataset is from the International Joint Conference on Neural Net-works Neural Networks Competition(Feldkamp and Puskorius, 1998) whilethe “ madelon ” dataset comes from the International Conference on NeuralInformation Processing Systems Feature Selection Challenge (Guyon et al.,2004).The “ seismic ” dataset deals with vehicle classiﬁcation from seismic data(Duarte and Hu, 2004). The “ mushrooms ” dataset classiﬁes wild mushroomsinto poisonous and non-poisonous types based on their physical character-istics (Iba et al., 1988). The “ phishing ” dataset uses characteristics of aweb address to predict whether or not a website is being used for nefariouspurposes (Mohommad et al., 2014).The ﬁnal dataset, the “ humidity ” dataset, comprises simulated satelliteradiometer radiances across 7 diﬀerent frequencies in the microwave range.Corresponding to each instance is a value for relative humidity at a singlevertical level. These humidity values have been discretized into 8 rangesto convert it into a statistical classiﬁcation problem. A full description ofthe genesis of this dataset as well as a rationale for treatment using statis-tical classiﬁcation is contained in Mills (2009). The statistical classiﬁcationmethods discussed in this paper were originally devised speciﬁcally for thisproblem.Most of the datasets have been supplied already divided into a “test”set and a “training” set. If this is the case, then it is noted in the summaryin Table 1 and the data has been used as given with the training set usedfor training and the test set used for testing. If the data is provided all inone lump, then it was randomly divided into test and training sets with thedivision diﬀerent for each of the ten numerical trials.To provide the best idea of when the technique is eﬀective and when itis not, results from all 17 datasets will be shown. All datasets were pre-processed in the same way: by taking the averages and standard deviationsof each feature from the training data and subtracting the averages fromboth the test and training data and dividing by the standard deviations.16igure 1: Support vectors for a pair of synthetic test classes.Features that took on the same value in the training data were removed.

We use the pair of synthetic test classes deﬁned in Mills (2011) to illustratethe diﬀerence between support vectors and border vectors and between bor-der vectors derived from AGF and from a LIBSVM model. Figure 1 showsa realization of the two sample classes in red and blue, comprising 300 sam-ples total, along with the support vectors derived from a LIBSVM model.The support vectors are a subset of the training samples and while theytend to cluster around the border, they do not deﬁne it. For reference, theborder between the two classes is also shown. This has been derived fromthe border-classiﬁcation method described in Section 2.3 using the math-ematical deﬁnition of the classes, hence it represents the “true” border towithin a very small numerical error.The true border is also compared with those derived from AGF andLIBSVM probability estimates in Figure 2. The classes are again shown for17igure 2: Borders mapped by the border-classiﬁcation method starting withprobabilities from the class deﬁnitions, adaptive Gaussian ﬁltering (AGF),and a support vector machine (SVM).18igure 3: Classiﬁcation accuracy and uncertainty coeﬃcient for border-classiﬁcation starting with probabilities from the class deﬁnitions, adaptiveGaussian ﬁltering (AGF), and a support vector machine (SVM). Average of20 trials. 19igure 4: Number of support vectors against number of training vectors forpair of synthetic test classes. Fitted curve returns exponent of 0.94 andmultiplication coeﬃcient of 0.38.reference. While these borders contain several hundred samples for a clearview of where they are located using each method, in fact the method workswell with surprisingly few samples. Figure 3 shows a plot of the skill versusthe number of border samples, where

U.C. stands for uncertainty coeﬃcient.Note that the scores saturate at only about 20 samples meaning that for thisproblem at least, very fast classiﬁcations are possible.Unlike support vectors, the number of border samples required is ap-proximately independent of the number of training samples. In additionto skill as a function of border samples for both AGF- and SVM-trainedborder-classiﬁers, Figure 3 also shows results for a border classiﬁer trainedfrom the mathematical deﬁnition of the classes themselves. The skill scoresof this latter curve do not level signiﬁcantly faster than the other two. Solong as the complexity of the problem does not increase, adding new train-ing samples does not increase the number of border samples required formaximum accuracy.Figure 4 shows the number of support vectors versus the number of20igure 5: Classiﬁcation accuracy and uncertainty coeﬃcient for a supportvector machine (SVM) trained with diﬀerent numbers of samples. Errorbars represent the standard deviation of 20 trials.training samples. The ﬁtted curve is approximately linear with an exponentof 0.94 and multiplication coeﬃcient of 0.38. In other words, for this problemthere will be approximately 38 % as many support vectors as there aretraining vectors.Of course it’s possible to speed up an SVM by sub-sampling the trainingdata or the resulting support vectors. In such case, the sampling must bedone carefully so as not to reduce the accuracy of the result. Figure 5 showsthe eﬀect on classiﬁcation skill for the synthetic test classes when the numberof training samples is reduced. Skill scores start to saturate at between 200and 300 samples. By contrast, Figure 3 implies that you need only 20 bordersamples for good accuracy, so even with only 200 training samples you willstill have improved eﬃciency by using the borders technique.This suggests a simple scaling law. The number of training samplesrequired for good accuracy, and hence the number of support vectors, shouldbe proportional to the approximate volume occupied by the training datain the feature space: n ∝ V where n is the minimum number of training21igure 6: Classiﬁcation time for a SVM for a single test point versus numberof support vectors.vectors and V is volume. Then the number of border vectors should beproportional to the volume taken to the root of one less than the dimensionof the feature space: n b ∝ V D − . Putting it together, we can relate thetwo as follows: n b ∝ n D − (33)where n b is the minimum number of border vectors required for goodaccuracy.In other words, provided the class borders are not fractal (Ott, 1993),mapping only the border between classes should always be faster than tech-niques that map the entirety of the class locations. This includes kerneldensity methods including SVM as well as similar methods such as learningvector quantization (LVQ) (Kohonen, 2000; Kohonen et al., 1995) that at-tempt to create an idealized representation of the classes through a set of“codebook” vectors.To make this more concrete, Figure 6 plots the classiﬁcation time versusthe number of support vectors for a SVM while Figure 7 plots the classi-22igure 7: Classiﬁcation time for a border classiﬁer for a single test pointversus number of border samples. 23igure 8: Number of border samples versus number of support vectors forequal classiﬁcation times.ﬁcation time versus the number of border samples for a border classiﬁer.Classiﬁcation times are for a single test point. Fitted straight lines areoverlaid for each and the slope and intercept printed in the subtitle.Figure 8 plots the number of border vectors versus the number of supportvectors at the “break even” point: that is, the classﬁcation time is the samefor each method. This graph was simply derived from the ﬁtted coeﬃcentsof the previous two graphs. It is somewhat optimistic since LIBSVM hasa larger overhead than the border classiﬁers. This overhead would be lesssigniﬁcant for larger problems with the “rule of thumb” suggested by theslope that the number of border vectors should be less than three times thesupport for a reasonable gain in eﬃciency.Unfortunately the graph is not general: while the borders method scaleslinearly with the number of classes, in LIBSVM there is some shared cal-culation for multi-class problems. That is, some of the support vectors areshared between classes moreover the number will be diﬀerent for each prob-lem. Model size comparisons between the two methods should ideally bebetween the total number of support vectors versus the total number of24able 2: Summary of the parameters used in the numerical trials for each ofthe four methods: KNN ( k -nearest-neighbours), AGF (adaptive Gaussianﬁltering), SVM (support vector machine) and ACC (“accelerated” SVM). Ifthe number of trials has been starred, some operations received only a singletrial either because processing times were too long or because the datasetcame with test and training sets already separated. See text.Dataset Stat. KNN AGF Accel. SVMtrials f k W k n b n b γ Cheart 10 0.4 11 10 - 100 100 0.01 0.5shuttle 10 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Four classiﬁcation models were tested on each of the 17 datasets describedin Section 3.3: k-nearest-neighbours (KNN), a borders model derived fromadaptive Gaussian ﬁlters (AGF), a support vector machine (SVM) and a25able 3: Collation of results for numerical trials of the four diﬀerent statis-tical classiﬁcation methods over six diﬀerent datasets. dataset quantity KNN AGF SVM Accel.heart train (s) - . ± .

02 0 . ± .

004 0 . ± . . ± . ± . ± . ± . ± .

03 0 . ± .

03 0 . ± . . ± . U.C. 0 . ± .

07 0 . ± .

06 0 . ± . . ± . shuttle train (s) - ± . ± . . ± . . ± . . ± .

02 2 . ± . . ± . acc . . ± .

04 0 .

998 0 . ± . . . ± .

03 0 .

982 0 . ± . - . ± . . ± .

03 3 . ± . . ± . . ± . . ± .

007 0 . ± . .

905 0 . ± . . . ± . .

783 0 . ± . . . ± . - . ± . . ± .

02 0 . ± . . ± .

006 0 . ± . . ± . . ± . . ± .

008 0 . ± . . ± . . ± . . ± .

015 0 . ± . . ± . . ± . - ± ± ± . ± . . ± . . ± . . ± . .

811 0 . ± . . . ± . .

496 0 . ± . . . ± . - ± . ± .

02 1 . ± . . ± . . ± . . ± .

008 0 . ± . .

704 0 . ± . . . ± . .

156 0 . ± . . . ± . ±

843 7 . ± . . ± . . ± . acc 0 .

953 0 . ± . . . ± . .

704 0 . ± . . . ± . - . ± . . ± . . ± .

115 14 . ± . . ± . .

933 0 . . . ± . .

89 0 . . . ± . - . ± . . ± . . ± . . ± .

08 1 . ± . . ± . . ± . .

974 0 . ± . . . ± . .

941 0 . ± . . . ± . dataset quantity KNN AGF SVM Accel.usps train (s) - ±

11 73 . ± . . ± . . ± . . ± .

02 5 . ± . . ± . acc 0 .

941 0 . ± . . . ± . .

867 0 . ± . . . ± . -

449 283acc 0 .

943 0 . . . .

865 0 . . . ±

440 5 . ± . . ± . . . ± . .

971 0 . ± . . . ± . .

585 0 . ± . . . ± . - . ± . . ± . . ± . . ± . . . ± .

03 0 . ± .

543 0 . ± . . . ± . . . ± .

007 0 . . ± . seismic train (s) - ± . . ± . ± .

256 1 . ± . .

67 0 . ± . . . ± . .

279 0 . ± . . . ± . - ± ± . ± . . ± .

27 0 . ± .

004 2 . ± . . ± . acc . ± . . ± .

002 1 . ± . . ± . . ± . . ± .

013 0 . ± .

003 0 . ± . - . ± . . ± . . ± . . ± . . ± . . ± .

08 0 . ± . . ± .

002 0 . ± . . ± . . ± . . ± .

010 0 . ± . . ± . . ± . - ± . ± . . ± . . ± .

07 194 . ± . acc 0 .

550 0 . ± . . . ± . .

427 0 . ± . . . ± . k , for AGF is thenumber of nearest neighbours used when computing the probabilities: whilethe order of the method, O ( n ), remains the same, nonetheless it can producea signiﬁcant speed improvement for large problems. The parameter, n b , isthe number of class borders used in both AGF and SVM-borders.For SVM, a Gaussian kernel is used as in (24) where γ = 1 / (2 σ ) is atunable parameter. C is a cost parameter added to reduce over-ﬁtting: seeEquations (17) and (20).In order to get a conﬁdence interval on the results, ten trials were per-formed for most of the datasets. In some cases, only a single trial wasperformed either because the operation took too long or because of a pre-existing separation between test and training data which was taken “as is”.Single trials are indicated through the absence of error bars which are cal-culated from the standard deviation. f is the fraction of test data relativeto the total number of samples.The results are summarized in Tables 3 and 4 including training and testtime for each method as well as skill scores. There are two skill scores, theﬁrst being simple accuracy or fraction of correct guesses while the second,called the uncertainty coeﬃcient, is based on information entropy and isdescribed below. The best values for each dataset are highlighted in bold.Note that KNN does not have a training phase but sometimes its classiﬁca-tion phase is shorter than all the others’ training phase in which case noneof the numbers are highlighted but rather the hyphen that’s put in placeof the KNN training time. The SVM-borders method is not a stand-alonemethod thus its training time is never highlighted.28nterestingly, the heart dataset classiﬁers are so fast that they were notdetected by the system clock. Even more interestingly, skill scores for SVM-borders are higher than those for SVM. This does not break any laws ofprobability, however, since the two scores are well within each others’ errorbars and the interval for borders is larger than that for SVM. It is important to evaluate a result based on skill scores that reliably reﬂecthow well a given classiﬁer is doing. Thus we will deﬁne the two scores usedin this validation exercise since one in particular is not commonly seen inthe literature even though it has several attractive features.Let { η ij } be the confusion matrix, that is the number test values forwhich the ﬁrst classiﬁer (the “truth”) returns the i th class while the secondclassiﬁer (the estimate) returns the j th class. Let n test = P i P j η ij be thetotal number of test points.The accuracy is given: a = P i η ii n test (34)or simply the fraction of correct guesses.The uncertainty coeﬃcient is a more sophisticated measure based on thechannel capacity (Shannon and Weaver, 1963). It has the advantage oversimple accuracy in that it is not aﬀected by the relative size of each classdistribution. It is also not aﬀected by consistent rearrangement of the classlabels.The entropy of the prior distribution is given: H i = − X i X j η ij n test  log X j η ij n test  (35)= − n test X i X j η ij  log X j η ij  − log n test  (36)while the entropy of the posterior distribution is given: H ( i | j ) = − X i X j (cid:18) η ij n test (cid:19) log (cid:18) η ij P i η ij (cid:19) (37)= − n test X i X j η ij log η ij − X j X i η ij ! log X i η ij ! (38)29able 5: Total number of support vectors versus total number of bordersamples. D n c n Total Total Time (s) Time (s)support borders SVM accel.heart 13 2 162 109 ± ± ±

30 200 2.2 0.114phishing 68 2 6633 1440 ±

32 500 2.71 0.259humidity 7 8 51840 37767 5600 194 5.13The uncertainty coeﬃcient is deﬁned in terms of the prior entropy, H i , andthe posterior entropy, H ( i | j ), as follows: U ( i | j ) = H i − H ( i | j ) H i (39)and tells us: for a given test classiﬁcation, how many bits of informationon average does the estimate supply of the true class value? (Press et al.,1992; Mills, 2011). Thirteen of the classiﬁcation problems show a signiﬁcant speed increase withthe application of the borders technique with heart , segment , dna , pendigits being the exceptions. Table 5 is an attempt to get a handle on the relativetime complexity of the two methods and lists all the relevant variables: num-ber of features, number of classes, total number of training samples, total30able 6: Results from SVM trials after sub-sampling to match SVM-bordersspeed if possible, otherwise skill is matched. dataset samples S.V. train (s) test (s) accuracy U.C.shuttle 6522 514 ±

10 3 . ± . . ± .

02 0 . ± .

001 0 . ± . ±

12 3 . ± .

06 0 . ± .

008 0 . ± .

006 0 . ± . . ± .

003 0 . ± .

200 0 . ± . ± . ± . . ± .

035 0 . ± . ±

47 32 . ± . . ± . . ± .

003 0 . ± . ±

19 10 . ± . . ± . . ± .

004 0 . ± . .

967 0 . ± .

02 2 . ± .

08 0 . ± .

005 0 . ± . ± . ± .

004 1 . ± .

05 0 . ± .

022 0 . ± . ±

13 2 . ± .

08 0 . ± .

02 0 . ± . . ± . ± .

01 0 . ± .

02 0 . ± .

024 0 . ± . ±

49 1 . ± . . ± . . ± .

004 0 . ± . support for SVM, total number of border samples for the borders methodcompared with the resulting classiﬁcation time for the two methods. Thetwo most relevant variables here are the number of support vectors versusthe number of border vectors. In order to get a successful speed increase, theformer should be larger than the latter, but as is apparent from some prob-lems such as sat , usps , and mnist , even having slightly more border samplescan sometimes produce a signiﬁcant, although modest, improvement.All increases in speed, however, come at the cost of accuracy. The ques-tion is, is the speed increase worth the decrease in skill? To test this, wesub-sample the datasets and then re-apply the SVM training until the speedthe two methods, SVM and SVM-borders, matches. In some cases, SVMcould not be made fast enough by sub-sampling, in which case skill wasmatched instead.It might seem more expedient to directly sub-sample the support vectorsthemselves rather than the training data. This, however, was found notto work and generated a precipitous drop in accuracy. Since the sparsecoeﬃcient set, ~w , is found through simultaneous optimization, the supportvectors turn out to be interdependent.Depending on how much the dataset is reduced, sub-sampling shouldbe done with at least some care. On one hand, a more sophisticated sub-sampling technique might be considered a method on its own, comparablewith the borders technique, but also likely requiring multiple training phasesusing the original technique thus making it signiﬁcantly slower. On the otherhand, at minimum we should consider the relative size of each class distri-31ution. If there are roughly the same number of classes, then for small sub-samples the relative numbers should be kept constant. The shuttle dataset,however, has very uneven class numbers so it was sub-sampled diﬀerentlyin order to ensure that the smallest classes retain some membership. Let n i be the number of samples of the i th class. Then the sub-sampled numbersare given: n ′ i = α ( n i ) n i (40)The form of α used for the shuttle dataset was: α ( n ) = Cn − ζ (41)where C = n ζ , 0 < ζ < n is the number of samples in the smallest class. To understand howthis functional form was chosen, please see Appendix A.The results of the sub-sampling exercise are shown in Table 6. This givesus a clearer understanding of whether or not and when SVM accerationthrough borders sampling is eﬀective. In some trials the speed increase isenough that even the AGF-borders method will provide an improvement overa sub-sampled SVM model, the results for AGF being wholly disappointing.And in a few trials, the speed increase is so great, SVM cannot match theborders method even through sub-sampling.AGF-borders never beats SVM in skill and rarely even equals KNN eventhough it’s essentially the same method but using a more sophisticated ker-nel and with the borders training applied. Nonetheless, there is good reasonto develop the method further: training time varies with the number of train-ing samples ( O ( n )) rather than with the square ( O ( n )). This is apparentfor the largest datasets with more than a few thousand training samples atwhich point the AGF-borders method starts to train faster than SVM.There are at least three major sources of error for the AGF-borderstechnique. First, the kernel method upon which it is based is only ﬁrst-order accurate. In particular, this will aﬀect gradient estimates which aresemi-analytic: see Equation (28). Second, the borders method providesonly limited sampling of the discrimination border and this sampling is notstrongly optimized. The sampling method, using pairs of training points ofopposite class, will tend to favour regions of high density, however directlyoptimizing for classiﬁcation skill would be the ideal solution. Finally, theprobability estimates extrapolate from only a single point. All these errorswill tend to compound, especially after converting to multiple classes. Twoof these errors sources also aﬀect SVM-borders but don’t seem to have alarge eﬀect on the ﬁnal results. 32ne potential improvement is to recalibrate the probability estimates asdone with the LIBSVM decision in equation (29) (Platt, 1999; Lin et al.,2007). There are many other methods of recalibrating classiﬁcation proba-bility estimates: see for instance Niculescu-Mizil and Caruana (2005); Zadrozny and Elkan(2001). Initial trials have shown some success. Recalibrating results forthe splice dataset by a simple shift of the threshold value for the decisionfunction, for instance, increases the uncertainty coeﬃcient to 0.43 (accu-racy=0.87) for AGF-borders and 0.48 (accuracy=0.88) for SVM-borders.This simplest method of recalibration is built in to the libAGF software andwas used to good eﬀect in Mills (2009) and Mills (2011). SVM results forthe same problem were already well enough calibrated that no signiﬁcantimprovement could be made by the same technique. Other problems werebetter calibrated, even for the borders classiﬁers. The primary goal of this work was to improve the classiﬁcation time of aSVM using a simple, piecewise linear classiﬁer which we call the bordersclassiﬁer. The outcome for each of the 17 datasets is summarized in Table7. When trained from the SVM, the method succeeded for eight of thedatasets and by the same criteria, when trained from the simpler pointwiseestimator (“AGF”), as compared with SVM, it succeeded for six of thedatasets if we include the calibrated splice results. Not a perfect score butcertainly worthwhile to try for operational retrievals where time performanceis critical, for instance classifying large amounts of satellite data in real time.This is especially so in light of the high performance ratios for some of theproblems: the humidity dataset is sped up by almost 20 times, for instance,with even higher factors for some of the binary datasets.It’s worthwhile to note where the algorithm is most likely to succeedand conversely where it might fail. One of the most successful trials was forthe humidity dataset which produced one of the largest time improvementscombined with relatively little loss of accuracy. This makes sense since themethod was devised speciﬁcally for this problem and the humidity datasetepitomizes the characteristics for which the technique is most eﬀective.Since it assumes that the diﬀerence in conditional probabilities is asmooth and continuous function, the borders method tends to work poorlywith integer or categorical data as well as problems with sharply deﬁned,non-overlapping classes. Indeed, two of the problems where it took thebiggest hit in accuracy, dna and splice , use binary and categorical data re-33able 7: Summary of results for all 17 datasets including a verdict on thesuccess or failure of borders classiﬁcation to speed up SVM.dataset time (s) U.C. verdictSVM accel. SVM Accel.heart 0.0 0.0 0 . ± .

09 0 . ± .

09 breaks evenshuttle 1 . ± .

02 1 . ± .

006 0 . ± .

007 0 . ± .

02 failssat 0 . ± .

008 0 . ± .

005 0 . ± .

01 0 . ± .

005 failssegment 0 . ± .

004 0 . ± .

02 0 . ± .

01 failsdna 1 . ± . . ± .

07 0 .

783 0 . ± .

006 failssplice 0 . ± .

003 0 . ± .

004 0 . ± .

09 0 . ± .

006 succeedssplice ∗ . ± .

003 0 . ± .

005 0 . ± .

09 0 . ± .

01 succeeds ∗ codrna 3 . ± . . ± . . ± .

09 0 . ± .

002 succeedsletter 13 . ± . . ± . . ± .

004 0 . ± .

003 failspendigits 0 . ± .

01 0 . ± .

01 0 .

949 0 . ± .

002 failsusps 2 . ± . . ± .

01 0 . ± .

007 0 . ± .

005 ambiguous;must be re-donemnist 251 283 0 .

913 0 .

902 failsijcnn1 2 . ± .

08 2 . ± . . ± .

03 0 . ± .

007 succeedsmadelon 3 . ± .

03 0 . ± . ± .

006 0 . ± .

006 succeedsseismic 1 . ± .

05 1 . ± .

007 0 . ± .

04 0 . ± .

003 succeedsmushrooms 0 . ± .

02 0 . ± .

005 0 . ± .

008 0 . ± .

02 succeedsphishing 0 . ± .

02 0 . ± .

003 0 . ± .

07 0 . ± .

02 succeedshumidity 5 . ± . . ± .

05 0 . ± .

004 0 . ± . ∗ SVM-borders classiﬁer has been calibrated.34pectively.Also, since there is no redundancy in calculations for multiple classes,whereas in SVM there is considerable redundancy, problems with a largenumber of classes should also be avoided. This can be mitigated by usinga multi-class classiﬁcation method requiring fewer binary classiﬁers such asone-versus-the-rest with O ( n c ) performance or a decision tree with O (log n c )performance, rather than one-versus-one with its O ( n c ) time complexity.The most important characteristic for success with the borders classiﬁ-cation method is a large number of training samples used to train a SVMfor maximum accuracy. This also implies a large number of support vec-tors, making the SVM slow. Choosing an appropriate number of bordersamples allows one to trade oﬀ accuracy for speed, with diminishing returnsfor larger numbers of border samples. The borders method, unlike SVM,also has a straightforward interpretation: the location of the samples repre-sent a hyper-surface that divides the two classes and their gradients are thenormals to this surface. In this regard it is somewhat similar to rule-basedclassiﬁers such as decision trees.There are many directions for future work. An obvious reﬁnement wouldbe to distribute the border samples less randomly and cluster them wherethey are most needed. As it is, the method of choosing by selecting randompairs of opposite classes, will tend to distribute them in areas of high density.The current, random method was found to work well enough. Anotherpotential improvement would be to position the border samples so as todirectly minimize classiﬁcation error. This need not be done all at once asin some of the methods mentioned in the Introduction, but rather point-by-point to keep the training relatively fast. A ﬁrst guess could be foundthrough a kernel method and then each pointed shifted along the normal.Piecewise linear statistical classiﬁcation methods are simple, powerful andfast and we think they should receive more attention.For certain types of datasets, particularly those with continuum fea-tures data, smooth probability functions (typically overlapping classes) anda large number of samples, the borders classiﬁcation algorithm is an eﬀectivemethod of improving the classiﬁcation time of kernel methods. Because it isnot a stand-alone method, but requires probability estimates, it can acheivea fast training time since it is not solving a global optimization problem,yet still maintain reasonable accuracy. While it may not be the ﬁrst choicefor cracking “hard” problems, it is ideal for workaday problems, such asoperational retrievals, for which speed is critical.35 Sub-sampling

Let n i be the number of samples of the i th class such that: n i ≥ n i − Let 0 ≤ α ( n ) ≤ n ′ i = α ( n i ) n i We wish to retain the rank ordering of the class sizes: α ( n i ) n i ≥ α ( n i − ) n i − while ensuring that the smallest classes have some minimum representation: α ( n i ) ≤ α ( n i − ) (42)Thus: dd n [ nα ( n )] = α ( n ) + n d α n ≥ α d n ≥ − α ( n ) n (43)The simplest means of ensuring that both (42) and (43) are fulﬁlled, is tomultiply the right side of (43) with a constant, 0 ≤ ζ ≤

1, and equate itwith the left side: d α d n = − ζα ( n ) n Integrating: α ( n ) = Cn − ζ The parameter, ζ , is set such that: X i C ( n i ) n i = f X i n i where 0 < f < Cf X i n ζi = 036 eferences Alimoglu, F. (1996). Combining multiple classiﬁers for pen-based handwrit-ten digit recognition. Master’s thesis, Bogazici University.Bagirov, A. M. (1999). Derivative-free methods for unconstrained nons-mooth optimization and its numerical analysis.

Invstigacao Operacional ,19:75–93.Bagirov, A. M. (2005). Max-min separability.

Optimization Methods andSoftware , 20(2-3):277–296.Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vec-tor machines.

ACM Transactions on Intelligent Systems and Technology ,2(3):27:1–27:27.Chen, F., Yu, H., Yao, J., and Hu, R. (2015). Robust sparse kernel densityestimation by inducing randomness.

Pattern Analysis and Applications ,18:367.Duarte, M. F. and Hu, Y. H. (2004). Vehicle classiﬁcation in distributedsensor networds.

Journal of Parallel Distributed Computing , 64:826–838.Feldkamp, L. and Puskorius, G. V. (1998). A signal processing frameworkbased on dynamic neural networks with application to problems in adap-tation, ﬁltering, and classiﬁcation.

Proceedings of the IEEE , 86(11):2259–2277.Frey, P. and Slate, D. (1991). Letter recognition using holland-style adaptiveclassiﬁers.

Machine Learning , 6(2):161–182.Gai, K. and Zhang, C. (2010). Learning Discriminative Piecewise LinearModels with Boundary Points. In

Proceedings of the Twenty-Fourth AAAIConference on Artiﬁcial Intelligence , pages 444–450. Association for theAdvancement of Artiﬁcial Intelligence.Guyon, I., Gunn, S., Hur, A. B., and Dror, G. (2004). Results analysis ofthe NIPS 2003 feature selection challenge. In

Proceedings of the 17th In-ternational Conference on Neural Information Processing Systems , pages545–552, Vancouver. MIT Press.Herman, G. T. and Yeung, K. T. D. (1992). On piecewise-linear classiﬁca-tion.

IEEE Transactions on Pattern Analysis and Machine Intelligence ,14(7):782–786. 37su, C.-W. and Lin, C.-J. (2002). A comparison of methods for multi-class support vector machines.

IEEE Transactions on Neural Networks ,13(2):415–425.Huang, X., Mehrkanoon, S., and Suykens, J. A. K. (2013). Support vec-tor machines with piecewise linear feature mapping.

Neurocomputing ,117(6):118–127.Hull, J. J. (1994). A database for handwritten text recognition re-search.

IEEE Transactions on Pattern Analysis and Machine Intelligence ,16(5):550–554.Iba, W., Wogulis, J., and Lngley, P. (1988). Trading of simplicity and cover-age in incremental concept learning. In

Proceedings of Fifth InternationalConference on Machine Learning , pages 73–79.King, R. D., Feng, C., and Sutherland, A. (1995). Statlog: Comparision ofClassiﬁcation Problems on Large Real-World Problems.

Applied ArtiﬁcialIntelligence , 9(3):289–333.Kohonen, T. (2000).

Self-Organizing Maps . Springer-Verlag, 3rd edition.Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., and Torkkola, K.(1995).

LVQ PAK: The Learning Vector Quantization Package, Version3.1 .Kostin, A. (2006). A simple and fast multi-class piecewise linear patternclassiﬁer.

Pattern Recognition , 39:1949–1962.LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998). Gradient-based learning applied to document recognition.

Proceedings of the IEEE ,86(11):2278–2324.Lee, T. and Richards, J. A. (1984). Piecewise linear classiﬁcation usingseniority logic committee methods with application to remote sensing.

Pattern Recognition , 17(4):453–464.Lee, T. and Richards, J. A. (1985). A low cost classiﬁer for multitemporalapplications.

International Journal of Remote Sensing , 6(8):1405–1417.Lichman, M. (2013). UCI machine learning repository.Lin, H.-T., Lin, C.-J., and Weng, R. C. (2007). A note on Platt’s probabilis-tic outputs for support vector machines.

Machine Learning , 68(267):276.38ichie, D., Spiegelhalter, D. J., and Tayler, C. C., editors (1994).

MachineLearning, Neural and Statistical Classiﬁcation . Ellis Horwood Series inArtiﬁcial Intelligence. Prentice Hall, Upper Saddle River, NJ. Availableonline at: .Mills, P. (2009). Isoline retrieval: An optimal method for validation ofadvected contours.

Computers & Geosciences , 35(11):2020–2031.Mills, P. (2011). Eﬃcient statistical classiﬁcation of satellite measurements.

International Journal of Remote Sensing , 32(21):6109–6132.Mohommad, R., Fadi Abdeljaber Thabtah, F. A., and McCluskey, T. (2014).Predicting phishing websites based on self-structuring neural network.

Neural Computing and Applications , 25(2):443–458.M¨uller, K.-R., Mika, S., R¨atsch, G., Tsuda, K., and Sch¨olkopf, B. (2001).An introduction to kernel-based learning algorithms.

IEEE Transactionson Neural Networks , 12(2):181–201.Niculescu-Mizil, A. and Caruana, R. A. (2005). Obtaining calibrated prob-abilities from boosting. In

Proceedings of the Twenty-First Conference onUncertainty in Artiﬁcial Intelligence , pages 413–420.Osborne, M. (1977). Seniority Logic: A Logic of a Committee Machine.

IEEE Transactions on Computers , 26(12):1302–1306.Ott, E. (1993).

Chaos in Dynamical Systems . Cambridge University Press.Pavlidis, N. G., Hofmeyr, D. P., and Tasoulis, S. K. (2016). MinimumDensity Hyperplanes.

Journal of Machine Learning Research , 17(156):1–33.Platt, J. (1999). Probabilistic outputs for support vector machines and com-parison to regularized likelihood methods. In

Advances in Large MarginClassiﬁers . MIT Press.Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992).

Numerical Recipes in C . Cambridge University Press, 2nd edition.Shannon, C. E. and Weaver, W. (1963).

The Mathematical Theory of Com-munication . University of Illinois Press.Sklansky, J. and Michelotti, L. (1980). Locally trained piecewise linear clas-siﬁers.

IEEE Transactions on Pattern Analysis and Machine Intelligence ,2(2):101–111. 39enmoto, H., Kuda, M., and Shimbo, M. (1998). Piecewsise linear classi-ﬁers with an appropriate number of hyperplanes.

Pattern Recognition ,31(11):1627–1634.Terrell, D. G. and Scott, D. W. (1992). Variable kernel density estimation.

Annals of Statistics , 20:1236–1265.Uzilov, A. V., Keegan, J. M., and Mathews, D. H. (2006). Detection ofnon-coding rnas on the basis of predicted secondary structure formationfree energy change.

BMC Bioinformatics , 7:173.Wang, J. and Saligrama, V. (2013). Locally-Linear Learning Machines(L3M). In

Proceedings of Machine Learning Research , volume 29, pages451–466.Webb, D. (2012).

Eﬃcient Piecewise Linear Classiﬁers and Applications .PhD thesis, University of Ballarat, Victoria, Australia.Wu, T.-F., Lin, C.-J., and Weng, R. C. (2004). Probability Estimatesfor Multi-class Classiﬁcation by Pairwise Coupling.

Journal of MachineLearning Research , 5:975–1005.Zadrozny, B. and Elkan, C. (2001). Obtaining calibrated probability esti-mates from decision trees and naive bayesian classiﬁers. In