aa r X i v : . [ s t a t . M L ] J u l Accelerating Kernel Classifiers Through BordersMapping
Peter Mills [email protected]
July 27, 2018
Abstract
Support vector machines (SVM) and other kernel techniques rep-resent a family of powerful statistical classification methods with highaccuracy and broad applicability. Because they use all or a significantportion of the training data, however, they can be slow, especiallyfor large problems. Piecewise linear classifiers are similarly versatile,yet have the additional advantages of simplicity, ease of interpreta-tion and, if the number of component linear classifiers is not too large,speed. Here we show how a simple, piecewise linear classifier can betrained from a kernel-based classifier in order to improve the classifi-cation speed. The method works by finding the root of the differencein conditional probabilities between pairs of opposite classes to buildup a representation of the decision boundary. When tested on 17 dif-ferent datasets, it succeeded in improving the classification speed ofa SVM for 9 of them by factors as high as 88 times or more. Themethod is best suited to problems with continuum features data andsmooth probability functions. Because the component linear classi-fiers are built up individually from an existing classifier, rather thanthrough a simultaneous optimization procedure, the classifier is alsofast to train.
Keywords class borders, multi-dimensional root-finding, adaptive Gaussianfiltering, binary classifier, multi-class classifier, non-parametricstatistics, variable kernel density estimation ontents List of symbols 31 Introduction 42 Theory 6 ist of symbols symbol definition first used g decision function in linear classifier (1) ~v hyperplane normal of decision boundary (1) b constant value defining location of decision boundary (1) ~x test point (1)˜ p kernel density estimator (2) K kernel function taking two vector arguments (2) { ~x i } set of training samples (2) ~κ parameters in vector kernel (2) N kernel normalization coefficient (2) n number of training samples (2) ~x = { x i } point in the feature space (3) c class label (4) y i class label of i th training sample (4) p ( c | ~x ) conditional probability S 2 p ( c, ~x ) joint probability S 2 K kernel function taking a single scalar argument (5) σ “bandwidth”: sets the size of a scalar kernel (5) P ( ~x ) true probability density (6) D number of dimensions in feature space (6) W sum of the kernels at each training sample (7) ~w = { w i } coefficients in weighted kernel estimator (8) ~φ theoretical expanded feature space in kernel-based SVM (9) g svm raw decision function in binary SVM (11) C cost parameter in SVM for reducing over-fitting (20) r difference in conditional probabilities (21)˜ r decision function which is estimator of r (21)˜ r kern kernel estimator for r (22)˜ r vb variable kernel estimator for r (23) { ~b i } set of vectors defining the class border (27) { ~v i } set of normals to the class border (27) n b number of border vectors (27) g border raw decision function for border classification (27) d distance in feature space (27)˙ K derivative of a scalar kernel, K (27)˜ r svm LIBSVM estimator for r (29) A coefficient used in ˜ r svm (29) B coefficient used in ˜ r svm (29)˜ r border borders estimator for r (30) p i = p ( i | ~x ) conditional probability of i th class (31) n c number of classes (31) λ Lagrange multiplier in multi-class problem (33) ymbol definition first used V volume of the feature space occupied by training data (33) n number of training samples needed for good accuracy (33) n b number of border samples needed for good accuracy (33) γ = σ parameter used for Gaussian kernels in SVM S 5 f fraction of data used for testing S 5 { η ij } confusion matrix (34) n test number of test points (34) H i entropy of the prior distribution (36) H ( i | j ) entropy of the posterior distribution (37) U ( i | j ) uncertainty coefficient (normalized channel capacity) (39) n i size of i th class (40) α subsampling fraction as function of class size (40) ζ exponent in subsampling function α (41) C coefficient in subsampling function α (41) Linear classifiers are well studied in the literature. Methods such as theperceptron, Fisher discriminant, logistic regression and now linear supportvector machines (SVM) (Michie et al., 1994) are often appropriate for rela-tively simple, binary classification problems in which both classes are closelyclustered or are well separated. An obvious extension for more complexproblems is a piecewise linear classifier in which the decision boundary isbuilt up from a series of linear classifiers. Piecewise linear classifiers enjoyedsome popularity during the early development of the field of machine learn-ing (Osborne, 1977; Sklansky and Michelotti, 1980; Lee and Richards, 1984,1985) and because of their versatility, generality and simplicity there hasbeen recent renewed interest (Bagirov, 2005; Kostin, 2006; Gai and Zhang,2010; Webb, 2012; Wang and Saligrama, 2013; Pavlidis et al., 2016).A linear classifier takes the form: g ( ~x ) = ~v · ~x + b (1)where ~x is a test point in the feature space, ~v is a normal to the decisionhyper-surface, b determines the location of the decision boundary along thenormal and g is the decision function which we use to estimate the class ofthe test point through its sign.A piecewise linear classifier collects a set of such linear classifiers: { ~v i } = { ~v , ~v , ~v ... } ; { b i } = { b , b , b , ... } . The two challenges here are, first, howto efficiently train each of the decision boundaries and, second, the relatedproblem of how to partition the feature space to determine which lineardecision boundary is used for a given test point.4n Bagirov (2005) for instance, the decision function is defined by par-titioning the set of linear classifiers and maximizing the minimum lineardecision value in each partition. To train the classifier, a cost function isdefined in terms of this decision function and directly minimized using ananalog to the derivative for non-smooth functions (Bagirov, 1999). Natu-rally, such an approach will be quite computationally costly.Partitioning of the feature space can be separate from the discrimina-tion borders (Huang et al., 2013) but more normally the discrimination bor-ders are themselves sufficient to partition the feature space (Osborne, 1977;Lee and Richards, 1984; Bagirov, 2005; Kostin, 2006). This means that allor a significant fraction of the component linear classifiers must be evalu-ated. In Kostin (2006), for instance, the linear classifiers form a decisiontree.In the method described in this paper, the constant term, b i , is changedto a vector and the partitioning accomplished through a nearest neighboursto this vector. Thus the zone of influence for each hyperplane will be de-scribed by the Veronoi tesselation (Kohonen, 2000). If the class domainsare simply connected and don’t curve back on themselves, then the parti-tions will also be shaped as hyper-pyramids, with the axes of the pyramidsroughly perpendicular to the decision border. A dot product with each ofthe vectors must be calculated, similar to a linear classifier, but afterwardsonly a single linear decision function is evaluated.There seems to be some tension in the literature between training the de-cision boundary through simultaneous optimization (Bagirov, 2005; Wang and Saligrama,2013) or through methods that are more piece meal (Gai and Zhang, 2010;Herman and Yeung, 1992; Kostin, 2006). Obviously, simultaneous optimiza-tion will be more accurate but also much more computationally expensive.In addition, finding global minima for cost functions involving more thana handful of hyper-surfaces will be all but impossible. There is also theissue of separability. Many of the current crop of methods seem to be de-signed with disjoint classes in mind (Herman and Yeung, 1992), for instanceGai and Zhang (2010), who stick the hyper-plane borders between neigh-bouring pairs of opposite classes. Yet there is no reason why a piecewiselinear classifier cannot be just as effective for overlapping classes.The technique under discussion in this paper mitigates all of these issuesbecause it is not a stand-alone method but requires estimates of the con-ditional probabilities. It is used to improve the time performance of kernelmethods, or for that matter, any binary classifier that returns a continuousdecision function that can approximate a conditional probability. This isdone while maintaining, in all but a few cases, most of the accuracy.5everal of the piecewise linear techniques found in the literature workby positioning each hyperplane between pairs of clusters or pairs of trainingsamples of opposite class (Sklansky and Michelotti, 1980; Tenmoto et al.,1998; Kostin, 2006; Gai and Zhang, 2010). Other unsupervised or semi-supervised classifiers work by placing the hyperplanes in regions of minimumdensity (Pavlidis et al., 2016). The method described in this paper in somesenses combines these two techniques by finding the root of the difference inconditional probabilities along a line between two points of opposite class.It will be tested on two kernel-based classifiers—a support vector machine(SVM) (Michie et al., 1994; M¨uller et al., 2001) and a simpler, “pointwiseestimator” (Terrell and Scott, 1992; Mills, 2011)—and evaluated based onhow well it improves classification speed and at what cost to accuracy.Section 2 describes the theory of support vector machines and the point-wise estimator (“adaptive Guassian filtering”) as well as the piecewise linearclassifier or “borders” classifier that will be trained on the two kernel esti-mators. Section 3 describes the software and test datasets then in section4 we analyze the different classification algorithms on a simple, syntheticdataset. Section 5 outlines the results for 17 case studies while in Section 6we discuss the results. Section 7 concludes the paper. A kernel is a scalar function of two vectors that can be used for non-parametric density estimation. A typical “kernel-density estimator” lookslike this: ˜ p ( ~x ) = 1 nN n X i =1 K ( ~x, ~x i , ~κ ) (2)where ˜ p is an estimator for the density, P , K is the kernel function, { ~x i | i ∈ [1 , n ] } are a set of training samples, ~x is the test point, and ~κ is a set ofparameters. The normalization coefficient, N , normalizes K : N = Z V K ( ~x, ~x ′ , ~κ )d ~x ′ (3)The method can be used for statistical classification by comparing resultsfrom the different classes: c = arg min i X j | y j = i K ( ~x, ~x i , ~κ ) (4)6here y j is the class of the j th sample. Similarly, the method can also re-turn estimates of the joint ( p ( c, ~x )) and conditional probabilities ( p ( c | ~x )) bydividing the sum in (4) by nN or by the sum of all the kernels, respectively.If the same kernel is used for every sample and every test point, theestimator may be sub-optimal, particularly in regions of very high or verylow density. There are at least two ways to address this problem. In a“variable-bandwidth” estimator, the coefficients, ~κ , depend in some wayon the density itself. Since the actual density is normally unavailable, theestimated density can be used as a proxy (Terrell and Scott, 1992; Mills,2011).Let the kernel function take the following form: K ( ~x, ~x ′ , σ ) = K (cid:18) | ~x − ~x ′ | σ (cid:19) (5)where σ is the “bandwidth”. In Mills (2011), σ is made proportional to thedensity: σ ∝ P /D ≈ p /D (6)where D is the dimension of the feature space. Since the normalizationcoefficient, N , must include the factor, σ D , some rearrangement shows that:1 n X i K (cid:18) | ~x − ~x i | σ (cid:19) = W = const. (7)This is a generalization of the k -nearest-neighbours scheme in which the freeparameter, W , takes the place of k (Mills, 2009, 2011). The bandwidth, σ ,can be solved for using any numerical, one-dimensional root-finding algo-rithm. The bandwidth is determined uniquely for a given test point but isheld constant for that one, which makes this a “balloon” estimator. Con-trast a “point-wise” estimator in which bandwidths are different for eachtraining point but need only be determined once (Terrell and Scott, 1992).Another method of improving the performance of a kernel-density esti-mator is to multiply each kernel by a coefficient:˜ p ( ~x ) = X i w i K ( ~x, ~x i , ~κ ) (8)The coefficients, { w i } , are found through an optimization procedure de-signed to minimize the error (Chen et al., 2015). In the most popular formof this kernel method, support vector machines (SVM), the coefficients arethe result of a complex, dual optimization procedure which minimizes theclassification error. We will briefly outline this procedure.7 .2 Support Vector Machines The basic “trick” of kernel-based SVM methods is to replace a dot productwith the kernel function in the assumption that it can be rewritten as a dotproduct of a transformed and expanded feature space: K ( ~x, ~x ′ ) = ~φ ( ~x ) · ~φ ( ~x ′ ) (9)For simplicity we have ommitted the kernel parameters. ~φ is a vectorfunction of the feature space. The simplest example of a kernel functionthat has a closed, analytical and finite-dimensional ~φ is the square of thedot product: K ( ~x, ~x ′ ) = ( ~x · ~x ′ ) (10)= ( x , x , x , ..., √ x x , √ x x , ... √ x x , ... ) · ( x ′ , x ′ , x ′ , ..., √ x ′ x ′ , √ x ′ x ′ , ... √ x ′ x ′ , ... )but it should be noted that in more complex cases, there is no need toactually construct ~φ since it is replaced by the kernel function, K , in thefinal analysis.In a binary SVM classifier, the classes are separated by a single hyper-plane defined by ~v and b . In a kernel-based SVM, this hyperplane bisectsnot the regular feature space, but the theoretical, transformed space definedby the function, ~φ ( ~x ). The decision value is calculated via a dot product: g svm ( ~x ) = ~v · ~φ ( ~x ) + b (11)and the class determined, as before, by the sign of the decision value: c ( ~x ) = g svm ( ~x ) | g svm ( ~x ) | (12)where for convenience, the class labels are given by c ∈ {− , } .In the first step of the minimization procedure, the magnitude of theborder normal, ~v , is minimized subject to the constraint that there are noclassification errors: min ~v,b | ~v | g svm ( ~x i ) y i ≥ { w i } , as Lagrange multipliers on the constraints:min ~v,b ( | ~v | − X i w i [ g svm ( ~x i ) y i − ) (13)8enerates the following pair of analytic expressions: X i w i y i = 0 (14) ~v = X i w i y ~φ ( ~x i ) (15)through setting the derivatives w.r.t. the minimizers to zero. Substitutingthe second equation, (15), into the decision function in (11) produces thefollowing: g svm ( ~x ) = X i w i y i K ( ~x, ~x i ) + b (16)Thus, the final, dual, quadratic optimization problem looks like this:max { w i } X i w i − X i,j w i w j y i y j K ( ~x i , ~x j ) (17) w i ≥ X i w i y i = 0 (19)There are a number of refinements that can be applied to the optimizationproblem in (17)-(19), chiefly to reduce over-fitting and to add some “margin”to the decision border to allow for the possibility of classification errors. Forinstance, substitute the following for (18):0 ≤ w i ≤ C (20)where C is the cost (M¨uller et al., 2001). Mainly we are concerned herewith the decision function in (16) since the initial fitting will be done withan external software package, namely LIBSVM (Chang and Lin, 2011).Two things should be noted. First, the function ~φ appears in neither thefinal decision function, (16), nor in the optimization problem, (17). Second,while the use of ~φ implies that the time complexity of the decision functioncould be O (1) as in a parametric statistical model, in actual fact it is de-pendent on the number of non-zero values in { w i } . While the coefficientset, { w i } , does tend to be sparse, nonetheless in most real problems thenumber of non-zero coefficients is proportional to the number of samples, n ,producing a time complexity of O ( n ). Thus for large problems, calculatingthe decision value will be slow, just as in other kernel estimation problems.9he advantage of SVM lies chiefly in its accuracy since it is minimiz-ing the classification error whereas a more basic kernel method is more adhoc and does little more than sum the number of samples of a given class,weighted by distance. In kernel SVM, the decision border exists only implicitly in a hypothetical,abstract space. Even in linear SVM, if the software is generalized to recog-nize the simple dot product as only one among many possible kernels, thenthe decision function may be built up, as in (16) through a sum of weightedkernels. This is the case for LIBSVM. The advantage of an explicit decisionborder as in (1) or (11) is that it is fast. The problem with a linear borderis that, except for a small class of problems, it is not very accurate.In the binary classification method described in Mills (2011), a non-lineardecision border is built up piece-wise from a collection of linear borders. Itis essentially a root-finding procedure for a decision function, such as g svm in (16). Let ˜ r be a decision function that approximates the difference inconditional probabilities:˜ r ( ~x ) ≈ r ( ~x ) = p (1 | ~x ) − p ( − | ~x ) (21)where p ( c | ~x ) represents the conditional probabilities of a binary classifierhaving labels c ∈ {− , } . For a simple kernel estimator, for instance, r isestimated as follows: ˜ r kern ( ~x ) = P i y i K ( ~x, ~x i ) P i K ( ~x, ~x i ) (22)where y i ∈ {− , } . For the variable bandwidth kernel estimator defined by(7), this works out to:˜ r vb ( ~x ) = 1 W X i y i K (cid:18) | ~x − ~x i | σ ( ~x ) (cid:19) (23)A variable bandwidth kernel-density estimator with a Gaussian kernel, K ( ~x, ~x ′ , σ ) = exp (cid:18) − | ~x − ~x ′ | σ (cid:19) (24)we will refer to as an “Adaptive Gaussian Filter” or AGF for short. Thiskernel will also be used for SVM where it’s often called a “radial basisfunction” or RBF for short. 10he procedure is as follows: pick a pair of points on either side of the de-cision boundary (the decision function has opposite signs). Good candidatesare one random training sample from each class. Then, zero the decisionfunction along the line between the points. This can be done as many timesas needed to build up a good representation of the decision boundary. Wenow have a set of points, { ~b i } , such that ˜ r ( ~b i ) = 0 for every i ∈ [1 ..n b ] where n b is the number of border samples.Along with the border samples, { ~b i } , we also collect a series of normalvectors, { ~v i } such that: ~v i = ∇ ~x ˜ r | ~x = ~b i (25)With this system, determining the class is a two step proces. First, thenearest border sample to the test point is found. Second, we define a newdecision function, g border , equivalent to (1), through a dot product with thenormal: i = arg min j | ~x − ~b j | (26) g border ( ~x ) = ~v i · ( ~x − ~b i )The class is determined by the sign of the decision function as in (12).The time complexity is completely independent of the number of trainingsamples, rather it is linearly proportional to the number of border vectors, n b , a tunable parameter. The number required for accurate classificationsis dependent on the complexity of the decision border.The gradient of the variable-bandwidth kernel estimator in (23) is: ∂ ˜ r vb ∂x j = 2 σW X i y i ˙ K (cid:18) d i σ (cid:19) x j − x ij d i − d i P k ˙ K (cid:16) d k σ (cid:17) x j − x kj d k P k d k ˙ K (cid:16) d k σ (cid:17) (27)where d i = | ~x − ~x i | is the distance between the test point and the i th sampleand ˙ K is the derivative of K . For AGF, this works out to: ∂ ˜ r vb ∂x j = 1 σ W X i y i K (cid:18) d i σ (cid:19) x ij − x j − d i P k K (cid:16) d k σ (cid:17) ( x kj − x j ) P k K (cid:16) d k σ (cid:17) d k (28)where K ( x ) = exp( − x /
2) (Mills, 2011).In LIBSVM, conditional probabilities are estimated by applying logisticregression (Michie et al., 1994) to the raw SVM decision function, g svm , in(16): ˜ r svm ( ~x ) = tanh (cid:18) Ag svm ( ~x ) + B (cid:19) (29)11Chang and Lin, 2011), where A and B are coefficients derived from thetraining data via a nonlinear fitting technique (Platt, 1999; Lin et al., 2007).The gradient of the revised SVM decision function, above, is: ∇ ~x ˜ r svm = (cid:2) − ˜ r svm ( ~x ) (cid:3) X i w i y i ∇ ~x K ( ~x, ~x i )Gradients of the initial decision function are useful not just to derive nor-mals to the decision boundary, but also as an aid to root finding when search-ing for border samples. If the decision function used to compute the bordersamples represents an estimator for the difference in conditional probabili-ties, then the raw decision value, g border , derived from the border samplingtechnique in (27) can also return estimates of the conditional probabilitieswith little extra effort and little loss of accuracy, also using a sigmoid func-tion: ˜ r border ( ~x ) = tanh [ g border ( ~x )] (30)This assumes that the class posterior probabilities, p ( ~x | c ), are approximatelyGaussian near the border (Mills, 2011).The border classification algorithm returns an estimator, ˜ r border , for thedifference in conditional probabilities of a binary classifier using equations(27) and (30). It can be trained with the functions ˜ r kern in (22), ˜ r vb in(23), ˜ r svm in (29), or any other continuous, differentiable, non-parametricestimator for the difference in conditional probabilities, r . At the cost ofa small reduction in accuracy, it has the potential to drastically reduceclassification time for kernel estimators and other non-parametric statisti-cal classifiers, especially for large training datasets, since it has O ( n b ) timecomplexity instead of O ( n ) complexity, where n b , the number of border sam-ples, is a free parameter. The actual number chosen can trade off betweenspeed and accuracy with rapidly diminishing returns beyond a certain point.One hundred border samples ( n b = 100) is usually sufficient. The computa-tion of ˜ r border also involves very simple operations— floating point addition,multiplication and numerical comparison, with no transcendental functionsexcept for the very last step (which can be omitted)—so the coefficient forthe time complexity will be small.A border classifier trained with AGF will be referred to as an “AGF-borders” classifier while a border classifier trained with SVM estimates willbe referred to as an “SVM-borders” classifier or an “accelerated” SVM clas-sifier. 12 .4 Multi-class classification The border classification algorithm, like SVM, only works for binary classi-fication problems. It is quite easy to generalize a binary classifier to performmulti-class classifications by using several of them and the number of waysof doing so grows exponentially with the number of classes. Since LIB-SVM uses the “one-versus-one” method (Hsu and Lin, 2002) of multi-classclassification, this is the one we will adopt.A major advantage of the borders classifier is that it returns probabilityestimates. These estimates have many uses including measuring the confi-dence of as well as recalibrating the class estimates (Mills, 2009, 2011). Thusthe multi-class method should also solve for the conditional probabilities inaddition to returning the class label.In a one-vs.-one scheme, the multi-class conditional probabilities can berelated to those of the binary classifiers as follows: r ij ( ~x ) = p ( j | ~x ) − p ( i | ~x ) p ( i | ~x ) + p ( j | ~x ) (31)where i ∈ [1 ..n c − j ∈ [1 ..n c ], n c is the number of classes, j > i , and r ij is the difference in conditional probabilities of the binary classifier thatdiscriminates between the i th and j th classes. Wu et al. (2004) transformthis problem into the following linear system: p i X k | k = i ( r ki + 1) + X j | j = i p j (1 − r ij ) + λ = 0 (32) X j p j = 1where p i = p ( i | ~x ) is the i th multi-class conditional probability and λ is aLagrange multiplier. They also show that the constraints not included inthe problem, that the probabilities are all positive, are always satisfied anddescribe an algorithm for solving it iteratively, although a simple matrixsolver is sufficient. LIBSVM is a machine learning software library for support vector machinesdeveloped by Chih-Chung Chang and Chih-Jen Lin of the National Taiwan13niversity, Taipei, Taiwan (Chang and Lin, 2011). It includes statisticalclassification using two regularization methods for minimizing over-fitting:
C-SVM and ν -SVM . It also includes code for nonlinear regression and den-sity estimation or “one-class SVM”. SVM models were trained using the svm-train command while classifications done with svm-predict . LIB-SVM can be found at: Similar to LIBSVM, libAGF is a machine learning library but for vari-able kernel estimation (Mills, 2011; Terrell and Scott, 1992) rather thanSVM. Like LIBSVM, it supports statistical classification, lonlinear regres-sion and density estimation. It supports both Gaussian kernels and k-nearest neighbours. It was written by Peter Mills and can be found at https://github.com/peteysoft/libmsci .Except for training and classifying the SVM models, all calculations inthis paper were done with the libAGF library. To convert a LIBSVM modelto a borders model, the single command, svm_accelerate , can be used.Classifications are then performed with classify_m . The borders classification algorithm was tested on a total of 17 differentdatasets. These will be briefly described in this section. The collection coversa fairly broad range of size and types of problems, number of classes andnumber and types of attributes but with the focus on larger datasets wherethe borders technique is actually useful. Four of the datasets are from the“Statlog” project (Michie et al., 1994; King et al., 1995) and are nicknamed“ heart ”, “ shuttle ”, “ sat ” and “ segment ”. The heart disease (“ heart ”) datasetcontains thirteen attributes of 270 patients along with one of two class labelsdenoting either the presence or absence of heart disease. The dataset comesoriginally from the Cleveland Clinic Foundation and two versions are storedon the machine learning database of U. C. Irvine (Lichman, 2013).The shuttle dataset is interesting because the classes have a very unevendistribution meaning that multi-class classifiers with a symmetric break-down of the classes, such as one-vs.-one, tend to perform poorly. The shuttle dataset comes originally from NASA and was taken from an actual spaceshuttle flight. The classes describe actions to be taken at different flightconfigurations. 14able 1: Summary of datasets used in the numerical trials. D Type n c Train Test Referenceheart 13 real 2 270 - (Lichman, 2013) shuttle 9 real 7 43500 14500 (King et al., 1995) sat 36 real 6 4435 2000 (King et al., 1995) segment 19 real 7 2310 - (King et al., 1995) dna 180 binary 3 2000 1186 (Michie et al., 1994) splice 60 cat 3 1000 2175 (Michie et al., 1994) codrna 8 mixed 2 59535 271617 (Uzilov et al., 2006) letter 16 integer 26 20000 - (Frey and Slate, 1991) pendigits 16 integer 10 7494 3498 (Alimoglu, 1996) usps 256 real 10 7291 2001 (Hull, 1994) mnist 665 integer 10 60000 10000 (LeCun et al., 1998) ijcnn1 22 real 2 49990 91701 (Feldkamp and Puskorius, 1998) madelon 500 integer 2 2000 600 (Guyon et al., 2004) seismic 50 real 2 78823 19705 (Duarte and Hu, 2004) mushrooms 112 binary 2 8124 - (Iba et al., 1988) phishing 68 binary 2 11055 - (Mohommad et al., 2014) humidity 7 real 8 86400 - (Mills, 2009)
The satellite (“ sat ”) dataset is a satellite remote-sensing land classifi-cation problem. The attributes represent 3-by-3 segments of pixels in aLandsat image with the class corresponding to the type of land cover inthe central pixel. The segmentation (“ segment ”) dataset is also an imageclassification dataset consisting of 3-by-3 pixel sets from outdoor images.The DNA dataset is concerned with classifying a 60 base-pair sequence ofDNA into one of three values: an intron-extron boundary, an extron-intronboundary or neither of those two. That is, during protein creation, partof the sequence is spliced out, with the section kept being the intron andthat spliced out being the extron. There are two versions of it: one called“ splice ” with the original sequence of 4 nucleotide bases but only two classesand one called “ dna ” in which the features data has been reprocessed so thatthe 60 base values are transformed to 180 binary attributes but keeping theoriginal three classes (Michie et al., 1994). Another dataset from the fieldof microbiology is the “ codrna ” dataset which deals with detection of non-coding RNA sequences (Uzilov et al., 2006).There are four text-classification datasets: “ letter ”, “ pendigits ”, “ usps ”and “ mnist ”. The “ letter ” dataset is a text-recognition problem concerned15ith classifying a character into one of the 26 letters of the alphabet basedon processed attributes of the isolated character (Frey and Slate, 1991). The pendigits dataset is similar to the letter dataset except for classifying num-bers instead of letters (Alimoglu, 1996). The “ usps ” dataset deals withclassifying text for the purpose of mailing letters (Hull, 1994). The “ mnist ”dataset uses 28 by 28 pixel images to classify text into one of ten differentcharacters (LeCun et al., 1998). Pixels that always take on the same valuewere removed.Two of the datasets are machine-learning competition challenges. The“ ijcnn1 ” dataset is from the International Joint Conference on Neural Net-works Neural Networks Competition(Feldkamp and Puskorius, 1998) whilethe “ madelon ” dataset comes from the International Conference on NeuralInformation Processing Systems Feature Selection Challenge (Guyon et al.,2004).The “ seismic ” dataset deals with vehicle classification from seismic data(Duarte and Hu, 2004). The “ mushrooms ” dataset classifies wild mushroomsinto poisonous and non-poisonous types based on their physical character-istics (Iba et al., 1988). The “ phishing ” dataset uses characteristics of aweb address to predict whether or not a website is being used for nefariouspurposes (Mohommad et al., 2014).The final dataset, the “ humidity ” dataset, comprises simulated satelliteradiometer radiances across 7 different frequencies in the microwave range.Corresponding to each instance is a value for relative humidity at a singlevertical level. These humidity values have been discretized into 8 rangesto convert it into a statistical classification problem. A full description ofthe genesis of this dataset as well as a rationale for treatment using statis-tical classification is contained in Mills (2009). The statistical classificationmethods discussed in this paper were originally devised specifically for thisproblem.Most of the datasets have been supplied already divided into a “test”set and a “training” set. If this is the case, then it is noted in the summaryin Table 1 and the data has been used as given with the training set usedfor training and the test set used for testing. If the data is provided all inone lump, then it was randomly divided into test and training sets with thedivision different for each of the ten numerical trials.To provide the best idea of when the technique is effective and when itis not, results from all 17 datasets will be shown. All datasets were pre-processed in the same way: by taking the averages and standard deviationsof each feature from the training data and subtracting the averages fromboth the test and training data and dividing by the standard deviations.16igure 1: Support vectors for a pair of synthetic test classes.Features that took on the same value in the training data were removed.
We use the pair of synthetic test classes defined in Mills (2011) to illustratethe difference between support vectors and border vectors and between bor-der vectors derived from AGF and from a LIBSVM model. Figure 1 showsa realization of the two sample classes in red and blue, comprising 300 sam-ples total, along with the support vectors derived from a LIBSVM model.The support vectors are a subset of the training samples and while theytend to cluster around the border, they do not define it. For reference, theborder between the two classes is also shown. This has been derived fromthe border-classification method described in Section 2.3 using the math-ematical definition of the classes, hence it represents the “true” border towithin a very small numerical error.The true border is also compared with those derived from AGF andLIBSVM probability estimates in Figure 2. The classes are again shown for17igure 2: Borders mapped by the border-classification method starting withprobabilities from the class definitions, adaptive Gaussian filtering (AGF),and a support vector machine (SVM).18igure 3: Classification accuracy and uncertainty coefficient for border-classification starting with probabilities from the class definitions, adaptiveGaussian filtering (AGF), and a support vector machine (SVM). Average of20 trials. 19igure 4: Number of support vectors against number of training vectors forpair of synthetic test classes. Fitted curve returns exponent of 0.94 andmultiplication coefficient of 0.38.reference. While these borders contain several hundred samples for a clearview of where they are located using each method, in fact the method workswell with surprisingly few samples. Figure 3 shows a plot of the skill versusthe number of border samples, where
U.C. stands for uncertainty coefficient.Note that the scores saturate at only about 20 samples meaning that for thisproblem at least, very fast classifications are possible.Unlike support vectors, the number of border samples required is ap-proximately independent of the number of training samples. In additionto skill as a function of border samples for both AGF- and SVM-trainedborder-classifiers, Figure 3 also shows results for a border classifier trainedfrom the mathematical definition of the classes themselves. The skill scoresof this latter curve do not level significantly faster than the other two. Solong as the complexity of the problem does not increase, adding new train-ing samples does not increase the number of border samples required formaximum accuracy.Figure 4 shows the number of support vectors versus the number of20igure 5: Classification accuracy and uncertainty coefficient for a supportvector machine (SVM) trained with different numbers of samples. Errorbars represent the standard deviation of 20 trials.training samples. The fitted curve is approximately linear with an exponentof 0.94 and multiplication coefficient of 0.38. In other words, for this problemthere will be approximately 38 % as many support vectors as there aretraining vectors.Of course it’s possible to speed up an SVM by sub-sampling the trainingdata or the resulting support vectors. In such case, the sampling must bedone carefully so as not to reduce the accuracy of the result. Figure 5 showsthe effect on classification skill for the synthetic test classes when the numberof training samples is reduced. Skill scores start to saturate at between 200and 300 samples. By contrast, Figure 3 implies that you need only 20 bordersamples for good accuracy, so even with only 200 training samples you willstill have improved efficiency by using the borders technique.This suggests a simple scaling law. The number of training samplesrequired for good accuracy, and hence the number of support vectors, shouldbe proportional to the approximate volume occupied by the training datain the feature space: n ∝ V where n is the minimum number of training21igure 6: Classification time for a SVM for a single test point versus numberof support vectors.vectors and V is volume. Then the number of border vectors should beproportional to the volume taken to the root of one less than the dimensionof the feature space: n b ∝ V D − . Putting it together, we can relate thetwo as follows: n b ∝ n D − (33)where n b is the minimum number of border vectors required for goodaccuracy.In other words, provided the class borders are not fractal (Ott, 1993),mapping only the border between classes should always be faster than tech-niques that map the entirety of the class locations. This includes kerneldensity methods including SVM as well as similar methods such as learningvector quantization (LVQ) (Kohonen, 2000; Kohonen et al., 1995) that at-tempt to create an idealized representation of the classes through a set of“codebook” vectors.To make this more concrete, Figure 6 plots the classification time versusthe number of support vectors for a SVM while Figure 7 plots the classi-22igure 7: Classification time for a border classifier for a single test pointversus number of border samples. 23igure 8: Number of border samples versus number of support vectors forequal classification times.fication time versus the number of border samples for a border classifier.Classification times are for a single test point. Fitted straight lines areoverlaid for each and the slope and intercept printed in the subtitle.Figure 8 plots the number of border vectors versus the number of supportvectors at the “break even” point: that is, the classfication time is the samefor each method. This graph was simply derived from the fitted coefficentsof the previous two graphs. It is somewhat optimistic since LIBSVM hasa larger overhead than the border classifiers. This overhead would be lesssignificant for larger problems with the “rule of thumb” suggested by theslope that the number of border vectors should be less than three times thesupport for a reasonable gain in efficiency.Unfortunately the graph is not general: while the borders method scaleslinearly with the number of classes, in LIBSVM there is some shared cal-culation for multi-class problems. That is, some of the support vectors areshared between classes moreover the number will be different for each prob-lem. Model size comparisons between the two methods should ideally bebetween the total number of support vectors versus the total number of24able 2: Summary of the parameters used in the numerical trials for each ofthe four methods: KNN ( k -nearest-neighbours), AGF (adaptive Gaussianfiltering), SVM (support vector machine) and ACC (“accelerated” SVM). Ifthe number of trials has been starred, some operations received only a singletrial either because processing times were too long or because the datasetcame with test and training sets already separated. See text.Dataset Stat. KNN AGF Accel. SVMtrials f k W k n b n b γ Cheart 10 0.4 11 10 - 100 100 0.01 0.5shuttle 10 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Four classification models were tested on each of the 17 datasets describedin Section 3.3: k-nearest-neighbours (KNN), a borders model derived fromadaptive Gaussian filters (AGF), a support vector machine (SVM) and a25able 3: Collation of results for numerical trials of the four different statis-tical classification methods over six different datasets. dataset quantity KNN AGF SVM Accel.heart train (s) - . ± .
02 0 . ± .
004 0 . ± . . ± . ± . ± . ± . ± .
03 0 . ± .
03 0 . ± . . ± . U.C. 0 . ± .
07 0 . ± .
06 0 . ± . . ± . shuttle train (s) - ± . ± . . ± . . ± . . ± .
02 2 . ± . . ± . acc . . ± .
04 0 .
998 0 . ± . . . ± .
03 0 .
982 0 . ± . - . ± . . ± .
03 3 . ± . . ± . . ± . . ± .
007 0 . ± . .
905 0 . ± . . . ± . .
783 0 . ± . . . ± . - . ± . . ± .
02 0 . ± . . ± .
006 0 . ± . . ± . . ± . . ± .
008 0 . ± . . ± . . ± . . ± .
015 0 . ± . . ± . . ± . - ± ± ± . ± . . ± . . ± . . ± . .
811 0 . ± . . . ± . .
496 0 . ± . . . ± . - ± . ± .
02 1 . ± . . ± . . ± . . ± .
008 0 . ± . .
704 0 . ± . . . ± . .
156 0 . ± . . . ± . ±
843 7 . ± . . ± . . ± . acc 0 .
953 0 . ± . . . ± . .
704 0 . ± . . . ± . - . ± . . ± . . ± .
115 14 . ± . . ± . .
933 0 . . . ± . .
89 0 . . . ± . - . ± . . ± . . ± . . ± .
08 1 . ± . . ± . . ± . .
974 0 . ± . . . ± . .
941 0 . ± . . . ± . dataset quantity KNN AGF SVM Accel.usps train (s) - ±
11 73 . ± . . ± . . ± . . ± .
02 5 . ± . . ± . acc 0 .
941 0 . ± . . . ± . .
867 0 . ± . . . ± . -
449 283acc 0 .
943 0 . . . .
865 0 . . . ±
440 5 . ± . . ± . . . ± . .
971 0 . ± . . . ± . .
585 0 . ± . . . ± . - . ± . . ± . . ± . . ± . . . ± .
03 0 . ± .
543 0 . ± . . . ± . . . ± .
007 0 . . ± . seismic train (s) - ± . . ± . ± .
256 1 . ± . .
67 0 . ± . . . ± . .
279 0 . ± . . . ± . - ± ± . ± . . ± .
27 0 . ± .
004 2 . ± . . ± . acc . ± . . ± .
002 1 . ± . . ± . . ± . . ± .
013 0 . ± .
003 0 . ± . - . ± . . ± . . ± . . ± . . ± . . ± .
08 0 . ± . . ± .
002 0 . ± . . ± . . ± . . ± .
010 0 . ± . . ± . . ± . - ± . ± . . ± . . ± .
07 194 . ± . acc 0 .
550 0 . ± . . . ± . .
427 0 . ± . . . ± . k , for AGF is thenumber of nearest neighbours used when computing the probabilities: whilethe order of the method, O ( n ), remains the same, nonetheless it can producea significant speed improvement for large problems. The parameter, n b , isthe number of class borders used in both AGF and SVM-borders.For SVM, a Gaussian kernel is used as in (24) where γ = 1 / (2 σ ) is atunable parameter. C is a cost parameter added to reduce over-fitting: seeEquations (17) and (20).In order to get a confidence interval on the results, ten trials were per-formed for most of the datasets. In some cases, only a single trial wasperformed either because the operation took too long or because of a pre-existing separation between test and training data which was taken “as is”.Single trials are indicated through the absence of error bars which are cal-culated from the standard deviation. f is the fraction of test data relativeto the total number of samples.The results are summarized in Tables 3 and 4 including training and testtime for each method as well as skill scores. There are two skill scores, thefirst being simple accuracy or fraction of correct guesses while the second,called the uncertainty coefficient, is based on information entropy and isdescribed below. The best values for each dataset are highlighted in bold.Note that KNN does not have a training phase but sometimes its classifica-tion phase is shorter than all the others’ training phase in which case noneof the numbers are highlighted but rather the hyphen that’s put in placeof the KNN training time. The SVM-borders method is not a stand-alonemethod thus its training time is never highlighted.28nterestingly, the heart dataset classifiers are so fast that they were notdetected by the system clock. Even more interestingly, skill scores for SVM-borders are higher than those for SVM. This does not break any laws ofprobability, however, since the two scores are well within each others’ errorbars and the interval for borders is larger than that for SVM. It is important to evaluate a result based on skill scores that reliably reflecthow well a given classifier is doing. Thus we will define the two scores usedin this validation exercise since one in particular is not commonly seen inthe literature even though it has several attractive features.Let { η ij } be the confusion matrix, that is the number test values forwhich the first classifier (the “truth”) returns the i th class while the secondclassifier (the estimate) returns the j th class. Let n test = P i P j η ij be thetotal number of test points.The accuracy is given: a = P i η ii n test (34)or simply the fraction of correct guesses.The uncertainty coefficient is a more sophisticated measure based on thechannel capacity (Shannon and Weaver, 1963). It has the advantage oversimple accuracy in that it is not affected by the relative size of each classdistribution. It is also not affected by consistent rearrangement of the classlabels.The entropy of the prior distribution is given: H i = − X i X j η ij n test log X j η ij n test (35)= − n test X i X j η ij log X j η ij − log n test (36)while the entropy of the posterior distribution is given: H ( i | j ) = − X i X j (cid:18) η ij n test (cid:19) log (cid:18) η ij P i η ij (cid:19) (37)= − n test X i X j η ij log η ij − X j X i η ij ! log X i η ij ! (38)29able 5: Total number of support vectors versus total number of bordersamples. D n c n Total Total Time (s) Time (s)support borders SVM accel.heart 13 2 162 109 ± ± ±
30 200 2.2 0.114phishing 68 2 6633 1440 ±
32 500 2.71 0.259humidity 7 8 51840 37767 5600 194 5.13The uncertainty coefficient is defined in terms of the prior entropy, H i , andthe posterior entropy, H ( i | j ), as follows: U ( i | j ) = H i − H ( i | j ) H i (39)and tells us: for a given test classification, how many bits of informationon average does the estimate supply of the true class value? (Press et al.,1992; Mills, 2011). Thirteen of the classification problems show a significant speed increase withthe application of the borders technique with heart , segment , dna , pendigits being the exceptions. Table 5 is an attempt to get a handle on the relativetime complexity of the two methods and lists all the relevant variables: num-ber of features, number of classes, total number of training samples, total30able 6: Results from SVM trials after sub-sampling to match SVM-bordersspeed if possible, otherwise skill is matched. dataset samples S.V. train (s) test (s) accuracy U.C.shuttle 6522 514 ±
10 3 . ± . . ± .
02 0 . ± .
001 0 . ± . ±
12 3 . ± .
06 0 . ± .
008 0 . ± .
006 0 . ± . . ± .
003 0 . ± .
200 0 . ± . ± . ± . . ± .
035 0 . ± . ±
47 32 . ± . . ± . . ± .
003 0 . ± . ±
19 10 . ± . . ± . . ± .
004 0 . ± . .
967 0 . ± .
02 2 . ± .
08 0 . ± .
005 0 . ± . ± . ± .
004 1 . ± .
05 0 . ± .
022 0 . ± . ±
13 2 . ± .
08 0 . ± .
02 0 . ± . . ± . ± .
01 0 . ± .
02 0 . ± .
024 0 . ± . ±
49 1 . ± . . ± . . ± .
004 0 . ± . support for SVM, total number of border samples for the borders methodcompared with the resulting classification time for the two methods. Thetwo most relevant variables here are the number of support vectors versusthe number of border vectors. In order to get a successful speed increase, theformer should be larger than the latter, but as is apparent from some prob-lems such as sat , usps , and mnist , even having slightly more border samplescan sometimes produce a significant, although modest, improvement.All increases in speed, however, come at the cost of accuracy. The ques-tion is, is the speed increase worth the decrease in skill? To test this, wesub-sample the datasets and then re-apply the SVM training until the speedthe two methods, SVM and SVM-borders, matches. In some cases, SVMcould not be made fast enough by sub-sampling, in which case skill wasmatched instead.It might seem more expedient to directly sub-sample the support vectorsthemselves rather than the training data. This, however, was found notto work and generated a precipitous drop in accuracy. Since the sparsecoefficient set, ~w , is found through simultaneous optimization, the supportvectors turn out to be interdependent.Depending on how much the dataset is reduced, sub-sampling shouldbe done with at least some care. On one hand, a more sophisticated sub-sampling technique might be considered a method on its own, comparablewith the borders technique, but also likely requiring multiple training phasesusing the original technique thus making it significantly slower. On the otherhand, at minimum we should consider the relative size of each class distri-31ution. If there are roughly the same number of classes, then for small sub-samples the relative numbers should be kept constant. The shuttle dataset,however, has very uneven class numbers so it was sub-sampled differentlyin order to ensure that the smallest classes retain some membership. Let n i be the number of samples of the i th class. Then the sub-sampled numbersare given: n ′ i = α ( n i ) n i (40)The form of α used for the shuttle dataset was: α ( n ) = Cn − ζ (41)where C = n ζ , 0 < ζ < n is the number of samples in the smallest class. To understand howthis functional form was chosen, please see Appendix A.The results of the sub-sampling exercise are shown in Table 6. This givesus a clearer understanding of whether or not and when SVM accerationthrough borders sampling is effective. In some trials the speed increase isenough that even the AGF-borders method will provide an improvement overa sub-sampled SVM model, the results for AGF being wholly disappointing.And in a few trials, the speed increase is so great, SVM cannot match theborders method even through sub-sampling.AGF-borders never beats SVM in skill and rarely even equals KNN eventhough it’s essentially the same method but using a more sophisticated ker-nel and with the borders training applied. Nonetheless, there is good reasonto develop the method further: training time varies with the number of train-ing samples ( O ( n )) rather than with the square ( O ( n )). This is apparentfor the largest datasets with more than a few thousand training samples atwhich point the AGF-borders method starts to train faster than SVM.There are at least three major sources of error for the AGF-borderstechnique. First, the kernel method upon which it is based is only first-order accurate. In particular, this will affect gradient estimates which aresemi-analytic: see Equation (28). Second, the borders method providesonly limited sampling of the discrimination border and this sampling is notstrongly optimized. The sampling method, using pairs of training points ofopposite class, will tend to favour regions of high density, however directlyoptimizing for classification skill would be the ideal solution. Finally, theprobability estimates extrapolate from only a single point. All these errorswill tend to compound, especially after converting to multiple classes. Twoof these errors sources also affect SVM-borders but don’t seem to have alarge effect on the final results. 32ne potential improvement is to recalibrate the probability estimates asdone with the LIBSVM decision in equation (29) (Platt, 1999; Lin et al.,2007). There are many other methods of recalibrating classification proba-bility estimates: see for instance Niculescu-Mizil and Caruana (2005); Zadrozny and Elkan(2001). Initial trials have shown some success. Recalibrating results forthe splice dataset by a simple shift of the threshold value for the decisionfunction, for instance, increases the uncertainty coefficient to 0.43 (accu-racy=0.87) for AGF-borders and 0.48 (accuracy=0.88) for SVM-borders.This simplest method of recalibration is built in to the libAGF software andwas used to good effect in Mills (2009) and Mills (2011). SVM results forthe same problem were already well enough calibrated that no significantimprovement could be made by the same technique. Other problems werebetter calibrated, even for the borders classifiers. The primary goal of this work was to improve the classification time of aSVM using a simple, piecewise linear classifier which we call the bordersclassifier. The outcome for each of the 17 datasets is summarized in Table7. When trained from the SVM, the method succeeded for eight of thedatasets and by the same criteria, when trained from the simpler pointwiseestimator (“AGF”), as compared with SVM, it succeeded for six of thedatasets if we include the calibrated splice results. Not a perfect score butcertainly worthwhile to try for operational retrievals where time performanceis critical, for instance classifying large amounts of satellite data in real time.This is especially so in light of the high performance ratios for some of theproblems: the humidity dataset is sped up by almost 20 times, for instance,with even higher factors for some of the binary datasets.It’s worthwhile to note where the algorithm is most likely to succeedand conversely where it might fail. One of the most successful trials was forthe humidity dataset which produced one of the largest time improvementscombined with relatively little loss of accuracy. This makes sense since themethod was devised specifically for this problem and the humidity datasetepitomizes the characteristics for which the technique is most effective.Since it assumes that the difference in conditional probabilities is asmooth and continuous function, the borders method tends to work poorlywith integer or categorical data as well as problems with sharply defined,non-overlapping classes. Indeed, two of the problems where it took thebiggest hit in accuracy, dna and splice , use binary and categorical data re-33able 7: Summary of results for all 17 datasets including a verdict on thesuccess or failure of borders classification to speed up SVM.dataset time (s) U.C. verdictSVM accel. SVM Accel.heart 0.0 0.0 0 . ± .
09 0 . ± .
09 breaks evenshuttle 1 . ± .
02 1 . ± .
006 0 . ± .
007 0 . ± .
02 failssat 0 . ± .
008 0 . ± .
005 0 . ± .
01 0 . ± .
005 failssegment 0 . ± .
004 0 . ± .
004 0 . ± .
02 0 . ± .
01 failsdna 1 . ± . . ± .
07 0 .
783 0 . ± .
006 failssplice 0 . ± .
003 0 . ± .
004 0 . ± .
09 0 . ± .
006 succeedssplice ∗ . ± .
003 0 . ± .
005 0 . ± .
09 0 . ± .
01 succeeds ∗ codrna 3 . ± . . ± . . ± .
09 0 . ± .
002 succeedsletter 13 . ± . . ± . . ± .
004 0 . ± .
003 failspendigits 0 . ± .
01 0 . ± .
01 0 .
949 0 . ± .
002 failsusps 2 . ± . . ± .
01 0 . ± .
007 0 . ± .
005 ambiguous;must be re-donemnist 251 283 0 .
913 0 .
902 failsijcnn1 2 . ± .
08 2 . ± . . ± .
03 0 . ± .
007 succeedsmadelon 3 . ± .
03 0 . ± . ± .
006 0 . ± .
006 succeedsseismic 1 . ± .
05 1 . ± .
007 0 . ± .
04 0 . ± .
003 succeedsmushrooms 0 . ± .
02 0 . ± .
005 0 . ± .
008 0 . ± .
02 succeedsphishing 0 . ± .
02 0 . ± .
003 0 . ± .
07 0 . ± .
02 succeedshumidity 5 . ± . . ± .
05 0 . ± .
004 0 . ± . ∗ SVM-borders classifier has been calibrated.34pectively.Also, since there is no redundancy in calculations for multiple classes,whereas in SVM there is considerable redundancy, problems with a largenumber of classes should also be avoided. This can be mitigated by usinga multi-class classification method requiring fewer binary classifiers such asone-versus-the-rest with O ( n c ) performance or a decision tree with O (log n c )performance, rather than one-versus-one with its O ( n c ) time complexity.The most important characteristic for success with the borders classifi-cation method is a large number of training samples used to train a SVMfor maximum accuracy. This also implies a large number of support vec-tors, making the SVM slow. Choosing an appropriate number of bordersamples allows one to trade off accuracy for speed, with diminishing returnsfor larger numbers of border samples. The borders method, unlike SVM,also has a straightforward interpretation: the location of the samples repre-sent a hyper-surface that divides the two classes and their gradients are thenormals to this surface. In this regard it is somewhat similar to rule-basedclassifiers such as decision trees.There are many directions for future work. An obvious refinement wouldbe to distribute the border samples less randomly and cluster them wherethey are most needed. As it is, the method of choosing by selecting randompairs of opposite classes, will tend to distribute them in areas of high density.The current, random method was found to work well enough. Anotherpotential improvement would be to position the border samples so as todirectly minimize classification error. This need not be done all at once asin some of the methods mentioned in the Introduction, but rather point-by-point to keep the training relatively fast. A first guess could be foundthrough a kernel method and then each pointed shifted along the normal.Piecewise linear statistical classification methods are simple, powerful andfast and we think they should receive more attention.For certain types of datasets, particularly those with continuum fea-tures data, smooth probability functions (typically overlapping classes) anda large number of samples, the borders classification algorithm is an effectivemethod of improving the classification time of kernel methods. Because it isnot a stand-alone method, but requires probability estimates, it can acheivea fast training time since it is not solving a global optimization problem,yet still maintain reasonable accuracy. While it may not be the first choicefor cracking “hard” problems, it is ideal for workaday problems, such asoperational retrievals, for which speed is critical.35 Sub-sampling
Let n i be the number of samples of the i th class such that: n i ≥ n i − Let 0 ≤ α ( n ) ≤ n ′ i = α ( n i ) n i We wish to retain the rank ordering of the class sizes: α ( n i ) n i ≥ α ( n i − ) n i − while ensuring that the smallest classes have some minimum representation: α ( n i ) ≤ α ( n i − ) (42)Thus: dd n [ nα ( n )] = α ( n ) + n d α n ≥ α d n ≥ − α ( n ) n (43)The simplest means of ensuring that both (42) and (43) are fulfilled, is tomultiply the right side of (43) with a constant, 0 ≤ ζ ≤
1, and equate itwith the left side: d α d n = − ζα ( n ) n Integrating: α ( n ) = Cn − ζ The parameter, ζ , is set such that: X i C ( n i ) n i = f X i n i where 0 < f < Cf X i n ζi = 036 eferences Alimoglu, F. (1996). Combining multiple classifiers for pen-based handwrit-ten digit recognition. Master’s thesis, Bogazici University.Bagirov, A. M. (1999). Derivative-free methods for unconstrained nons-mooth optimization and its numerical analysis.
Invstigacao Operacional ,19:75–93.Bagirov, A. M. (2005). Max-min separability.
Optimization Methods andSoftware , 20(2-3):277–296.Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vec-tor machines.
ACM Transactions on Intelligent Systems and Technology ,2(3):27:1–27:27.Chen, F., Yu, H., Yao, J., and Hu, R. (2015). Robust sparse kernel densityestimation by inducing randomness.
Pattern Analysis and Applications ,18:367.Duarte, M. F. and Hu, Y. H. (2004). Vehicle classification in distributedsensor networds.
Journal of Parallel Distributed Computing , 64:826–838.Feldkamp, L. and Puskorius, G. V. (1998). A signal processing frameworkbased on dynamic neural networks with application to problems in adap-tation, filtering, and classification.
Proceedings of the IEEE , 86(11):2259–2277.Frey, P. and Slate, D. (1991). Letter recognition using holland-style adaptiveclassifiers.
Machine Learning , 6(2):161–182.Gai, K. and Zhang, C. (2010). Learning Discriminative Piecewise LinearModels with Boundary Points. In
Proceedings of the Twenty-Fourth AAAIConference on Artificial Intelligence , pages 444–450. Association for theAdvancement of Artificial Intelligence.Guyon, I., Gunn, S., Hur, A. B., and Dror, G. (2004). Results analysis ofthe NIPS 2003 feature selection challenge. In
Proceedings of the 17th In-ternational Conference on Neural Information Processing Systems , pages545–552, Vancouver. MIT Press.Herman, G. T. and Yeung, K. T. D. (1992). On piecewise-linear classifica-tion.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,14(7):782–786. 37su, C.-W. and Lin, C.-J. (2002). A comparison of methods for multi-class support vector machines.
IEEE Transactions on Neural Networks ,13(2):415–425.Huang, X., Mehrkanoon, S., and Suykens, J. A. K. (2013). Support vec-tor machines with piecewise linear feature mapping.
Neurocomputing ,117(6):118–127.Hull, J. J. (1994). A database for handwritten text recognition re-search.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,16(5):550–554.Iba, W., Wogulis, J., and Lngley, P. (1988). Trading of simplicity and cover-age in incremental concept learning. In
Proceedings of Fifth InternationalConference on Machine Learning , pages 73–79.King, R. D., Feng, C., and Sutherland, A. (1995). Statlog: Comparision ofClassification Problems on Large Real-World Problems.
Applied ArtificialIntelligence , 9(3):289–333.Kohonen, T. (2000).
Self-Organizing Maps . Springer-Verlag, 3rd edition.Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., and Torkkola, K.(1995).
LVQ PAK: The Learning Vector Quantization Package, Version3.1 .Kostin, A. (2006). A simple and fast multi-class piecewise linear patternclassifier.
Pattern Recognition , 39:1949–1962.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE ,86(11):2278–2324.Lee, T. and Richards, J. A. (1984). Piecewise linear classification usingseniority logic committee methods with application to remote sensing.
Pattern Recognition , 17(4):453–464.Lee, T. and Richards, J. A. (1985). A low cost classifier for multitemporalapplications.
International Journal of Remote Sensing , 6(8):1405–1417.Lichman, M. (2013). UCI machine learning repository.Lin, H.-T., Lin, C.-J., and Weng, R. C. (2007). A note on Platt’s probabilis-tic outputs for support vector machines.
Machine Learning , 68(267):276.38ichie, D., Spiegelhalter, D. J., and Tayler, C. C., editors (1994).
MachineLearning, Neural and Statistical Classification . Ellis Horwood Series inArtificial Intelligence. Prentice Hall, Upper Saddle River, NJ. Availableonline at: .Mills, P. (2009). Isoline retrieval: An optimal method for validation ofadvected contours.
Computers & Geosciences , 35(11):2020–2031.Mills, P. (2011). Efficient statistical classification of satellite measurements.
International Journal of Remote Sensing , 32(21):6109–6132.Mohommad, R., Fadi Abdeljaber Thabtah, F. A., and McCluskey, T. (2014).Predicting phishing websites based on self-structuring neural network.
Neural Computing and Applications , 25(2):443–458.M¨uller, K.-R., Mika, S., R¨atsch, G., Tsuda, K., and Sch¨olkopf, B. (2001).An introduction to kernel-based learning algorithms.
IEEE Transactionson Neural Networks , 12(2):181–201.Niculescu-Mizil, A. and Caruana, R. A. (2005). Obtaining calibrated prob-abilities from boosting. In
Proceedings of the Twenty-First Conference onUncertainty in Artificial Intelligence , pages 413–420.Osborne, M. (1977). Seniority Logic: A Logic of a Committee Machine.
IEEE Transactions on Computers , 26(12):1302–1306.Ott, E. (1993).
Chaos in Dynamical Systems . Cambridge University Press.Pavlidis, N. G., Hofmeyr, D. P., and Tasoulis, S. K. (2016). MinimumDensity Hyperplanes.
Journal of Machine Learning Research , 17(156):1–33.Platt, J. (1999). Probabilistic outputs for support vector machines and com-parison to regularized likelihood methods. In
Advances in Large MarginClassifiers . MIT Press.Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992).
Numerical Recipes in C . Cambridge University Press, 2nd edition.Shannon, C. E. and Weaver, W. (1963).
The Mathematical Theory of Com-munication . University of Illinois Press.Sklansky, J. and Michelotti, L. (1980). Locally trained piecewise linear clas-sifiers.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,2(2):101–111. 39enmoto, H., Kuda, M., and Shimbo, M. (1998). Piecewsise linear classi-fiers with an appropriate number of hyperplanes.
Pattern Recognition ,31(11):1627–1634.Terrell, D. G. and Scott, D. W. (1992). Variable kernel density estimation.
Annals of Statistics , 20:1236–1265.Uzilov, A. V., Keegan, J. M., and Mathews, D. H. (2006). Detection ofnon-coding rnas on the basis of predicted secondary structure formationfree energy change.
BMC Bioinformatics , 7:173.Wang, J. and Saligrama, V. (2013). Locally-Linear Learning Machines(L3M). In
Proceedings of Machine Learning Research , volume 29, pages451–466.Webb, D. (2012).
Efficient Piecewise Linear Classifiers and Applications .PhD thesis, University of Ballarat, Victoria, Australia.Wu, T.-F., Lin, C.-J., and Weng, R. C. (2004). Probability Estimatesfor Multi-class Classification by Pairwise Coupling.
Journal of MachineLearning Research , 5:975–1005.Zadrozny, B. and Elkan, C. (2001). Obtaining calibrated probability esti-mates from decision trees and naive bayesian classifiers. In