The Perceptron with Dynamic Margin
aa r X i v : . [ c s . L G ] M a y The Perceptron with Dynamic Margin
Constantinos Panagiotakopoulos and Petroula Tsampouka
Physics Division, School of TechnologyAristotle University of Thessaloniki, Greece [email protected],[email protected]
Abstract.
The classical perceptron rule provides a varying upper boundon the maximum margin, namely the length of the current weight vec-tor divided by the total number of updates up to that time. Requiringthat the perceptron updates its internal state whenever the normalizedmargin of a pattern is found not to exceed a certain fraction of this dy-namic upper bound we construct a new approximate maximum marginclassifier called the perceptron with dynamic margin (PDM). We demon-strate that PDM converges in a finite number of steps and derive an up-per bound on them. We also compare experimentally PDM with otherperceptron-like algorithms and support vector machines on hard margintasks involving linear kernels which are equivalent to 2-norm soft margin.
Keywords:
Online learning, classification, maximum margin.
It is a common belief that learning machines able to produce solution hyperplaneswith large margins exhibit greater generalization ability [21] and this justifies theenormous interest in Support Vector Machines (SVMs) [21, 2]. Typically, SVMsobtain large margin solutions by solving a constrained quadratic optimizationproblem using dual variables. In their native form, however, efficient implemen-tation is hindered by the quadratic dependence of their memory requirements inthe number of training examples a fact which renders prohibitive the processingof large datasets. To overcome this problem decomposition methods [15, 6] weredeveloped that apply optimization only to a subset of the training set. Althoughsuch methods led to improved convergence rates, in practice their superlineardependence on the number of examples, which can be even cubic, can still leadto excessive runtimes when large datasets are processed. Recently, the so-calledlinear SVMs [7, 8, 13] made their appearance. They take advantage of linear ker-nels in order to allow parts of them to be written in primal notation and wereshown to outperform decomposition SVMs when dealing with massive datasets.The above considerations motivated research in alternative large margin clas-sifiers naturally formulated in primal space long before the advent of linearSVMs. Such algorithms are mostly based on the perceptron [16, 12], the simplestonline learning algorithm for binary linear classification. Like the perceptron,they focus on the primal problem by updating a weight vector which representst each step the current state of the algorithm whenever a data point presentedto it satisfies a specific condition. It is the ability of such algorithms to processone example at a time that allows them to spare time and memory resourcesand consequently makes them able to handle large datasets. The first algorithmof that kind is the perceptron with margin [3] which is much older than SVMs.It is an immediate extension of the perceptron which provably achieves solutionswith only up to 1 / k a t k /t on the maximum margin, with k a t k being the length of the weight vector and t the number of updates, that comes as an immediate consequence of the percep-tron update rule is very accurate and tends to improve as the algorithm achieveslarger margins. In the present work we replace the fixed target margin value witha fraction 1 − ǫ of this varying upper bound on the maximum margin. The hopeis that as the algorithm keeps updating its state the upper bound will keep ap-proaching the maximum margin and convergence to a solution with the desiredaccuracy ǫ will eventually occur. Thus, the resulting algorithm may be regardedas a realizable implementation of the perceptron with fixed margin condition.The rest of this paper is organized as follows. Section 2 contains some prelim-inaries and a motivation of the algorithm based on a qualitative analysis. In Sect.3 we give a formal theoretical analysis. Section 4 is devoted to implementationalissues. Section 5 contains our experimental results while Sect. 6 our conclusions. Let us consider a linearly separable training set { ( x k , l k ) } mk =1 , with vectors x k ∈ IR d and labels l k ∈ { +1 , − } . This training set may either be the original datasetor the result of a mapping into a feature space of higher dimensionality [21, 2].Actually, there is a very well-known construction [4] making linear separabilityalways possible, which amounts to the adoption of the 2-norm soft margin. Byplacing x k in the same position at a distance ρ in an additional dimension, i.e.by extending x k to [ x k , ρ ], we construct an embedding of our data into the so-called augmented space [3]. This way, we construct hyperplanes possessing bias The conversion of online algorithms to the batch setting is done by cycling repeatedlythrough the dataset and using the last hypothesis for prediction. n the non-augmented feature space. Following the augmentation, a reflectionwith respect to the origin of the negatively labeled patterns is performed bymultiplying every pattern with its label. This allows for a uniform treatment ofboth categories of patterns. Also, R ≡ max k k y k k with y k ≡ [ l k x k , l k ρ ] the k th augmented and reflected pattern. Obviously, R ≥ ρ .The relation characterizing optimally correct classification of the trainingpatterns y k by a weight vector u of unit norm in the augmented space is u · y k ≥ γ d ≡ max u ′ : (cid:13)(cid:13)(cid:13) u ′ (cid:13)(cid:13)(cid:13) = 1 min i { u ′ · y i } ∀ k . (1)We shall refer to γ d as the maximum directional margin. It coincides with themaximum margin in the augmented space with respect to hyperplanes passingthrough the origin. For the maximum directional margin γ d and the maximumgeometric margin γ in the non-augmented feature space, it holds that 1 ≤ γ/γ d ≤ R/ρ . As ρ → ∞ , R/ρ → γ d → γ [17, 18].We consider algorithms in which the augmented weight vector a t is initiallyset to zero, i.e. a = , and is updated according to the classical perceptron rule a t +1 = a t + y k (2)each time an appropriate misclassification condition is satisfied by a trainingpattern y k . Taking the inner product of (2) with the optimal direction u andusing (1) we get u · a t +1 − u · a t = u · y k ≥ γ d a repeated application of which gives [12] k a t k ≥ u · a t ≥ γ d t . (3)From (3) we readily obtain γ d ≤ k a t k t (4)provided t >
0. Notice that the above upper bound on the maximum directionalmargin γ d is an immediate consequence of the classical perceptron rule and holdsindependent of the misclassification condition.It would be very desirable that k a t k /t approaches γ d with t increasing sincethis would provide an after-run estimate of the accuracy achieved by an algo-rithm employing the classical perceptron update. More specifically, with γ ′ d beingthe directional margin achieved upon convergence of the algorithm in t c updates,it holds that γ d − γ ′ d γ d ≤ − γ ′ d t c k a t c k . (5)In order to understand the mechanism by which k a t k /t evolves we considerthe difference between two consecutive values of k a t k /t which may be shownto be given by the relation k a t k t − k a t +1 k ( t + 1) = 1 t ( t + 1) ( k a t k t − a t · y k ! + k a t +1 k t + 1 − a t +1 · y k !) . (6)et us assume that satisfaction of the misclassification condition by a pattern y k has as a consequence that k a t k /t > a t · y k (i.e., the normalized margin u t · y k of y k (with u t ≡ a t / k a t k ) is smaller than the upper bound (4) on γ d ).Let us further assume that after the update has taken place y k still satisfiesthe misclassification condition and therefore k a t +1 k / ( t + 1) > a t +1 · y k . Then,the r.h.s. of (6) is positive and k a t k /t decreases as a result of the update. Inthe event, instead, that the update leads to violation of the misclassificationcondition, k a t +1 k / ( t + 1) is not necessarily larger than a t +1 · y k and k a t k /t may not decrease as a result of the update. We expect that statistically, at least inthe early stages of the algorithm, most updates do not lead to correctly classifiedpatterns (i.e., patterns which violate the misclassification condition) and as aconsequence k a t k /t will have the tendency to decrease. Obviously, the rate atwhich this will take place depends on the size of the difference k a t k /t − a t · y k which, in turn, depends on the misclassification condition.If we are interested in obtaining solutions possessing margin the most naturalchoice of misclassification condition is the fixed (normalized) margin condition a t · y k ≤ (1 − ǫ ) γ d k a t k (7)with the accuracy parameter ǫ satisfying 0 < ǫ ≤
1. This is an example of a mis-classification condition which if it is satisfied ensures that k a t k /t > a t · y k . More-over, by making use of (4) and (7) it may easily be shown that k a t +1 k / ( t + 1) ≥ a t +1 · y k for t ≥ ǫ − R /γ . Thus, after at most ǫ − R /γ updates k a t k /t de-creases monotonically. The perceptron algorithm with fixed margin condition(PFM) is known to converge in a finite number of updates to an ǫ -accurate ap-proximation of the maximum directional margin hyperplane [17, 18, 1]. Althoughit appears that PFM demands exact knowledge of the value of γ d , we notice thatonly the value of β ≡ (1 − ǫ ) γ d , which is the quantity entering (7), needs to beset and not the values of ǫ and γ d separately. That is why the after-run estimate(5) is useful in connection with the algorithm in question. Nevertheless, in orderto make sure that β < γ d a priori knowledge of a fairly good lower bound on γ d is required and this is an obvious defect of PFM.The above difficulty associated with the fixed margin condition may be reme-died if the unknown γ d is replaced for t > k a t k /t a t · y k ≤ (1 − ǫ ) k a t k t . (8)Condition (8) ensures that k a t k /t − a t · y k ≥ ǫ k a t k /t >
0. Moreover, asin the case of the fixed margin condition, k a t +1 k / ( t + 1) − a t +1 · y k ≥ t ≥ ǫ − R /γ . As a result, after at most ǫ − R /γ updates the r.h.s. of (6) isbounded from below by ǫ k a t k /t ( t + 1) ≥ ǫγ / ( t + 1) and k a t k /t decreasesmonotonically and sufficiently fast. Thus, we expect that k a t k /t will eventuallyapproach γ d close enough, thereby allowing for convergence of the algorithm toan ǫ -accurate approximation of the maximum directional margin hyperplane. Itis also apparent that the decrease of k a t k /t will be faster for larger values of ǫ . he Perceptron with Dynamic Margin Input:
A linearly separable augmented dataset S = ( y , . . . , y k , . . . , y m ) with reflection assumed Fix: ǫ Define: q k = k y k k , ¯ ǫ = 1 − ǫ Initialize: t = 0 , a = , ℓ = 0 , θ = 0 repeatfor k = 1 to m do p tk = a t · y k if p tk ≤ θ t then a t +1 = a t + y k ℓ t +1 = ℓ t + 2 p tk + q k t ← t + 1 θ t = ¯ ǫ ℓ t /t until no update made within the for loop The perceptron algorithm em-ploying the misclassification con-dition (8) (with its threshold setto 0 for t = 0), which may beregarded as originating from (7)with γ d replaced for t > k a t k /t ,will be named the perceptron withdynamic margin (PDM). From the discussion that led tothe formulation of PDM it is ap-parent that if the algorithm con-verges it will achieve by construc-tion a solution possessing directional margin at least as large as (1 − ǫ ) γ d . (Weremind the reader that convergence assumes violation of the misclassificationcondition (8) by all patterns. In addition, (4) holds.) The same obviously appliesto PFM. Thus, for both algorithms it only remains to be demonstrated thatthey converge in a finite number of steps. This has already been shown for PFM[17, 18, 1] but no general ǫ -dependent bound in closed form has been derived.Our purpose in this section is to demonstrate convergence of PDM and provideexplicit bounds for both algorithms.Before we proceed with our analysis we will need the following result. Lemma 1.
Let the variable t ≥ e − C satisfy the inequality t < δ (1 + C + ln t ) , (9) where δ , C are constants and δ > e − C . Then t ≤ t ≡ (1 + e − ) δ ( C + ln ((1 + e ) δ )) . (10) Proof. If t ≥ e − C then (1 + C + ln t ) ≥ f ( t ) = t/ (1 + C + ln t ) − δ <
0. For the function f ( t ) defined in the interval[ e − C , + ∞ ) it holds that f ( e − C ) < df /dt = ( C + ln t ) / (1 + C + ln t ) > t > e − C . Stated differently, f ( t ) starts from negative values at t = e − C andincreases monotonically. Therefore, if f ( t ) ≥ t is an upper bound of all t for which f ( t ) <
0. Indeed, it is not difficult to verify that t > δ > e − C and f ( t ) = δ (1 + e − ) (cid:18) e C (1 + e ) δ )ln( e C (1 + e ) δ ) (cid:19) − − ! ≥ x/ ln x ≤ e − . ⊓⊔ Now we are ready to derive an upper bound on the number of steps of PFM. heorem 1.
The number t of updates of the perceptron algorithm with fixedmargin condition satisfies the bound t ≤ (1 + e − )2 ǫ R γ (cid:26) γ d R (cid:16) − γ d R (1 − ǫ ) (cid:17) + ln (cid:18) (1 + e ) ǫ Rγ d (cid:16) − γ d R (1 − ǫ ) (cid:17)(cid:19)(cid:27) . Proof.
From (2) and (7) we get k a t +1 k = k a t k + k y k k + 2 a t · y k ≤ k a t k R k a t k + 2(1 − ǫ ) γ d k a t k ! . Then, taking the square root and using the inequality √ x ≤ x/ k a t +1 k ≤ k a t k R k a t k + 2(1 − ǫ ) γ d k a t k ! ≤ k a t k R k a t k + (1 − ǫ ) γ d k a t k ! . Now, by making use of k a t k ≥ γ d t , we observe that k a t +1 k − k a t k ≤ R k a t k + (1 − ǫ ) γ d ≤ R γ d t + (1 − ǫ ) γ d . A repeated application of the above inequality t − N times ( t > N ≥
1) gives k a t k − k a N k ≤ R γ d t − X k = N k − + (1 − ǫ ) γ d ( t − N ) < R γ d (cid:18) N + Z tN k − dk (cid:19) + (1 − ǫ ) γ d ( t − N )from where using the obvious bound k a N k ≤ RN we get an upper bound on k a t k k a t k < R γ d (cid:18) N + ln tN (cid:19) + (1 − ǫ ) γ d ( t − N ) + RN .
Combining the above upper bound on k a t k , which holds not only for t > N butalso for t = N , with the lower bound from (3) we obtain t < ǫ R γ (cid:26) N − ln N + 2 γ d R (cid:16) − γ d R (1 − ǫ ) (cid:17) N + ln t (cid:27) . Setting δ = 12 ǫ R γ , α = 2 γ d R (cid:16) − γ d R (1 − ǫ ) (cid:17) and choosing N = 1 + [ α − ], with [ x ] being the integer part of x ≥
0, we finallyget t < δ (1 + 2 α + ln α + ln t ) . (11)Notice that in deriving (11) we made use of the fact that αN + N − − ln N < α + ln α . Inequality (11) has the form (9) with C = 2 α + ln α . Obviously, e − C < α − < N ≤ t and e − C < α − ≤ δ . Thus, the conditions of Lemma 1 aresatisfied and the required bound, which is of the form (10), follows from (11). ⊓⊔ inally, we arrive at our main result which is the proof of convergence ofPDM in a finite number of steps and the derivation of the relevant upper bound. Theorem 2.
The number t of updates of the perceptron algorithm with dynamicmargin satisfies the bound t ≤ t (cid:16) − − ǫ R γ t − (cid:17) ǫ , t ≡ [ ǫ − ] (cid:16) Rγ d (cid:17) ǫ (cid:16) [ ǫ − ] − − ǫ (cid:17) ǫ if ǫ < (1 + e − ) R γ ln (cid:16) (1 + e ) R γ (cid:17) if ǫ = t (cid:0) − − ǫ ) t − ǫ (cid:1) , t ≡ ǫ (3 − ǫ )2 ǫ − R γ if ǫ > . Proof.
From (2) and (8) we get k a t +1 k = k a t k + 2 a t · y k + k y k k ≤ k a t k (cid:18) − ǫ ) t (cid:19) + R . (12)Let us assume that ǫ < /
2. Then, using the inequality (1 + x ) ζ ≥ ζx for x ≥ ζ = 2(1 − ǫ ) ≥ k a t +1 k ≤ k a t k (cid:18) t (cid:19) − ǫ ) + R from where by dividing both sides with ( t + 1) − ǫ ) we arrive at k a t +1 k ( t + 1) − ǫ ) − k a t k t − ǫ ) ≤ R ( t + 1) − ǫ ) . A repeated application of the above inequality t − N times ( t > N ≥
1) gives k a t k t − ǫ ) − k a N k N − ǫ ) ≤ R t X k = N +1 k − − ǫ ) ≤ R Z tN k − − ǫ ) dk = R N ǫ − ǫ − (cid:18) tN (cid:19) ǫ − − ! . (13)Now, let us define α t ≡ k a t k Rt and observe that the bounds k a t k ≤ Rt and k a t k ≥ γ d t confine α t to lie in therange γ d R ≤ α t ≤ . Setting k a N k = α N RN in (13) we get the following upper bound on k a t k k a t k ≤ t − ǫ ) α N R N ǫ ( α − N N − ǫ − (cid:18) tN (cid:19) ǫ − − !) hich combined with the lower bound k a t k ≥ γ t leads to t ǫ ≤ α N R γ N ǫ ( α − N N − ǫ − (cid:18) tN (cid:19) ǫ − − !) . (14)For ǫ < / t/N ) ǫ − in (14) is negative and may bedropped to a first approximation leading to the looser upper bound t t ≡ N (cid:18) α N Rγ d (cid:19) ǫ (cid:18) α − N N − − ǫ (cid:19) ǫ (15)on the number t of updates. Then, we may replace t with its upper bound t inthe r.h.s. of (14) and get the improved bound t ≤ t (cid:18) − − ǫ R γ t − (cid:19) ǫ . This is allowed given that the term proportional to ( t/N ) ǫ − in (14) is negativeand moreover t is raised to a negative power. Choosing N = [ ǫ − ] and α N = 1(i.e., setting α N to its upper bound which is the least favorable assumption) weobtain the bound stated in Theorem 2 for ǫ < / ǫ > /
2. Then, using the inequality (1 + x ) ζ + ζ (1 − ζ ) x / ≥ ζx for x ≥
0, 0 ≤ ζ = 2(1 − ǫ ) ≤ k a t k ≤ Rt we obtain k a t +1 k ≤ k a t k (cid:18) t (cid:19) − ǫ ) + (1 − ǫ )(2 ǫ − k a t k t + R ≤ k a t k (cid:18) t (cid:19) − ǫ ) + ǫ (3 − ǫ ) R . By dividing both sides of the above inequality with ( t + 1) − ǫ ) we arrive at k a t +1 k ( t + 1) − ǫ ) − k a t k t − ǫ ) ≤ ǫ (3 − ǫ ) R ( t + 1) − ǫ ) (16)a repeated application of which, using also k a k ≤ R ≤ ǫ (3 − ǫ ) R , gives k a t k t − ǫ ) ≤ ǫ (3 − ǫ ) R t X k =1 k − − ǫ ) ≤ ǫ (3 − ǫ ) R (cid:18) Z t k − − ǫ ) dk (cid:19) = ǫ (3 − ǫ ) R (cid:18) t ǫ − − ǫ − (cid:19) . Combining the above bound with the bound k a t k ≥ γ t we obtain t ǫ ≤ ǫ (3 − ǫ ) R γ (cid:18) t ǫ − − ǫ − (cid:19) (17)r t ≤ ǫ (3 − ǫ )2 ǫ − R γ (cid:0) − − ǫ ) t − ǫ (cid:1) . (18)For ǫ > / t − ǫ in (18) is negative and may bedropped to a first approximation leading to the looser upper bound t t ≡ ǫ (3 − ǫ )2 ǫ − R γ on the number t of updates. Then, we may replace t with its upper bound t inthe r.h.s. of (18) and get the improved bound stated in Theorem 2 for ǫ > / t − ǫ in (18) is negative andmoreover t is raised to a negative power.Finally, taking the limit ǫ → / N = 1, α N = 1) or in (17) weget t ≤ R γ (1 + ln t )which on account of Lemma 1 leads to the bound of Theorem 2 for ǫ = 1 / ⊓⊔ Remark 1.
The bound of Theorem 2 holds for PFM as well on account of (4).The worst-case bound of Theorem 2 for ǫ ≪ ǫ − ( R/γ d ) ǫ which suggests an extremely slow convergence if we require margins close to themaximum. From expression (15) for t , however, it becomes apparent that amore favorable assumption concerning the value of α N (e.g., α N ≪ α N ∼ γ d /R ) after the first N ≫ α − N updates does lead to tremendousimprovement provided, of course, that N is not extremely large. Such a sharpdecrease of k a t k /t in the early stages of the algorithm, which may be expectedfrom relation (6) and the discussion that followed, lies behind its experimentallyexhibited rather fast convergence.It would be interesting to find a procedure by which the algorithm will beforced to a guaranteed sharp decrease of the ratio k a t k /t . The following twoobservations will be vital in devising such a procedure. First, we notice thatwhen PDM with accuracy parameter ǫ has converged in t c updates the threshold(1 − ǫ ) k a t c k /t c of the misclassification condition must have fallen below γ d k a t c k .Otherwise, the normalized margin u t c · y k of all patterns y k would be larger than γ d . Thus, α t c < (1 − ǫ ) − γ d /R . Second, after convergence of the algorithm withaccuracy parameter ǫ in t c updates we may lower the accuracy parameterfrom the value ǫ to the value ǫ and continue the run from the point whereconvergence with parameter ǫ has taken place since for all updates that tookplace during the first run the misclassified patterns would certainly satisfy (atthat time) the condition associated with the smaller parameter ǫ . This way,the first run is legitimately fully incorporated into the second one and the t c updates required for convergence during the first run may be considered the first t c updates of the second run under this specific policy of presenting patterns tothe algorithm. Combining the above two observations we see that by employing first run with accuracy parameter ǫ we force the algorithm with accuracyparameter ǫ < ǫ to have α t decreased from a value ∼ α t c1 < (1 − ǫ ) − γ d /R in the first t c updates.The above discussion suggests that we consider a decreasing sequence ofparameters ǫ n such that ǫ n +1 = ǫ n /η ( η >
1) starting with ǫ = 1 / ǫ and perform successive runs of PDM with accuracies ǫ n until convergence in t c n updates is reached. According to our earlier discussion t c n includes the updates that led the algorithm to convergence in the currentand all previous runs. Moreover, at the end of the run with parameter ǫ n we willhave ensured that α t c n < (1 − ǫ n ) − γ d /R . Therefore, t c n +1 satisfies t c n +1 ≤ t or t c n +1 ≤ t c n (cid:18) − ǫ n (cid:19) η/ǫ n (cid:18) − ǫ n ) − ǫ n /η R γ t − n (cid:19) η/ ǫ n . This is obtained by substituting in (15) the values ǫ = ǫ n +1 = ǫ n /η , N = t c n and α N = (1 − ǫ n ) − γ d /R which is the least favorable choice for α t c n . Let us assumethat ǫ n ≪ t c n = ξ − n R /γ with ξ n ≪
1. Then, 1 / (1 − ǫ n ) η/ǫ n ≃ e η and (cid:18) − ǫ n ) − ǫ n /η R γ t − n (cid:19) η/ ǫ n ≃ (1 + ξ n ) η/ ǫ n ≃ e ηξ n / ǫ n . For ξ n ≃ ǫ n the term above becomes approximately e η/ while for ξ n ≪ ǫ n approaches 1. We see that under the assumption that PDM with accuracy pa-rameter ǫ n converges in a number of updates ≫ R /γ the ratio t c n +1 /t c n in thesuccessive run scenario is rather tightly constrained. If, instead, our assumptionis not satisfied then convergence of the algorithm is fast anyway. Notice, that thevalue of t c n +1 /t c n inferred from the bound of Theorem 2 is ∼ η ( R/γ d ) ( η − /ǫ n which is extremely large. We conclude that PDM employing the successive runscenario (PDM-succ) potentially converges in a much smaller number of steps. To reduce the computational cost involved in running PDM, we extend theprocedure of [14, 13] and construct a three-member nested sequence of reduced“active sets” of data points. As we cycle once through the full dataset, the(largest) first-level active set is formed from the points of the full dataset sat-isfying a t · y k ≤ c (1 − ǫ ) k a t k /t with c = 2 .
2. Analogously, the second-levelactive set is formed as we cycle once through the first-level active set from thepoints which satisfy a t · y k ≤ c (1 − ǫ ) k a t k /t with c = 1 .
1. The third-levelactive set comprises the points that satisfy a t · y k ≤ (1 − ǫ ) k a t k /t as we cycleonce through the second-level active set. The third-level active set is presentedrepetitively to the algorithm for N ep mini-epochs. Then, the second-level activeset is presented N ep times. During each round involving the second-level set, anew third-level set is constructed and a new cycle of N ep passes begins. Whenthe number of N ep cycles involving the second-level set is reached the first-levelet becomes active again leading to the population of a new second-level activeset. By invoking the first-level set for the ( N ep +1) th time, we trigger the loadingof the full dataset and the procedure starts all over again until no point is foundmisclassified among the ones comprising the full dataset. Of course, the N ep , N ep and N ep rounds are not exhausted if no update takes place during a round.In all experiments we choose N ep = 9, N ep = N ep = 12. In addition, everytime we make use of the full dataset we actually employ a permuted instanceof it. Evidently, the whole procedure amounts to a different way of sequentiallypresenting the patterns to the algorithm and does not affect the applicability ofour theoretical analysis. A completely analogous procedure is followed for PFM.An additional mechanism providing a substantial improvement of the compu-tational efficiency is the one of performing multiple updates [14, 13] once a datapoint is presented to the algorithm. It is understood, of course, that in order for amultiple update to be compatible with our theoretical analysis it should be equiv-alent to a certain number of updates occuring as a result of repeatedly presentingto the algorithm the data point in question. For PDM when a pattern y k is foundto satisfy the misclassification condition (8) we perform λ = [ µ + ] + 1 updates atonce. Here, µ + is the smallest non-negative root of the quadratic equation in thevariable µ derivable from the relation ( t + µ ) a t + µ · y k − (1 − ǫ ) k a t + µ k = 0 inwhich a t + µ · y k = a t · y k + µ k y k k and k a t + µ k = k a t k + 2 µ a t · y k + µ k y k k .Thus, we require that as a result of the multiple update the pattern violates themisclassification condition. Similarly, we perform multiple updates for PFM.Finally, in the case of PDM (no successive runs) when we perform multipleupdates we start doing so after the first full epoch. This way, we avoid theexcessive growth of the length of the weight vector due to the contribution tothe solution of many aligned patterns in the early stages of the algorithm whichhinders the fast decrease of k a t k /t . Moreover, in this scenario when we selectthe first-level active set as we go through the full dataset for the first time (firstfull epoch) we found it useful to set c = c = 1 . c = 2 . We compare PDM with several other large margin classifiers on the basis of theirability to achieve fast convergence to a certain approximation of the “optimal”hyperplane in the feature space where the patterns are linearly separable. Forlinearly separable data the feature space is the initial instance space whereas forinseparable data (which is the case here) a space extended by as many dimensionsas the instances is considered where each instance is placed at a distance ∆ from the origin in the corresponding dimension [4]. This extension generates amargin of at least ∆/ √ m . Moreover, its employment relies on the well-known y k = [¯ y k , l k ∆δ k , . . . , l k ∆δ mk ], where δ ij is Kronecker’s δ and ¯ y k the projectionof the k th extended instance y k (multiplied by its label l k ) onto the initial instancespace. The feature space mapping defined by the extension commutes with a possibleaugmentation (with parameter ρ ) in which case ¯ y k = [ l k ¯ x k , l k ρ ]. Here ¯ x k representsthe k th data point. quivalence between the hard margin optimization in the extended space andthe soft margin optimization in the initial instance space with objective function k w k + ∆ − P i ¯ ξ i involving the weight vector w and the 2-norm of the slacks ¯ ξ i [2]. Of course, all algorithms are required to solve identical hard margin problems.The datasets we used for training are: the Adult ( m = 32561 instances, n = 123 attributes) and Web ( m = 49749 , n = 300) UCI datasets as com-piled by Platt [15], the training set of the KDD04 Physics dataset ( m = 50000, n = 70 after removing the 8 columns containing missing features) obtainablefrom http://kodiak.cs.cornell.edu/kddcup/datasets.html , the Real-sim( m = 72309 , n = 20958), News20 ( m = 19996 , n = 1355191) and Webspam (un-igram treatment with m = 350000 , n = 254) datasets all available at , the multiclass Cover-type UCI dataset ( m = 581012 , n = 54) and the full Reuters RCV1 dataset( m = 804414 , n = 47236) obtainable from . For the Covertype datasetwe study the binary classification problem of the first class versus rest while forthe RCV1 we consider both the binary text classification tasks of the C11 andCCAT classes versus rest. The Physics and Covertype datasets were rescaledby a multiplicative factor 0.001. The experiments were conducted on a 2.5 GHzIntel Core 2 Duo processor with 3 GB RAM running Windows Vista. Our codeswritten in C++ were compiled using the g++ compiler under Cygwin.The parameter ∆ of the extended space is chosen from the set { , , . , . } in such a way that it corresponds approximately to R/
10 or R/ γ d /R does not become too small (giventhat the extension generates a margin of at least ∆/ √ m ). More specifically, wehave chosen ∆ = 3 for Covertype, ∆ = 1 for Adult, Web and Physics, ∆ = 0 . ∆ = 0 . ∆ do not lead to a significant decrease of thetraining error. For all datasets and for algorithms that introduce bias throughaugmentation the associated parameter ρ was set to the value ρ = 1.We begin our experimental evaluation by comparing PDM with PFM. Werun PDM with accuracy parameter ǫ = 0 .
01 and subsequently PFM with thefixed margin β = (1 − ǫ ) γ d set to the value γ ′ d of the directional margin achievedby PDM. This procedure is repeated using PDM-succ with step η = 8 (i.e., ǫ = 0 . , ǫ = 0 . , ǫ = ǫ = 0 . γ ′ d achieved, the number of required updates (upd) for convergence andthe CPU time for training in seconds (s)) are presented in Table 1. We see thatPDM is considerably faster than PFM as far as training time is concerned in spiteof the fact that PFM needs much less updates for convergence. The successive runscenario succeeds, in accordance with our expectations, in reducing the numberof updates to the level of the updates needed by PFM in order to achieve thesame value of γ ′ d at the expense of an increased runtime. We believe that itis fair to say that PDM-succ with η = 8 has the overall performance of PFMwithout the defect of the need for a priori knowledge of the value of γ d . We alsonotice that although the accuracy ǫ is set to the same value for both scenarios able 1. Results of an experimental evaluation comparing the algorithms PDMand PDM-succ with PFM.data PDM ǫ = 0 .
01 PFM PDM-succ ǫ = 0 .
01 PFMset γ ′ d − upd s − upd s γ ′ d − upd s − upd s Adult
Web
Physics
Real-sim
News20
Webspam
Covertype
C11
CCAT γ d .We also considered other large margin classifiers representing classes of al-gorithms such as perceptron-like algorithms, decomposition SVMs and linearSVMs with the additional requirement that the chosen algorithms need onlyspecification of an accuracy parameter. From the class of perceptron-like algo-rithms we have chosen (aggressive) ROMMA which is much faster than ALMAin the light of the results presented in [9, 14]. Decomposition SVMs are repre-sented by SVM light [7] which, apart from being one of the fastest algorithmsof this class, has the additional advantage of making very efficient use of mem-ory, thereby making possible the training on very large datasets. Finally, fromthe more recent class of linear SVMs we have included in our study the dualcoordinate descent (DCD) algorithm [8] and the margin perceptron with un-learning (MPU) [13]. We considered the DCD versions with 1-norm (DCD-L1) and 2-norm (DCD-L2) soft margin which for the same value of the accu-racy parameter produce identical solutions if the penalty parameter is C = ∞ for DCD-L1 and C = 1 / (2 ∆ ) for DCD-L2. The source for SVM light (version6.02) is available at http://smvlight.joachims.org and for DCD at . The absence of publicly availableimplementations for ROMMA necessitated the writing of our own code in C++employing the mechanism of active sets proposed in [9] and incorporating a mech-anism of permutations performed at the beginning of a full epoch. For MPUthe implementation followed closely [13] with active set parameters ¯ c = 1 . N ep = N ep = 5, gap parameter δb = 3 R and early stopping.The experimental results (margin values achieved and training runtimes)involving the above algorithms with the accuracy parameter set to 0.01 for all of MPU uses dual variables but is not formulated as an optimization. It is a perceptronincorporating a mechanism of reduction of possible contributions from “very-wellclassified” patterns to the weight vector which is an essential ingredient of SVMs. able 2.
Results of experiments with ROMMA, SVM light , DCD-L1, DCD-L2and MPU algorithms. The accuracy parameter for all algorithms is set to 0.01.data ROMMA SVM light
DCD-L1 DCD-L2 MPUset γ ′ d s γ ′ s γ ′ d s s γ ′ d s Adult
Web
Physics
Real-sim
News20
Webspam
Covertype
C11
CCAT light we give the geometricmargin γ ′ instead of the directional one γ ′ d because SVM light does not introducebias through augmentation. For the rest of the algorithms considered, includingPDM and PFM, the geometric margin γ ′ achieved is not listed in the tables sinceit is very close to the directional margin γ ′ d if the augmentation parameter ρ isset to the value ρ = 1. Moreover, for DCD-L1 and DCD-L2 the margin valuescoincide as we pointed out earlier. From Table 2 it is apparent that ROMMA andSVM light are orders of magnitude slower than DCD and MPU. Comparing theresults of Table 1 with those of Table 2 we see that PDM is orders of magnitudefaster than ROMMA which is its natural competitor since they both belong tothe class of perceptron-like algorithms. PDM is also much faster than SVM light but statistically a few times slower than DCD, especially for the larger datasets.Moreover, PDM is a few times slower than MPU for all datasets. Finally, weobserve that the accuracy achieved by PDM is, in general, closer to the before-run accuracy 0.01 since in most cases PDM obtains lower margin values. Thisindicates that PDM succeeds in obtaining a better estimate of the maximummargin than the remaining algorithms with the possible exception of MPU.Before we conclude our comparative study it is fair to point out that PDMis not the fastest perceptron-like large margin classifier. From the results of [14]the fastest algorithm of this class is the margitron which has strong before-runguarantees and a very good after-run estimate of the achieved accuracy through(5). However, its drawback is that an approximate knowledge of the value of γ d (preferably an upper bound) is required in order to fix the parameter controllingthe margin threshold. Although there is a procedure to obtain this information,taking all the facts into account the employment of PDM seems preferable. Conclusions
We introduced the perceptron with dynamic margin (PDM), a new approximatemaximum margin classifier employing the classical perceptron update, demon-strated its convergence in a finite number of steps and derived an upper bound onthem. PDM uses the required accuracy as the only input parameter. Moreover,it is a strictly online algorithm in the sense that it decides whether to performan update taking into account only its current state and irrespective of whetherthe pattern presented to it has been encountered before in the process of cyclingrepeatedly through the dataset. This certainly does not hold for linear SVMs.Our experimental results indicate that PDM is the fastest large margin classifierenjoying the above two very desirable properties.
References (1999) 277–2965. Gentile, C.: A new approximate maximal margin classification algorithm. Journalof Machine Learning Research (2001) 213–2426. Joachims, T.: Making large-scale SVM learning practical. In Advances in kernelmethods-support vector learning (1999) MIT Press7. Joachims, T.: Training linear SVMs in linear time. KDD (2006) 217–2268. Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S.S., Sundararajan, S.: A dualcoordinate descent method for large-scale linear SVM. ICML (2008) 408–4159. Ishibashi, K., Hatano, K., Takeda, M.: Online learning of maximum p-norm marginclassifiers. COLT (2008) 69-80.10. Krauth, W., M´ e zard, M.: Learning algorithms with optimal stability in neuralnetworks. Journal of Physics A20 (1987) L745–L75211. Li, Y., Long, P.: The relaxed online maximum margin algorithm. Machine Learning, (2002) 361–38712. Novikoff, A.B.J.: On convergence proofs on perceptrons. In Proc. Symp. Math.Theory Automata, Vol. 12 (1962) 615–62213. Panagiotakopoulos, C., Tsampouka, P.: The margin perceptron with unlearning.ICML (2010) 855-86214. Panagiotakopoulos, C., Tsampouka, P.: The margitron: A generalized perceptronwith margin. IEEE Transactions on Neural Networks (2011) 395-40715. Platt, J.C.: Sequential minimal optimization: A fast algorithm for training supportvector machines. Microsoft Res. Redmond WA, Tech. Rep. MSR-TR-98-14 (1998)16. Rosenblatt, F.: The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review,65(6)