Robustness and Regularization of Support Vector Machines
aa r X i v : . [ c s . L G ] N ov Robustness and Regularization of SVMs
Robustness and Regularization of Support Vector Machines
Huan Xu [email protected],
Department of Electrical and Computer Engineering, McGill University, Canada
Constantine Caramanis [email protected]
Department of Electrical and Computer Engineering, The University of Texas at Austin, USA
Shie Mannor [email protected]
Department of Electrical and Computer Engineering, McGill University, Canada
Editor:
Alexander Smola
Abstract
We consider regularized support vector machines (SVMs) and show that they are preciselyequivalent to a new robust optimization formulation. We show that this equivalence ofrobust optimization and regularization has implications for both algorithms, and analysis.In terms of algorithms, the equivalence suggests more general SVM-like algorithms forclassification that explicitly build in protection to noise, and at the same time controloverfitting. On the analysis front, the equivalence of robustness and regularization, providesa robust optimization interpretation for the success of regularized SVMs. We use the thisnew robustness interpretation of SVMs to give a new proof of consistency of (kernelized)SVMs, thus establishing robustness as the reason regularized SVMs generalize well.
Keywords:
Robustness, Regularization, Generalization, Kernel, Support Vector Machine
1. Introduction
Support Vector Machines (SVMs for short) originated in Boser et al. (1992) and can betraced back to as early as Vapnik and Lerner (1963) and Vapnik and Chervonenkis (1974).They continue to be one of the most successful algorithms for classification. SVMs ad-dress the classification problem by finding the hyperplane in the feature space that achievesmaximum sample margin when the training samples are separable, which leads to mini-mizing the norm of the classifier. When the samples are not separable, a penalty termthat approximates the total training error is considered (Bennett and Mangasarian, 1992;Cortes and Vapnik, 1995). It is well known that minimizing the training error itself can leadto poor classification performance for new unlabeled data; that is, such an approach mayhave poor generalization error because of, essentially, overfitting (Vapnik and Chervonenkis,1991). A variety of modifications have been proposed to combat this problem, one of themost popular methods being that of minimizing a combination of the training-error anda regularization term. The latter is typically chosen as a norm of the classifier. Theresulting regularized classifier performs better on new data. This phenomenon is ofteninterpreted from a statistical learning theory view: the regularization term restricts thecomplexity of the classifier, hence the deviation of the testing error and the training erroris controlled (see Smola et al., 1998; Evgeniou et al., 2000; Bartlett and Mendelson, 2002;Koltchinskii and Panchenko, 2002; Bartlett et al., 2005, and references therein). u, Caramanis and Mannor In this paper we consider a different setup, assuming that the training data are gen-erated by the true underlying distribution, but some non-i.i.d. (potentially adversarial)disturbance is then added to the samples we observe. We follow a robust optimization(see El Ghaoui and Lebret, 1997; Ben-Tal and Nemirovski, 1999; Bertsimas and Sim, 2004,and references therein) approach, i.e., minimizing the worst possible empirical error un-der such disturbances. The use of robust optimization in classification is not new (e.g.,Shivaswamy et al., 2006; Bhattacharyya et al., 2004b; Lanckriet et al., 2002). Robust clas-sification models studied in the past have considered only box-type uncertainty sets, whichallow the possibility that the data have all been skewed in some non-neutral manner by acorrelated disturbance. This has made it difficult to obtain non-conservative generalizationbounds. Moreover, there has not been an explicit connection to the regularized classi-fier, although at a high-level it is known that regularization and robust optimization arerelated (e.g., El Ghaoui and Lebret, 1997; Anthony and Bartlett, 1999). The main contri-bution in this paper is solving the robust classification problem for a class of non-box-typeduncertainty sets, and providing a linkage between robust classification and the standardregularization scheme of SVMs. In particular, our contributions include the following: • We solve the robust SVM formulation for a class of non-box-type uncertainty sets.This permits finer control of the adversarial disturbance, restricting it to satisfy ag-gregate constraints across data points, therefore reducing the possibility of highlycorrelated disturbance. • We show that the standard regularized SVM classifier is a special case of our robustclassification, thus explicitly relating robustness and regularization. This providesan alternative explanation to the success of regularization, and also suggests newphysically motivated ways to construct regularization terms. • We relate our robust formulation to several probabilistic formulations. We considera chance-constrained classifier (i.e., a classifier with probabilistic constraints on mis-classification) and show that our robust formulation can approximate it far less con-servatively than previous robust formulations could possibly do. We also considera Bayesian setup, and show that this can be used to provide a principled means ofselecting the regularization coefficient without cross-validation. • We show that the robustness perspective, stemming from a non-i.i.d. analysis, canbe useful in the standard learning (i.i.d.) setup, by using it to prove consistencyfor standard SVM classification, without using VC-dimension or stability arguments .This result implies that generalization ability is a direct result of robustness to localdisturbances; it therefore suggests a new justification for good performance, and conse-quently allows us to construct learning algorithms that generalize well by robustifyingnon-consistent algorithms.
Robustness and Regularization
We comment here on the explicit equivalence of robustness and regularization. We briefly ex-plain how this observation is different from previous work and why it is interesting. Certainequivalence relationships between robustness and regularization have been established for obustness and Regularization of SVMs problems other than classification (El Ghaoui and Lebret, 1997; Ben-Tal and Nemirovski,1999; Bishop, 1995), but their results do not directly apply to the classification prob-lem. Indeed, research on classifier regularization mainly discusses its effect on bound-ing the complexity of the function class (e.g., Smola et al., 1998; Evgeniou et al., 2000;Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, 2002; Bartlett et al., 2005). Mean-while, research on robust classification has not attempted to relate robustness and regular-ization (e.g., Lanckriet et al., 2002; Bhattacharyya et al., 2004a,b; Shivaswamy et al., 2006;Trafalis and Gilbert, 2007; Globerson and Roweis, 2006), in part due to the robustness for-mulations used in those papers. In fact, they all consider robustified versions of regularized classifications. Bhattacharyya (2004) considers a robust formulation for box-type uncer-tainty, and relates this robust formulation with regularized SVM. However, this formulationinvolves a non-standard loss function that does not bound the 0 − corrupted samples and ishence different from the known PAC bounds. This is helpful when the training samples andthe testing samples are drawn from different distributions, or some adversary manipulatesthe samples to prevent them from being correctly labeled (e.g., spam senders change theirpatterns from time to time to avoid being labeled and filtered). Finally, this connection of
1. Lanckriet et al. (2002) is perhaps the only exception, where a regularization term is added to the covari-ance estimation rather than to the objective function. u, Caramanis and Mannor robustification and regularization also provides us with new proof techniques as well (seeSection 5).We need to point out that there are several different definitions of robustness in litera-ture. In this paper, as well as the aforementioned robust classification papers, robustnessis mainly understood from a Robust Optimization perspective, where a min-max optimiza-tion is performed over all possible disturbances. An alternative interpretation of robustnessstems from the rich literature on Robust Statistics (e.g., Huber, 1981; Hampel et al., 1986;Rousseeuw and Leeroy, 1987; Maronna et al., 2006), which studies how an estimator oralgorithm behaves under a small perturbation of the statistics model. For example, the In-fluence Function approach, proposed in Hampel (1974) and Hampel et al. (1986), measuresthe impact of an infinitesimal amount of contamination of the original distribution on thequantity of interest. Based on this notion of robustness, Christmann and Steinwart (2004)showed that many kernel classification algorithms, including SVM, are robust in the senseof having a finite Influence Function. A similar result for regression algorithms is shown inChristmann and Steinwart (2007) for smooth loss functions, and in Christmann and Van Messem(2008) for non-smooth loss functions where a relaxed version of the Influence Function isapplied. In the machine learning literature, another widely used notion closely related torobustness is the stability , where an algorithm is required to be robust (in the sense thatthe output function does not change significantly) under a specific perturbation: deletingone sample from the training set. It is now well known that a stable algorithm such asSVM has desirable generalization properties, and is statistically consistent under mild tech-nical conditions; see for example Bousquet and Elisseeff (2002); Kutin and Niyogi (2002);Poggio et al. (2004); Mukherjee et al. (2006) for details. One main difference between Ro-bust Optimization and other robustness notions is that the former is constructive ratherthan analytical. That is, in contrast to robust statistics or the stability approach that mea-sures the robustness of a given algorithm, Robust Optimization can robustify an algorithm:it converts a given algorithm to a robust one. For example, as we show in this paper, the ROversion of a naive empirical-error minimization is the well known SVM. As a constructiveprocess, the RO approach also leads to additional flexibility in algorithm design, especiallywhen the nature of the perturbation is known or can be well estimated. Structure of the Paper:
This paper is organized as follows. In Section 2 we investigatethe correlated disturbance case, and show the equivalence between the robust classificationand the regularization process. We develop the connections to probabilistic formulationsin Section 3, and prove a consistency result based on robustness analysis in Section 5.The kernelized version is investigated in Section 4. Some concluding remarks are given inSection 6.
Notation:
Capital letters are used to denote matrices, and boldface letters are used todenote column vectors. For a given norm k · k , we use k · k ∗ to denote its dual norm, i.e., k z k ∗ , sup { z ⊤ x |k x k ≤ } . For a vector x and a positive semi-definite matrix C of the samedimension, k x k C denotes √ x ⊤ C x . We use δ to denote disturbance affecting the samples.We use superscript r to denote the true value for an uncertain variable, so that δ ri is thetrue (but unknown) noise of the i th sample. The set of non-negative scalars is denoted by R + . The set of integers from 1 to n is denoted by [1 : n ]. obustness and Regularization of SVMs
2. Robust Classification and Regularization
We consider the standard binary classification problem, where we are given a finite numberof training samples { x i , y i } mi =1 ⊆ R n × {− , +1 } , and must find a linear classifier, specifiedby the function h w ,b ( x ) = sgn( h w , x i + b ) . For the standard regularized classifier, theparameters ( w , b ) are obtained by solving the following convex optimization problem:min w ,b, ξ : r ( w , b ) + m X i =1 ξ i s . t . : ξ i ≥ (cid:2) − y i ( h w , x i i + b )] ξ i ≥ , where r ( w , b ) is a regularization term. This is equivalent tomin w ,b ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3)) . Previous robust classification work (Shivaswamy et al., 2006; Bhattacharyya et al., 2004a,b;Bhattacharyya, 2004; Trafalis and Gilbert, 2007) considers the classification problem wherethe input are subject to (unknown) disturbances ~δ = ( δ , . . . , δ m ) and essentially solves thefollowing min-max problem:min w ,b max ~δ ∈N box ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3)) , (1)for a box-type uncertainty set N box . That is, let N i denotes the projection of N box ontothe δ i component, then N box = N × · · · × N m . Effectively, this allows simultaneous worst-case disturbances across many samples, and leads to overly conservative solutions. Thegoal of this paper is to obtain a robust formulation where the disturbances { δ i } may bemeaningfully taken to be correlated, i.e., to solve for a non-box-type N :min w ,b max ~δ ∈N ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3)) . (2)We briefly explain here the four reasons that motivate this “robust to perturbation” setupand in particular the min-max form of (1) and (2). First, it can explicitly incorporate priorproblem knowledge of local invariance (e.g, Teo et al., 2008). For example, in vision tasks,a desirable classifier should provide a consistent answer if an input image slightly changes.Second, there are situations where some adversarial opponents (e.g., spam senders) willmanipulate the testing samples to avoid being correctly classified, and the robustness to-ward such manipulation should be taken into consideration in the training process (e.g,Globerson and Roweis, 2006). Or alternatively, the training samples and the testing sam-ples can be obtained from different processes and hence the standard i.i.d. assumption isviolated (e.g, Bi and Zhang, 2004). For example in real-time applications, the newly gener-ated samples are often less accurate due to time constraints. Finally, formulations based onchance-constraints (e.g., Bhattacharyya et al., 2004b; Shivaswamy et al., 2006) are mathe-matically equivalent to such a min-max formulation.We define explicitly the correlated disturbance (or uncertainty) which we study below. u, Caramanis and Mannor Definition 1
A set N ⊆ R n is called an Atomic Uncertainty Set if(I) ∈ N ; (II) For any w ∈ R n : sup δ [ w ⊤ δ ] = sup δ ′ [ − w ⊤ δ ′ ] < + ∞ . We use “sup” here because the maximal value is not necessary attained since N may notbe a closed set. The second condition of Atomic Uncertainty set basically says that theuncertainty set is bounded and symmetric. In particular, all norm balls and ellipsoidscentered at the origin are atomic uncertainty sets, while an arbitrary polytope might notbe an atomic uncertainty set. Definition 2
Let N be an atomic uncertainty set. A set N ⊆ R n × m is called a SublinearAggregated Uncertainty Set of N , if N − ⊆ N ⊆ N + , where: N − , m [ t =1 N − t ; N − t , { ( δ , · · · , δ m ) | δ t ∈ N ; δ i = t = } . N + , { ( α δ , · · · , α m δ m ) | m X i =1 α i = 1; α i ≥ , δ i ∈ N , i = 1 , · · · , m } . The Sublinear Aggregated Uncertainty definition models the case where the disturbanceson each sample are treated identically, but their aggregate behavior across multiple samplesis controlled. Some interesting examples include(1) { ( δ , · · · , δ m ) | m X i =1 k δ i k ≤ c } ;(2) { ( δ , · · · , δ m ) |∃ t ∈ [1 : m ]; k δ t k ≤ c ; δ i = , ∀ i = t } ;(3) { ( δ , · · · , δ m ) | m X i =1 p c k δ i k ≤ c } . All these examples have the same atomic uncertainty set N = (cid:8) δ (cid:12)(cid:12) k δ k ≤ c (cid:9) . Figure 1provides an illustration of a sublinear aggregated uncertainty set for n = 1 and m = 2, i.e.,the training set consists of two univariate samples. Theorem 3
Assume { x i , y i } mi =1 are non-separable, r ( · ) : R n +1 → R is an arbitrary func-tion, N is a Sublinear Aggregated Uncertainty set with corresponding atomic uncertaintyset N . Then the following min-max problem min w ,b sup ( δ , ··· , δ m ) ∈N ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3)) (3) is equivalent to the following optimization problem on w , b, ξ : min : r ( w , b ) + sup δ ∈N ( w ⊤ δ ) + m X i =1 ξ i , s . t . : ξ i ≥ − [ y i ( h w , x i i + b )] , i = 1 , . . . , m ; ξ i ≥ , i = 1 , . . . , m. (4) obustness and Regularization of SVMs xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx a. N − b. N + c. N d. Box uncertaintyFigure 1: Illustration of a Sublinear Aggregated Uncertainty Set N . Furthermore, the minimization of Problem (4) is attainable when r ( · , · ) is lower semi-continuous. Proof
Define: v ( w , b ) , sup δ ∈N ( w ⊤ δ ) + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) . Recall that N − ⊆ N ⊆ N + by definition. Hence, fixing any ( ˆ w , ˆ b ) ∈ R n +1 , the followinginequalities hold: sup ( δ , ··· , δ m ) ∈N − m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) ≤ sup ( δ , ··· , δ m ) ∈N m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) ≤ sup ( δ , ··· , δ m ) ∈N + m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) . To prove the theorem, we first show that v ( ˆ w , ˆ b ) is no larger than the leftmost expressionand then show v ( ˆ w , ˆ b ) is no smaller than the rightmost expression.Step 1: We prove that v ( ˆ w , ˆ b ) ≤ sup ( δ , ··· , δ m ) ∈N − m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) . (5)Since the samples { x i , y i } mi =1 are not separable, there exists t ∈ [1 : m ] such that y t ( h ˆ w , x t i + ˆ b ) < . (6) u, Caramanis and Mannor Hence, sup ( δ , ··· , δ m ) ∈N − t m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) = X i = t max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + sup δ t ∈N max (cid:2) − y t ( h ˆ w , x t − δ t i + ˆ b ) , (cid:3) = X i = t max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + max (cid:2) − y t ( h ˆ w , x t i + ˆ b ) + sup δ t ∈N ( y t ˆ w ⊤ δ t ) , (cid:3) = X i = t max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + max (cid:2) − y t ( h ˆ w , x t i + ˆ b ) , (cid:3) + sup δ t ∈N ( y t ˆ w ⊤ δ t )= sup δ ∈N ( ˆ w ⊤ δ ) + m X i =1 max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) = v ( ˆ w , ˆ b ) . The third equality holds because of Inequality (6) and sup δ t ∈N ( y t ˆ w ⊤ δ t ) being non-negative(recall ∈ N ). Since N − t ⊆ N − , Inequality (5) follows.Step 2: Next we prove thatsup ( δ , ··· , δ m ) ∈N + m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) ≤ v ( ˆ w , ˆ b ) . (7)Notice that by the definition of N + we havesup ( δ , ··· , δ m ) ∈N + m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) = sup P mi =1 α i =1; α i ≥
0; ˆ δ i ∈N m X i =1 max (cid:2) − y i ( h ˆ w , x i − α i ˆ δ i i + ˆ b ) , (cid:3) = sup P mi =1 α i =1; α i ≥ m X i =1 max (cid:2) sup ˆ δ i ∈N (cid:0) − y i ( h ˆ w , x i − α i ˆ δ i i + ˆ b ) (cid:1) , (cid:3) . (8)Now, for any i ∈ [1 : m ], the following holds,max (cid:2) sup ˆ δ i ∈N (cid:0) − y i ( h ˆ w , x i − α i ˆ δ i i + ˆ b ) (cid:1) , (cid:3) = max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) + α i sup ˆ δ i ∈N ( ˆ w ⊤ ˆ δ i ) , (cid:3) ≤ max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + α i sup ˆ δ i ∈N ( ˆ w ⊤ ˆ δ i ) . Therefore, Equation (8) is upper bounded by m X i =1 max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + sup P mi =1 α i =1; α i ≥ m X i =1 α i sup ˆ δ i ∈N ( ˆ w ⊤ ˆ δ i )= sup δ ∈N ( ˆ w ⊤ δ ) + m X i =1 max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) = v ( ˆ w , ˆ b ) , obustness and Regularization of SVMs hence Inequality (7) holds.Step 3: Combining the two steps and adding r ( w , b ) on both sides leads to: ∀ ( w , b ) ∈ R n +1 , sup ( δ , ··· , δ m ) ∈N m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3) + r ( w , b ) = v ( w , b ) + r ( w , b ) . Taking the infimum on both sides establishes the equivalence of Problem (3) and Prob-lem (4). Observe that sup δ ∈N w ⊤ δ is a supremum over a class of affine functions, andhence is lower semi-continuous. Therefore v ( · , · ) is also lower semi-continuous. Thus theminimum can be achieved for Problem (4), and Problem (3) by equivalence, when r ( · ) islower semi-continuous.This theorem reveals the main difference between Formulation (1) and our formulationin (2). Consider a Sublinear Aggregated Uncertainty set N = { ( δ , · · · , δ m ) | P mi =1 k δ i k ≤ c } . The smallest box-type uncertainty set containing N includes disturbances with normsum up to mc . Therefore, it leads to a regularization coefficient as large as mc that is linkedto the number of training samples, and will therefore be overly conservative.An immediate corollary is that a special case of our robust formulation is equivalent tothe norm-regularized SVM setup: Corollary 4
Let T , n ( δ , · · · δ m ) | P mi =1 k δ i k ∗ ≤ c o . If the training sample { x i , y i } mi =1 are non-separable, then the following two optimization problems on ( w , b ) are equivalent min : max ( δ , ··· , δ m ) ∈T m X i =1 max (cid:2) − y i (cid:0) h w , x i − δ i i + b (cid:1) , (cid:3) , (9)min : c k w k + m X i =1 max (cid:2) − y i (cid:0) h w , x i i + b (cid:1) , (cid:3) . (10) Proof
Let N be the dual-norm ball { δ |k δ k ∗ ≤ c } and r ( w , b ) ≡
0. Then sup k δ k ∗ ≤ c ( w ⊤ δ ) = c k w k . The corollary follows from Theorem 3. Notice indeed the equivalence holds for any w and b .This corollary explains the widely known fact that the regularized classifier tends to bemore robust. Specifically, it explains the observation that when the disturbance is noise-like and neutral rather than adversarial, a norm-regularized classifier (without any robust-ness requirement) has a performance often superior to a box-typed robust classifier (seeTrafalis and Gilbert, 2007). On the other hand, this observation also suggests that theappropriate way to regularize should come from a disturbance-robustness perspective. Theabove equivalence implies that standard regularization essentially assumes that the dis-turbance is spherical; if this is not true, robustness may yield a better regularization-likealgorithm. To find a more effective regularization term, a closer investigation of the datavariation is desirable, e.g., by examining the variation of the data and solving the corre-sponding robust classification problem. For example, one way to regularize is by splitting
2. The optimization equivalence for the linear case was observed independently by Bertsimas and Fertis(2008). u, Caramanis and Mannor the given training samples into two subsets with equal number of elements, and treatingone as a disturbed copy of the other. By analyzing the direction of the disturbance andthe magnitude of the total variation, one can choose the proper norm to use, and a suitabletradeoff parameter.
3. Probabilistic Interpretations
Although Problem (3) is formulated without any probabilistic assumptions, in this section,we briefly explain two approaches to construct the uncertainty set and equivalently tunethe regularization parameter c based on probabilistic information.The first approach is to use Problem (3) to approximate an upper bound for a chance-constrained classifier. Suppose the disturbance ( δ r , · · · δ rm ) follows a joint probability mea-sure µ . Then the chance-constrained classifier is given by the following minimization prob-lem given a confidence level η ∈ [0 , w ,b,l : l s . t . : µ n m X i =1 max (cid:2) − y i ( h w , x i − δ ri i + b ) , (cid:3) ≤ l o ≥ − η. (11)The formulations in Shivaswamy et al. (2006), Lanckriet et al. (2002) and Bhattacharyya et al.(2004a) assume uncorrelated noise and require all constraints to be satisfied with high prob-ability simultaneously . They find a vector [ ξ , · · · , ξ m ] ⊤ where each ξ i is the η -quantile ofthe hinge-loss for sample x ri . In contrast, our formulation above minimizes the η -quantileof the average (or equivalently the sum of) empirical error. When controlling this averagequantity is of more interest, the box-type noise formulation will be overly conservative.Problem (11) is generally intractable. However, we can approximate it as follows. Let c ∗ , inf { α | µ ( X i k δ i k ∗ ≤ α ) ≥ − η } . Notice that c ∗ is easily simulated given µ . Then for any ( w , b ), with probability no lessthan 1 − η , the following holds, m X i =1 max (cid:2) − y i ( h w , x i − δ ri i + b ) , (cid:3) ≤ max P i k δ i k ∗ ≤ c ∗ m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3) . Thus (11) is upper bounded by (10) with c = c ∗ . This gives an additional probabilisticrobustness property of the standard regularized classifier. Notice that following a similarapproach but with the constraint-wise robust setup, i.e., the box uncertainty set, wouldlead to considerably more pessimistic approximations of the chance constraint.The second approach considers a Bayesian setup. Suppose the total disturbance c r , P mi =1 k δ ri k ∗ follows a prior distribution ρ ( · ). This can model for example the case thatthe training sample set is a mixture of several data sets where the disturbance magnitude obustness and Regularization of SVMs of each set is known. Such a setup leads to the following classifier which minimizes theBayesian (robust) error:min w ,b : Z n max P k δ i k ∗ ≤ c m X i =1 max (cid:2) − y i (cid:0) h w , x i − δ i i + b (cid:1) , (cid:3)o dρ ( c ) . (12)By Corollary 4, the Bayesian classifier (12) is equivalent tomin w ,b : Z n c k w k + m X i =1 max (cid:2) − y i (cid:0) h w , x i i + b (cid:1) , (cid:3)o dρ ( c ) , which can be further simplified asmin w ,b : c k w k + m X i =1 max (cid:2) − y i (cid:0) h w , x i i + b (cid:1) , (cid:3) , where c , R c dρ ( c ). This thus provides us a justifiable parameter tuning method differentfrom cross validation: simply using the expected value of c r . We note that it is the equiva-lence of Corollary 4 that makes this possible, since it is difficult to imagine a setting whereone would have a prior on regularization coefficients.
4. Kernelization
The previous results can be easily generalized to the kernelized setting, which we discuss indetail in this section. In particular, similar to the linear classification case, we give a newinterpretation of the standard kernelized SVM as the min-max empirical hinge-loss solution,where the disturbance is assumed to lie in the feature space. We then relate this to the(more intuitively appealing) setup where the disturbance lies in the sample space. We usethis relationship in Section 5 to prove a consistency result for kernelized SVMs.The kernelized SVM formulation considers a linear classifier in the feature space H , aHilbert space containing the range of some feature mapping Φ( · ). The standard formulationis as follows, min w ,b : r ( w , b ) + m X i =1 ξ i s . t . : ξ i ≥ (cid:2) − y i ( h w , Φ( x i ) i + b )] ,ξ i ≥ . It has been proved in Sch¨olkopf and Smola (2002) that if we take f ( h w , w i ) for some in-creasing function f ( · ) as the regularization term r ( w , b ), then the optimal solution has arepresentation w ∗ = P mi =1 α i Φ( x i ), which can further be solved without knowing explicitlythe feature mapping, but by evaluating a kernel function k ( x , x ′ ) , h Φ( x ) , Φ( x ′ ) i only.This is the well-known “kernel trick”.The definitions of Atomic Uncertainty Set and Sublinear Aggregated Uncertainty Set inthe feature space are identical to Definition 1 and 2, with R n replaced by H . The followingtheorem is a feature-space counterpart of Theorem 3. The proof follows from a similarargument to Theorem 3, i.e., for any fixed ( w , b ) the worst-case empirical error equals theempirical error plus a penalty term sup δ ∈N (cid:0) h w , δ i (cid:1) , and hence the details are omitted. u, Caramanis and Mannor Theorem 5
Assume { Φ( x i ) , y i } mi =1 are not linearly separable, r ( · ) : H × R → R is anarbitrary function, N ⊆ H m is a Sublinear Aggregated Uncertainty set with correspondingatomic uncertainty set N ⊆ H . Then the following min-max problem min w ,b sup ( δ , ··· , δ m ) ∈N ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , Φ( x i ) − δ i i + b ) , (cid:3)) (13) is equivalent to min : r ( w , b ) + sup δ ∈N ( h w , δ i ) + m X i =1 ξ i , s . t . : ξ i ≥ − y i (cid:0) h w , Φ( x i ) i + b (cid:1) , i = 1 , · · · , m ; ξ i ≥ , i = 1 , · · · , m. (14) Furthermore, the minimization of Problem (14) is attainable when r ( · , · ) is lower semi-continuous. For some widely used feature mappings (e.g., RKHS of a Gaussian kernel), { Φ( x i ) , y i } mi =1 are always separable. In this case, the worst-case empirical error may not be equal to theempirical error plus a penalty term sup δ ∈N (cid:0) h w , δ i (cid:1) . However, it is easy to show that forany ( w , b ), the latter is an upper bound of the former.The next corollary is the feature-space counterpart of Corollary 4, where k · k H standsfor the RKHS norm, i.e., for z ∈ H , k z k H = p h z , z i . Noticing that the RKHS norm is selfdual, we find that the proof is identical to that of Corollary 4, and hence omit it. Corollary 6
Let T H , n ( δ , · · · δ m ) | P mi =1 k δ i k H ≤ c o . If { Φ( x i ) , y i } mi =1 are non-separable,then the following two optimization problems on ( w , b ) are equivalent min : max ( δ , ··· , δ m ) ∈T H m X i =1 max (cid:2) − y i (cid:0) h w , Φ( x i ) − δ i i + b (cid:1) , (cid:3) , (15)min : c k w k H + m X i =1 max (cid:2) − y i (cid:0) h w , Φ( x i ) i + b (cid:1) , (cid:3) . (16)Equation (16) is a variant form of the standard SVM that has a squared RKHS normregularization term, and it can be shown that the two formulations are equivalent up tochanging of tradeoff parameter c , since both the empirical hinge-loss and the RKHS normare convex. Therefore, Corollary 6 essentially means that the standard kernelized SVM isimplicitly a robust classifier (without regularization) with disturbance in the feature-space,and the sum of the magnitude of the disturbance is bounded.Disturbance in the feature-space is less intuitive than disturbance in the sample space,and the next lemma relates these two different notions. Lemma 7
Suppose there exists
X ⊆ R n , ρ > , and a continuous non-decreasing function f : R + → R + satisfying f (0) = 0 , such that k ( x , x ) + k ( x ′ , x ′ ) − k ( x , x ′ ) ≤ f ( k x − x ′ k ) , ∀ x , x ′ ∈ X , k x − x ′ k ≤ ρ obustness and Regularization of SVMs then k Φ(ˆ x + δ ) − Φ(ˆ x ) k H ≤ q f ( k δ k ) , ∀k δ k ≤ ρ, ˆ x , ˆ x + δ ∈ X . In the appendix, we prove a result that provides a tighter relationship between disturbancein the feature space and disturbance in the sample space, for RBF kernels.
Proof
Expanding the RKHS norm yields k Φ(ˆ x + δ ) − Φ(ˆ x ) k H = p h Φ(ˆ x + δ ) − Φ(ˆ x ) , Φ(ˆ x + δ ) − Φ(ˆ x ) i = p h Φ(ˆ x + δ ) , Φ(ˆ x + δ ) i + h Φ(ˆ x ) , Φ(ˆ x ) i − h Φ(ˆ x + δ ) , Φ(ˆ x ) i = q k (cid:0) ˆ x + δ , ˆ x + δ (cid:1) + k (cid:0) ˆ x , ˆ x (cid:1) − k (cid:0) ˆ x + δ , ˆ x (cid:1) ≤ q f ( k ˆ x + δ − ˆ x k ) = q f ( k δ k ) , where the inequality follows from the assumption.Lemma 7 essentially says that under certain conditions, robustness in the feature space isa stronger requirement that robustness in the sample space. Therefore, a classifier thatachieves robustness in the feature space (the SVM for example) also achieves robustness inthe sample space. Notice that the condition of Lemma 7 is rather weak. In particular, itholds for any continuous k ( · , · ) and bounded X .In the next section we consider a more foundational property of robustness in the sam-ple space: we show that a classifier that is robust in the sample space is asymptoticallyconsistent. As a consequence of this result for linear classifiers, the above results imply theconsistency for a broad class of kernelized SVMs.
5. Consistency of Regularization
In this section we explore a fundamental connection between learning and robustness, byusing robustness properties to re-prove the statistical consistency of the linear classifier,and then the kernelized SVM. Indeed, our proof mirrors the consistency proof found in(Steinwart, 2005), with the key difference that we replace metric entropy, VC-dimension,and stability conditions used there, with a robustness condition .Thus far we have considered the setup where the training-samples are corrupted bycertain set-inclusive disturbances. We now turn to the standard statistical learning setup,by assuming that all training samples and testing samples are generated i.i.d. according toa (unknown) probability P , i.e., there does not exist explicit disturbance.Let X ⊆ R n be bounded, and suppose the training samples ( x i , y i ) ∞ i =1 are generated i.i.d.according to an unknown distribution P supported by X × {− , +1 } . The next theoremshows that our robust classifier setup and equivalently regularized SVM asymptoticallyminimizes an upper-bound of the expected classification error and hinge loss. Theorem 8
Denote K , max x ∈X k x k . Then there exists a random sequence { γ m,c } suchthat:1. ∀ c > , lim m →∞ γ m,c = 0 almost surely, and the convergence is uniform in P ; u, Caramanis and Mannor
2. the following bounds on the Bayes loss and the hinge loss hold uniformly for all ( w , b ) : E ( x ,y ) ∼ P ( y = sgn ( h w , x i + b ) ) ≤ γ m,c + c k w k + 1 m m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) ; E ( x ,y ) ∼ P (cid:0) max(1 − y ( h w , x i + b ) , (cid:1) ≤ γ m,c (1 + K k w k + | b | ) + c k w k + 1 m m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) . Proof
We briefly explain the basic idea of the proof before going to the technical de-tails. We consider the testing sample set as a perturbed copy of the training sample set,and measure the magnitude of the perturbation. For testing samples that have “small”perturbations, c k w k + m P mi =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) upper-bounds their total lossby Corollary 4. Therefore, we only need to show that the ratio of testing samples having“large” perturbations diminishes to prove the theorem.Now we present the detailed proof. Given a c >
0, we call a testing sample ( x ′ , y ′ ) anda training sample ( x , y ) a sample pair if y = y ′ and k x − x ′ k ≤ c . We say a set of trainingsamples and a set of testing samples form l pairings if there exist l sample pairs with nodata reused. Given m training samples and m testing samples, we use M m,c to denotethe largest number of pairings. To prove this theorem, we need to establish the followinglemma. Lemma 9
Given a c > , M m,c /m → almost surely as m → + ∞ , uniformly w.r.t. P . Proof
We make a partition of
X × {− , +1 } = S T c t =1 X t such that X t either has the form[ α , α + c/ √ n ) × [ α , α + c/ √ n ) · · · × [ α n , α n + c/ √ n ) × { +1 } or [ α , α + c/ √ n ) × [ α , α + c/ √ n ) · · · × [ α n , α n + c/ √ n ) × {− } (recall n is the dimension of X ). That is, each partitionis the Cartesian product of a rectangular cell in X and a singleton in {− , +1 } . Notice thatif a training sample and a testing sample fall into X t , they can form a pairing.Let N trt and N tet be the number of training samples and testing samples falling inthe t th set, respectively. Thus, ( N tr , · · · , N trT c ) and ( N te , · · · , N teT c ) are multinomially dis-tributed random vectors following a same distribution. Notice that for a multinomially dis-tributed random vector ( N , · · · , N k ) with parameter m and ( p , · · · , p k ), the following holds(Breteganolle-Huber-Carol inequality, see for example Proposition A6.6 of van der Vaart and Wellner,2000). For any λ > P (cid:16) k X i =1 (cid:12)(cid:12) N i − mp i (cid:12)(cid:12) ) ≥ √ mλ (cid:17) ≤ k exp( − λ ) . Hence we have P (cid:16) T c X t =1 (cid:12)(cid:12) N trt − N tet (cid:12)(cid:12) ≥ √ mλ (cid:17) ≤ T c +1 exp( − λ ) , = ⇒ P (cid:16) m T c X t =1 (cid:12)(cid:12) N trt − N tet (cid:12)(cid:12) ≥ λ (cid:17) ≤ T c +1 exp( − mλ , = ⇒ P (cid:16) M m,c /m ≤ − λ (cid:17) ≤ T c +1 exp( − mλ , (17) obustness and Regularization of SVMs Observe that P ∞ m =1 T c +1 exp( − mλ ) < + ∞ , hence by the Borel-Cantelli Lemma (see forexample Durrett, 2004), with probability one the event { M m,c /m ≤ − λ } only occursfinitely often as m → ∞ . That is, lim inf m M m,c /m ≥ − λ almost surely. Since λ canbe arbitrarily close to zero, M m,c /m → P , since T c only depends on X .Now we proceed to prove the theorem. Given m training samples and m testing sampleswith M m,c sample pairs, we notice that for these paired samples, both the total testing errorand the total testing hinge-loss is upper bounded bymax ( δ , ··· , δ m ) ∈N ×···×N m X i =1 max (cid:2) − y i (cid:0) h w , x i − δ i i + b (cid:1) , (cid:3) ≤ cm k w k + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , , where N = { δ | k δ k ≤ c } . Hence the total classification error of the m testing samples canbe upper bounded by( m − M m,c ) + cm k w k + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , , and sincemax x ∈X (1 − y ( h w , x i )) ≤ max x ∈X n | b | + p h x , x i · h w , w i o = 1 + | b | + K k w k , the accumulated hinge-loss of the total m testing samples is upper bounded by( m − M m,c )(1 + K k w k + | b | ) + cm k w k + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , . Therefore, the average testing error is upper bounded by1 − M m,c /m + c k w k + 1 m n X i =1 max (cid:2) − y i ( h w , x i i + b ) , , (18)and the average hinge loss is upper bounded by(1 − M m,c /m )(1 + K k w k + | b | ) + c k w k + 1 m m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) . Let γ m,c = 1 − M m,c /m . The proof follows since M m,c /m → c > P (cid:16) γ m,c ≥ λ (cid:17) ≤ exp (cid:16) − mλ / T c + 1) log 2 (cid:17) , (19)i.e., the convergence is uniform in P . u, Caramanis and Mannor We have shown that the average testing error is upper bounded. The final step is toshow that this implies that in fact the random variable given by the conditional expecta-tion (conditioned on the training sample) of the error is bounded almost surely as in thestatement of the theorem. To make things precise, consider a fixed m , and let ω ∈ Ω and ω ∈ Ω generate the m training samples and m testing samples, respectively, and forshorthand let T m denote the random variable of the first m training samples. Let us denotethe probability measures for the training by ρ and the testing samples by ρ . By indepen-dence, the joint measure is given by the product of these two. We rely on this property inwhat follows. Now fix a λ and a c >
0. In our new notation, Equation (19) now reads: Z Ω Z Ω (cid:8) γ m,c ( ω , ω ) ≥ λ (cid:9) dρ ( ω ) dρ ( ω ) = P (cid:16) γ m,c ( ω , ω ) ≥ λ (cid:17) ≤ exp (cid:16) − mλ / T c + 1) log 2 (cid:17) . We now bound P ω ( E ω [ γ m,c ( ω , ω ) | T m ] > λ ), and then use Borel-Cantelli to show thatthis even can happen only finitely often. We have: P ω ( E ω [ γ m,c ( ω , ω ) | T m ] > λ )= Z Ω (cid:8) Z Ω γ m,c ( ω , ω ) dρ ( ω ) > λ (cid:9) dρ ( ω ) ≤ Z Ω n(cid:2) Z Ω γ m,c ( ω , ω ) ( γ m,c ( ω , ω ) ≤ λ ) dρ ( ω ) + Z Ω γ m,c ( ω , ω ) ( γ m,c ( ω , ω ) > λ ) dρ ( ω ) (cid:3) ≥ λ o dρ ( ω ) ≤ Z Ω n(cid:2) Z Ω λ ( λ ( ω , ω ) ≤ λ ) dρ ( ω ) + Z Ω ( γ m,c ( ω , ω ) > λ ) dρ ( ω ) (cid:3) ≥ λ o dρ ( ω ) ≤ Z Ω n(cid:2) λ + Z Ω ( γ m,c ( ω , ω ) > λ ) dρ ( ω ) (cid:3) ≥ λ o dρ ( ω )= Z Ω n Z Ω ( γ m,c ( ω , ω ) > λ ) dρ ( ω ) ≥ λ o dρ ( ω ) . Here, the first equality holds because training and testing samples are independent, andhence the joint measure is the product of ρ and ρ . The second inequality holds because γ m,c ( ω , ω ) ≤ Z Ω Z Ω (cid:8) γ m,c ( ω , ω ) ≥ λ (cid:9) dρ ( ω ) dρ ( ω ) ≥ Z Ω λ n Z Ω (cid:0) γ m,c ( ω , ω ) ≥ λ (cid:1) dρ ( ω ) > λ o dρ ( ω ) . Thus we have P ( E ω ( γ m,c ( ω , ω )) > λ ) ≤ P (cid:16) γ m,c ( ω , ω ) ≥ λ (cid:17) /λ ≤ exp (cid:16) − mλ / T c + 1) log 2 (cid:17) /λ. obustness and Regularization of SVMs For any λ and c , summing up the right hand side over m = 1 to ∞ is finite, hence thetheorem follows from the Borel-Cantelli lemma. Remark 10
We notice that, M m /m converges to 1 almost surely even when X is notbounded. Indeed, to see this, fix ǫ >
0, and let X ′ ⊆ X be a bounded set such that P ( X ′ ) > − ǫ . Then, with probability one, X ′ ) /m → , by Lemma 9. In addition,max (cid:0) X ′ ) , X ′ ) (cid:1) /m → ǫ. Notice that M m ≥ m − X ′ ) − max (cid:0) X ′ ) , X ′ ) (cid:1) . Hence lim m →∞ M m /m ≥ − ǫ, almost surely. Since ǫ is arbitrary, we have M m /m → X ⊆ R n be bounded, and suppose the training samples ( x i , y i ) ∞ i =1 are generated i.i.d. according toan unknown distribution P supported on X × {− , +1 } . Theorem 11
Denote K , max x ∈X k ( x , x ) . Suppose there exists ρ > and a continuousnon-decreasing function f : R + → R + satisfying f (0) = 0 , such that: k ( x , x ) + k ( x ′ , x ′ ) − k ( x , x ′ ) ≤ f ( k x − x ′ k ) , ∀ x , x ′ ∈ X , k x − x ′ k ≤ ρ. Then there exists a random sequence { γ m,c } such that,1. ∀ c > , lim m →∞ γ m,c = 0 almost surely, and the convergence is uniform in P ;2. the following bounds on the Bayes loss and the hinge loss hold uniformly for all ( w , b ) ∈H × RE P ( y = sgn ( h w , Φ( x ) i + b ) ) ≤ γ m,c + c k w k H + 1 m m X i =1 max (cid:2) − y i ( h w , Φ( x i ) i + b ) , (cid:3) , E ( x ,y ) ∼ P (cid:0) max(1 − y ( h w , Φ( x ) i + b ) , (cid:1) ≤ γ m,c (1 + K k w k H + | b | ) + c k w k H + 1 m m X i =1 max (cid:2) − y i ( h w , Φ( x i ) i + b ) , (cid:3) . u, Caramanis and Mannor Proof
As in the proof of Theorem 8, we generate a set of m testing samples and m trainingsamples, and then lower-bound the number of samples that can form a sample pair in thefeature-space; that is, a pair consisting of a training sample ( x , y ) and a testing sample( x ′ , y ′ ) such that y = y ′ and k Φ( x ) − Φ( x ′ ) k H ≤ c . In contrast to the finite-dimensionalsample space, the feature space may be infinite dimensional, and thus our decompositionmay have an infinite number of “bricks.” In this case, our multinomial random variableargument used in the proof of Lemma 9 breaks down. Nevertheless, we are able to lowerbound the number of sample pairs in the feature space by the number of sample pairs inthe sample space .Define f − ( α ) , max { β ≥ | f ( β ) ≤ α } . Since f ( · ) is continuous, f − ( α ) > α >
0. Now notice that by Lemma 7, if a testing sample x and a training sample x ′ belongto a “brick” with length of each side min( ρ/ √ n, f − ( c ) / √ n ) in the sample space (see theproof of Lemma 9), k Φ( x ) − Φ( x ′ ) k H ≤ c . Hence the number of sample pairs in the featurespace is lower bounded by the number of pairs of samples that fall in the same brick inthe sample space. We can cover X with finitely many (denoted as T c ) such bricks since f − ( c ) >
0. Then, a similar argument as in Lemma 9 shows that the ratio of samplesthat form pairs in a brick converges to 1 as m increases. Further notice that for M pairedsamples, the total testing error and hinge-loss are both upper-bounded by cM k w k H + M X i =1 max (cid:2) − y i ( h w , Φ( x i ) i + b ) , (cid:3) . The rest of the proof is identical to Theorem 8. In particular, Inequality (19) still holds.Notice that the condition in Theorem 11 is satisfied by most widely used kernels, e.g.,homogeneous polynominal kernels, and Gaussian RBF. This condition requires that thefeature mapping is “smooth” and hence preserves “locality” of the disturbance, i.e., smalldisturbance in the sample space guarantees the corresponding disturbance in the featurespace is also small. It is easy to construct non-smooth kernel functions which do notgeneralize well. For example, consider the following kernel: k ( x , x ′ ) = (cid:26) x = x ′ ;0 x = x ′ . A standard RKHS regularized SVM using this kernel leads to a decision functionsign( m X i =1 α i k ( x , x i ) + b ) , which equals sign( b ) and provides no meaningful prediction if the testing sample x is notone of the training samples. Hence as m increases, the testing error remains as large as 50%regardless of the tradeoff parameter used in the algorithm, while the training error can bemade arbitrarily small by fine-tuning the parameter. Convergence to Bayes Risk
Next we relate the results of Theorem 8 and Theorem 11 to the standard consistencynotion, i.e., convergence to the Bayes Risk (Steinwart, 2005). The key point of interest obustness and Regularization of SVMs in our proof is the use of a robustness condition in place of a VC-dimension or stabilitycondition used in (Steinwart, 2005). The proof in (Steinwart, 2005) has 4 main steps.They show: (i) there always exists a minimizer to the expected regularized (kernel) hingeloss; (ii) the expected regularized hinge loss of the minimizer converges to the expectedhinge loss as the regularizer goes to zero; (iii) if a sequence of functions asymptoticallyhave optimal expected hinge loss, then they also have optimal expected loss; and (iv) theexpected hinge loss of the minimizer of the regularized training hinge loss concentratesaround the empirical regularized hinge loss. In (Steinwart, 2005), this final step, (iv), isaccomplished using concentration inequalities derived from VC-dimension considerations,and stability considerations.Instead, we use our robustness-based results of Theorem 8 and Theorem 11 to replacethese approaches (Lemmas 3.21 and 3.22 in (Steinwart, 2005)) in proving step (iv), andthus to establish the main result.Recall that a classifier is a rule that assigns to every training set T = { x i , y i } mi =1 ameasurable function f T . The risk of a measurable function f : X → R is defined as R P ( f ) , P ( { x , y : sign f ( x ) = y } ) . The smallest achievable risk R P , inf {R P ( f ) | f measurable } is called the Bayes Risk of P . A classifier is said to be strongly uniformly consistent is forall distributions P on X × [ − , +1], the following holds almost surely.lim m →∞ R P ( f T ) = R P . Without loss of generality, we only consider the kernel version. Recall a definition fromSteinwart (2005).
Definition 12
Let C ( X ) be the set of all continuous functions defined on X . Consider themapping I : H → C ( X ) defined by I w , h w , Φ( · ) i . If I has a dense image, we call thekernel universal . Roughly speaking, if a kernel is universal, it is rich enough to satisfy the condition of step(ii) above.
Theorem 13
If a kernel satisfies the condition of Theorem 11, and is universal, then theKernel SVM with c ↓ sufficiently slowly is strongly uniformly consistent. Proof
We first introduce some notation, largely following (Steinwart, 2005). For someprobability measure µ and ( w , b ) ∈ H × R , R L,µ (( w , b )) , E ( x ,y ) ∼ µ (cid:8) max(0 , − y ( h w , Φ( x ) i + b )) (cid:9) , is the expected hinge-loss under probability µ , and R cL,µ (( w , b )) , c k w k H + E ( x ,y ) ∼ µ (cid:8) max(0 , − y ( h w , Φ( x ) i + b )) (cid:9) u, Caramanis and Mannor is the regularized expected hinge-loss. Hence R L, P ( · ) and R cL, P ( · ) are the expected hinge-loss and regularized expected hinge-loss under the generating probability P . If µ is theempirical distribution of m samples, we write R L,m ( · ) and R cL,m ( · ) respectively. Notice R cL,m ( · ) is the objective function of the SVM. Denote its solution by f m,c , i.e., the classifierwe get by running SVM with m samples and parameter c . Further denote by f P ,c ∈ H × R the minimizer of R cL, P ( · ). The existence of such a minimizer is proved in Lemma 3.1 ofSteinwart (2005) (step (i)). Let R L, P , min f measurable E x ,y ∼ P n max(1 − yf ( x ) , (cid:1) } , i.e., the smallest achievable hinge-loss for all measurable functions.The main content of our proof is to use Theorems 8 and 11 to prove step (iv) in Steinwart(2005). In particular, we show: if c ↓ m →∞ R L, P ( f m,c ) = R L, P . (20)To prove Equation (20), denote by w ( f ) and b ( f ) as the weight part and offset part of anyclassifier f . Next, we bound the magnitude of f m,c by using R cL,m ( f m,c ) ≤ R cL,m ( , ≤ k w ( f m,c ) k H ≤ /c and | b ( f m,c ) | ≤ K k w ( f m,c ) k H ≤ K/c. ¿From Theorem 11 (note that the bound holds uniformly for all ( w , b )), we have R L, P ( f m,c ) ≤ γ m,c [1 + K k w ( f m,c ) k H + | b | ] + R cL,m ( f m,c ) ≤ γ m,c [3 + 2 K/c ] + R cL,m ( f m,c ) ≤ γ m,c [3 + 2 K/c ] + R cL,m ( f P ,c )= R L, P + γ m,c [3 + 2 K/c ] + (cid:8) R cL,m ( f P ,c ) − R cL, P ( f P ,c ) (cid:9) + (cid:8) R cL, P ( f P ,c ) − R L, P (cid:9) = R L, P + γ m,c [3 + 2 K/c ] + (cid:8) R L,m ( f P ,c ) − R L, P ( f P ,c ) (cid:9) + (cid:8) R cL, P ( f P ,c ) − R L, P (cid:9) . The last inequality holds because f m,c minimizes R cL,m .It is known (Steinwart, 2005, Proposition 3.2) (step (ii)) that if the kernel used is richenough, i.e., universal, then lim c → R cL, P ( f P , c ) = R L, P . For fixed c >
0, we have lim m →∞ R L,m ( f P ,c ) = R L, P ( f P ,c ) , almost surely due to the strong law of large numbers (notice that f P ,c is a fixed classifier),and γ m,c [3 + 2 K/c ] → P .Therefore, if c ↓ we have almost surelylim m →∞ R L, P ( f m,c ) ≤ R L, P .
3. For example, we can take { c ( m ) } be the smallest number satisfying c ( m ) ≥ m − / and T c ( m ) ≤ m / / log 2 −
1. Inequality (19) thus leads to P ∞ m =1 P ( γ m,c ( m ) /c ( m ) ≥ m / ) ≤ + ∞ which impliesuniform convergence of γ m,c ( m ) /c ( m ). obustness and Regularization of SVMs Now, for any m and c , we have R L, P ( f m,c ) ≥ R L, P by definition. This implies that Equa-tion (20) holds almost surely, thus giving us step (iv).Finally, Proposition 3.3. of (Steinwart, 2005) shows step (iii), namely, approximatinghinge loss is sufficient to guarantee approximation of the Bayes loss. Thus Equation (20)implies that the risk of function f m,c converges to Bayes risk.
6. Concluding Remarks
This work considers the relationship between robust and regularized SVM classification. Inparticular, we prove that the standard norm-regularized SVM classifier is in fact the solutionto a robust classification setup, and thus known results about regularized classifiers extendto robust classifiers. To the best of our knowledge, this is the first explicit such link betweenregularization and robustness in pattern classification. This link suggests that norm-basedregularization essentially builds in a robustness to sample noise whose probability level setsare symmetric, and moreover have the structure of the unit ball with respect to the dual ofthe regularizing norm. It would be interesting to understand the performance gains possiblewhen the noise does not have such characteristics, and the robust setup is used in place ofregularization with appropriately defined uncertainty set.Based on the robustness interpretation of the regularization term, we re-proved theconsistency of SVMs without direct appeal to notions of metric entropy, VC-dimension, orstability. Our proof suggests that the ability to handle disturbance is crucial for an algorithmto achieve good generalization ability. In particular, for “smooth” feature mappings, therobustness to disturbance in the observation space is guaranteed and hence SVMs achieveconsistency. On the other-hand, certain “non-smooth” feature mappings fail to be consistentsimply because for such kernels the robustness in the feature-space (guaranteed by theregularization process) does not imply robustness in the observation space.
Acknowledgments
We thank the editor and three anonymous reviewers for significantly improving the acces-sibility of this manuscript. We also benefited from comments from participants in ITA2008.
Appendix A.
In this appendix we show that for RBF kernels, it is possible to relate robustness in thefeature space and robustness in the sample space more directly.
Theorem 14
Suppose the Kernel function has the form k ( x , x ′ ) = f ( k x − x ′ k ) , with f : R + → R a decreasing function. Denote by H the RKHS space of k ( · , · ) and Φ( · ) thecorresponding feature mapping. Then we have for any x ∈ R n , w ∈ H and c > , sup k δ k≤ c h w , Φ( x − δ ) i = sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) + δ φ i . u, Caramanis and Mannor Proof
We show that the left-hand-side is not larger than the right-hand-side, and viceversa.First we show sup k δ k≤ c h w , Φ( x − δ ) i ≤ sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i . (21)We notice that for any k δ k ≤ c , we have h w , Φ( x − δ ) i = D w , Φ( x ) + (cid:0) Φ( x − δ ) − Φ( x ) (cid:1)E = h w , Φ( x ) i + h w , Φ( x − δ ) − Φ( x ) i≤h w , Φ( x ) i + k w k H · k Φ( x − δ ) − Φ( x ) k H ≤h w , Φ( x ) i + k w k H p f (0) − f ( c )= sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i . Taking the supremum over δ establishes Inequality (21).Next, we show the opposite inequality,sup k δ k≤ c h w , Φ( x − δ ) i ≥ sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i . (22)If f ( c ) = f (0), then Inequality 22 holds trivially, hence we only consider the case that f ( c ) < f (0). Notice that the inner product is a continuous function in H , hence for any ǫ >
0, there exists a δ ′ φ such that h w , Φ( x ) − δ ′ φ i > sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i − ǫ ; k δ ′ φ k H < p f (0) − f ( c ) . Recall that the RKHS space is the completion of the feature mapping, thus there exists asequence of { x ′ i } ∈ R n such that Φ( x ′ i ) → Φ( x ) − δ ′ φ , (23)which is equivalent to (cid:0) Φ( x ′ i ) − Φ( x ) (cid:1) → − δ ′ φ . This leads to lim i →∞ q f (0) − f ( k x ′ i − x k )= lim i →∞ k Φ( x ′ i ) − Φ( x ) k H = k δ ′ φ k H < p f (0) − f ( c ) . Since f is decreasing, we conclude that k x ′ i − x k ≤ c holds except for a finite number of i .By (23) we have h w , Φ( x ′ i ) i → h w , Φ( x ) − δ ′ φ i > sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i − ǫ, obustness and Regularization of SVMs which means sup k δ k≤ c h w , Φ( x − δ ) i ≥ sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i − ǫ. Since ǫ is arbitrary, we establish Inequality (22).Combining Inequality (21) and Inequality (22) proves the theorem. References
M. Anthony and P. Bartlett.
Neural Network Learning: Theoretical Foundations . CambridgeUniversity Press, 1999.P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds andstructural results.
Journal of Machine Learning Research , 3:463–482, November 2002.P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexity.
The Annals ofStatistics , 33(4):1497–1537, 2005.A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs.
OperationsResearch Letters , 25(1):1–13, August 1999.K. Bennett and O. Mangasarian. Robust linear programming discrimination of two linearlyinseparable sets.
Optimization Methods and Software , 1(1):23–34, 1992.D. Bertsimas and A. Fertis. Personal Correspondence, March 2008.D. Bertsimas and M. Sim. The price of robustness.
Operations Research , 52(1):35–53,January 2004.C. Bhattacharyya. Robust classification of noisy data using second order cone programmingapproach. In
Proceedings International Conference on Intelligent Sensing and InformationProcessing , pages 433–438, Chennai, India, 2004.C. Bhattacharyya, L. Grate, M. Jordan, L. El Ghaoui, and I. Mian. Robust sparse hyper-plane classifiers: Application to uncertain molecular profiling data.
Journal of Computa-tional Biology , 11(6):1073–1089, 2004a.C. Bhattacharyya, K. Pannagadatta, and A. Smola. A second order cone programming for-mulation for classifying missing data. In Lawrence K. Saul, Yair Weiss, and L´eon Bottou,editors,
Advances in Neural Information Processing Systems (NIPS17) , Cambridge, MA,2004b. MIT Press.J. Bi and T. Zhang. Support vector classification with input data uncertainty. InLawrence K. Saul, Yair Weiss, and L´eon Bottou, editors,
Advances in Neural InformationProcessing Systems (NIPS17) , Cambridge, MA, 2004. MIT Press. u, Caramanis and Mannor C. Bishop. Training with noise is equivalent to tikhonov regularization.
Neu-ral Computation , 7(1):108–116, 1995. doi: 10.1162/neco.1995.7.1.108. URL .P. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers.In
Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory ,pages 144–152, New York, NY, 1992.O. Bousquet and A. Elisseeff. Stability and generalization.
Journal of Machine LearningResearch , 2:499–526, 2002.A. Christmann and I. Steinwart. On robust properties of convex risk minimization methodsfor pattern recognition.
Journal of Machine Learning Research , 5:1007–1034, 2004.A. Christmann and I. Steinwart. Consistency and robustness of kernel based regression.
Bernoulli , 13(3):799–819, 2007.A. Christmann and A. Van Messem. Bouligand derivatives and robustness of support vectormachines.
Journal of Machine Learning Research , 9:915–936, 2008.C. Cortes and V. Vapnik. Support vector networks.
Machine Learning , 20:1–25, 1995.R. Durrett.
Probability: Theory and Examples . Duxbury Press, 2004.L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertaindata.
SIAM Journal on Matrix Analysis and Applications , 18:1035–1064, 1997.T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector ma-chines. In A. Smola, P. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors,
Advances inLarge Margin Classifiers , pages 171–203, Cambridge, MA, 2000. MIT Press.A. Globerson and S. Roweis. Nightmare at test time: Robust learning by feature deletion.In
ICML ’06: Proceedings of the 23rd International Conference on Machine Learning ,pages 353–360, New York, NY, USA, 2006. ACM Press.F. Hampel. The influence curve and its role in robust estimation.
Journal of the AmericanStatistical Association , 69(346):383–393, 1974.F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel.
Robust Statistics: TheApproach Based on Influence Functions . John Wiley & Sons, New York, 1986.P. Huber.
Robust Statistics . John Wiley & Sons, New York, 1981.M. Kearns, Y. Mansour, A. Ng, and D. Ron. An experimental and theoretical comparisonof model selection methods.
Machine Learning , 27:7–50, 1997.V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the gen-eralization error of combined classifiers.
The Annals of Statistics , 30(1):1–50, 2002.Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and general-ization error. In
In UAI-2002: Uncertainty in Artificial Intelligence , number 275–282,2002. obustness and Regularization of SVMs G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. A robust minimax approachto classification.
Journal of Machine Learning Research , 3:555–582, December 2002.R. A. Maronna, R. D. Martin, and V. J. Yohai.
Robust Statistics. Theory and Methods.
John Wiley & Sons, New York, 2006.S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: Stability is sufficient forgeneralization and necessary and sufficient for consistency of empirical risk minimization.
Advances in Computational Mathematics , 25(1-3):161–193, 2006.T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity inlearning theory.
Nature , 428(6981):419–422, 2004.P. Rousseeuw and A. Leeroy.
Robust Regression and Outlier Detection . John Wiley & Sons,New York, 1987.B. Sch¨olkopf and A. Smola.
Learning with Kernels . MIT Press, 2002.P. Shivaswamy, C. Bhattacharyya, and A. Smola. Second order cone programming ap-proaches for handling missing and uncertain data.
Journal of Machine Learning Research ,7:1283–1314, July 2006.A. Smola, B. Sch¨olkopf, and K. M¨ullar. The connection between regularization operatorsand support vector kernels.
Neural Networks , 11:637–649, 1998.I. Steinwart. Consistency of support vector machines and other regularized kernel classifiers.
IEEE Transactions on Information Theory , 51(1):128–142, 2005.C. H. Teo, A. Globerson, S. Roweis, and A. Smola. Convex learning with invariances. InJ.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors,
Advances in Neural InformationProcessing Systems 20 , pages 1489–1496, Cambridge, MA, 2008. MIT Press.T. Trafalis and R. Gilbert. Robust support vector machines for classification and compu-tational issues.
Optimization Methods and Software , 22(1):187–198, February 2007.A. van der Vaart and J. Wellner.
Weak Convergence and Empirical Processes . Springer-Verlag, New York, 2000.V. Vapnik and A. Chervonenkis.
Theory of Pattern Recognition . Nauka, Moscow, 1974.V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for consistency inthe empirical risk minimization method.
Pattern Recognition and Image Analysis , 1(3):260–284, 1991.V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method.
Automa-tion and Remote Control , 24:744–780, 1963., 24:744–780, 1963.