[PDF] Robustness and Regularization of Support Vector Machines

Abstract

We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization, provides a robust optimization interpretation for the success of regularized SVMs. We use the this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well.

Full PDF

aa r X i v : . [ c s . L G ] N ov Robustness and Regularization of SVMs

Robustness and Regularization of Support Vector Machines

Huan Xu [email protected],

Department of Electrical and Computer Engineering, McGill University, Canada

Constantine Caramanis [email protected]

Department of Electrical and Computer Engineering, The University of Texas at Austin, USA

Shie Mannor [email protected]

Department of Electrical and Computer Engineering, McGill University, Canada

Editor:

Alexander Smola

Abstract

We consider regularized support vector machines (SVMs) and show that they are preciselyequivalent to a new robust optimization formulation. We show that this equivalence ofrobust optimization and regularization has implications for both algorithms, and analysis.In terms of algorithms, the equivalence suggests more general SVM-like algorithms forclassiﬁcation that explicitly build in protection to noise, and at the same time controloverﬁtting. On the analysis front, the equivalence of robustness and regularization, providesa robust optimization interpretation for the success of regularized SVMs. We use the thisnew robustness interpretation of SVMs to give a new proof of consistency of (kernelized)SVMs, thus establishing robustness as the reason regularized SVMs generalize well.

Keywords:

Robustness, Regularization, Generalization, Kernel, Support Vector Machine

1. Introduction

Support Vector Machines (SVMs for short) originated in Boser et al. (1992) and can betraced back to as early as Vapnik and Lerner (1963) and Vapnik and Chervonenkis (1974).They continue to be one of the most successful algorithms for classiﬁcation. SVMs ad-dress the classiﬁcation problem by ﬁnding the hyperplane in the feature space that achievesmaximum sample margin when the training samples are separable, which leads to mini-mizing the norm of the classiﬁer. When the samples are not separable, a penalty termthat approximates the total training error is considered (Bennett and Mangasarian, 1992;Cortes and Vapnik, 1995). It is well known that minimizing the training error itself can leadto poor classiﬁcation performance for new unlabeled data; that is, such an approach mayhave poor generalization error because of, essentially, overﬁtting (Vapnik and Chervonenkis,1991). A variety of modiﬁcations have been proposed to combat this problem, one of themost popular methods being that of minimizing a combination of the training-error anda regularization term. The latter is typically chosen as a norm of the classiﬁer. Theresulting regularized classiﬁer performs better on new data. This phenomenon is ofteninterpreted from a statistical learning theory view: the regularization term restricts thecomplexity of the classiﬁer, hence the deviation of the testing error and the training erroris controlled (see Smola et al., 1998; Evgeniou et al., 2000; Bartlett and Mendelson, 2002;Koltchinskii and Panchenko, 2002; Bartlett et al., 2005, and references therein). u, Caramanis and Mannor In this paper we consider a diﬀerent setup, assuming that the training data are gen-erated by the true underlying distribution, but some non-i.i.d. (potentially adversarial)disturbance is then added to the samples we observe. We follow a robust optimization(see El Ghaoui and Lebret, 1997; Ben-Tal and Nemirovski, 1999; Bertsimas and Sim, 2004,and references therein) approach, i.e., minimizing the worst possible empirical error un-der such disturbances. The use of robust optimization in classiﬁcation is not new (e.g.,Shivaswamy et al., 2006; Bhattacharyya et al., 2004b; Lanckriet et al., 2002). Robust clas-siﬁcation models studied in the past have considered only box-type uncertainty sets, whichallow the possibility that the data have all been skewed in some non-neutral manner by acorrelated disturbance. This has made it diﬃcult to obtain non-conservative generalizationbounds. Moreover, there has not been an explicit connection to the regularized classi-ﬁer, although at a high-level it is known that regularization and robust optimization arerelated (e.g., El Ghaoui and Lebret, 1997; Anthony and Bartlett, 1999). The main contri-bution in this paper is solving the robust classiﬁcation problem for a class of non-box-typeduncertainty sets, and providing a linkage between robust classiﬁcation and the standardregularization scheme of SVMs. In particular, our contributions include the following: • We solve the robust SVM formulation for a class of non-box-type uncertainty sets.This permits ﬁner control of the adversarial disturbance, restricting it to satisfy ag-gregate constraints across data points, therefore reducing the possibility of highlycorrelated disturbance. • We show that the standard regularized SVM classiﬁer is a special case of our robustclassiﬁcation, thus explicitly relating robustness and regularization. This providesan alternative explanation to the success of regularization, and also suggests newphysically motivated ways to construct regularization terms. • We relate our robust formulation to several probabilistic formulations. We considera chance-constrained classiﬁer (i.e., a classiﬁer with probabilistic constraints on mis-classiﬁcation) and show that our robust formulation can approximate it far less con-servatively than previous robust formulations could possibly do. We also considera Bayesian setup, and show that this can be used to provide a principled means ofselecting the regularization coeﬃcient without cross-validation. • We show that the robustness perspective, stemming from a non-i.i.d. analysis, canbe useful in the standard learning (i.i.d.) setup, by using it to prove consistencyfor standard SVM classiﬁcation, without using VC-dimension or stability arguments .This result implies that generalization ability is a direct result of robustness to localdisturbances; it therefore suggests a new justiﬁcation for good performance, and conse-quently allows us to construct learning algorithms that generalize well by robustifyingnon-consistent algorithms.

Robustness and Regularization

We comment here on the explicit equivalence of robustness and regularization. We brieﬂy ex-plain how this observation is diﬀerent from previous work and why it is interesting. Certainequivalence relationships between robustness and regularization have been established for obustness and Regularization of SVMs problems other than classiﬁcation (El Ghaoui and Lebret, 1997; Ben-Tal and Nemirovski,1999; Bishop, 1995), but their results do not directly apply to the classiﬁcation prob-lem. Indeed, research on classiﬁer regularization mainly discusses its eﬀect on bound-ing the complexity of the function class (e.g., Smola et al., 1998; Evgeniou et al., 2000;Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, 2002; Bartlett et al., 2005). Mean-while, research on robust classiﬁcation has not attempted to relate robustness and regular-ization (e.g., Lanckriet et al., 2002; Bhattacharyya et al., 2004a,b; Shivaswamy et al., 2006;Trafalis and Gilbert, 2007; Globerson and Roweis, 2006), in part due to the robustness for-mulations used in those papers. In fact, they all consider robustiﬁed versions of regularized classiﬁcations. Bhattacharyya (2004) considers a robust formulation for box-type uncer-tainty, and relates this robust formulation with regularized SVM. However, this formulationinvolves a non-standard loss function that does not bound the 0 − corrupted samples and ishence diﬀerent from the known PAC bounds. This is helpful when the training samples andthe testing samples are drawn from diﬀerent distributions, or some adversary manipulatesthe samples to prevent them from being correctly labeled (e.g., spam senders change theirpatterns from time to time to avoid being labeled and ﬁltered). Finally, this connection of

1. Lanckriet et al. (2002) is perhaps the only exception, where a regularization term is added to the covari-ance estimation rather than to the objective function. u, Caramanis and Mannor robustiﬁcation and regularization also provides us with new proof techniques as well (seeSection 5).We need to point out that there are several diﬀerent deﬁnitions of robustness in litera-ture. In this paper, as well as the aforementioned robust classiﬁcation papers, robustnessis mainly understood from a Robust Optimization perspective, where a min-max optimiza-tion is performed over all possible disturbances. An alternative interpretation of robustnessstems from the rich literature on Robust Statistics (e.g., Huber, 1981; Hampel et al., 1986;Rousseeuw and Leeroy, 1987; Maronna et al., 2006), which studies how an estimator oralgorithm behaves under a small perturbation of the statistics model. For example, the In-ﬂuence Function approach, proposed in Hampel (1974) and Hampel et al. (1986), measuresthe impact of an inﬁnitesimal amount of contamination of the original distribution on thequantity of interest. Based on this notion of robustness, Christmann and Steinwart (2004)showed that many kernel classiﬁcation algorithms, including SVM, are robust in the senseof having a ﬁnite Inﬂuence Function. A similar result for regression algorithms is shown inChristmann and Steinwart (2007) for smooth loss functions, and in Christmann and Van Messem(2008) for non-smooth loss functions where a relaxed version of the Inﬂuence Function isapplied. In the machine learning literature, another widely used notion closely related torobustness is the stability , where an algorithm is required to be robust (in the sense thatthe output function does not change signiﬁcantly) under a speciﬁc perturbation: deletingone sample from the training set. It is now well known that a stable algorithm such asSVM has desirable generalization properties, and is statistically consistent under mild tech-nical conditions; see for example Bousquet and Elisseeﬀ (2002); Kutin and Niyogi (2002);Poggio et al. (2004); Mukherjee et al. (2006) for details. One main diﬀerence between Ro-bust Optimization and other robustness notions is that the former is constructive ratherthan analytical. That is, in contrast to robust statistics or the stability approach that mea-sures the robustness of a given algorithm, Robust Optimization can robustify an algorithm:it converts a given algorithm to a robust one. For example, as we show in this paper, the ROversion of a naive empirical-error minimization is the well known SVM. As a constructiveprocess, the RO approach also leads to additional ﬂexibility in algorithm design, especiallywhen the nature of the perturbation is known or can be well estimated. Structure of the Paper:

This paper is organized as follows. In Section 2 we investigatethe correlated disturbance case, and show the equivalence between the robust classiﬁcationand the regularization process. We develop the connections to probabilistic formulationsin Section 3, and prove a consistency result based on robustness analysis in Section 5.The kernelized version is investigated in Section 4. Some concluding remarks are given inSection 6.

Notation:

Capital letters are used to denote matrices, and boldface letters are used todenote column vectors. For a given norm k · k , we use k · k ∗ to denote its dual norm, i.e., k z k ∗ , sup { z ⊤ x |k x k ≤ } . For a vector x and a positive semi-deﬁnite matrix C of the samedimension, k x k C denotes √ x ⊤ C x . We use δ to denote disturbance aﬀecting the samples.We use superscript r to denote the true value for an uncertain variable, so that δ ri is thetrue (but unknown) noise of the i th sample. The set of non-negative scalars is denoted by R + . The set of integers from 1 to n is denoted by [1 : n ]. obustness and Regularization of SVMs

2. Robust Classiﬁcation and Regularization

We consider the standard binary classiﬁcation problem, where we are given a ﬁnite numberof training samples { x i , y i } mi =1 ⊆ R n × {− , +1 } , and must ﬁnd a linear classiﬁer, speciﬁedby the function h w ,b ( x ) = sgn( h w , x i + b ) . For the standard regularized classiﬁer, theparameters ( w , b ) are obtained by solving the following convex optimization problem:min w ,b, ξ : r ( w , b ) + m X i =1 ξ i s . t . : ξ i ≥ (cid:2) − y i ( h w , x i i + b )] ξ i ≥ , where r ( w , b ) is a regularization term. This is equivalent tomin w ,b ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3)) . Previous robust classiﬁcation work (Shivaswamy et al., 2006; Bhattacharyya et al., 2004a,b;Bhattacharyya, 2004; Trafalis and Gilbert, 2007) considers the classiﬁcation problem wherethe input are subject to (unknown) disturbances ~δ = ( δ , . . . , δ m ) and essentially solves thefollowing min-max problem:min w ,b max ~δ ∈N box ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3)) , (1)for a box-type uncertainty set N box . That is, let N i denotes the projection of N box ontothe δ i component, then N box = N × · · · × N m . Eﬀectively, this allows simultaneous worst-case disturbances across many samples, and leads to overly conservative solutions. Thegoal of this paper is to obtain a robust formulation where the disturbances { δ i } may bemeaningfully taken to be correlated, i.e., to solve for a non-box-type N :min w ,b max ~δ ∈N ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3)) . (2)We brieﬂy explain here the four reasons that motivate this “robust to perturbation” setupand in particular the min-max form of (1) and (2). First, it can explicitly incorporate priorproblem knowledge of local invariance (e.g, Teo et al., 2008). For example, in vision tasks,a desirable classiﬁer should provide a consistent answer if an input image slightly changes.Second, there are situations where some adversarial opponents (e.g., spam senders) willmanipulate the testing samples to avoid being correctly classiﬁed, and the robustness to-ward such manipulation should be taken into consideration in the training process (e.g,Globerson and Roweis, 2006). Or alternatively, the training samples and the testing sam-ples can be obtained from diﬀerent processes and hence the standard i.i.d. assumption isviolated (e.g, Bi and Zhang, 2004). For example in real-time applications, the newly gener-ated samples are often less accurate due to time constraints. Finally, formulations based onchance-constraints (e.g., Bhattacharyya et al., 2004b; Shivaswamy et al., 2006) are mathe-matically equivalent to such a min-max formulation.We deﬁne explicitly the correlated disturbance (or uncertainty) which we study below. u, Caramanis and Mannor Deﬁnition 1

A set N ⊆ R n is called an Atomic Uncertainty Set if(I) ∈ N ; (II) For any w ∈ R n : sup δ [ w ⊤ δ ] = sup δ ′ [ − w ⊤ δ ′ ] < + ∞ . We use “sup” here because the maximal value is not necessary attained since N may notbe a closed set. The second condition of Atomic Uncertainty set basically says that theuncertainty set is bounded and symmetric. In particular, all norm balls and ellipsoidscentered at the origin are atomic uncertainty sets, while an arbitrary polytope might notbe an atomic uncertainty set. Deﬁnition 2

Let N be an atomic uncertainty set. A set N ⊆ R n × m is called a SublinearAggregated Uncertainty Set of N , if N − ⊆ N ⊆ N + , where: N − , m [ t =1 N − t ; N − t , { ( δ , · · · , δ m ) | δ t ∈ N ; δ i = t = } . N + , { ( α δ , · · · , α m δ m ) | m X i =1 α i = 1; α i ≥ , δ i ∈ N , i = 1 , · · · , m } . The Sublinear Aggregated Uncertainty deﬁnition models the case where the disturbanceson each sample are treated identically, but their aggregate behavior across multiple samplesis controlled. Some interesting examples include(1) { ( δ , · · · , δ m ) | m X i =1 k δ i k ≤ c } ;(2) { ( δ , · · · , δ m ) |∃ t ∈ [1 : m ]; k δ t k ≤ c ; δ i = , ∀ i = t } ;(3) { ( δ , · · · , δ m ) | m X i =1 p c k δ i k ≤ c } . All these examples have the same atomic uncertainty set N = (cid:8) δ (cid:12)(cid:12) k δ k ≤ c (cid:9) . Figure 1provides an illustration of a sublinear aggregated uncertainty set for n = 1 and m = 2, i.e.,the training set consists of two univariate samples. Theorem 3

Assume { x i , y i } mi =1 are non-separable, r ( · ) : R n +1 → R is an arbitrary func-tion, N is a Sublinear Aggregated Uncertainty set with corresponding atomic uncertaintyset N . Then the following min-max problem min w ,b sup ( δ , ··· , δ m ) ∈N ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3)) (3) is equivalent to the following optimization problem on w , b, ξ : min : r ( w , b ) + sup δ ∈N ( w ⊤ δ ) + m X i =1 ξ i , s . t . : ξ i ≥ − [ y i ( h w , x i i + b )] , i = 1 , . . . , m ; ξ i ≥ , i = 1 , . . . , m. (4) obustness and Regularization of SVMs xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx a. N − b. N + c. N d. Box uncertaintyFigure 1: Illustration of a Sublinear Aggregated Uncertainty Set N . Furthermore, the minimization of Problem (4) is attainable when r ( · , · ) is lower semi-continuous. Proof

Deﬁne: v ( w , b ) , sup δ ∈N ( w ⊤ δ ) + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) . Recall that N − ⊆ N ⊆ N + by deﬁnition. Hence, ﬁxing any ( ˆ w , ˆ b ) ∈ R n +1 , the followinginequalities hold: sup ( δ , ··· , δ m ) ∈N − m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) ≤ sup ( δ , ··· , δ m ) ∈N m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) ≤ sup ( δ , ··· , δ m ) ∈N + m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) . To prove the theorem, we ﬁrst show that v ( ˆ w , ˆ b ) is no larger than the leftmost expressionand then show v ( ˆ w , ˆ b ) is no smaller than the rightmost expression.Step 1: We prove that v ( ˆ w , ˆ b ) ≤ sup ( δ , ··· , δ m ) ∈N − m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) . (5)Since the samples { x i , y i } mi =1 are not separable, there exists t ∈ [1 : m ] such that y t ( h ˆ w , x t i + ˆ b ) < . (6) u, Caramanis and Mannor Hence, sup ( δ , ··· , δ m ) ∈N − t m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) = X i = t max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + sup δ t ∈N max (cid:2) − y t ( h ˆ w , x t − δ t i + ˆ b ) , (cid:3) = X i = t max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + max (cid:2) − y t ( h ˆ w , x t i + ˆ b ) + sup δ t ∈N ( y t ˆ w ⊤ δ t ) , (cid:3) = X i = t max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + max (cid:2) − y t ( h ˆ w , x t i + ˆ b ) , (cid:3) + sup δ t ∈N ( y t ˆ w ⊤ δ t )= sup δ ∈N ( ˆ w ⊤ δ ) + m X i =1 max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) = v ( ˆ w , ˆ b ) . The third equality holds because of Inequality (6) and sup δ t ∈N ( y t ˆ w ⊤ δ t ) being non-negative(recall ∈ N ). Since N − t ⊆ N − , Inequality (5) follows.Step 2: Next we prove thatsup ( δ , ··· , δ m ) ∈N + m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) ≤ v ( ˆ w , ˆ b ) . (7)Notice that by the deﬁnition of N + we havesup ( δ , ··· , δ m ) ∈N + m X i =1 max (cid:2) − y i ( h ˆ w , x i − δ i i + ˆ b ) , (cid:3) = sup P mi =1 α i =1; α i ≥

0; ˆ δ i ∈N m X i =1 max (cid:2) − y i ( h ˆ w , x i − α i ˆ δ i i + ˆ b ) , (cid:3) = sup P mi =1 α i =1; α i ≥ m X i =1 max (cid:2) sup ˆ δ i ∈N (cid:0) − y i ( h ˆ w , x i − α i ˆ δ i i + ˆ b ) (cid:1) , (cid:3) . (8)Now, for any i ∈ [1 : m ], the following holds,max (cid:2) sup ˆ δ i ∈N (cid:0) − y i ( h ˆ w , x i − α i ˆ δ i i + ˆ b ) (cid:1) , (cid:3) = max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) + α i sup ˆ δ i ∈N ( ˆ w ⊤ ˆ δ i ) , (cid:3) ≤ max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + α i sup ˆ δ i ∈N ( ˆ w ⊤ ˆ δ i ) . Therefore, Equation (8) is upper bounded by m X i =1 max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) + sup P mi =1 α i =1; α i ≥ m X i =1 α i sup ˆ δ i ∈N ( ˆ w ⊤ ˆ δ i )= sup δ ∈N ( ˆ w ⊤ δ ) + m X i =1 max (cid:2) − y i ( h ˆ w , x i i + ˆ b ) , (cid:3) = v ( ˆ w , ˆ b ) , obustness and Regularization of SVMs hence Inequality (7) holds.Step 3: Combining the two steps and adding r ( w , b ) on both sides leads to: ∀ ( w , b ) ∈ R n +1 , sup ( δ , ··· , δ m ) ∈N m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3) + r ( w , b ) = v ( w , b ) + r ( w , b ) . Taking the inﬁmum on both sides establishes the equivalence of Problem (3) and Prob-lem (4). Observe that sup δ ∈N w ⊤ δ is a supremum over a class of aﬃne functions, andhence is lower semi-continuous. Therefore v ( · , · ) is also lower semi-continuous. Thus theminimum can be achieved for Problem (4), and Problem (3) by equivalence, when r ( · ) islower semi-continuous.This theorem reveals the main diﬀerence between Formulation (1) and our formulationin (2). Consider a Sublinear Aggregated Uncertainty set N = { ( δ , · · · , δ m ) | P mi =1 k δ i k ≤ c } . The smallest box-type uncertainty set containing N includes disturbances with normsum up to mc . Therefore, it leads to a regularization coeﬃcient as large as mc that is linkedto the number of training samples, and will therefore be overly conservative.An immediate corollary is that a special case of our robust formulation is equivalent tothe norm-regularized SVM setup: Corollary 4

Let T , n ( δ , · · · δ m ) | P mi =1 k δ i k ∗ ≤ c o . If the training sample { x i , y i } mi =1 are non-separable, then the following two optimization problems on ( w , b ) are equivalent min : max ( δ , ··· , δ m ) ∈T m X i =1 max (cid:2) − y i (cid:0) h w , x i − δ i i + b (cid:1) , (cid:3) , (9)min : c k w k + m X i =1 max (cid:2) − y i (cid:0) h w , x i i + b (cid:1) , (cid:3) . (10) Proof

Let N be the dual-norm ball { δ |k δ k ∗ ≤ c } and r ( w , b ) ≡

0. Then sup k δ k ∗ ≤ c ( w ⊤ δ ) = c k w k . The corollary follows from Theorem 3. Notice indeed the equivalence holds for any w and b .This corollary explains the widely known fact that the regularized classiﬁer tends to bemore robust. Speciﬁcally, it explains the observation that when the disturbance is noise-like and neutral rather than adversarial, a norm-regularized classiﬁer (without any robust-ness requirement) has a performance often superior to a box-typed robust classiﬁer (seeTrafalis and Gilbert, 2007). On the other hand, this observation also suggests that theappropriate way to regularize should come from a disturbance-robustness perspective. Theabove equivalence implies that standard regularization essentially assumes that the dis-turbance is spherical; if this is not true, robustness may yield a better regularization-likealgorithm. To ﬁnd a more eﬀective regularization term, a closer investigation of the datavariation is desirable, e.g., by examining the variation of the data and solving the corre-sponding robust classiﬁcation problem. For example, one way to regularize is by splitting

2. The optimization equivalence for the linear case was observed independently by Bertsimas and Fertis(2008). u, Caramanis and Mannor the given training samples into two subsets with equal number of elements, and treatingone as a disturbed copy of the other. By analyzing the direction of the disturbance andthe magnitude of the total variation, one can choose the proper norm to use, and a suitabletradeoﬀ parameter.

3. Probabilistic Interpretations

Although Problem (3) is formulated without any probabilistic assumptions, in this section,we brieﬂy explain two approaches to construct the uncertainty set and equivalently tunethe regularization parameter c based on probabilistic information.The ﬁrst approach is to use Problem (3) to approximate an upper bound for a chance-constrained classiﬁer. Suppose the disturbance ( δ r , · · · δ rm ) follows a joint probability mea-sure µ . Then the chance-constrained classiﬁer is given by the following minimization prob-lem given a conﬁdence level η ∈ [0 , w ,b,l : l s . t . : µ n m X i =1 max (cid:2) − y i ( h w , x i − δ ri i + b ) , (cid:3) ≤ l o ≥ − η. (11)The formulations in Shivaswamy et al. (2006), Lanckriet et al. (2002) and Bhattacharyya et al.(2004a) assume uncorrelated noise and require all constraints to be satisﬁed with high prob-ability simultaneously . They ﬁnd a vector [ ξ , · · · , ξ m ] ⊤ where each ξ i is the η -quantile ofthe hinge-loss for sample x ri . In contrast, our formulation above minimizes the η -quantileof the average (or equivalently the sum of) empirical error. When controlling this averagequantity is of more interest, the box-type noise formulation will be overly conservative.Problem (11) is generally intractable. However, we can approximate it as follows. Let c ∗ , inf { α | µ ( X i k δ i k ∗ ≤ α ) ≥ − η } . Notice that c ∗ is easily simulated given µ . Then for any ( w , b ), with probability no lessthan 1 − η , the following holds, m X i =1 max (cid:2) − y i ( h w , x i − δ ri i + b ) , (cid:3) ≤ max P i k δ i k ∗ ≤ c ∗ m X i =1 max (cid:2) − y i ( h w , x i − δ i i + b ) , (cid:3) . Thus (11) is upper bounded by (10) with c = c ∗ . This gives an additional probabilisticrobustness property of the standard regularized classiﬁer. Notice that following a similarapproach but with the constraint-wise robust setup, i.e., the box uncertainty set, wouldlead to considerably more pessimistic approximations of the chance constraint.The second approach considers a Bayesian setup. Suppose the total disturbance c r , P mi =1 k δ ri k ∗ follows a prior distribution ρ ( · ). This can model for example the case thatthe training sample set is a mixture of several data sets where the disturbance magnitude obustness and Regularization of SVMs of each set is known. Such a setup leads to the following classiﬁer which minimizes theBayesian (robust) error:min w ,b : Z n max P k δ i k ∗ ≤ c m X i =1 max (cid:2) − y i (cid:0) h w , x i − δ i i + b (cid:1) , (cid:3)o dρ ( c ) . (12)By Corollary 4, the Bayesian classiﬁer (12) is equivalent tomin w ,b : Z n c k w k + m X i =1 max (cid:2) − y i (cid:0) h w , x i i + b (cid:1) , (cid:3)o dρ ( c ) , which can be further simpliﬁed asmin w ,b : c k w k + m X i =1 max (cid:2) − y i (cid:0) h w , x i i + b (cid:1) , (cid:3) , where c , R c dρ ( c ). This thus provides us a justiﬁable parameter tuning method diﬀerentfrom cross validation: simply using the expected value of c r . We note that it is the equiva-lence of Corollary 4 that makes this possible, since it is diﬃcult to imagine a setting whereone would have a prior on regularization coeﬃcients.

4. Kernelization

The previous results can be easily generalized to the kernelized setting, which we discuss indetail in this section. In particular, similar to the linear classiﬁcation case, we give a newinterpretation of the standard kernelized SVM as the min-max empirical hinge-loss solution,where the disturbance is assumed to lie in the feature space. We then relate this to the(more intuitively appealing) setup where the disturbance lies in the sample space. We usethis relationship in Section 5 to prove a consistency result for kernelized SVMs.The kernelized SVM formulation considers a linear classiﬁer in the feature space H , aHilbert space containing the range of some feature mapping Φ( · ). The standard formulationis as follows, min w ,b : r ( w , b ) + m X i =1 ξ i s . t . : ξ i ≥ (cid:2) − y i ( h w , Φ( x i ) i + b )] ,ξ i ≥ . It has been proved in Sch¨olkopf and Smola (2002) that if we take f ( h w , w i ) for some in-creasing function f ( · ) as the regularization term r ( w , b ), then the optimal solution has arepresentation w ∗ = P mi =1 α i Φ( x i ), which can further be solved without knowing explicitlythe feature mapping, but by evaluating a kernel function k ( x , x ′ ) , h Φ( x ) , Φ( x ′ ) i only.This is the well-known “kernel trick”.The deﬁnitions of Atomic Uncertainty Set and Sublinear Aggregated Uncertainty Set inthe feature space are identical to Deﬁnition 1 and 2, with R n replaced by H . The followingtheorem is a feature-space counterpart of Theorem 3. The proof follows from a similarargument to Theorem 3, i.e., for any ﬁxed ( w , b ) the worst-case empirical error equals theempirical error plus a penalty term sup δ ∈N (cid:0) h w , δ i (cid:1) , and hence the details are omitted. u, Caramanis and Mannor Theorem 5

Assume { Φ( x i ) , y i } mi =1 are not linearly separable, r ( · ) : H × R → R is anarbitrary function, N ⊆ H m is a Sublinear Aggregated Uncertainty set with correspondingatomic uncertainty set N ⊆ H . Then the following min-max problem min w ,b sup ( δ , ··· , δ m ) ∈N ( r ( w , b ) + m X i =1 max (cid:2) − y i ( h w , Φ( x i ) − δ i i + b ) , (cid:3)) (13) is equivalent to min : r ( w , b ) + sup δ ∈N ( h w , δ i ) + m X i =1 ξ i , s . t . : ξ i ≥ − y i (cid:0) h w , Φ( x i ) i + b (cid:1) , i = 1 , · · · , m ; ξ i ≥ , i = 1 , · · · , m. (14) Furthermore, the minimization of Problem (14) is attainable when r ( · , · ) is lower semi-continuous. For some widely used feature mappings (e.g., RKHS of a Gaussian kernel), { Φ( x i ) , y i } mi =1 are always separable. In this case, the worst-case empirical error may not be equal to theempirical error plus a penalty term sup δ ∈N (cid:0) h w , δ i (cid:1) . However, it is easy to show that forany ( w , b ), the latter is an upper bound of the former.The next corollary is the feature-space counterpart of Corollary 4, where k · k H standsfor the RKHS norm, i.e., for z ∈ H , k z k H = p h z , z i . Noticing that the RKHS norm is selfdual, we ﬁnd that the proof is identical to that of Corollary 4, and hence omit it. Corollary 6

Let T H , n ( δ , · · · δ m ) | P mi =1 k δ i k H ≤ c o . If { Φ( x i ) , y i } mi =1 are non-separable,then the following two optimization problems on ( w , b ) are equivalent min : max ( δ , ··· , δ m ) ∈T H m X i =1 max (cid:2) − y i (cid:0) h w , Φ( x i ) − δ i i + b (cid:1) , (cid:3) , (15)min : c k w k H + m X i =1 max (cid:2) − y i (cid:0) h w , Φ( x i ) i + b (cid:1) , (cid:3) . (16)Equation (16) is a variant form of the standard SVM that has a squared RKHS normregularization term, and it can be shown that the two formulations are equivalent up tochanging of tradeoﬀ parameter c , since both the empirical hinge-loss and the RKHS normare convex. Therefore, Corollary 6 essentially means that the standard kernelized SVM isimplicitly a robust classiﬁer (without regularization) with disturbance in the feature-space,and the sum of the magnitude of the disturbance is bounded.Disturbance in the feature-space is less intuitive than disturbance in the sample space,and the next lemma relates these two diﬀerent notions. Lemma 7

Suppose there exists

X ⊆ R n , ρ > , and a continuous non-decreasing function f : R + → R + satisfying f (0) = 0 , such that k ( x , x ) + k ( x ′ , x ′ ) − k ( x , x ′ ) ≤ f ( k x − x ′ k ) , ∀ x , x ′ ∈ X , k x − x ′ k ≤ ρ obustness and Regularization of SVMs then k Φ(ˆ x + δ ) − Φ(ˆ x ) k H ≤ q f ( k δ k ) , ∀k δ k ≤ ρ, ˆ x , ˆ x + δ ∈ X . In the appendix, we prove a result that provides a tighter relationship between disturbancein the feature space and disturbance in the sample space, for RBF kernels.

Proof

Expanding the RKHS norm yields k Φ(ˆ x + δ ) − Φ(ˆ x ) k H = p h Φ(ˆ x + δ ) − Φ(ˆ x ) , Φ(ˆ x + δ ) − Φ(ˆ x ) i = p h Φ(ˆ x + δ ) , Φ(ˆ x + δ ) i + h Φ(ˆ x ) , Φ(ˆ x ) i − h Φ(ˆ x + δ ) , Φ(ˆ x ) i = q k (cid:0) ˆ x + δ , ˆ x + δ (cid:1) + k (cid:0) ˆ x , ˆ x (cid:1) − k (cid:0) ˆ x + δ , ˆ x (cid:1) ≤ q f ( k ˆ x + δ − ˆ x k ) = q f ( k δ k ) , where the inequality follows from the assumption.Lemma 7 essentially says that under certain conditions, robustness in the feature space isa stronger requirement that robustness in the sample space. Therefore, a classiﬁer thatachieves robustness in the feature space (the SVM for example) also achieves robustness inthe sample space. Notice that the condition of Lemma 7 is rather weak. In particular, itholds for any continuous k ( · , · ) and bounded X .In the next section we consider a more foundational property of robustness in the sam-ple space: we show that a classiﬁer that is robust in the sample space is asymptoticallyconsistent. As a consequence of this result for linear classiﬁers, the above results imply theconsistency for a broad class of kernelized SVMs.

5. Consistency of Regularization

In this section we explore a fundamental connection between learning and robustness, byusing robustness properties to re-prove the statistical consistency of the linear classiﬁer,and then the kernelized SVM. Indeed, our proof mirrors the consistency proof found in(Steinwart, 2005), with the key diﬀerence that we replace metric entropy, VC-dimension,and stability conditions used there, with a robustness condition .Thus far we have considered the setup where the training-samples are corrupted bycertain set-inclusive disturbances. We now turn to the standard statistical learning setup,by assuming that all training samples and testing samples are generated i.i.d. according toa (unknown) probability P , i.e., there does not exist explicit disturbance.Let X ⊆ R n be bounded, and suppose the training samples ( x i , y i ) ∞ i =1 are generated i.i.d.according to an unknown distribution P supported by X × {− , +1 } . The next theoremshows that our robust classiﬁer setup and equivalently regularized SVM asymptoticallyminimizes an upper-bound of the expected classiﬁcation error and hinge loss. Theorem 8

Denote K , max x ∈X k x k . Then there exists a random sequence { γ m,c } suchthat:1. ∀ c > , lim m →∞ γ m,c = 0 almost surely, and the convergence is uniform in P ; u, Caramanis and Mannor

2. the following bounds on the Bayes loss and the hinge loss hold uniformly for all ( w , b ) : E ( x ,y ) ∼ P ( y = sgn ( h w , x i + b ) ) ≤ γ m,c + c k w k + 1 m m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) ; E ( x ,y ) ∼ P (cid:0) max(1 − y ( h w , x i + b ) , (cid:1) ≤ γ m,c (1 + K k w k + | b | ) + c k w k + 1 m m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) . Proof

We brieﬂy explain the basic idea of the proof before going to the technical de-tails. We consider the testing sample set as a perturbed copy of the training sample set,and measure the magnitude of the perturbation. For testing samples that have “small”perturbations, c k w k + m P mi =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) upper-bounds their total lossby Corollary 4. Therefore, we only need to show that the ratio of testing samples having“large” perturbations diminishes to prove the theorem.Now we present the detailed proof. Given a c >

0, we call a testing sample ( x ′ , y ′ ) anda training sample ( x , y ) a sample pair if y = y ′ and k x − x ′ k ≤ c . We say a set of trainingsamples and a set of testing samples form l pairings if there exist l sample pairs with nodata reused. Given m training samples and m testing samples, we use M m,c to denotethe largest number of pairings. To prove this theorem, we need to establish the followinglemma. Lemma 9

Given a c > , M m,c /m → almost surely as m → + ∞ , uniformly w.r.t. P . Proof

We make a partition of

X × {− , +1 } = S T c t =1 X t such that X t either has the form[ α , α + c/ √ n ) × [ α , α + c/ √ n ) · · · × [ α n , α n + c/ √ n ) × { +1 } or [ α , α + c/ √ n ) × [ α , α + c/ √ n ) · · · × [ α n , α n + c/ √ n ) × {− } (recall n is the dimension of X ). That is, each partitionis the Cartesian product of a rectangular cell in X and a singleton in {− , +1 } . Notice thatif a training sample and a testing sample fall into X t , they can form a pairing.Let N trt and N tet be the number of training samples and testing samples falling inthe t th set, respectively. Thus, ( N tr , · · · , N trT c ) and ( N te , · · · , N teT c ) are multinomially dis-tributed random vectors following a same distribution. Notice that for a multinomially dis-tributed random vector ( N , · · · , N k ) with parameter m and ( p , · · · , p k ), the following holds(Breteganolle-Huber-Carol inequality, see for example Proposition A6.6 of van der Vaart and Wellner,2000). For any λ > P (cid:16) k X i =1 (cid:12)(cid:12) N i − mp i (cid:12)(cid:12) ) ≥ √ mλ (cid:17) ≤ k exp( − λ ) . Hence we have P (cid:16) T c X t =1 (cid:12)(cid:12) N trt − N tet (cid:12)(cid:12) ≥ √ mλ (cid:17) ≤ T c +1 exp( − λ ) , = ⇒ P (cid:16) m T c X t =1 (cid:12)(cid:12) N trt − N tet (cid:12)(cid:12) ≥ λ (cid:17) ≤ T c +1 exp( − mλ , = ⇒ P (cid:16) M m,c /m ≤ − λ (cid:17) ≤ T c +1 exp( − mλ , (17) obustness and Regularization of SVMs Observe that P ∞ m =1 T c +1 exp( − mλ ) < + ∞ , hence by the Borel-Cantelli Lemma (see forexample Durrett, 2004), with probability one the event { M m,c /m ≤ − λ } only occursﬁnitely often as m → ∞ . That is, lim inf m M m,c /m ≥ − λ almost surely. Since λ canbe arbitrarily close to zero, M m,c /m → P , since T c only depends on X .Now we proceed to prove the theorem. Given m training samples and m testing sampleswith M m,c sample pairs, we notice that for these paired samples, both the total testing errorand the total testing hinge-loss is upper bounded bymax ( δ , ··· , δ m ) ∈N ×···×N m X i =1 max (cid:2) − y i (cid:0) h w , x i − δ i i + b (cid:1) , (cid:3) ≤ cm k w k + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , , where N = { δ | k δ k ≤ c } . Hence the total classiﬁcation error of the m testing samples canbe upper bounded by( m − M m,c ) + cm k w k + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , , and sincemax x ∈X (1 − y ( h w , x i )) ≤ max x ∈X n | b | + p h x , x i · h w , w i o = 1 + | b | + K k w k , the accumulated hinge-loss of the total m testing samples is upper bounded by( m − M m,c )(1 + K k w k + | b | ) + cm k w k + m X i =1 max (cid:2) − y i ( h w , x i i + b ) , . Therefore, the average testing error is upper bounded by1 − M m,c /m + c k w k + 1 m n X i =1 max (cid:2) − y i ( h w , x i i + b ) , , (18)and the average hinge loss is upper bounded by(1 − M m,c /m )(1 + K k w k + | b | ) + c k w k + 1 m m X i =1 max (cid:2) − y i ( h w , x i i + b ) , (cid:3) . Let γ m,c = 1 − M m,c /m . The proof follows since M m,c /m → c > P (cid:16) γ m,c ≥ λ (cid:17) ≤ exp (cid:16) − mλ / T c + 1) log 2 (cid:17) , (19)i.e., the convergence is uniform in P . u, Caramanis and Mannor We have shown that the average testing error is upper bounded. The ﬁnal step is toshow that this implies that in fact the random variable given by the conditional expecta-tion (conditioned on the training sample) of the error is bounded almost surely as in thestatement of the theorem. To make things precise, consider a ﬁxed m , and let ω ∈ Ω and ω ∈ Ω generate the m training samples and m testing samples, respectively, and forshorthand let T m denote the random variable of the ﬁrst m training samples. Let us denotethe probability measures for the training by ρ and the testing samples by ρ . By indepen-dence, the joint measure is given by the product of these two. We rely on this property inwhat follows. Now ﬁx a λ and a c >

0. In our new notation, Equation (19) now reads: Z Ω Z Ω (cid:8) γ m,c ( ω , ω ) ≥ λ (cid:9) dρ ( ω ) dρ ( ω ) = P (cid:16) γ m,c ( ω , ω ) ≥ λ (cid:17) ≤ exp (cid:16) − mλ / T c + 1) log 2 (cid:17) . We now bound P ω ( E ω [ γ m,c ( ω , ω ) | T m ] > λ ), and then use Borel-Cantelli to show thatthis even can happen only ﬁnitely often. We have: P ω ( E ω [ γ m,c ( ω , ω ) | T m ] > λ )= Z Ω (cid:8) Z Ω γ m,c ( ω , ω ) dρ ( ω ) > λ (cid:9) dρ ( ω ) ≤ Z Ω n(cid:2) Z Ω γ m,c ( ω , ω ) ( γ m,c ( ω , ω ) ≤ λ ) dρ ( ω ) + Z Ω γ m,c ( ω , ω ) ( γ m,c ( ω , ω ) > λ ) dρ ( ω ) (cid:3) ≥ λ o dρ ( ω ) ≤ Z Ω n(cid:2) Z Ω λ ( λ ( ω , ω ) ≤ λ ) dρ ( ω ) + Z Ω ( γ m,c ( ω , ω ) > λ ) dρ ( ω ) (cid:3) ≥ λ o dρ ( ω ) ≤ Z Ω n(cid:2) λ + Z Ω ( γ m,c ( ω , ω ) > λ ) dρ ( ω ) (cid:3) ≥ λ o dρ ( ω )= Z Ω n Z Ω ( γ m,c ( ω , ω ) > λ ) dρ ( ω ) ≥ λ o dρ ( ω ) . Here, the ﬁrst equality holds because training and testing samples are independent, andhence the joint measure is the product of ρ and ρ . The second inequality holds because γ m,c ( ω , ω ) ≤ Z Ω Z Ω (cid:8) γ m,c ( ω , ω ) ≥ λ (cid:9) dρ ( ω ) dρ ( ω ) ≥ Z Ω λ n Z Ω (cid:0) γ m,c ( ω , ω ) ≥ λ (cid:1) dρ ( ω ) > λ o dρ ( ω ) . Thus we have P ( E ω ( γ m,c ( ω , ω )) > λ ) ≤ P (cid:16) γ m,c ( ω , ω ) ≥ λ (cid:17) /λ ≤ exp (cid:16) − mλ / T c + 1) log 2 (cid:17) /λ. obustness and Regularization of SVMs For any λ and c , summing up the right hand side over m = 1 to ∞ is ﬁnite, hence thetheorem follows from the Borel-Cantelli lemma. Remark 10

We notice that, M m /m converges to 1 almost surely even when X is notbounded. Indeed, to see this, ﬁx ǫ >

0, and let X ′ ⊆ X be a bounded set such that P ( X ′ ) > − ǫ . Then, with probability one, X ′ ) /m → , by Lemma 9. In addition,max (cid:0) X ′ ) , X ′ ) (cid:1) /m → ǫ. Notice that M m ≥ m − X ′ ) − max (cid:0) X ′ ) , X ′ ) (cid:1) . Hence lim m →∞ M m /m ≥ − ǫ, almost surely. Since ǫ is arbitrary, we have M m /m → X ⊆ R n be bounded, and suppose the training samples ( x i , y i ) ∞ i =1 are generated i.i.d. according toan unknown distribution P supported on X × {− , +1 } . Theorem 11

Denote K , max x ∈X k ( x , x ) . Suppose there exists ρ > and a continuousnon-decreasing function f : R + → R + satisfying f (0) = 0 , such that: k ( x , x ) + k ( x ′ , x ′ ) − k ( x , x ′ ) ≤ f ( k x − x ′ k ) , ∀ x , x ′ ∈ X , k x − x ′ k ≤ ρ. Then there exists a random sequence { γ m,c } such that,1. ∀ c > , lim m →∞ γ m,c = 0 almost surely, and the convergence is uniform in P ;2. the following bounds on the Bayes loss and the hinge loss hold uniformly for all ( w , b ) ∈H × RE P ( y = sgn ( h w , Φ( x ) i + b ) ) ≤ γ m,c + c k w k H + 1 m m X i =1 max (cid:2) − y i ( h w , Φ( x i ) i + b ) , (cid:3) , E ( x ,y ) ∼ P (cid:0) max(1 − y ( h w , Φ( x ) i + b ) , (cid:1) ≤ γ m,c (1 + K k w k H + | b | ) + c k w k H + 1 m m X i =1 max (cid:2) − y i ( h w , Φ( x i ) i + b ) , (cid:3) . u, Caramanis and Mannor Proof

As in the proof of Theorem 8, we generate a set of m testing samples and m trainingsamples, and then lower-bound the number of samples that can form a sample pair in thefeature-space; that is, a pair consisting of a training sample ( x , y ) and a testing sample( x ′ , y ′ ) such that y = y ′ and k Φ( x ) − Φ( x ′ ) k H ≤ c . In contrast to the ﬁnite-dimensionalsample space, the feature space may be inﬁnite dimensional, and thus our decompositionmay have an inﬁnite number of “bricks.” In this case, our multinomial random variableargument used in the proof of Lemma 9 breaks down. Nevertheless, we are able to lowerbound the number of sample pairs in the feature space by the number of sample pairs inthe sample space .Deﬁne f − ( α ) , max { β ≥ | f ( β ) ≤ α } . Since f ( · ) is continuous, f − ( α ) > α >

0. Now notice that by Lemma 7, if a testing sample x and a training sample x ′ belongto a “brick” with length of each side min( ρ/ √ n, f − ( c ) / √ n ) in the sample space (see theproof of Lemma 9), k Φ( x ) − Φ( x ′ ) k H ≤ c . Hence the number of sample pairs in the featurespace is lower bounded by the number of pairs of samples that fall in the same brick inthe sample space. We can cover X with ﬁnitely many (denoted as T c ) such bricks since f − ( c ) >

0. Then, a similar argument as in Lemma 9 shows that the ratio of samplesthat form pairs in a brick converges to 1 as m increases. Further notice that for M pairedsamples, the total testing error and hinge-loss are both upper-bounded by cM k w k H + M X i =1 max (cid:2) − y i ( h w , Φ( x i ) i + b ) , (cid:3) . The rest of the proof is identical to Theorem 8. In particular, Inequality (19) still holds.Notice that the condition in Theorem 11 is satisﬁed by most widely used kernels, e.g.,homogeneous polynominal kernels, and Gaussian RBF. This condition requires that thefeature mapping is “smooth” and hence preserves “locality” of the disturbance, i.e., smalldisturbance in the sample space guarantees the corresponding disturbance in the featurespace is also small. It is easy to construct non-smooth kernel functions which do notgeneralize well. For example, consider the following kernel: k ( x , x ′ ) = (cid:26) x = x ′ ;0 x = x ′ . A standard RKHS regularized SVM using this kernel leads to a decision functionsign( m X i =1 α i k ( x , x i ) + b ) , which equals sign( b ) and provides no meaningful prediction if the testing sample x is notone of the training samples. Hence as m increases, the testing error remains as large as 50%regardless of the tradeoﬀ parameter used in the algorithm, while the training error can bemade arbitrarily small by ﬁne-tuning the parameter. Convergence to Bayes Risk

Next we relate the results of Theorem 8 and Theorem 11 to the standard consistencynotion, i.e., convergence to the Bayes Risk (Steinwart, 2005). The key point of interest obustness and Regularization of SVMs in our proof is the use of a robustness condition in place of a VC-dimension or stabilitycondition used in (Steinwart, 2005). The proof in (Steinwart, 2005) has 4 main steps.They show: (i) there always exists a minimizer to the expected regularized (kernel) hingeloss; (ii) the expected regularized hinge loss of the minimizer converges to the expectedhinge loss as the regularizer goes to zero; (iii) if a sequence of functions asymptoticallyhave optimal expected hinge loss, then they also have optimal expected loss; and (iv) theexpected hinge loss of the minimizer of the regularized training hinge loss concentratesaround the empirical regularized hinge loss. In (Steinwart, 2005), this ﬁnal step, (iv), isaccomplished using concentration inequalities derived from VC-dimension considerations,and stability considerations.Instead, we use our robustness-based results of Theorem 8 and Theorem 11 to replacethese approaches (Lemmas 3.21 and 3.22 in (Steinwart, 2005)) in proving step (iv), andthus to establish the main result.Recall that a classiﬁer is a rule that assigns to every training set T = { x i , y i } mi =1 ameasurable function f T . The risk of a measurable function f : X → R is deﬁned as R P ( f ) , P ( { x , y : sign f ( x ) = y } ) . The smallest achievable risk R P , inf {R P ( f ) | f measurable } is called the Bayes Risk of P . A classiﬁer is said to be strongly uniformly consistent is forall distributions P on X × [ − , +1], the following holds almost surely.lim m →∞ R P ( f T ) = R P . Without loss of generality, we only consider the kernel version. Recall a deﬁnition fromSteinwart (2005).

Deﬁnition 12

Let C ( X ) be the set of all continuous functions deﬁned on X . Consider themapping I : H → C ( X ) deﬁned by I w , h w , Φ( · ) i . If I has a dense image, we call thekernel universal . Roughly speaking, if a kernel is universal, it is rich enough to satisfy the condition of step(ii) above.

Theorem 13

If a kernel satisﬁes the condition of Theorem 11, and is universal, then theKernel SVM with c ↓ suﬃciently slowly is strongly uniformly consistent. Proof

We ﬁrst introduce some notation, largely following (Steinwart, 2005). For someprobability measure µ and ( w , b ) ∈ H × R , R L,µ (( w , b )) , E ( x ,y ) ∼ µ (cid:8) max(0 , − y ( h w , Φ( x ) i + b )) (cid:9) , is the expected hinge-loss under probability µ , and R cL,µ (( w , b )) , c k w k H + E ( x ,y ) ∼ µ (cid:8) max(0 , − y ( h w , Φ( x ) i + b )) (cid:9) u, Caramanis and Mannor is the regularized expected hinge-loss. Hence R L, P ( · ) and R cL, P ( · ) are the expected hinge-loss and regularized expected hinge-loss under the generating probability P . If µ is theempirical distribution of m samples, we write R L,m ( · ) and R cL,m ( · ) respectively. Notice R cL,m ( · ) is the objective function of the SVM. Denote its solution by f m,c , i.e., the classiﬁerwe get by running SVM with m samples and parameter c . Further denote by f P ,c ∈ H × R the minimizer of R cL, P ( · ). The existence of such a minimizer is proved in Lemma 3.1 ofSteinwart (2005) (step (i)). Let R L, P , min f measurable E x ,y ∼ P n max(1 − yf ( x ) , (cid:1) } , i.e., the smallest achievable hinge-loss for all measurable functions.The main content of our proof is to use Theorems 8 and 11 to prove step (iv) in Steinwart(2005). In particular, we show: if c ↓ m →∞ R L, P ( f m,c ) = R L, P . (20)To prove Equation (20), denote by w ( f ) and b ( f ) as the weight part and oﬀset part of anyclassiﬁer f . Next, we bound the magnitude of f m,c by using R cL,m ( f m,c ) ≤ R cL,m ( , ≤ k w ( f m,c ) k H ≤ /c and | b ( f m,c ) | ≤ K k w ( f m,c ) k H ≤ K/c. ¿From Theorem 11 (note that the bound holds uniformly for all ( w , b )), we have R L, P ( f m,c ) ≤ γ m,c [1 + K k w ( f m,c ) k H + | b | ] + R cL,m ( f m,c ) ≤ γ m,c [3 + 2 K/c ] + R cL,m ( f m,c ) ≤ γ m,c [3 + 2 K/c ] + R cL,m ( f P ,c )= R L, P + γ m,c [3 + 2 K/c ] + (cid:8) R cL,m ( f P ,c ) − R cL, P ( f P ,c ) (cid:9) + (cid:8) R cL, P ( f P ,c ) − R L, P (cid:9) = R L, P + γ m,c [3 + 2 K/c ] + (cid:8) R L,m ( f P ,c ) − R L, P ( f P ,c ) (cid:9) + (cid:8) R cL, P ( f P ,c ) − R L, P (cid:9) . The last inequality holds because f m,c minimizes R cL,m .It is known (Steinwart, 2005, Proposition 3.2) (step (ii)) that if the kernel used is richenough, i.e., universal, then lim c → R cL, P ( f P , c ) = R L, P . For ﬁxed c >

0, we have lim m →∞ R L,m ( f P ,c ) = R L, P ( f P ,c ) , almost surely due to the strong law of large numbers (notice that f P ,c is a ﬁxed classiﬁer),and γ m,c [3 + 2 K/c ] → P .Therefore, if c ↓ we have almost surelylim m →∞ R L, P ( f m,c ) ≤ R L, P .

3. For example, we can take { c ( m ) } be the smallest number satisfying c ( m ) ≥ m − / and T c ( m ) ≤ m / / log 2 −

1. Inequality (19) thus leads to P ∞ m =1 P ( γ m,c ( m ) /c ( m ) ≥ m / ) ≤ + ∞ which impliesuniform convergence of γ m,c ( m ) /c ( m ). obustness and Regularization of SVMs Now, for any m and c , we have R L, P ( f m,c ) ≥ R L, P by deﬁnition. This implies that Equa-tion (20) holds almost surely, thus giving us step (iv).Finally, Proposition 3.3. of (Steinwart, 2005) shows step (iii), namely, approximatinghinge loss is suﬃcient to guarantee approximation of the Bayes loss. Thus Equation (20)implies that the risk of function f m,c converges to Bayes risk.

6. Concluding Remarks

This work considers the relationship between robust and regularized SVM classiﬁcation. Inparticular, we prove that the standard norm-regularized SVM classiﬁer is in fact the solutionto a robust classiﬁcation setup, and thus known results about regularized classiﬁers extendto robust classiﬁers. To the best of our knowledge, this is the ﬁrst explicit such link betweenregularization and robustness in pattern classiﬁcation. This link suggests that norm-basedregularization essentially builds in a robustness to sample noise whose probability level setsare symmetric, and moreover have the structure of the unit ball with respect to the dual ofthe regularizing norm. It would be interesting to understand the performance gains possiblewhen the noise does not have such characteristics, and the robust setup is used in place ofregularization with appropriately deﬁned uncertainty set.Based on the robustness interpretation of the regularization term, we re-proved theconsistency of SVMs without direct appeal to notions of metric entropy, VC-dimension, orstability. Our proof suggests that the ability to handle disturbance is crucial for an algorithmto achieve good generalization ability. In particular, for “smooth” feature mappings, therobustness to disturbance in the observation space is guaranteed and hence SVMs achieveconsistency. On the other-hand, certain “non-smooth” feature mappings fail to be consistentsimply because for such kernels the robustness in the feature-space (guaranteed by theregularization process) does not imply robustness in the observation space.

Acknowledgments

We thank the editor and three anonymous reviewers for signiﬁcantly improving the acces-sibility of this manuscript. We also beneﬁted from comments from participants in ITA2008.

Appendix A.

In this appendix we show that for RBF kernels, it is possible to relate robustness in thefeature space and robustness in the sample space more directly.

Theorem 14

Suppose the Kernel function has the form k ( x , x ′ ) = f ( k x − x ′ k ) , with f : R + → R a decreasing function. Denote by H the RKHS space of k ( · , · ) and Φ( · ) thecorresponding feature mapping. Then we have for any x ∈ R n , w ∈ H and c > , sup k δ k≤ c h w , Φ( x − δ ) i = sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) + δ φ i . u, Caramanis and Mannor Proof

We show that the left-hand-side is not larger than the right-hand-side, and viceversa.First we show sup k δ k≤ c h w , Φ( x − δ ) i ≤ sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i . (21)We notice that for any k δ k ≤ c , we have h w , Φ( x − δ ) i = D w , Φ( x ) + (cid:0) Φ( x − δ ) − Φ( x ) (cid:1)E = h w , Φ( x ) i + h w , Φ( x − δ ) − Φ( x ) i≤h w , Φ( x ) i + k w k H · k Φ( x − δ ) − Φ( x ) k H ≤h w , Φ( x ) i + k w k H p f (0) − f ( c )= sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i . Taking the supremum over δ establishes Inequality (21).Next, we show the opposite inequality,sup k δ k≤ c h w , Φ( x − δ ) i ≥ sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i . (22)If f ( c ) = f (0), then Inequality 22 holds trivially, hence we only consider the case that f ( c ) < f (0). Notice that the inner product is a continuous function in H , hence for any ǫ >

0, there exists a δ ′ φ such that h w , Φ( x ) − δ ′ φ i > sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i − ǫ ; k δ ′ φ k H < p f (0) − f ( c ) . Recall that the RKHS space is the completion of the feature mapping, thus there exists asequence of { x ′ i } ∈ R n such that Φ( x ′ i ) → Φ( x ) − δ ′ φ , (23)which is equivalent to (cid:0) Φ( x ′ i ) − Φ( x ) (cid:1) → − δ ′ φ . This leads to lim i →∞ q f (0) − f ( k x ′ i − x k )= lim i →∞ k Φ( x ′ i ) − Φ( x ) k H = k δ ′ φ k H < p f (0) − f ( c ) . Since f is decreasing, we conclude that k x ′ i − x k ≤ c holds except for a ﬁnite number of i .By (23) we have h w , Φ( x ′ i ) i → h w , Φ( x ) − δ ′ φ i > sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i − ǫ, obustness and Regularization of SVMs which means sup k δ k≤ c h w , Φ( x − δ ) i ≥ sup k δ φ k H ≤ √ f (0) − f ( c ) h w , Φ( x ) − δ φ i − ǫ. Since ǫ is arbitrary, we establish Inequality (22).Combining Inequality (21) and Inequality (22) proves the theorem. References

M. Anthony and P. Bartlett.

Neural Network Learning: Theoretical Foundations . CambridgeUniversity Press, 1999.P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds andstructural results.

Journal of Machine Learning Research , 3:463–482, November 2002.P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexity.

The Annals ofStatistics , 33(4):1497–1537, 2005.A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs.

OperationsResearch Letters , 25(1):1–13, August 1999.K. Bennett and O. Mangasarian. Robust linear programming discrimination of two linearlyinseparable sets.

Optimization Methods and Software , 1(1):23–34, 1992.D. Bertsimas and A. Fertis. Personal Correspondence, March 2008.D. Bertsimas and M. Sim. The price of robustness.

Operations Research , 52(1):35–53,January 2004.C. Bhattacharyya. Robust classiﬁcation of noisy data using second order cone programmingapproach. In

Proceedings International Conference on Intelligent Sensing and InformationProcessing , pages 433–438, Chennai, India, 2004.C. Bhattacharyya, L. Grate, M. Jordan, L. El Ghaoui, and I. Mian. Robust sparse hyper-plane classiﬁers: Application to uncertain molecular proﬁling data.

Journal of Computa-tional Biology , 11(6):1073–1089, 2004a.C. Bhattacharyya, K. Pannagadatta, and A. Smola. A second order cone programming for-mulation for classifying missing data. In Lawrence K. Saul, Yair Weiss, and L´eon Bottou,editors,

Advances in Neural Information Processing Systems (NIPS17) , Cambridge, MA,2004b. MIT Press.J. Bi and T. Zhang. Support vector classiﬁcation with input data uncertainty. InLawrence K. Saul, Yair Weiss, and L´eon Bottou, editors,

Advances in Neural InformationProcessing Systems (NIPS17) , Cambridge, MA, 2004. MIT Press. u, Caramanis and Mannor C. Bishop. Training with noise is equivalent to tikhonov regularization.

Neu-ral Computation , 7(1):108–116, 1995. doi: 10.1162/neco.1995.7.1.108. URL .P. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classiﬁers.In

Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory ,pages 144–152, New York, NY, 1992.O. Bousquet and A. Elisseeﬀ. Stability and generalization.

Journal of Machine LearningResearch , 2:499–526, 2002.A. Christmann and I. Steinwart. On robust properties of convex risk minimization methodsfor pattern recognition.

Journal of Machine Learning Research , 5:1007–1034, 2004.A. Christmann and I. Steinwart. Consistency and robustness of kernel based regression.

Bernoulli , 13(3):799–819, 2007.A. Christmann and A. Van Messem. Bouligand derivatives and robustness of support vectormachines.

Journal of Machine Learning Research , 9:915–936, 2008.C. Cortes and V. Vapnik. Support vector networks.

Machine Learning , 20:1–25, 1995.R. Durrett.

Probability: Theory and Examples . Duxbury Press, 2004.L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertaindata.

SIAM Journal on Matrix Analysis and Applications , 18:1035–1064, 1997.T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector ma-chines. In A. Smola, P. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors,

Advances inLarge Margin Classiﬁers , pages 171–203, Cambridge, MA, 2000. MIT Press.A. Globerson and S. Roweis. Nightmare at test time: Robust learning by feature deletion.In

ICML ’06: Proceedings of the 23rd International Conference on Machine Learning ,pages 353–360, New York, NY, USA, 2006. ACM Press.F. Hampel. The inﬂuence curve and its role in robust estimation.

Journal of the AmericanStatistical Association , 69(346):383–393, 1974.F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel.

Robust Statistics: TheApproach Based on Inﬂuence Functions . John Wiley & Sons, New York, 1986.P. Huber.

Robust Statistics . John Wiley & Sons, New York, 1981.M. Kearns, Y. Mansour, A. Ng, and D. Ron. An experimental and theoretical comparisonof model selection methods.

Machine Learning , 27:7–50, 1997.V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the gen-eralization error of combined classiﬁers.

The Annals of Statistics , 30(1):1–50, 2002.Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and general-ization error. In

In UAI-2002: Uncertainty in Artiﬁcial Intelligence , number 275–282,2002. obustness and Regularization of SVMs G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. A robust minimax approachto classiﬁcation.

Journal of Machine Learning Research , 3:555–582, December 2002.R. A. Maronna, R. D. Martin, and V. J. Yohai.

Robust Statistics. Theory and Methods.

John Wiley & Sons, New York, 2006.S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: Stability is suﬃcient forgeneralization and necessary and suﬃcient for consistency of empirical risk minimization.

Advances in Computational Mathematics , 25(1-3):161–193, 2006.T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity inlearning theory.

Nature , 428(6981):419–422, 2004.P. Rousseeuw and A. Leeroy.

Robust Regression and Outlier Detection . John Wiley & Sons,New York, 1987.B. Sch¨olkopf and A. Smola.

Learning with Kernels . MIT Press, 2002.P. Shivaswamy, C. Bhattacharyya, and A. Smola. Second order cone programming ap-proaches for handling missing and uncertain data.

Journal of Machine Learning Research ,7:1283–1314, July 2006.A. Smola, B. Sch¨olkopf, and K. M¨ullar. The connection between regularization operatorsand support vector kernels.

Neural Networks , 11:637–649, 1998.I. Steinwart. Consistency of support vector machines and other regularized kernel classiﬁers.

IEEE Transactions on Information Theory , 51(1):128–142, 2005.C. H. Teo, A. Globerson, S. Roweis, and A. Smola. Convex learning with invariances. InJ.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors,

Advances in Neural InformationProcessing Systems 20 , pages 1489–1496, Cambridge, MA, 2008. MIT Press.T. Trafalis and R. Gilbert. Robust support vector machines for classiﬁcation and compu-tational issues.

Optimization Methods and Software , 22(1):187–198, February 2007.A. van der Vaart and J. Wellner.

Weak Convergence and Empirical Processes . Springer-Verlag, New York, 2000.V. Vapnik and A. Chervonenkis.

Theory of Pattern Recognition . Nauka, Moscow, 1974.V. Vapnik and A. Chervonenkis. The necessary and suﬃcient conditions for consistency inthe empirical risk minimization method.

Pattern Recognition and Image Analysis , 1(3):260–284, 1991.V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method.

Automa-tion and Remote Control , 24:744–780, 1963., 24:744–780, 1963.