GGaussian Robust Classification
A thesis submitted in partial fulfillment of therequirements for the degree of Master of Scienceby
Ido Ginodi
Supervised by Dr. Amir GlobersonDecember 2010
The School of Computer Science and EngineeringThe Hebrew University of Jerusalem, Israel a r X i v : . [ c s . L G ] A p r bstract Supervised learning is all about the ability to generalize knowledge. Specifically, the goalof the learning is to train a classifier using training data, in such a way that it will be capableof classifying new unseen data correctly. In order to acheive this goal, it is important tocarefully design the learner, so it will not overfit the training data. The later can be donein a couple of ways, where adding a regularization term is probably the most commonone. The statistical learning theory explains the success of the regularization method byclaiming that it restricts the complexity of the learned model. This explanation, however,is rather abstract and does not have a geometric intuition.The generalization error of a classifier may be thought of as correlated with its ro-bustness to perturbations of the data. Namely, if a classifier is capable of coping withdistrubance, it is expected to generalize well. Indeed, it was established that the ordinarySVM formulation is equivalent to a robust formulation, in which an adversary may displacethe training and testing points within a ball of pre-determined radius (Xu et al. [2009]).In this work we explore a different kind of robustness. We suggest changing each datapoint with a Gaussian cloud centered at the original point. The loss is evaluated as theexpectation of an underlying loss function on the cloud. This setup fits the fact that inmany applications, the data is sampled along with noise. We develop a robust optimiza-tion (RO) framework, in which the adversary chooses the covariance of the noise. In ouralgorithm named GURU, the tuning parameter is the variance of the noise that contami-nates the data, and so it can be estimated using physical or applicative considerations. Ourexperiments show that this framework generates classifiers that perform as well as SVMand even slightly better in some cases. Generalizations for Mercer kernels and for themulticlass case are presented as well. We also show that our framework may be furthergeneralized, using the technique of convex perspective functions. ontents A Single-Point Algorithms 57
B Diagonal Covariance 61C Using the Multi-Hinge Loss 63 hapter 1Introduction
1. Motivation
The ability to understand new unseen data, based on knowledge that was gained using atraining sample, is probably the main goal of machine learning. In the supervised learningsetup, one is given a training set, consists of data samples along with labels indicating their’type’ or ’class’. The learning task in this case is to develop a decision rule, which willallow predicting the correct label of unfamiliar data.As the main goal is to be able to generalize, it makes sense to design the learningprocess so it reflects the conditions under which the classifier is going to be tested andused. In many real world applications, the data we are given is corrupted by noise. Thenoise may be either inherent to the process that generates the data or adversarial. Examplesto an inherent noise include a noisy sensor and natural variability of the data. Adversarialnoise is present for example in spam emails. In either way, it is vital to learn how to classifywhen it is present. We suggest to do it by preparing for the worst case. Amongst all noisedistribution that have a bounded power (i.e. bounded covariance), the Gaussian noise isbelieved to be the most problematic, since it has the maximal entropy.By designing a classifier that is robust to Gaussian noise, we are able to learn and gener-alize well, without the need to introduce an explicit regularization term. In that respect, ourwork aims at shading more light on the connection between robustness and generalization.
2. The supervised learning framework
Formally speaking, the supervised learning setup consists of three major components:1.
Data.
We denote X the sample space, in which the data samples live (i.e. the objectsone tries to classify. e.g., vector representation of handwritten digits). Alongside thesample space, we are given the label set, denoted Y . This set contains the variousclasses to which the data points may be assigned (e.g., , , . . . , in the handwrittendigits example). A distribution D is defined over X × Y , and dictates the probabilityto sample a data point x ∈ X along with a label y ∈ Y . In our discussion we will5estrict ourselves to the Euclidean case, namely X = R d . Unless stated otherwise,we assume a binary setting, in which Y = { +1 , − } .2. Hypothesis class.
In the learning process, one considers candidate hypotheses takenout of the class H . This class consists of functions from X to Y . Its contents reflectsome kind of prior data about the problem at hand. A well known example is theclass of half-spaces, defined as H half − space = (cid:8) φ w ( x ) = sgn ( w T x ) (cid:12)(cid:12) φ w : R d → { +1 , − } , w ∈ R d (cid:9) (1.1)3. Loss measure.
The means to measure the performance of a specific instance h ∈ H is the loss function, (cid:96) : X × Y × H → R + The most intuitive loss function in thebinary case is the zero-one loss, defined by (cid:96) − ( x , y ; h ) = [ h ( x ) (cid:54) = y ] (1.2)The learning task is to find the classifier h ∗ ∈ H which is optimal, in the sense that itminimizes the actual risk , defined aserr ( h ) = E ( x ,y ) ∼D (cid:96) ( x , y ; h ) (1.3)Most of the times, however, it is the case that D is unknown. Even in the rare cases inwhich it is known, it is not always possible to optimize the expectation over it. The learneris thus given a training set S ⊆ X × Y of i.i.d. samples. The learning task in that case isto minimize the empirical risk , defined as ˆ err ( h ) = 1 M M (cid:88) m =1 (cid:96) ( x m , y m ; h ) (1.4)where S = { ( x m , y m ) } Mm =1 . This technique is called empirical risk minimization (ERM).It is important to keep in mind that although the technical tool is ERM, the objective isalways to have the actual risk as low as possible.Sometimes, however, this is not the case. That is, in spite of the fact that the learneddecision rule is capable of classifying the training data, it fails to do so on fresh test data. Inthis case we say that the generalization error is high, although the training error is low. Thereason for such a failure is most often overfitting . In this situation, the learned classifier fitsthe training data very well, but misses the general rule behind the data. In the PAC model,overfitting is explained by a too rich hypothesis class. If the learner can choose a modelthat fits perfectly the training data - it will do so, ignoring the fact that the chosen modelwill possibly not be able to explain new data. Say for example that the hypothesis classconsists of all the functions from the sample space to the labels space. A na¨ıve learnermight choose a classifier that handles all the training points well, whereas any unknownsample is classified as +1 . This selection might obviously have erroneous results. In the6pirit of this idea, the PAC theory bounds the difference between the empirical and theactual risk using a combinatorial measure of the hypothesis class complexity, named VCdimension . For a detailed review see Vapnik [1995].A common solution for this problem is to add a regularization term to the objectiveof the minimization problem. Usually, a norm of the classifier is taken as a regularizationterm. From the statistical learning theory’s point of view, the regularization restricts thecomplexity of the model, and by that controls the difference between the training and test-ing error (Smola et al. [1998]; Evgeniou et al. [2000]; Bartlett et al. [2002]). The idea ofminimizing the complexity of the model is not unique to the statistical theory, and maybe traced back to the Ocaam’s razor principle: the simplest hypothesis that explains thephenomenon is likely to be the correct one. Another way to understand the regularizationterm, is as a means to introduce prior knowledge.
3. Support Vector Machines
In support vector machine (SVM), the loss measure at hand is the hinge-loss (cid:96) hinge ( x m , y m ; w ) = [1 − y m w T x m ] + (cid:96) hinge is a surrogate loss function, in the sense that it upper-bounds the zero-one loss. Fur-thermore, (cid:96) hinge is convex, which makes it a far more convenient objective for numericaloptimization than the zero-one loss. Note that the hinge loss intorduces penalty when theclassifier correctly predicts the label of a sample, but does so with too little margin, i.e. w T x m < . The penalty on a wrong classification is linear in the distance of the samplefrom the hyperplane.As discussed, optimizing the sum of the losses solely may result in poor generaliza-tion performace. The SVM solution is to add an L regularization term. The geometricalintuition behind this term is the following: The distance between the point x m and thehyperplane w T x = b is given by | w T x m − b |(cid:107) w (cid:107) (1.5)One may scale w and b in such a way that the point with the smallest margin (that is, theone closest to the hyperplane) will have (cid:107) w T x m − b (cid:107) = 1 . In that case, the bilateral marginis (cid:107) w (cid:107) (see Figure 1.2). This geometrical intuition, along with the fact that the hinge losspunishes too little margin, motivates the name Maximum Margin Classification that wasgranted to SVM. Hence, the SVM optimization task is min w ,b λ (cid:107) w (cid:107) + M (cid:88) m =1 [1 − y m ( w T x m − b )] + (1.6)The parameter λ controls the tradeoff between the training error and the margin of theclassifier (cf. Section 5). 7 igure 1.1: The hinge loss is a convex surrogate to the zero-one loss.Figure 1.2: The bilateral margin is (cid:107) w (cid:107) . Thus, minimizing (cid:107) w (cid:107) results in maximizing the margin.
4. Robustness
The objective of the learning is to be able to classify new data. Thus, being robust toperturbations of the data is usually a desirable property for a classifier. In some cases, thetraining data and the testing data are sampled from different processes, which are similarto some extent but are not identical (Bi and Zhang [2004]). This situation can happen alsodue to application specific issues, when new samples are sampled with reduced accuracy(for example, the training data may be collected with an expensive sensor, whereas cheapersensors are deployed for actual use).Even harder scenario is the one of learning in the presence of an adversary that maycorrupt the training data, the testing data or both. The key step in order to formulate therobust learning task, is to model the action of the adversary, i.e., to define what is the family8f perturbations that he may apply on the data points. In the Robust-SVM model, theadversary may apply a bounded additive distrubance, by displacing a sample point withina ball around it (Shivaswamy et al. [2006]). This case is referred to as box-uncertainty.Globerson and Roweis [2006] assumed a different type of adversary. In their model, namedFDROP, the adversary is allowed to delete a bounded number of features. This modelresults in more balanced classifiers, which are less likely to base their prediction only on asmall subset of informative features.Two issues usually repeat in robust learning formulations. The first one is the problemof the adversarial choice . Most of the times, the first step in the analysis of the model ischaracterizing the exact action of the adversary on a specific data sample, given specificmodel parameters. The Robust-SVM adversary will choose to displace the point perpen-dicularly to the separating hyperplane. FDROP’s adversary will delete the most informa-tive features, i.e. those that have the maximal contribution to the dot product between theweights vector and the data point. The second issue is the restriction on the adversary’saction. Regardless of the actual type of perturbation that the adversary uses, one needs tobound the extent to which it is applied. If no constraint is specified, the adversary willchoose his action in such a way that the signal to noise ratio (SNR) will vanish, and thedata will no longer carry any information. In the Robust-SVM formulation, the adversaryis constrained to perform a displacement within a bounded ball. In the case of FDROP, nomore than a pre-defined number of features may be deleted.Note that robust formulations are closely related to the notion of consistency . A classi-fier is said to be consistent, if close enough data points are predicted to have the same label.Different adversarial models befit different notions of distance. For example, the box-uncertainty model is related to the Euclidean metric and feature deletion suits the hammingdistance.It should be mentioned that robustness has quite a few meanings in the literature ofstatistics and machine learning. In this work, we use robustness in the sense of robustoptimization (RO), i.e. minimizing the worst-case loss under given circumstances.
5. Between robustness and regularization
The fact the robustness is related to regularization and generalization is not too surprising.Indeed, first equivalence results have been established for learning problems other thanclassification more than a decade ago (Ghaoui and Lebret [1997]; Xu et al. [2008]; Bishop[1994]). Recently, Xu et al. [2009] have proven the fact that the regularization employed bySVM is equivalent to a robust formulation. Specificly, they have shown that the followingtwo formulations are equivalent min w ,b λ (cid:107) w (cid:107) + M (cid:88) m =1 [1 − y m ( w T x m − b )] + min w ,b max (cid:80) m (cid:107) δ m (cid:107) ∗ ≤ λ M (cid:88) m =1 [1 − y m ( w T ( x m − δ m ) − b )] + (cid:107) · (cid:107) ∗ is the dual-norm. This equivalence has a strong geometric interpertation, andsheds a new light on the function of the tuning parameter of SVM. Using the notion ofrobustness, a consistency result for SVM was given, without the use of VC or stabilityarguments. The novelty of that work stems from the fact that most previous works onrobust classification were not aimed at relating robustness to regularization. Rather, themodels were based on an already regularized SVM formulation, in which the loss measurewas effectively modified.
6. Our contribution
In this work we adopt the idea of using robustness as a means to achieve generalization. Wepresent a new robust-learning setup in which each data point is altered by a stochastic cloudcentered on it. The loss is then evaluated as the expectation of an underlying loss on thecloud. The parameters of this cloud’s distribution are chosen in an adversarial fashion. Weanalyze the case in which the adversary is restricted to choose a Gaussin cloud with a trace-bounded covariance matrix. Then we show that this formulation culminates in a smoothupper-approximation of the hinge loss, which gets tighter as the cloud around each datasample shrinks. This loss function can be shown to have a convex perspective structure. Byderiving the dual problem, we are able to demonstrate a method of generating new smoothloss functions. Our algorithmic approach is to directly solve the primal problem. We showthat this yields a learning algorithm which generalizes as well as SVM on synthetic as wellas real data. Generalizations to the non-linear and multiclass cases are given.
7. Related work
Other works have incorporated a noise model into the learning setup. For example, baptistePothin and Richard [2008] have warped the data points with ellipsoids. Pozdnoukhov et al.[2005] have shown how to train classifiers for distributions. Similar to what we do in thiswork, they use tails of distributions in their derivation. Their work, however, treated eachdata class as a distribution, whereas in this work we attach a noise distribution for each datapoint separately. Bhattacharyya et al. [2004b]; Shivaswamy et al. [2006] have employedsecond order cone programming (SOCP) methods in order to handle the uncertainty inthe data. Bhattacharyya et al. [2004a] have assumed stochastic clouds instead of discretepoints, as we do, but they did not try to minimize the expectation of the loss function overthe cloud. Instead, their idea was to incorporate the idea with the soft margin framework. Biand Zhang [2004] have tried to learn a better classifier by presenting the learning algorithm’more reasonable’ samples. We elaborate on this model in Appendix A.Smooth loss function were studied by Zhang et al. [2003]; Chapelle [2007]. Analysisof methods for Solving SVM and SVM-like problems using the primal formulation wasdone by Shalev-Shwartz et al. [2007a]; Chapelle [2007].The rest of this document is organized as follows: in Chapter 2 we present our framework10ormally, derive the explicit form of the smooth loss function and devise an algorithm thatfinds the optimal classifier. In Chapter 3 we derive a dual formulation for the problem, andpoint out that our model may be generalized for other loss functions. In Chapter 4 we applythe kernel trick and devise a method for training non-linear classifiers in the same cost asfor the linear kernel. Chapter 5 contatins a generalization of the binary algorithm for themulticlass case. At last, in Chapter 6 we discuss the contributions presented in this workand mention possible directions for future work. In Appendix A we discuss a far morebasic version of resistance to noise. The results of the first section therein are not originaland presented here only for the sake of logical order. The next section contains a simplegeneralization for the multiclass case. Appendix B gives the solution to the adversarialchoice problem for an adversary that is restricted to spread the noise along the primaryaxes. At last, in Appendix C we explain why we find the usual multiclass hinge lossinapplicable in our framework. 11 hapter 2Gaussian Robust Classification
In this work we take the approach of robust optimization (RO). Accordingly, we present amin-max learning framework, in which the learner strives to minimize the loss, whereas theadversary tries to maximize it. The model that we introduce in this chapter has two layersof ’robustness’. Firstly, we use the min-max robustness, which lays in the foundations ofRO. Secondly, we effectively enhance the training dataset by taking into consideration allthe possible outputs of the adversarial perturbation. More concretely, we alter each trainingsample with a stochastic cloud. The shape of this cloud is chosen by the adversary from apre-determined family of distributions. The spreading of the samples should be understoodas adding noise, where different disturbances take place with different probability. Theloss on each sample is finally computed as the expectation of an underlying loss on therespective cloud.
1. Problem Formulation
In this section we formally describe the model we investigate in the work. We take thehinge-loss as the underlying loss function, and build the learning framework on top ofit. We then show that the new framewok we introduce is equivalent to an unconstrainedminimization of an effective loss function.Recall that the hinge loss is defined (cid:96) hinge ( x m , y m ; w ) = [1 − y m w T x m ] + (2.1)We introduce the expected hinge loss (cid:96) E hinge ( x m , y m ; w , D ) = E n ∼D (cid:2) − y m w T ( x m + n ) (cid:3) + (2.2)where D is a predefined noise distribution over the sample space. The optimization problemfor learning a classifier w.r.t. the expected hinge loss is thus min w M (cid:88) m =1 (cid:96) E hinge ( x m , y m ; w , D ) (2.3)12ranting an adversary the ability to choose the noise distribution, we end up with the fol-lowing formulation min w max D ×D × ... ×D M ∈C ×C × ... ×C M M (cid:88) m =1 (cid:96) E hinge ( x m , y m ; w , D m ) (2.4)where C m is the set of allowed noise distributions for the m th sample. In order for theadversarial optimization to be meaningful, all C m ’s should have a ’bounded’ nature. Wenow alter the order of maximization and summation, and write min w M (cid:88) m =1 max D m ∈C m (cid:96) E hinge ( x m , y m ; w , D m ) (2.5)At last, we observe that the optimization task at hand is nothing else than optimizing theeffective loss function (cid:96) robhinge ( x m , y m ; w , D ) = max D∈C E n ∼D (cid:2) − y m w T ( x m + n ) (cid:3) + (2.6)We refer to (cid:96) robhinge ( x m , y m ; w , D ) as the expected robust hinge loss.
2. The adversarial Choice
Equation 2.5 presents the general noise-robust formulation. In the following, we will derivean explicit loss function for a specific collection of noise distributions. We focus on the casein which the adversary is constrained to spread a Gaussian noise, having a trace boundedcovariance-matrix. The motivation behind this constraint is physical. When a noise ismodeled with a distribution, its covariance is considered as its power. Thus, by constrainingthe sum of the eigenvalues of the covariance matrix we bound the power that the adversarycan spread. The Gaussian noise is the worst case noise, in the sense that amongst alldistributions with a certain poer bound it has the maximal entropy.Using the notations of the previous section, we specify the restriction on the adversaryas C = C β = {D ∼ N ( , Σ ) | Σ ∈ Λ β } where Λ β = { Σ ∈ PSD | T r (Σ) ≤ β } , i.e. Gaussian distributions having the zero vector asmean and a covariance matrix with a bounded sum of eigenvalues.In the next couple of sections we will characterize the adversarial choice of the covari-ance matrix and derive an explicit loss function. The following paragraphs are rather technical. For later use, we explicitly perform theintegration of the robust hinge loss function. We then prove a monotony property of theintegrated loss. This property will help us analyze the nature of the adversarial choice in13ur case. The key observation throughout the derivation is that the multivarite expectationcan be transformed to a univariate problem.We plug the notations that were introduced above into Equation 2.6 and get: (cid:96) robhinge ( x m , y m ; w , Σ) = max Σ ∈ Λ β c | Σ | − (cid:90) e − n T Σ − n (cid:2) − y m w T ( x m + n ) (cid:3) + d n (2.7)where c = (2 π ) − d/ is the normalization constant. This is equivalent to: (cid:96) robhinge ( x m , y m ; w , Σ) = max Σ ∈ Λ β c | Σ | − (cid:90) e − n T Σ − n (cid:2) − y m w T x m − y m w T n (cid:3) + d n (2.8)As a first step in the analysis of the expected robust hinge loss, we shall handle thequantity Q def = c | Σ | − (cid:90) e − n T Σ − n (cid:2) − y m w T x m − y m w T n (cid:3) + n (2.9)Note that the above only depends on n via products of the form w T n . Therefore, wedefine a new scalar variable u = y m w T n . Equation 2.9 can now be viewed as the expectedvalue of g ( u ) = [1 − y m w T x m − u ] + . The moments of u are E u = E y m w T n == y m w T E n = 0 and V ar [ u ] = V ar [ y m w T n ] == y m w T V ar [ n ] y m w == ( y m ) w T Σ w == w T Σ w Thus we get Q = 1 √ π w T Σ w (cid:90) e − u w T Σ w (cid:2) − y m w T x m − u (cid:3) + du (2.10)Define erf ( t ) = √ π (cid:82) t −∞ exp( − z ) dz . In addition, denote σ ( w , Σ) = w T Σ w . Thefollowing proposition holds. Proposition 1 Q = (cid:0) − y m w T x m (cid:1) erf (cid:18) − y m w T x m √ σ ( w , Σ) (cid:19) + (cid:113) σ ( w , Σ)2 π exp (cid:18) − ( − y m w T x m ) σ ( w , Σ) (cid:19) roof: We conduct a direct computation: Q = 1 (cid:112) πσ ( w , Σ) (cid:90) e − u σ w , Σ) (cid:2) − y m w T x m − u (cid:3) + du == 1 (cid:112) πσ ( w , Σ) (cid:90) − y m w T x m −∞ e − u σ w , Σ) (cid:0) − y m w T x m − u (cid:1) du == 1 (cid:112) πσ ( w , Σ) (cid:32)(cid:0) − y m w T x m (cid:1) (cid:90) − y m w T x m −∞ e − u σ w , Σ) du − (cid:90) − y m w T x m −∞ ue − u σ w , Σ) du (cid:33) == (cid:0) − y m w T x m (cid:1) erf (cid:32) − y m w T x m (cid:112) σ ( w , Σ) (cid:33) − (cid:112) πσ ( w , Σ) (cid:90) − y m w T x m −∞ ue − u σ w , Σ) du By using the variable substitution theorem and observing that the remaining integrand is anodd function (thus the identity (cid:82) t −∞ odd = (cid:82) − t −∞ odd holds), we conclude that Q = (cid:0) − y m w T x m (cid:1) erf (cid:32) − y m w T x m (cid:112) σ ( w , Σ) (cid:33) + (cid:114) σ ( w , Σ)2 π exp (cid:32) − (cid:0) − y m w T x m (cid:1) σ ( w , Σ) (cid:33) (2.11)Let us establish the following simple property of Q . Lemma 2 Q is monotone-increasing in σ . Proof:
The fundamental theorem of calculus yields that ddt erf ( t ) = 1 √ π exp (cid:18) − t (cid:19) (2.12)Using the chain rule we compute dQdσ = 12 √ π √ σ exp (cid:32) − (cid:0) − y m w T x m (cid:1) σ ( w , Σ) (cid:33) (2.13)It is evident that for all σ ≥ dQdσ ≥ (2.14)i.e. Q is monotone-increasing in σ . 15 .2 The optimal covariance matrix subject to a trace constraint We will now focus on finding the optimal adversary, i.e., performing the maximization ofEquation 2.8 over the range of allowed covariance matrices. The next theorem specifieswhich covariance matrix attains the worst-case loss. In out terminology, refer to this resultas the adversarial choice . Theorem 2.1:
The optimal Σ in Equation 2.8 is given by Σ ∗ = β ww T (cid:107) w (cid:107) where the optimiza-tion is done over Σ ∈ Λ β .Before actually proving the theorem, we will give some geometric intuition. The ideabehind the expected loss is to replace the original sample point with a Gaussian cloudcentered at the original point (Figure 2.1a). Consider an arbitrary displacement ˆ x m = x m + n . For fixed w , n can be written as n = n (cid:107) + n ⊥ . The relevant quantity is w T ˆ x m = w T x m + w T n (cid:107) , that is, the orthogonal component does not have any effect.Accordingly, it makes sense that the optimal noise direction is orthogonal to the separatinghyperplane, i.e. parallel to the vector w (see Figure 2.1b). Figure 2.1: (a) Replacing the sample point with a Gaussian cloud. (b) The optimal noise directionis orthogonal to the separating hyperplane. (c) The expected robust hinge loss only considers thetail of the distribution, i.e. the points that suffer a margin error.
The proof of Theorem 2.1 applies simple algebraic results to establish this result rigor-ously.
Proof:
Plugging Proposition 1 into Equation 2.8 we get (cid:96) robhinge ( x m , y m ; w , Σ) = max Σ ∈ Λ β (cid:34)(cid:0) − y m w T x m (cid:1) erf (cid:32) − y m w T x m (cid:112) σ ( w , Σ) (cid:33) + (cid:114) σ ( w , Σ)2 π exp (cid:32) − (cid:0) − y m w T x m (cid:1) σ ( w , Σ) (cid:33)(cid:35) The above depends on Σ only via σ ( w , Σ) . According to Lemma 2, the objectiveis monotone increasing in σ . Therefore, the adversary would like to choose Σ so that16 ( w , Σ) is maximized. By applying the Cauchy-Schwartz inequality, we conclude that themaximum value of σ ( w , Σ) is λ max (Σ) (cid:107) w (cid:107) . For all Σ ∈ Λ β it holds that T r (Σ) ≤ β .Since all of the eigenvalues are positive, it holds that λ max ≤ β as well. Consider thecandidate solution Σ = β ww T (cid:107) w (cid:107) . Since σ ( w , Σ ) = β (cid:107) w (cid:107) , this selection attains themaximum. Note that this covariance matrix reflects the fact that the adversarial choicewould be to spread the noise parallel to the separator.
3. A smooth loss function
In the previous sections we have done the technical computations needed in order to derivethe robust hinge loss explicitly, and found the optimal covariance matrix. In the followingwe will put these results together, and present an explicit formulation of the loss functionresulting from our model. In addition, it is shown that our robust loss can be representedas a perspective of a scalar smooth approximation of the hinge loss. By analyzing thisfunction we are able to gain a better understanding of (cid:96) robhinge . We conclude this section byshowing that our loss function is a smooth convex upper-approximation of the hinge-loss.When the ’diameter’ of the noise cloud is shrunk, (cid:96) robhinge coincides with the hinge-loss.We devote a notation for the result of Proposition 1 L ( x m , y m ; w , σ ) = (cid:0) − y m w T x m (cid:1) erf (cid:18) − y m w T x m σ (cid:19) + σ √ π exp (cid:32) − (cid:0) − y m w T x m (cid:1) σ (cid:33) By combining the above equation with the result of Theorem 2.1 we conclude (cid:96) robhinge ( x m , y m ; w , β ) = (cid:0) − y m w T x m (cid:1) erf (cid:18) − y m w T x m √ β (cid:107) w (cid:107) (cid:19) + √ β (cid:107) w (cid:107)√ π exp (cid:32) − (cid:0) − y m w T x m (cid:1) β (cid:107) w (cid:107) (cid:33) β has the meaning of statistical variance, and therefore in the following we will replace itwith σ (not to be confused with σ ( w , Σ) ). In order to understand the nature of the lossfunction we have defined, it is suggestive to define f ( z ) = z erf ( z ) + 1 √ π e − z (2.15)Using f , the robust expected hinge loss can be written as (cid:96) robhinge ( x m , y m ; w , σ ) = σ (cid:107) w (cid:107) f (cid:18) − y m w T x m σ (cid:107) w (cid:107) (cid:19) (2.16)17 direct computation shows that dfdz = erf ( z ) + z √ π e − z − z √ π e − z = erf ( z ) (2.17)We are now ready to prove a simple yet fundamental property of f . Theorem 3.1: f is a smooth strictly-convex upper-approximation of the hinge loss. Figure 2.2: The function f is a smooth approximation of the hinge loss Proof:
Denote the hinge loss h ( z ) = [ z ] + . We must show that1. f is strictly-convex2. f ( z ) ≥ h ( z ) lim z →−∞ f ( z ) − h ( z ) = 0 lim z →∞ f ( z ) − h ( z ) = 0 Differentiating Equation 2.17 once again, we get d fdz = 1 √ π e − z (2.18)which is clearly positive for all z . Thus, f is strictly-convex.For the upper bound property, notice that f ( z ) ≥ for all z ∈ R . Hence, for z < we18imply have f ( z ) > h ( z ) . For the complementary case, denote the difference functionover the positives δ ( z ) = f ( z ) − h ( z ) = z erf ( z ) + √ π e − z − z . Using Equation 2.17 weobtain dhdz = erf ( z ) − (2.19)It can be easily seen that dhdz < , i.e. h is monotone decreasing. Observe that δ (0) = √ π and lim z →∞ δ ( z ) = 0 . Since all the functions involved are continous, we conclude that for z ≥ it holds that h ( z ) ≤ f ( z ) ≤ h ( z ) + √ π . Altogether, we have established the upperbound property.For the asymptote at z → −∞ , observe that from l’Hopital lim z →−∞ z erf ( z ) = lim z →−∞ √ πe − z = 0 Since the exponent in the right summand of f decays as well, we have that at z → −∞ both f ( z ) and h ( z ) coincide on z = 0 .For the asymptote at z → ∞ , we must show that f asymptotically coincide with the lin-ear function z . To this end, let us write f ( z ) − z = z ( erf ( z ) −
1) + √ π e − z . Apply-ing l’Hopital rule along with the asymptotic behavior of the exponent, we deduce that lim z →∞ f ( z ) − z = 0 , as desired.Next, we will analyze the relation between f and (cid:96) robhinge . Definition 3
Perspective of a function (from Boyd and Vandenberghe [2004] . . ).If f : R n → R , then the perspective of f is the function g : R n +1 → R defined by g ( t, x ) = tf (cid:16) xt (cid:17) with domain dom ( g ) = (cid:110) ( x, t ) (cid:12)(cid:12)(cid:12) xt ∈ dom ( f ) , t > (cid:111) The following lemma if useful. For a proof see Boyd and Vandenberghe [2004] . . , e.g. Lemma 4 If f is convex (concave), then its perspective is convex (concave) as well.Define the function g ( a, b ) = af (cid:18) ba (cid:19) (2.20)Lemma 4 implies that g (cid:0) σ (cid:107) w (cid:107) , − y m w T x m (cid:1) is jointly convex in both its arguments.In order to establish the strict-convexity of (cid:96) robhinge in w , we need a more powerful tool.Consider the following lemma (Boyd and Vandenberghe [2004] . . )19 emma 5 Let h : R k → R , g i : R n → R . Consider the function f ( x ) = h ( g ( x )) = h ( g ( x ) , g ( x ) , ..., g k ( x )) . Then, f is convex if h is convex, h is nondecreasing in eachargument, and g i are convex.This lemma can be easily generalized to the case of strictly-convex functions. The proof isidentical to that of the original version, thus will be skipped.We are now ready to prove the following theorem Theorem 3.2: (cid:96) robhinge is strictly-convex in w . Proof:
From Lemma 4, g is convex. In additon, g is nondecreasing in each of its arguments.To see that, observe that dgda = f (cid:18) ba (cid:19) − ba f (cid:48) (cid:18) ba (cid:19) = 1 √ π exp (cid:18) − s b a (cid:19) dgdb = f (cid:48) (cid:18) ba (cid:19) = erf (cid:18) ba (cid:19) which are both strictly positive. σ (cid:107) w (cid:107) and (cid:0) − y m w T x m (cid:1) are both convex in w , thus we conclude by applying Lemma 5.The next theorem explores some of the other properties of the loss function we havedefined: Theorem 3.3: (cid:96) robhinge is an upper-approximation to the hinge loss. Furthermore, when σ → , the loss function (cid:96) robhinge coincides with the hinge loss. Proof:
For the upper bound property, we apply Theorem 3.1: σ (cid:107) w (cid:107) f (cid:18) − y i w T x i σ (cid:107) w (cid:107) (cid:19) ≥ σ (cid:107) w (cid:107) h (cid:18) − y i w T x i σ (cid:107) w (cid:107) (cid:19) = σ (cid:107) w (cid:107) (cid:20) (1 − y i w T x i σ (cid:107) w (cid:107) (cid:21) + = (cid:2) − y i w T x i (cid:3) + For the second part of the theorem, let us first observe that σ √ π exp (cid:32) − (cid:0) − y m w T x m (cid:1) σ (cid:33) → (2.21)as a multiplication of two vanishing factors at σ → . We consider two cases:20. − y m w T x m ≥ . Observe thaterf (cid:18) − y m w T x m σ (cid:107) w (cid:107) (cid:19) → erf ( ∞ ) = 1 Thus, (cid:96) robhinge ( x m , y m ; w , σ ) → − y m w T x m .2. − y m w T x m < . In this caseerf (cid:18) − y m w T x m σ (cid:107) w (cid:107) (cid:19) → erf ( −∞ ) = 0 Thus, (cid:96) robhinge ( x m , y m ; w , σ ) → Altogether, we have shown that when σ → , (cid:96) robhinge ( x m , y m ; w , σ ) → (cid:2) − y m w T x m (cid:3) + .Observe that at w = the loss function is not continuous. The discontinuity is remov-able, however, so this issue does not pose any problem. Figure 2.3: (cid:96) robhinge is a convex upper-approximation to the hinge loss. As σ tends to , (cid:96) robhinge tendsto the hinge. In all of the graphs, the norm (cid:107) w (cid:107) was set to 1. The norm of the classifier (cid:107) w (cid:107) always appears in a multiplication with σ . Thus, weobserve that it has a similar function. Namely, it controls the tightness of the approximationof the smooth loss function to the hinge. Since σ is pre-determined, the optimal normshould reflect some kind of compensation. We thus conjecture that there exist a inverseratio between σ and the optimal norm (cf. Chapter 4).21t last, it should be noted that this loss smooth function can be viewed as a multiplica-tive regularized loss.
4. GURU: a primal algorithm
We are now ready to devise an algorithm that solves our learning problem. In this sectionwe describe a stochastic gradient descent (SGD) method that minimizes the strictly-convexloss function at hand. A convergence result for the algorithm stems from general propertiesof SGD that were studied extensively (see Shalev-Shwartz et al. [2007b]; Kivinen et al.[2003]; Zhang et al. [2003]; Nedic and Bertsekas [2000]; Bottou and Bousquet [2008],e.g.).Plugging the robust hinge loss function we have derived (Equation 2.15) into the origi-nal optimization task (Equation 2.5), we get min w M (cid:88) m =1 (cid:0) − y m w T x m (cid:1) erf (cid:18) − y m w T x m σ (cid:107) w (cid:107) (cid:19) + σ (cid:107) w (cid:107)√ π exp (cid:32) − (cid:0) − y m w T x m (cid:1) σ (cid:107) w (cid:107) (cid:33) (2.22)This formulation is a convex unconstrained minimization task. One very natural approachfor solving this kind of task is using the family gradient descent methods. Denote theobjective of the optimization as G ( w ) = (cid:88) g i ( w ) (2.23)In batch gradient descent, in each step the algorithm updates w ← w − η ∇ G ( w ) (2.24)In stochastic gradient methods the gradient is approximated as the gradient of one of thesummands. Thus, the algorithm first randomizes an index i , then updates w ← w − η ∇ g i ( w ) (2.25)where η is the learning rate. The stochastic version suits settings of online learning, inwhich the learner is presented one training sample at a time. It has been suggested that usingthe stochastic version yields better generalization performance in learning tasks (Amari[1998]; Bottou and LeCun [2003]).Our algorithm, named GURU ( G a U ssian R ob U st), optimizes Equation 2.22 using anSGD procedure. (For a full treatment see, e.g. Boyd (ref).)In order to derive the update formula, one should first calculate the gradient of the lossfunction. A straight forward computation yields ∇ w (cid:96) robhinge ( x i , y i ; w , σ ) = − y i x i erf (cid:18) − y i w T x i σ (cid:107) w (cid:107) (cid:19) + σ w √ π (cid:107) w (cid:107) exp (cid:18) − (1 − y i w T x i ) σ (cid:107) w (cid:107) (cid:19) (2.26)22ame Table 2.1: Description of the databased used in the binary case
We therefore suggest the following SGD procedure
Algorithm 1 : GURU( S , η , (cid:15) ) Data : Training set S , learning rate η , accuracy (cid:15) Result : ww ← ; while ∆ L ≥ (cid:15) do m ← rand ( M ) ; w ← w − η √ t ∇ w (cid:96) robhinge ( x m , y m ; w, σ ) ; endreturn w ;For convergence results see Nedic and Bertsekas [2000]. For a full treatment, see Bert-sekas et al. [2003], chapter .
5. Experiments
In this section we present experimental results that demonstrate the fact that GURU gener-alizes as well as SVM. Experiments were carried out on two toy problems (see Figure 2.4for a visualization), USPS handwritten digits classification (3 vs. 5, 5 vs. 8 and 7 vs. 9)and a couple of UCI databases (Frank and Asuncion [2010]). The sizes of the data sets aredetailed in Table 2.1. 23ame GURU(%) SVM(%)Toy(a) 92.5 92.5Toy(b) 92 92Ionosp-here 82.24 79.61diabetes 67.52 67.31splice1 vs. 2 93.39 92.44USPS3 vs. 5 95.57 95.86USPS5 vs. 8 97.71 98USPS7 vs. 9 97.57 97.43
Table 2.2: Summary of the results: GURU and SVM.
We have trianed GURU for σ values varying from − to , with exponential jumps.The learning rate was tuned empirically (values between − to were tested). SVM wastrained and tested using the SVM-light package. λ values between − and were used.Note that in the SVM-light formulation, λ multiples the loss and not the regularizationterm. Thus, the qualitative relation between λ and σ is roughly σ ∼ λ . Parameter selectionwas done based on the cross-validation set, and performance was evaluated for the optimalparameter on a testing set. The results are summarized in Table 2.2.On the toy databases (a)-(b), the performance of GURU is identical to that of SVM.We have tested the learned classifiers’ resistance to Noise, by adding uniformly distributedrandom noise to both cross-validation and test sets. The results are presented in Figure 2.5.Observe that the resistance of GURU slightly outperforms that of SVM. Nontheless, thisresult gives an experimental support to the theoretical result in Xu et al. [2009], where itwas shown that the ordinary SVM formulation is equivalent to a robust formulation, inwhich the adversary is capable of displacing the data samples.On the Ionosphere database, GURU significantly outperforms SVM. The samples ofthis database consist of radar reading. Thus, GURU’s performance may be understood bythe noisy nature of the samples. This finding supports the intuition that GURU perfomswell in noisy setups.On USPS, the performance of GURU is pretty similar to that of SVM. Since the samplescan be easily visualized as images, it is convenient to examine the adversarial action in thiscase. Consider Figure 2.6. The GURU adversary is symmetric, in the sense that it maymove the samples either closer or further from the separating hyperplane. Hence, somedigits look even more clear that the original ones, whereas others look as the opponentdigit. 24 igure 2.4: (a) Gaussian data. (b) Narrow Gaussian with outliers.Figure 2.5: Classifiers’ performance on a noised cross-validation and testing sets. The x -axis indi-cates the magnitude of the noise (noise distributes as U ( − x, x ) ). The experiment was repeated 50times. (a)-(b) represent the respective toy problems.Figure 2.6: The GURU adversary adds noise perpendicularly to the separating hyperplane. Notethat some samples are even more clear than the original, whereas others look like the opponentdigit. A bunch of samples are a superposition of both. (a)-(c) are the original digits. (d)-(f) are 25noisy samples. igure 2.7: Classifiers’ performance on a noised cross-validation and testing sets. The x -axis indi-cates the manitude of the noise (noise distributes as U ( − x, x ) ). The experiment was repeated 10times. (a) 3 vs 5, (b) 5 vs 8, (c) 7 vs 9. In addition, on the USPS dataset, GURU has demonstrated a significantly better resis-tance to noise than SVM (see Figure 2.7). 26 hapter 3Dual formulation
In this chapter we derive a dual probelm for the learning task at hand. We do not use the dualas a means to solve the primal problem, since the primal optimization works well. Rather,we use it to gain a better understanding of the problem. In the course of the derivation weuse the notion of conjugate functions. We will show that the dual problem itself specifiesthe classifier up to a scailing factor. Thus, we devise a method to extract the norm usingthe available information. It is interesting to observe that throughout the derivation ofthe dual, the smooth function f plays a specific and distinguished role. Thus, the entireprocedure may be applied as is for other smooth convex function, by only calculating theirconjugate dual. We demonstrate this principle in Section 2, where we also discuss therelation between the primal loss and the dual formulation.
1. Mathematical Derivation
This section is rather technical, and goes through the derivation of the dual. We start withthe perspective representation of (cid:96) robhinge , and introduce copule of auxilliary variables. Usingthese variables, the Lagrangian takes a form that we are able to analyze. Theorem 1.1encapsulates the effect of f , in such a manner that other loss functions can be plugged intothe derivation rather easily.The main result of this section is that the dual form of Equation 2.22 is max (cid:80) m α m s.t. (cid:107) (cid:80) m α m y m x m (cid:107) ≤ σ (cid:80) m √ π exp (cid:16) − erf inv ( α m )2 (cid:17) α ≥ (3.1)In the following paragraphs we will go through the details.The optimization task Equation 2.22 may be written as min w σ (cid:107) w (cid:107) (cid:88) m f (cid:18) − y m w T x m σ (cid:107) w (cid:107) (cid:19) (3.2)27e introduce the auxilliary varibles z m , and constrain them with − y m w T x ≤ z m .Note that f ( z ) is monotone increasing in z . Thus, at optimality z m = 1 − y m w T x . Inaddition, we introduce the variable r and constrain it with σ (cid:107) w (cid:107) ≤ r . At the optimum r = σ (cid:107) w (cid:107) , since rf ( zr ) is monotone increasing in r . Altogether we get the followingoptimization task min r (cid:80) m f (cid:0) z m r (cid:1) s.t. σ (cid:107) w (cid:107) ≤ r − y m w T x m ≤ z m r ≥ (3.3)where the optimization variables are w , z , . . . , z n , r .The objective is convex according to Lemma 4, and the constraint on w is a secondorder cone.To find the dual, write the Lagrangian: L ( w , r, z , α , λ ) = r (cid:88) m f (cid:16) z m r (cid:17) + λ [ σ (cid:107) w (cid:107) − r ] + (cid:88) m α m (cid:2) − y m w T x m − z m (cid:3) − µr where λα m , µ ≥ are the Lagrange multipliers. For later convenience we add a set ofvariables r m and force them all to equal r . So the new Lagrangian is: L ( w , r, z , α , λ, δ ) = (cid:88) m r m f (cid:18) z m r m (cid:19) + λ [ σ (cid:107) w (cid:107) − r ]+ (cid:88) m α m (cid:2) − y m w T x m − z m (cid:3) − µr − (cid:88) m δ m [ r m − r ] where δ m ≥ .Recall that we have defined g ( a, b ) = af (cid:0) ba (cid:1) . Using this notion we get the followingtask min w ,r, z L ( w , r, z , α , λ, δ ) = min w ,r (cid:88) m min z m ,r m [ g ( r m , z m ) − α m z m − δ m r m ]+ λ [ σ (cid:107) w (cid:107) − r ] + (cid:88) m α m (cid:2) − y m w T x m (cid:3) − µr + r (cid:88) m δ m = min w ,r (cid:88) m g ∗ ( α m ; δ m ) + λ [ σ (cid:107) w (cid:107) − r ]+ (cid:88) m α m (cid:2) − y m w T x m (cid:3) − µr + r (cid:88) m δ m where g ∗ is by definition the conjugate function of g (for details see Boyd and Vanden-berghe [2004] . , e.g). Deriving the Lagrangian w.r.t. w gives: σλ w (cid:107) w (cid:107) = (cid:88) m α m y m x m (3.4)28aking the norm of both sides of the equation yields σλ ( α ) = (cid:107) (cid:88) m α m y m x m (cid:107) (3.5)Substituting this back into the objective, the terms with w cancel out and we have: min r (cid:88) m g ∗ ( α m ; δ m ) − rλ ( α ) − µr + r (cid:88) m δ m (3.6)This is linear in r , thus deriving w.r.t. r yields a constraint (cid:80) m δ m = λ ( α ) + µ . Since µ ≥ , the equality constraint might be relaxed to (cid:80) m δ m ≥ λ ( α ) , and we end up with thefollowing formulation max (cid:80) m α m + (cid:80) m g ∗ ( α m ; δ m ) s.t. (cid:80) m δ m ≥ λ ( α ) α ≥ (3.7)Or: max (cid:80) m α m + (cid:80) m g ∗ ( α m ; δ m ) s.t. (cid:107) (cid:80) m α m y m x m (cid:107) ≤ σ (cid:80) m δ m α ≥ (3.8)The overall problem has a concave objective (since it’s a conjugate dual of a convex func-tion) and second order cone constraints. In what follows we work out the form of theconjugate dual g ∗ .Denote by f ∗ ( α ) the conjugate function of f (it is concave). The next theorem specifiesthe conjugate g ∗ in terms of f ∗ : Theorem 1.1:
The conjugate dual of g ( a, b ) is g ∗ ( α, δ ) = (cid:26) f ∗ ( α ) ≥ δ −∞ o therwise (3.9) Proof:
We must calculate g ∗ ( α ; δ ) = min x,t (cid:16) tf ( xt ) − αx − δt (cid:17) (3.10)To prove, we change from variables x, t to a variables z = x/t, t : min t ≥ ,z tf ( z ) − αzt − δt = min t ≥ ,z t ( f ( z ) − αz − δ ) (3.11)For the first case, assume that f ∗ ( α ) ≥ δ , which implies that for all z : f ( z ) − αz ≥ δ (3.12)Then in Equation 3.11 the minimization is always of the product of t ≥ and some non-negative number. Hence it is always greater than zero, and zero can be attained at the limit t → . 29n the other hand if f ∗ ( α ) < δ , we will show that there exists a pair t, z that achievesa value −∞ : Since f ∗ ( α ) < δ there exists a z for which f ( z ) − αz − δ < (3.13)If we take t → ∞ and this z we get a value of −∞ .In order the complete the derivation of the dual formulation, we should compute theconjugate dual f ∗ . The following lemma gives the desired result Lemma 6
The conjugate dual of f is f ∗ ( α ) = 1 √ π exp (cid:18) − erf inv ( α )2 (cid:19) Proof:
Recall that: f ( z ) = z erf ( z ) + 1 √ π e − z (3.14)and that its first derivative is dfdz = erf ( z ) (see Equation 2.17). By Theorem 3.1, f is convex, thus we compute f ’s conjugate dual: f ∗ ( α ) = min z f ( z ) − αz (3.15)The minimum satisfies: f (cid:48) ( z ) = α erf ( z ) = αz = erf inv ( α ) where erf inv is the inverse function of erf. We plug this equality into the objective andconclude f ∗ ( α ) = f ( erf inv ( α )) − α erf inv ( α )= erf inv ( α ) α + 1 √ π exp (cid:18) − erf inv ( α ) (cid:19) − α erf inv ( α )= 1 √ π exp (cid:18) − erf inv ( α )2 (cid:19) It can be easily verified that f ∗ is concave, as expected from the theory. Note that from thederivation above it follows that α M ≤ . 30aking the dual problem in Equation 3.8 and plugging in the conjugate duals derivedabove, we get: max (cid:80) m α m s.t. (cid:107) (cid:80) m α m y m x m (cid:107) ≤ σ (cid:80) m δ m √ π exp (cid:16) − erf inv ( α m )2 (cid:17) ≥ δ m α ≥ (3.16)Consider the following problem: max (cid:80) m α m s.t. (cid:107) (cid:80) m α m y m x m (cid:107) ≤ σ (cid:80) m √ π exp (cid:16) − erf inv ( α m )2 (cid:17) α ≥ (3.17)The following proposition asserts that both of the formulations above are equivalent. Proposition 7
Equation 3.16 and Equation 3.17 are equivalent.
Proof:
Denote C the feasible region of Equation 3.16, and C the feasible region of Equa-tion 3.17. Let α ∈ C . Then trivially we have α ∈ C . On the other hand, let α ∈ C .Denote δ m = √ π exp( − erf inv ( α m ) ) . It is easy to verify that this selection correspondsto a feasible point for Equation 3.16 (i.e. ∈ C ) with the same objective value.As we have seen, the optimization problem we analyze in this work is a relative ofthe SVM problem. It is interesting to examine what happens when considering the duals.Consider the SVM formulation min w λ (cid:107) w (cid:107) + (cid:80) Mm =1 ξ m s.t. ∀ m ∈ { , , . . . , M } : ξ m ≥ − y m w T x m ξ m ≥ (3.18)Its dual is min α (cid:80) Mm =1 α m − (cid:80) Mm,n =1 α m α n y m y n ( x m ) T x n s.t. ∀ m ∈ { , , . . . , M } : 0 ≤ α m ≤ λ (3.19)This dual form shares some properties with the dual form of GURU. For example, noticethat in both cases one tries to maximize the sum of the dual variables α m . Another issueis that of the norm minimization. The SVM dual explicitly minimizes the norm of theclassifier. In our dual, however, the situation is rather implicit: there exist a bound on thenorm of the classifier. Without going into the details, we mention that moving a constraintinto the objective or vice versa is possible in the context of Lagrangian duality. At last,notice that in spite of the fact that σ and λ play similar roles, increasing λ results in srinkingthe feasible region of the SVM duak, whereas in our problem, increasing σ expands thefeasible region.The last issue we discuss in this section is the norm of the optimal classifier. Note thatby solving the dual formulation, one can only get the optimal classifier up to a scailingfactor. Of course, it is essential to know the norm exactly in order to be able to use theclassifier. This goal can be achieved using the following theorem:31 heorem 1.2: The norm of the optimal classifier is (cid:107) w ∗ (cid:107) = 1 erf inv ( α m ∗ ) + y m ( ˆ w ∗ ) T x m (3.20)for every m , where ˆ w ∗ is the normalized optimal classifier. Proof:
Equation 3.4 may be written as min w ,r, z L ( w , r, z , α , λ, δ ) = min w ,r (cid:88) m min z m ,r m (cid:20) r m f (cid:18) z m r m (cid:19) − α m z m − δ m r m (cid:21) + λ [ σ (cid:107) w (cid:107) − r ] + (cid:88) m α m (cid:2) − y m w T x m (cid:3) − µr + r (cid:88) m δ m = min w ,r (cid:88) m min r m (cid:20) r m min z m (cid:20) f (cid:18) z m r m (cid:19) − α m z m r m (cid:21) − δ m r m (cid:21) + λ [ σ (cid:107) w (cid:107) − r ] + (cid:88) m α m (cid:2) − y m w T x m (cid:3) − µr + r (cid:88) m δ m since r m ≥ . We define q m = z m r m . Since the equation above depends on z m only via q m ,we get min w ,r, z L ( w , r, z , α , λ, δ ) = min w ,r (cid:88) m min r m (cid:20) r m min q m [ f ( q m ) − α m q m ] − δ m r m (cid:21) + λ [ σ (cid:107) w (cid:107) − r ] + (cid:88) m α m (cid:2) − y m w T x m (cid:3) − µr + r (cid:88) m δ m If when substituting the dual optimum in the Lagrangian, there exists a unique primalfeasible solution, then it must be primal optimal (see Boyd and Vandenberghe [2004], . . for details). Thus, at the optimum q ∗ m = min q m [ f ( q m ) − α m q m ] . According to the proofof Lemma 6 it holds that q ∗ m = erf inv ( α m ) . By exploiting the monotony properties of theproblem (that were presented in the beginning of the section), we conclude that − y m w T x m σ (cid:107) w (cid:107) = erf inv ( α m ) (3.21)The desired result follows from basic algebraic operations.Note that the values of the optimal α ’s are known, as well as the normalized vector ˆ w ∗ = w ∗ (cid:107) w ∗ (cid:107) . Thus, we can compute the optimal norm.It is possible that the norm of the optimal classifier is bounded (as a function of σ ).Although we couldn’t prove this result, we conjecture that such a result might stem from astrong duality argument: (cid:107) α ∗ (cid:107) = (cid:88) m (cid:96) robhinge ( x m , y m ; w ∗ , σ ) (3.22)32 igure 3.1: Norm of the optimal classifiers trained for the toy problems of Section 5, for various σ values. (a)-(b) represent the respective toy problems. By plugging Equation 3.21 into the previous equality, we obtain (cid:107) w (cid:107) = (cid:107) α ∗ (cid:107) (cid:80) m f ( erf inv ( α m )) (3.23)A better understanding of the constraints on α may help bounding the RHS of the equation.We have plotted the norm of the optimal classifiers for the toy problems of Chapter 2 (referto Section 5 for more details). The results are shown in Figure 3.1 and clearly support thisconjecture.
2. A general framework
The dual form we have derived sheds some light on the structure of the problem. In thissection we discuss the relation between the loss function f and the norm constraint thatappears in the dual. We claim that there is a correspondence between approximations of f and relaxations of the dual problem. More specifically, approximations of the loss functionculminates in approximations of the feasible region of the dual problem.The norm constraint in the dual is a core component of the optimization. We denote by s ( α ) = exp (cid:18) − erf inv ( α ) (cid:19) (3.24)the function under summation. It is complicated to handle and understand s ( α ) , thus it isappealing to approximate it using elementary functions. Two such approximations are ˜ s ( α ) = H ( α ) = − α log ( α ) − (1 − α ) log (1 − α )˜ s ( α ) = 4 α (1 − α ) (3.25)(see Figure 3.2). 33 igure 3.2: The dual constraint may be approximated using elementary functions. Note that in the previous section we only used f as a means to express g ∗ (Equation3.9). Thus, if one replaces f with some alternative convex loss function ˜ f , the derivationof the dual will remain correct. Of course, the dual norm constraint will be affected by thischange.In order to understand the nature of the approximations in Equation 3.25, it is necessary toexplore the respective dual conjugates. Lemma 8
Let ˜ f ( z ) = log (1 + 2 z ) . Then its conjugate dual is ˜ f ∗ ( α ) = − α log ( α ) − (1 − α ) log (1 − α ) (3.26) Proof:
We compute ˜ f ’s conjugate dual: ˜ f ∗ ( α ) = min z ˜ f ( z ) − αz (3.27)The minimum satisfies: ˜ f (cid:48) ( z ) = α z z + 1 = α z = α − αz = log (cid:18) α − α (cid:19)
34e plug this equality into the objective and conclude ˜ f ∗ ( α ) = log (cid:18) α − α (cid:19) − α log α − α = − α log ( α ) − (1 − α ) log (1 − α ) as claimed. As in the case of our Gaussian robust loss, we have α ≤ . Figure 3.3: Log loss apears naturally in our framework. In addition, we have demonstrated a meansto generate some other loss functions, such as the quaratic loss above.
Lemma 9
Let ˜ f ( z ) = if z < − ( z +4) if − ≤ z ≤ z if z > (3.28)Then its conjugate dual is ˜ f ∗ ( α ) = (cid:40) α (1 − α ) if ≤ α ≤ −∞ if α > (3.29) Proof:
It is easy to verify that ˜ f is smooth. We thus compute ˜ f ’s conjugate dual in thefollowing way: ˜ f ∗ ( α ) = min z ˜ f ( z ) − αz (3.30)35xtremum points satisfy: ˜ f (cid:48) ( z ) − α = 0 − α if z < − ( z +4)8 − α if − ≤ z ≤ − α if z > The above equation vanishes at z = 8 α − . For ≤ α ≤ we have − ≤ z ≤ , thus weconclude ˜ f ∗ ( α ) (cid:12)(cid:12)(cid:12) ≤ α ≤ = 4 α (1 − α ) For α > we take z → ∞ , and ˜ f ( z ) = (cid:12)(cid:12)(cid:12) z> (1 − α ) z → −∞ . Altogether we haceestablished the desired result.These lemmas shade some light on (cid:96) robhinge and on the structure of our problem. It turnsout that the well-known log-loss as well as a quadratic loss that has the same flavour as theHuber loss appear naturally in our framework (see Figure 3.3 for a visualization). Whatwe have demonstrated is that there exist a close connection between approximations of theprimal loss and relaxations of the dual problem. Specifically, we have that the dual of min w M (cid:88) m =1 (cid:107) w (cid:107) ˜ f (cid:18) − y m w T x m (cid:107) w (cid:107) (cid:19) (3.31)is max (cid:80) m α m s.t. (cid:107) (cid:80) m α m y m x m (cid:107) ≤ (cid:80) m ˜ s ( α m ) α ≥ (3.32)Note, however, that this connection should be further investigated. It should be ob-served that not every smooth convex primal loss ˜ f yields a perspective that is convex in w . For that to happen, ˜ f should satisfy some mathematical properties that are yet to beunderstood. One example for such a condition is f ( z ) ≥ z d ˜ fdz ( z ) . Under this condition wecan use the same reasoning as in the proof of Theorem 3.2 and conclude that the primalprobem is convex. In this case, we can automaticaly apply the derivation presented in theprevious section and deduce the respective dual problem. Another issue that should bebetter understood is the connection between approximations of f and the robust setup wehave begun with. In particular, it is interesting to understand if the logarithmic loss may beinterperted as resulting from RO. 36 hapter 4Introducing Kernels One of the greatest stengths of the theory of support vector machines, is the simple gen-eralization to nonlinear cases. This generalization is carried out via the elegant notion ofkernels. An examination of our derivation suggests that one may apply the kernel trick andintroduce a means to learn nonlinear classifiers in Gaussian Robust framework.In this chapter we will develop a kernelized version of the GURU algorithm. Most ofthe derivation is straight forward: we begin by giving a representer result. Plugging the newparametrization of the classifer into the framework, we show that our update formulas areperfectly suitable for maintaining this kind of representation. The tricky part stems from thefact that our updates depend directly on the norm of the weights vector. Naive computationof the norm costs O ( M ) operations, which significantly slows down the algorithm. Wethus derive a procedure to update the norm in O (1) , based on previous computations.
1. A representer result
The first step towards kernelization of GURU, is to change our represention of the classifierfrom a weights vector ( w ) to a linear combination of the training samples. The theoreticaljustification of such an operations is known as a representer result.The fact that an optimal classifier may be represented as a linear combination of the train-ing sample, stems from the mathematical theory of Hilbert spaces. In our case, as wellas in SVM, however, the same result can be derived using far more simple and explicitargumentation. In this section we will show three ways to establish the representer resultfor the case of GURU. In spite of the fact that we could prove the theroem using abstractargumentation, it is necsssary to develop the technical proof, as it lays the foundations forthe derivation of the kerenelized algorithm.We start by stating a version of the representer theorem: Theorem 1.1:
Let H be a reproducing kernel Hilbert space with a kernel κ : X × X → R ,a symmetric positive semi-definite function on the compact domain. For any function L : n → R , and any nondecreasing function Ω : R → R . If J ∗ = min f ∈H J ( f ) = min f ∈H (cid:8) Ω (cid:0) (cid:107) f (cid:107) H (cid:1) + L ( f ( x ) , f ( x ) , . . . , f ( x n )) (cid:9) is well-defined, then there are some α , α , . . . α n ∈ R , such that f ( · ) = n (cid:88) i =1 α i κ ( x i , · ) (4.1)acheives J ( f ) = J ∗ . Furthermore, if Ω is increasing, then each minimizer of J ( f ) can beexpressed in the form of Equation 4.1.For a proof and more details, see for example Sch¨olkopf and Smola [2002].As mentioned, we will discuss three techniques to establis the required result. First,using the structure of the updates that GURU perform. Second, by the derivation of thedual problem presented in Chapter 3, and third, using the general representer theorem. Theorem 1.2:
There exists a solution of Equation 2.22 that takes the form w = M (cid:88) m =1 α m y m x m (4.2) Proof: Via the structure of GURU
Recall that the updates in the GURU algorithm are of the form w ← w − η √ t (cid:18) − y i x i erf (cid:18) − y i w T x i σ (cid:107) w (cid:107) (cid:19) + σ w √ π (cid:107) w (cid:107) exp (cid:18) − (1 − y i w T x i ) σ (cid:107) w (cid:107) (cid:19)(cid:19) It is suggestive to observe that the update formula can be split and written as two successivesteps. The first of which is w ← w − η √ t σ w √ π (cid:107) w (cid:107) exp (cid:18) − (1 − y i w T x i ) σ (cid:107) w (cid:107) (cid:19) followed by w ← w + η √ t y i x i erf (cid:18) − y i w T x i σ (cid:107) w (cid:107) (cid:19) (4.3)The first step is nothing else then a rescailing of the weights vector w = γ w , γ = 1 − η √ t σ √ π (cid:107) w (cid:107) exp (cid:18) − (1 − y i w T x i ) σ (cid:107) w (cid:107) (cid:19) (4.4)38ecall that GURU initializes the weight vector as w = 0 , which clearly can be repre-sented as = M (cid:88) m =1 y m x m (4.5)We thus assume that the desired representation exists, and proceed by induction. By plug-ging the representation into the previous equations, we get M (cid:88) m =1 α newm y m x m = M (cid:88) m =1 α m y m x m + γ M (cid:88) m =1 α m y m x m i.e. for all m α newm = (1 + γ ) α m (4.6)where α newm is the result of thee respective update. The second step in the update formula(Equation 4.3), may be written as M (cid:88) m =1 α newm y m x m = M (cid:88) m =1 α m y m x m + µ i y i x i , µ i = η √ t y i x i erf (cid:18) − y i w T x i σ (cid:107) w (cid:107) (cid:19) i.e. α newm = (cid:40) α m if m (cid:54) = iα i + µ i if m = i (4.7)Combining both steps, we end up with the following update rule: α t +1 m = (cid:40) γα tm if m (cid:54) = iγα ti + µ i if m = i (4.8)Since GURU is guranteed to converge to the optimum, by taking t → ∞ we establish thedesired result. Proof: Via the dual formulation
We have already seen (Equation 3.4) that σλ w (cid:107) w (cid:107) = (cid:88) m α m y m x m By defining ˜ α m = (cid:107) w (cid:107) σλ α m and plugging it into the previous equality, we conclude that w = (cid:88) m ˜ α m y m x m as required. Proof: Via the general representer theorem
Set Ω ≡ , L (( f ( x ) , f ( x ) , . . . , f ( x n ))) = (cid:80) ni =1 f ( x i ) , f = (cid:96) robhinge . and let κ be the linearkernel κ ( x , x ) = x T x . The desired result stems immidiately from Theorem 1.1.39 . KEN-GURU: A primal kernelized version of GURU In the pevious section we have established a representer result for GURU. The next step inthe derivation is to work the components of the algorithm, so the only dpendence on thedata samples and on the classifier would be via dot products. That being the case, we canapply the kernel trick, namely to replace each dot product ( x m ) T x n with the kernel en-try κ ( x m , x n ) (for details see, for example, Aizerman et al. [1964]; Sch¨olkopf and Smola[2002]). We start by expanding the quantities that appear in the update formula in terms of α m ’s. Then, we introduce a method to update the value of the norm variable in a computa-tionally cheap way. We conclude the section by putting the results together, and prsentingthe KEN-GURU ( KE r N elized G a U ssian R ob U st) algorithm.In order to compute γ and µ i of Equation 4.4 and Equation 4.7, one must know thevalues of w T x i and (cid:107) w (cid:107) . Let us expand the first quantity w T x i = (cid:32) M (cid:88) m =1 α m y m x m (cid:33) T x i = M (cid:88) m =1 α m y m ( x m ) T x i = M (cid:88) m =1 α m y m K mi The norm might be computed as (cid:107) w (cid:107) = w T w = (cid:32) M (cid:88) m =1 α m y m x m (cid:33) T M (cid:88) n =1 α n y n x n = M (cid:88) m =1 M (cid:88) n =1 α m α n y m y n ( x m ) T x n = M (cid:88) m =1 M (cid:88) n =1 α m α n y m y n K mn Note that the Gram matrix K may be precomputed and cached (total cost of O ( M ) ).Thus, w T x i can be computed in O ( M ) , and (cid:107) w (cid:107) in O ( M ) . As both of these values shouldbe computed for each update, the cost of the norm computation is extremely expensive.Instead of computing the norm each time from scratch, it is possible to use its previousvalue. The updated norm may be computed as40 w (cid:107) t +1 = M (cid:88) m =1 M (cid:88) n =1 α t +1 m α t +1 n y m y n K mn = M (cid:88) m =1 (cid:34)(cid:88) n (cid:54) = i α t +1 m α t +1 n y m y n K mn + α t +1 m α t +1 i y m y i K mi (cid:35) = M (cid:88) m =1 (cid:88) n (cid:54) = i α t +1 m α t +1 n y m y n K mn + M (cid:88) m =1 α t +1 m α t +1 i y m y i K mi = (cid:88) n (cid:54) = i (cid:34)(cid:88) m (cid:54) = i α t +1 m α t +1 n y m y n K mn + α t +1 i α t +1 n y i y n K in (cid:35) + M (cid:88) m =1 α t +1 m α t +1 i y m y i K mi = (cid:88) n (cid:54) = i (cid:88) m (cid:54) = i α t +1 m α t +1 n y m y n K mn + (cid:88) n (cid:54) = i α t +1 i α t +1 n y i y n K in + (cid:88) m (cid:54) = i α t +1 m α t +1 i y m y i K mi + α t +1 i α t +1 i y i y i K ii = (cid:88) n (cid:54) = i (cid:88) m (cid:54) = i α t +1 m α t +1 n y m y n K mn + 2 (cid:88) m (cid:54) = i α t +1 m α t +1 i y m y i K mi + α t +1 i α t +1 i y i y i K ii By plugging Equation 4.8 we get (cid:107) w (cid:107) t +1 = γ (cid:88) n (cid:54) = i (cid:88) m (cid:54) = i α tm α tn y m y n K mn + 2 γ (cid:88) m (cid:54) = i α tm ( γα ti + µ i ) y m y i K in +( γα ti + µ i ) K ii = γ (cid:107) w (cid:107) t + 2 γµ i y i M (cid:88) m =1 α tm y m K mi + µ i K ii = γ (cid:107) w (cid:107) t + 2 γµ i y i w T x i + µ i K ii where w T x i is computed regardless of (cid:107) w (cid:107) . Thus, the value of the norm can be main-tained in O (1) .In may be easily observed that the data samples x m participate in the computations ofthe update only via the Gram matrix K . Thus, we can apply the kernel trick, and use K ij = κ ( x i , x j ) (4.9)41or any Mercer Kernel κ . Based on the results established in the previous sections, we maytranslate GURU into a kerenlized version, named KEN-GURU.We intoduce an auxilliary variable ζ , that holds the value of the product κ ( w , x i ) and isevaluated by ζ t +1 = M (cid:88) m =1 α tm y m K ( x m , x i ) (4.10)According to Equation 4.4, Equation 4.7 and Equation 4.9 we introduce the followingupdate formulas γ t +1 = 1 − η √ t σ √ πν t exp (cid:18) − (1 − y i ζ t +1 ) σ ν t (cid:19) (4.11) µ t +1 = η √ t erf (cid:18) − y i ζ t +1 σν t (cid:19) (4.12) ν t +1 = (cid:113) γ t +1 ν t + 2 γ t +1 µ t +1 y i ζ t +1 + µ t +1 K ii (4.13) Algorithm 2 : KEN-GURU( κ , S , η , (cid:15) ) Data : Kernel function κ , training set S , learning rate η , accuracy (cid:15) Result : α //initializations forall m, n = 1 ..m do K mn = κ ( x m , x n ) end α ← ; ν ← ; t ← ; while ∆ L ≥ (cid:15) do //randomize a sample i ← rand ( M ) ; //evaluate coefficients Compute ζ t +1 (Equation 4.10);Compute γ t +1 (Equation 4.11);Compute µ t +1 (Equation 4.12); //update alphas α t +1 ← γ t +1 α t ; α it +1 ← α it +1 + µ t +1 ; t ← t + 1 ; endreturn α ;The correctness of the algorithm stems directly from that of GURU.42ame GURU(%) SVM(%)Ionosp-here 83.55 81.58diabetes 68.59 66.67splice1 vs. 2 92.28 92.28USPS3 vs. 5 97.86 98USPS5 vs. 8 98.29 98.71USPS7 vs. 9 98.43 97.86 Table 4.1: Results summary for KEN-GURU.
3. Experiments
In this section we present experimental results regarding the performance of KEN-GURU.We show how σ affects the learned classifier and then compare KEN-GURU to SVM onUSPS pairs and on the Ionosphere database (see Table 2.1 for details). For the USPS tasks,a polynomial kernel of degree was used and for Ionosphere, RBF with γ = 1 . The resultsare summarized in Table 4.1.Consider Figure 4.1, in which KEN-GURU classifiers trained for various values of theparameter σ with a polynomial kernel of degree are presented. The toy probelm wassynthesized by first generating uniformly points on [ − . , . × [ − . , . . Points whichfall within the ball of radius around the origin were assigned a positive label. Pointswhich are more distant from the origin than . units were taken as negative examples.Points which fell in between were dropped. Observe that increasing σ puts extra emphasison the number of samples in each class. Specifically, in the problem at hand, there aremuch more points outside the circle than inside. When σ is rather small, the training is’local’ in the sense that each sample governs what happens in its immediate environment.On the contrary, when σ is relatively big, the emphasis is on global tendencies.On the Ionosphere databse, KEN-GURU performs significantly better than SVM. Re-call that the outperformance of GURU on SVM in this case is consistent with the perfor-mance in the case of a linear kernel. This behavior is explained by the noisy nature of theIonosphere database. For the USPS couples, KEN-GURU’s performance is pretty similarto that of SVM. 43 igure 4.1: KEN-GURU performance on a radial data set. The green and red points indicate datapoints that were correctly classified (each color stands for one of the classes). Blue points indicatemisclassification. The parameter σ determines how distant is the effect of each data point. Note thatfor small values of σ , the behavior of the classifier is determined locally by the samples. For ratherbig σ , the effect is global, in the sense that the behavior of the classifier is determined by close aswell as distant data samples. hapter 5The Multiclass Case In the previous chapters we have developed the binary algorhtm GURU, and its kernelizedversion KEN-GURU. In this chapter we will analyze anotther extension of the algorithm,for the case of multiclass cases.The ideas that were presented in Chapter 2 may be generalized for the multi-class case.To that end, we first should generalize the loss function we are working with. This goal isacheived by solving the generalized problem of the adversarial choice. After establishingthis reuslt we devise the effective robust loss function, and devise an optimization algorithmfor it.We relax the problem twice in order to solve it. First, we work with the sum-of-hingesloss function (Weston and Watkins [1999]). In addition, we use a superset of noise dis-tribution, that contains all covariance matrix with a bounded maximal eigenvalue. By theend of the chapter we will prove that for the binary case the maximal eigenvalue and traceconstraint give the same result.The setting we address in the followings is of data drawn from X = R d , accompaniedby labels drawn from Y = { , , . . . , C } . The learning task is to train the weight vectors w , w , . . . , w C . The target classifier is φ : X → Y , defined by φ ( x ; w , w , . . . , w C ) = max y ∈Y (cid:2) w Ty x (cid:3) (5.1)
1. Problem formulation
In this section we formally describe the generalization of the learning task from the binaryto the multiclass case. We show that the generalization culminates in a loss function whichis the sum of several appropriate binary losses.In Chapter 2 we have started our derivation from the hinge loss (cid:96) hinge ( x m , y m ; w ) = [1 − y m w T x ] + (5.2)The most common generalization of the hinge loss to the multi-class case is (cid:96) mult ( x m , y m ; w , . . . , w C ) = max y (cid:2) w Ty x m − w Ty m x m + δ y,y m (cid:3) (5.3)45owever, this loss function is not applicable in our framework (see Appendix C). Instead,we suggest to minimize the following surrogate loss function (Weston & Watkins, e.g. ref): (cid:96) sum ( x m , y m ; w , w , . . . , w C ) = (cid:88) i (cid:54) = y (cid:2) − ( w y m − w i ) T x m (cid:3) + (5.4)which is a surrogate to the zero-one loss.Let us write down the formulation of the problem in this case: min w , w ,..., w C (cid:88) m max Σ ∈ Γ β E n ∼N ( , Σ ) (cid:88) y (cid:48) (cid:54) = y m (cid:2) − ( w y m − w y (cid:48) ) T ( x m + n ) (cid:3) + (5.5)where Γ β = { Σ ∈ PSD (cid:12)(cid:12)(cid:12) ρ (Σ) ≤ β } (5.6)and ρ is the spectral norm of a matrix, defined by ρ ( A ) = (cid:112) λ max ( A ∗ A ) Using this set we constrain the maximal power of noise that the adversary may spread ineach primary direction.
2. The adversarial choice
In the followings we will focus on deriving the adversarial choise for the problem at hand.It appears that in the current setup, the solution is simpler than the one we had in Chapter 2.
Let us investigate what is the adversary’s optimal way for spsreading the noise. The ideasof the development are similar to that of Theorem 2.1.Denote ∆ W y,y (cid:48) = w y − w (cid:48) y (5.7)Using the same procedure we have employed in the binary case (see Section 2.1 and Equa-tion 2.15 thereby) we can write Equation 5.5 as: min w , w ,..., w C (cid:88) m max Σ ∈ Γ β (cid:88) y (cid:48) (cid:54) = y m L (cid:0) x m , +1; ∆ W y m ,y (cid:48) , ∆ W Ty m ,y (cid:48) Σ∆ W y m ,y (cid:48) (cid:1) (5.8)i.e. the task at hand is to optimize the effective loss function (cid:96) robsum ( x m , y m ; w , w , . . . w C , β ) = max Σ ∈ Γ β (cid:88) y (cid:48) (cid:54) = y m L (cid:0) x m , +1; ∆ W y m ,y (cid:48) , ∆ W Ty m ,y (cid:48) Σ∆ W y m ,y (cid:48) (cid:1) (5.9)46bserve that in every appearance of (cid:96) robhinge , the label y m was replaced with +1 . The reasonfor this change is that we are classifying using the weight vector w y m − w y (cid:48) . That is, ourprediction is ( w y m − w y (cid:48) ) T x m = w Ty m x m − w Ty (cid:48) x m Our objective is, of course, to have w Ty m x m > w Ty (cid:48) x m , which corresponds to the label +1 .The next theorem specifies the adversarial choice of the covariance matrix Σ , and is themulti-class analog of Theorem 2.1: Theorem 2.1:
The optimal Σ in Equation 5.9 is given by Σ ∗ = βI . Proof:
In Lemma 2 we have shown that L is monotone increasing in its th argument.By the Cauchy-Schwartz inequality we have that ∆ W Ty m ,y (cid:48) Σ∆ W y m ,y (cid:48) ≤ β (cid:107) ∆ W y m ,y (cid:48) (cid:107) (5.10)On the other hand, it holds that for all y (cid:48) ∆ W Ty m ,y (cid:48) βI ∆ W y m ,y (cid:48) = β (cid:107) ∆ W y m ,y (cid:48) (cid:107) (5.11)hence this upper bound is attained for all C − summands cuncurrenlty with Σ = βI .The geometric interpertation of this result is that under the spectral norm constraint, theadversary will choose to spread the noise in an isothropic fashion around the sample point.We thus get the following optimization problem: min w , w ,..., w C (cid:88) m (cid:88) y (cid:48) (cid:54) = y m L ( x m , +1; ∆ W y m ,y (cid:48) , β (cid:107) ∆ W y m ,y (cid:48) (cid:107) ) (5.12)Applying the same terminology used in the binary case, we have: min w , w ,..., w C (cid:88) m (cid:88) y (cid:48) (cid:54) = y m (cid:96) robhinge ( x m , +1; ∆ W y m ,y (cid:48) , β ) (5.13)and Equation 5.9 equals (cid:96) robsum ( x m , y m ; w , w , ..., w C , β ) = (cid:88) y (cid:48) (cid:54) = y m (cid:96) robhinge ( x m , +1; ∆ W y m ,y (cid:48) ) , β (5.14)47 .2 The connection to the trace constraint It is interesting to examine the reduction of the multiclass loss we have derived, to thebinary case. Note that since we have used a substantially larger matrix collection, there isno apriori reason to expect that the results will coincide.Taking C = 2 brings us back to the binary case. We use w +1 , w − for the weightvectors of the classes. By expanding Equation 5.9, we get (cid:96) robsum ( x m , y m ; w +1 , w − , β ) = (cid:96) robhinge ( x m , +1; w y m − w − y m , β ) (5.15)If we take w = w +1 − w − , we end up with (cid:96) robsum ( x m , y m ; w +1 , w − , β ) = (cid:96) robhinge ( x m , y m ; w , β ) (5.16)It is interesting to observe that the resulting loss functions are identical, even thoughthe constraints we put on the convariance matrices are different. In order to explain thisphenomenon, let us go back the geometric intuition that we have given prior to the proof ofTheorem 2.1. Figure 5.1: Visualization of Λ and Γ in the -dimensional case. The axes represent the eigenvaluesof Σ . The dark shaded region contains all the matrices having λ + λ ≤ , i.e. corresponds to Λ .The light area corresponds to Γ , and consists of all the matrices with max { λ , λ } ≤ . Consider Figure B.1, which presents a visualization of Λ β and Γ β in the -dimensionalcase. What we have shown in Theorem 2.1, is that the multiclass adversary will choose thepoint (1 , . Under the trace constraint, however, the adversary will have to choose either (1 , , (0 , , or any other point lying on the line connecting them. Our geometric intuitionsays that all the power that was not spread perpendicularly to the separating hyperplaneis irrelevant. Thus, when the adversary has to choose a directional noise, he would takethe perpendicular direction. On the other hand, if we limit his action axis-wise (and notoverall), he will surely choose to spread the noise equally over all of the axes.48 . M-GURU: a primal algorithm for the multiclass case In the following we generalize GURU (that was presented in Section 4) for the multiclasscase. As a direct corrolary of the results presnted in previous chapters, we have that ourloss function in this case is strictly-convex. Thus, we turn to devise an SGD procedure.We shall begin by computing the gradient of (cid:96) robsum ( x m , y m ; w , w , . . . , w C , β ) . Forconvenience, we write it in terms of the binary loss function (cid:96) robhinge : ∇ w r (cid:96) robsum ( x m , y m ; w , w , . . . , w C , β ) = (cid:80) y (cid:48) (cid:54) = r ∇ w (cid:96) robhinge ( x n , +1; w , β ) (cid:12)(cid:12)(cid:12) w = w ym − w y (cid:48) if r = y m −∇ w (cid:96) robhinge ( x n , +1; w , β ) (cid:12)(cid:12)(cid:12) w = w ym − w r otherwiseFollowing the considerations that we have introduced in Section 4, we devise an SGDprocedure for the minimization task: Algorithm 3 : M-GURU( S , η , (cid:15) ) Data : Training set S , learning rate η , accuracy (cid:15) Result : ww ← ; while ∆ L ≥ (cid:15) do m ← rand ( M ) ; for y (cid:48) ∈ { , , . . . , C } do w y (cid:48) ← w y (cid:48) − η √ t ∇ w y (cid:48) (cid:96) robsum ( x m , y m ; w , w , ..., w C , β ) ; endendreturn w ;In Algorithm 3, the notion of stochastic gradient was applied once, to the extent thatour updates depend on a single sample in each iteration. It may be applied again, however.Instead of updating all the weight vectors concurrently, one might randomize which vectorto update, as well. The resulting algrithm is Algorithm 4 : M-GURU- S ( S , η , (cid:15) ) Data : Training set S , learning rate η , accuracy (cid:15) Result : ww ← ; while ∆ L ≥ (cid:15) do m ← rand ( M ) ; y (cid:48) ← rand ( C ) w y (cid:48) ← w y (cid:48) − η √ t ∇ w y (cid:48) (cid:96) robsum ( x m , y m ; w , w , ..., w C , β ) ; endreturn w ; 49ame Table 5.1: Description of the databases used in the binary case
4. Experiments
M-GURU and M-GURU- S were tested on toy problems, USPS and a couple of UCIdatabases (Frank and Asuncion [2010]). The datasets are detailed in Table 5.1. In Toy-3 andToy-4 each class is a Gaussian distribution. These problems are visualized in Figure 5.3.The rsults are summarized in Table 5.2.Observe that the performance of M-GURU is similar to that of SVM. Nontheless, itshould be noted that SVM slightly outperforms M-GURU. This difference is explainedby the fact that M-GURU is based on the sum-of-hinges loss function, which is a loosersurrogate of the zero-one loss than the SVM multi-hinge loss function. We have tested therelative performance of M-GURU and M-GURU- S on the toy-3 dataset. Figure 5.2: A typical run of M-GURU and M-GURU- S on the toy-3 dataset. The loss is plottedagainst the number of updates that were performed. The S variant appears to have an advantagein the descent phase. In the convergence phase, however, M-GURU takes the lead. Overall, theperformance of both variants is pretty similiar. (a) linear scale. (b) semi-logarithmic scale. S (%) SVM(%)Toy-3 98.67 98 98.67Toy-4 96 96 96USPS3,5,8 94.67 94.57 94.857USPS0-9 92.78 92.7 92.85splice 89.08 89.08 89.5wine 92.31 91.03 92.31 Table 5.2: Summary of the results.Figure 5.3: The toy problems used in the testing of M-GURU and M-GURU- S . (a) Toy-3. (b)Toy-4. We observe that M-GURU outperforms the S variant. Our experiments show that theempirical behavior of the classifiers stabilizes a significant time before the optimizationprocess converges. Thus, M-GURU- S may be used to learn classifiers more quickly.51 hapter 6Discussion
1. Contribution
In this work we presented a new robust learning framework. In our framework we minimizethe expected loss over a spreading of the sample points. Each displacement is assumed totake place with a probability that depends on its distance from the original point. Thus, weeffectively replace each point with a fading cloud.We have analyzed the case of Gaussian noise distribution, where the underlying lossmeasure is the hinge-loss. In this case, we have shown that the resulting effective lossfunction is a smooth strictly-convex upper-approximation of the hinge-loss, denoted (cid:96) robhinge .One of the main advantages of this loss function, is its parameter σ that has a clear mean-ing: the variance of the noise that contaminates the data. Similarly to SVM, our algorithm,named GURU, depends on a single parameter. A significant difference is the ability toassign a value to this parameter. In the case of SVM, for a long time all that was knownon this parameter is that it controls the tradeoff between the training error and the marginof the classifier. Xu et al. [2009] have shown that SVM is equivalent to a robust formu-lation in which the parameter corrsponds to the radius of a rigid ball in which the samplepoint may be displaced. This result, however, relates the parameter with the entire dataset. Thus, it is still difficult to tune it. In our method, σ is the magnitude of noise thatpossibly corrupts each sample point, hence it might be evaluated from physical consid-eration, such as the process that generates the data, etc. Without putting extra effort, weare able to point out an alternative explanation for non-regularized SVMs lack of abilityto generalize. We have shown that as σ tends to , (cid:96) robhinge coincides asymptotically withthe hinge loss. Thus, non-regularized SVM may be understood as not trying to acheiverobustness to perturbations, hence it tends to overfit the data. We have shown that (cid:96) robhinge may be written as a perspective of a smooth loss function (denoted f ), where the scalingfactor is σ (cid:107) w (cid:107) . This representation suggests that the robust framework we have developedintroduces a multiplicative regularization. Using both this representation we have deriveda dual problem. The dual formulation depends on the actual loss function f only via itsconjugate dual. Thus, it is possible to plug into the same formulation some other lossesthat follow certain conditions. In particular, as we have demonstrated in Chapter 3, there is52 tight connection between approximations of the loss function and relaxations of the dualproblem. We believe that applying the same technique we have apllied here to other lossfunctions will result in new robust learning algorithms. The connection between the primalloss and the resulting dual shold be investigated more throughly. The algorithmic approachwe have taken in this work is rather simplistic. Due to the fact that our objective is strictly-convex, many off-the-shelf convex optimization algorithms may be used. Our method ofchoice was stochastic gradient descent. Furthrmore, if there is a bound on the norm of theoptimal classifier (as in SVM. see Shalev-Shwartz et al. [2007a] for details), it is probablypossible to use it in order to achieve even faster algorithms. Specifically, subject to such abound, we may restrict the optimization problem to a ball around the origin. In this ball, itis possible that our loss function is strongly-convex,hence it can be optimized using moreaggressive procedure (Shalev-Shwartz and Kakade [2008]). Our generalization to Mercerkernels, is done based on the primal formulation. In order to compute the updates fast( O ( M ) ), we have shown how to maintain the value of the norm of the classifier in O (1) based on pre-computed values. This technique may be employed in Pegasos, e.g, in orderto perform the projection step efficiently.
2. Generalizations
The framework we have introduced may be generalized in couple of interesting directions.Obviously, various families of noise distributions may be plugged into the model. Oneparticularly interesting is the class of all probabilty distributions having a specific first andsecond moment. Vandenberghe et al. [2007] have shown that the probability of a set definedby quadratic inequalities may be computed using semidefinite programming. In addition,they have shown that the optimum is acheived over a discrete probability distribution. Weconjecture that a similar technique may be employed in our case, in order to show that theoptimum of the loss expectation is attained over a discrete distribution. In addition, thesame framework can be used in order to explore more convex perturbations. For example,in the field of computer vision it is possible to assume that the adversary rotates or translatesthe sample, and that the distribution of these perturbations is chosen adversely. In order tomake this practical, it is crucial to understand in which cases the integration and integrationof the loss are possible.Regarding the theoretical aspects of this work, it still remains to show how to deriveperformance bounds for the introduced framework. In particular, it is interesting to un-derstand what kind of gurantees can be derived for the general perspective-optimizationframework we have discussed. 53 ibliography
M. A. Aizerman, E. A. Braverman, and L. Rozonoer. Theoretical foundations of the poten-tial function method in pattern recognition learning. In
Automation and Remote Control, ,number 25, pages 821–837, 1964.Shun-Ichi Amari. Natural gradient works efficiently in learning.
Neural Comput. , 10:251–276, February 1998. ISSN 0899-7667. doi: 10.1162/089976698300017746. URL http://portal.acm.org/citation.cfm?id=287476.287477 .Jean baptiste Pothin and Cdric Richard. Incorporating prior information into support vectormachines in the form of ellipsoidal knowledge sets. 2008.Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexi-ties. In
Annals of Statistics , pages 44–58, 2002.Dimitri P. Bertsekas, Angelia Nedic, and Asuman E. Ozdaglar.
Convex analysis and opti-mization . Athena Scientific, Nashua, USA, 2003.Chiranjib Bhattacharyya, L. R. Grate, Michael I. Jordan, Laurent El Ghaoui, and I. SairaMian. Robust sparse hyperplane classifiers: Application to uncertain molecular profilingdata.
Journal of Computational Biology , 11(6):1073–1089, 2004a.Chiranjib Bhattacharyya, Pannagadatta K. Shivaswamy, and Alex J. Smola. A second ordercone programming formulation for classifying missing data. In
NIPS , 2004b.Jinbo Bi and Tong Zhang. Support vector classification with input data uncertainty. nips,2004.Chris M. Bishop. Training with noise is equivalent to tikhonov regularization.
NeuralComputation , 7:108–116, 1994.L´eon Bottou and Olivier Bousquet. The Tradeoffs of Large Scale Learning. In J. C. Platt,D. Koller, Y. Singer, and S. Roweis, editors,
Advances in Neural Information Process-ing Systems 20 , pages 161–168, 2008. URL http://books.nips.cc/papers/files/nips20/NIPS2007_0726.bib .54´eon Bottou and Yann LeCun. Large scale online learning. In Sebastian Thrun,Lawrence K. Saul, and Bernhard Sch¨olkopf, editors,
NIPS . MIT Press, 2003. ISBN0-262-20152-6.Stephen Boyd and Lieven Vandenberghe.
Convex Optimization . Cambridge UniversityPress, 2004. ISBN 0521833787. URL .Olivier Chapelle. Training a support vector machine in the primal.
Neural Computation ,19:1155–1178, 2007.Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines.
The Journal of Machine Learning Research , 2:265–292, 2002.ISSN 1532-4435.Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networksand support vector machines. In
Advances in Computational Mathematics , pages 1–50.MIT Press, 2000.A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml .Laurent El Ghaoui and Herve Lebret. Robust solutions to least-squares problems withuncertain data, 1997.Amir Globerson and Sam T. Roweis. Nightmare at test time: robust learning by featuredeletion. In William W. Cohen and Andrew Moore, editors,
ICML , volume 148 of
ACM International Conference Proceeding Series , pages 353–360. ACM, 2006. ISBN1-59593-383-2.Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson. Online learning with ker-nels, 2003.Daphne Koller, Dale Schuurmans, Yoshua Bengio, and L´eon Bottou, editors.
Advances inNeural Information Processing Systems 21, Proceedings of the Twenty-Second AnnualConference on Neural Information Processing Systems, Vancouver, British Columbia,Canada, December 8-11, 2008 , 2009. MIT Press.Angelia Nedic and Dimitri Bertsekas. Convergence rate of incremental subgradient al-gorithms. In
Stochastic Optimization: Algorithms and Applications , pages 263–304.Kluwer, 2000.Alexei Pozdnoukhov, Samy Bengio, Alexei Pozdnoukhov, and Samy Bengio. A kernelclassifier for distributions, 2005.Andrew M. Ross. Useful bounds on the expected maximum of correlated normal variables,2003. 55ernhard Sch ¨olkopf and Alexander J. Smola.
Learning with kernels : support vec-tor machines, regularization, optimization, and beyond . Adaptive computation andmachine learning. MIT Press, 2002. URL .Shai Shalev-Shwartz and Sham M. Kakade. Mind the duality gap: Logarithmic regretalgorithms for online optimization. In Koller et al. [2009], pages 1457–1464.Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Zoubin Ghahramani, editor,
ICML , volume 227 of
ACMInternational Conference Proceeding Series , pages 807–814. ACM, 2007a. ISBN 978-1-59593-793-3.Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. 2007b. URL http://ttic.uchicago.edu/˜shai/papers/ShalevSiSr07.pdf . A fast online algorithm for solving the linear svm inprimal using sub-gradients.Pannagadatta K. Shivaswamy, Chiranjib Bhattacharyya, and Alexander J. Smola. Secondorder cone programming approaches for handling missing and uncertain data.
Journalof Machine Learning Research , 7:1283–1314, 2006.Alexander Smola, Bernhard Schlkopf, Rudower Chaussee, and Bernhard Sch Olkopf. Fromregularization operators to support vector kernels. In
In Advances in Neural informationprocessings systems 10 , pages 343–349. MIT Press, 1998.Lieven Vandenberghe, Stephen Boyd, and Katherine Comanor. Generalized chebyshevbounds via semidefinite programming.
SIAM Review , 49, 2007.Vladimir Vapnik.
The Nature of Statistical Learning Theory . Springer, New York, 1995.J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition.In
Proceedings of the Seventh European Symposium On Artificial Neural Networks , vol-ume 4. Citeseer, 1999.Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and lasso. In Kolleret al. [2009], pages 1801–1808.Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization ofsupport vector machines.
Journal of Machine Learning Research , 10:1485–1510, 2009.Jian Zhang, Rong Jin, Yiming Yang, and Alexander G. Hauptmann. Modified logisticregression: An approximation to svm and its applications in large-scale text categoriza-tion. In Tom Fawcett and Nina Mishra, editors,
ICML , pages 888–895. AAAI Press,2003. ISBN 1-57735-189-4. 56 ppendix ASingle-Point Algorithms
The object of this work is to learn classifiers that are robust to noise. As discussed, a possi-ble way to achieve this goal is by applying an adversarial framework. The most importantissue in this case is designing an effective adversary. While in the previous chapters ofthe work we explored more sophisticated adversaries, it is nice to end the journey with arather simple mathematical formulation. The binary version of the algorithms was exten-sively studied. We review the result here for the sake of a complete presentation. A simplegeneralization for the multiclass case is presented subsequently.
1. Problem presentation
Maybe the simplest action that the adversary can take at test-time is displacing a test point,in such a way that will cause this point to be missclassified. If we limit the freedom given tothe adversary, it might not be able to corrupt the classification of the point, but rather onlyreduce the associated confiedence. The model that we will explore in the followings grantsthe adversary the ability to displace a sample point within a ball centered at the originalpoint.In order for the learned classifier to be robust to such displacements, we should modify theobjective of the learning task. In the following we present and anlyze one way to do it, byoptimizaing the worst-case scenario: min w max (cid:107) ∆ x m (cid:107)≤ δ : m =1 ..M λ (cid:107) w (cid:107) + M (cid:88) m =1 (cid:2) − y m w T ( x m + ∆ x m ) (cid:3) + (A.1)This formulation has an additive structure, in which each term ∆ x m appears exactly once.We use these properties in order to decouple the optimization problem. The learning taskat hand in this case is thus min w λ (cid:107) w (cid:107) + M (cid:88) m =1 max (cid:107) ∆ x m (cid:107)≤ δ (cid:2) − y m w T ( x m + ∆ x m ) (cid:3) + (A.2)57ecall that in the general SVM setting, one tries to minimize the hinge loss: (cid:96) hinge ( x , y ; w ) = [1 − y w T x ] + (A.3)Equation A.2 can be interpreted as optimizing the effective loss function (cid:96) robhinge ( x , y ; w ) = max (cid:107) ∆ x (cid:107)≤ δ [1 − y w T ( x + ∆ x )] + (A.4)We say that this loss function is robust, in the sense that it represents the worst-case losssubject to the potential action of the adversary.
2. Computing the optimal displacement
In order to derive a closed form for the loss function (cid:96) robhinge , we should explore the nature ofthe adversarial choice in our model. Intuitively, the adversary will try to relocate the pointto the wrong side of the seperating hyperplane. For this end, it is pointless to move thepoint along any axes not orthogonal to the seperating hyperplane. This idea is visualized inFigure A.1. We will now prove this simple theorem:
Theorem 2.1:
The optimum of the maximization in Equation A.4 is acheived at x opt = x − δ w (cid:107) w (cid:107) Proof:
First we observe that the function f ( z ) = [1 − z ] + is a monotone non-increasingfunction of its argument z . Thus, maximizing f ( z ) is equivalent to minimizing z . By theCauchy-Schwartz inequality, we have that | y w T ∆ x | ≤ (cid:107) w (cid:107) · (cid:107) ∆ x (cid:107) , with equality iff ∆ x is proportional to w . Therefore, the minimal value possible is attained at ∆ x opt = − δ w (cid:107) w (cid:107) .We conclude that x opt = x − δ w (cid:107) w (cid:107) as claimed.Plugging the result of the theorem above into Equation A.4 we end up with (cid:96) robhinge ( x , y ; w ) = [1 − y w T x + δ (cid:107) w (cid:107) ] + (A.5)
3. ASVC: Adversarial Support Vector Classification
The fact that Equation A.4 has a simple closed-form solution allows us to employ thealgorithmic scheme of alternating optimization for Equation A.1. The structure of thealgorithm is quite simple:1. Alternately:(a) Optimize for w (b) Optimize for ∆ x , ∆ x ,..., ∆ x M Until convergence. 58 igure A.1: The adversarial displacement employed by ASVC
Notice that 1a is nothing more than an SVM taking the displaced points as input. Fur-thermore, 1b has a closed-form solution as we have proved in Theorem 2.1. Thus, to solvefor the optimal classifier, any off-the-shelf SVM solver can be used. We end up with Algo-rithm 5.
Algorithm 5 : ASVC( S , δ , λ , T , k ) Data : Training set S , radius δ , tradeoff λ Result : The weight vector ww ← ; repeat ∆ x m ← − δ w (cid:107) w (cid:107) ; ˜ S ← { x m + ∆ x m } x m ∈S ; w ← solveSVM ( ˜ S , λ ) until convergence ; return w ;
4. The Multiclass Case
Pretty similar ideas can be adopted in order to generalize ASVC for the multiclass case.The multi-hinge loss is defined as (cid:96) mult ( x m , y m ; w , w , ..., w C ) = max y =1 , ,...,C (cid:2) δ y,y m − ( w y m − w y ) T x m (cid:3) (A.6)Using the notions of the previous section, we define (cid:96) single mult ( x m , y m ; w , w , ..., w C ) = max (cid:107) ∆ x (cid:107)≤ δ max y =1 , ,...,C (cid:2) δ y,y m − ( w y m − w y ) T ( x m + ∆ x ) (cid:3) (A.7)59ote the order of maximization can be changes, i.e. (cid:96) single mult ( x m , y m ; w , w , ..., w C ) = max y =1 , ,...,C max (cid:107) ∆ x (cid:107)≤ δ (cid:2) δ y,y m − ( w y m − w y ) T ( x m + ∆ x ) (cid:3) (A.8)Applying a slight variation of Theorem 2.1, we conclude with (cid:96) single mult ( x m , y m ; w , w , ..., w C )= max y =1 , ,...,C (cid:20) δ y,y m − ( w y m − w y ) T (cid:18) x m − δ w y m − w y (cid:107) w y m − w y (cid:107) (cid:19)(cid:21) = max y =1 , ,...,C (cid:2) δ y,y m − ( w y m − w y ) T x m + δ (cid:107) w y m − w y (cid:107) (cid:3)
5. Related work
Our ASVC algorithm is a mirror reflection of TSVC presented in (Bi & Zhang, NIPS04).TSVC performs alternating optimization, each time replacing the set of training sampleswith { x i + y i δ i w (cid:107) w (cid:107) } , which are more distant from the separator (thus, easier to classify).The idea there is to address the case in which noisy data distracts the classifier, by usingthe shifted training sets. Figure A.2: The displacement employed by TSVC ppendix BDiagonal Covariance In this appendix we discuss the case in which the adversary is constrained to choose adiagonal covariance matrix. This setting corresponds to the case when the noise is allignedto the primary axes. In this case we are able to give a closed form analytical result, subjectto a bounded trace constraint on the covariance matrix.The adversarial choice problem can can be written max Σ= diag ( a ,a ,...,a d ) tr (Σ) ≤ β L ( x m , y m ; w , w T Σ w ) (B.1)Let us expand w T Σ w = w T diag ( a , a , . . . , a d ) w = d (cid:88) i =1 a i w i = a T w · where w · represents the coordinate-wise product of w with itself. Let i ∗ be the index ofthe maximal entry in w · . It hold that w T Σ w ≤ (cid:88) i β i w i ∗ (B.2)Using the same argumentation as in Chapter 2, we conclude that the adversary will choosethe covariance matrix Σ ∗ = β e i ∗ i ∗ (B.3)where e ij is the matrix having zeros in all of its entries beside ( i, j ) , where it takes thevalue . The geometric meaning of this result is that the adversary will choose to spreadthe noise in a single direction, along the primary axis that creates the biggest angle with theseparating hyperplane. 61 igure B.1: Under the diagonal covariance restriction, the adversary will choose to spread the noisein a unique direction. This direction is the one that creates the biggest angle with the separatinghyperplane. ppendix CUsing the Multi-Hinge Loss The most common generalization of the hinge loss for the multiclass case is the followingloss function (cid:96) mult ( x m , y m ; w , . . . , w C ) = max y (cid:2) w Ty x m − w Ty m x m + δ y,y m (cid:3) (C.1)(see Crammer and Singer [2002]). In this appendix we point out some of the issues thatmade us choose to work with the sum-of-hinges loss function and not with the one above.If we plug the multi-hinge loss into our framework, we get the following learning prob-lem: min w (cid:88) m max Σ ∈ S (cid:90) p ( ˆ x | x m ; Σ) max y (cid:2) w Ty ˆ x − w Ty m ˆ x + δ y,y m (cid:3) d ˆ x (C.2)Define ∆ w y,y m = w y − w y m and write: min w (cid:88) m max Σ ∈ S (cid:90) p ( ˆ x | x m ; Σ) max y [∆ w y,y m ˆ x + δ y,y m ] d ˆ x (C.3)And for Gaussian noise this is: min w (cid:88) m max Σ ∈ S c | Σ | − . (cid:90) e − n T Σ − n max y [∆ w y,y m x m + ∆ w y,y m n + δ y,y m ] d n (C.4)The ability to understand the solution of the adversarial choice problem in this case, isconnected to the ability to understand the expectation of the maximum of a set of normalrandom variables. This problem probably does not have an analytical solution (see Ross[2003]). UNIDIRECTIONAL NOISE
In another approach we have studied, we assumed an adversary that spreads the noise in asingle direction. The motivation for this kind of adversary is the solution to the adversarialchoice problem in the binary case. 63e formulate the problem by letting the adversary to choose a unit length vector. Thus,in the case of unidirectonal noise, the task that the adversary faces is: max v : (cid:107) v (cid:107)≤ (cid:90) R N z (0 , σ ) max y (cid:2) ∆ w Ty,y m x m + ∆ w Ty,y m n + δ y,y m (cid:3) dz (C.5)The integrand (excluding the pdf) is a piecewise linear function. The knees of thisfunction as well as the slopes of the linear sections are strongly dependent on vv