Adaptive versus Standard Descent Methods and Robustness Against Adversarial Examples
AAdaptive versus Standard Descent Methods andRobustness Against Adversarial Examples
Marc Khoury ∗ University of California, BerkeleyFebruary 11, 2020
Abstract
Adversarial examples are a pervasive phenomenon of machine learning models where seemingly impercep-tible perturbations to the input lead to misclassifications for otherwise statistically accurate models. In this paperwe study how the choice of optimization algorithm influences the robustness of the resulting classifier to adver-sarial examples. Specifically we show an example of a learning problem for which the solution found by adaptiveoptimization algorithms exhibits qualitatively worse robustness properties against both L - and L ∞ -adversariesthan the solution found by non-adaptive algorithms. Then we fully characterize the geometry of the loss land-scape of L -adversarial training in least-squares linear regression. The geometry of the loss landscape is subtleand has important consequences for optimization algorithms. Finally we provide experimental evidence whichsuggests that non-adaptive methods consistently produce more robust models than adaptive methods. keywords: adversarial examples, robustness, optimization, geometry ∗ [email protected] a r X i v : . [ c s . L G ] F e b Introduction
Adversarial examples are a pervasive phenomenon of machine learning models where perturbations of the inputthat are imperceptible to humans reliably lead to confident incorrect classifications (Szegedy et al. (2013); Good-fellow et al. (2014)). Since this phenomenon was first observed, researchers have attempted to develop methodswhich produce models that are robust to adversarial perturbations under specific attack models (Wong and Kolter(2018); Sinha et al. (2018); Raghunathan et al. (2018); Mirman et al. (2018); Madry et al. (2018); Zhang et al.(2019)). As machine learning proliferates into society, including security-critical settings like health care (Estevaet al. (2017)) or autonomous vehicles (Codevilla et al. (2018)), it is crucial to develop methods that allow us tounderstand the vulnerability of our models and design appropriate counter-measures.Additionally there is a growing literature on the theory of adversarial examples. Many of these results attemptto understand adversarial examples by constructing examples of learning problems for which it is difficult toconstruct a classifier that is robust to adversarial perturbations. This difficulty may arise due to sample complexity(Schmidt et al. (2018)), computational constraints (Bubeck et al. (2019); Degwekar et al. (2019)), or the high-dimensional geometry of the initial feature space (Shafahi et al. (2019); Khoury and Hadfield-Menell (2018)). Weexpand upon these results in Section 2.Currently less well-understood, and to our knowledge not addressed by the theoretical literature on adversarialexamples, is how our algorithmic choices effect the robustness of our models. With respect to optimization andgeneralization, but importantly not robustness, the success of standard (or non-adaptive ) gradient descent methods,including stochastic gradient descent (SGD) and SGD with momentum, is starting to be better understood (Du et al.(2019); Allen-Zhu et al. (2019); Gunasekar et al. (2018a;b)). However, as an increasing amount of time has beenspent training deep networks, researchers and practitioners have heavily employed adaptive gradient methods,such as Adam (Kingma and Ba (2015)), Adagrad (Duchi et al. (2011)), and RMSprop (Tieleman and Hinton(2012)), due to their rapid training times (Karparthy (2017)). Unfortunately the properties of adaptive optimizationalgorithms are less well-understood than those of their non-adaptive counterparts. Wilson et al. (2017) providetheoretical and empirical evidence which suggests that adaptive algorithms often produce solutions that generalizeworse than those found by non-adaptive algorithms.In this paper, we study the robustness of solutions found by adaptive and non-adaptive algorithms to adver-sarial examples. Furthermore we study the effect of adversarial training on the geometry of the loss landscapeand, consequently, on the solutions found by adaptive and non-adaptive algorithms for the adversarial trainingobjective. Our paper makes the following contributions. • We show an example of a learning problem for which the solution found by adaptive optimization algorithmsexhibits qualitatively worse robustness properties against both L - and L ∞ -adversaries than the solutionfound by non-adaptive algorithms. Furthermore the robustness of the adaptive solution decreases rapidly asthe dimension of the problem increases, while the robustness of the non-adaptive solution is stable as thedimension increases. • We fully characterize the geometry of the loss landscape of L -adversarial training in least-squares linearregression. The L -adversarial training objective L is convex everywhere; moreover, it is strictly convexeverywhere except along either 0, 1, or 2 line segments, depending on the value of (cid:15) . Furthermore for nearlyall choices of (cid:15) , these line segments along which L is convex, but not strictly convex, lie outside of therowspace and the gradient along these line segments is nonzero. It follows that any reasonable optimizationalgorithm finds the unique global minimum of L . • We conduct an extensive empirical evaluation to explore the effect of different optimization algorithms onrobustness. Our experimental results suggest that non-adaptive methods consistently produce more robustmodels than adaptive methods. • For all (cid:15) (cid:54) = 1 / (cid:107) X † y (cid:107) Related Work
There has been a long line of work on the theory of adversarial examples. Schmidt et al. (2018) explore the samplecomplexity required to produce robust models. They demonstrate a simple setting, a mixture of two Gaussians, inwhich a linear classifier with near perfect natural accuracy can be learned from a single sample, but any algorithmthat produces any binary classifier requires Ω( √ d ) samples to produce a robust classifier. Followup work byBubeck et al. (2019) suggests that adversarial examples may arise from computational constraints. They exhibitpairs of distributions that differ only in a k -dimensional subspace, and are otherwise standard Gaussians, andshow that while it is information-theoretically possible to distinguish these distributions, it requires exponentiallymany queries in the statistical query model of computation. We note that both of these constructions producedistributions whose support is the entirety of R d .Bubeck et al. (2019) further characterize five mutually exclusive “worlds” of robustness, inspired by simi-lar characterizations in complexity theory (Impagliazzo (1995)). A learning problem must fall into one of thefollowing possibilities: World 1 : No robust classifier exists, regardless of computational considerations or sample efficiency.
World 2 : Robust classifiers exists, but they are computationally inefficient to evaluate.
World 3 : Computationally efficient robust classifiers exist, but learning them requires more samples.
World 4 : Computationally efficient robust classifiers exist and can be learned from few samples, but learn-ing is inefficient.
World 5 : Computationally efficient robust classifiers exists and can be learned efficiently from few samples.While learning problems can be constructed that fall into each possible world, the question for researchers isinto which world are problems from practice most likely to fall? Every theoretical construction, such as those bySchmidt et al. (2018) and Bubeck et al. (2019), can be thought of as providing evidence for the prevalence of oneof the worlds. In the language of Bubeck et al. (2019), the sampling complexity result of Schmidt et al. (2018)provides evidence for World 3, by constructing an example of a problem that falls into world three. The learningproblem constructed by Bubeck et al. (2019) provides evidence for World 4. Subsequent work by Degwekar et al.(2019) provides evidence for Worlds 2 and 4. Under standard cryptographic assumptions, Degwekar et al. (2019)construct an a learning problem for which a computationally efficient non-robust classifier exists, no efficientrobust classifier exists, but an inefficient robust classifier exists. Similarly, assuming the existence of one-wayfunctions, they construct a learning problem for which an efficient robust classifier exists, but it is computationallyinefficient to learn a robust classifier. Finally, in an attempt to understand how likely World 4 is in practice,they show that any task where an efficient robust classifier exists but is hard to learn in polynomial time impliesone-way functions. Additionally there is a line work that attempts to explain the pervasiveness of adversarial examples throughthe lens of high-dimensional geometry. Gilmer et al. (2018) experimentally evaluated the setting of two concen-tric under-sampled -spheres embedded in R , and concluded that adversarial examples occur on the datamanifold. Shafahi et al. (2019) suggest that adversarial examples may be an unavoidable consequence of thehigh-dimensional geometry of data. Their result depends upon the use of an isopermetric inequality. The maindrawback of these results, as well as the constructions of Schmidt et al. (2018) and Bubeck et al. (2019), is thatthey assume that the support of the data distribution has full or nearly full dimension. We do not believe this tobe the case in practice. Instead we believe that the data distribution is often supported on a very low-dimensionalsubset of R d . This case is addressed in Khoury and Hadfield-Menell (2018), where they consider the problemof constructing decision boundaries robust to adversarial examples when data is drawn from a low-dimensionalmanifold embedded in R d . They highlight the role of co-dimension, the difference between the dimension ofthe embedding space and the dimension of the data manifold, as a key source of the pervasiveness of adversarialvulnerability. Said differently, it is the low-dimensional structure of features embedded in high-dimensional spacethat contributes, at least in part, to adversarial examples. This idea is also explored in Nar et al. (2019), but withemphasis on the cross-entropy loss.We believe that problems in practice are most likely to fall into World 5, the best of all worlds. Problems inthis class have robust classifiers which are efficient to evaluate and can be learned efficiently from relatively fewsamples. We simply haven’t found the right algorithm for learning such classifiers. The goal of this paper is toexplore the effect of our algorithms on robustness. Specifically we wish to understand the robustness properties ofsolutions found by common optimization algorithms. To our knowledge no other work has explored the robustnessproperties of solutions found by different optimization algorithms. Thus at least one community will be happy. Adaptive Algorithms May Significantly Reduce Robustness
Wilson et al. (2017) explore the effect of different optimization methods on generalization both in a simple theo-retical setting and empirically. For their main theoretical result, they construct a learning problem for which thesolution found by any adaptive method, denoted w ada , has worse generalization properties than the solution foundby non-adaptive methods, denoted w SGD . We recall their construction in Section 3.1. In Section 3.2 we describethe adaptive solution w ada and in Section 3.3 we describe the non-adaptive solution w SGD .Generalization and robustness are different properties of a classifier. A classifier can generalize well but haveterrible robustness properties, as we often see in practice. On the other hand, a constant classifier generalizespoorly, but has perfect robustness (Zhang et al. (2019)). Wilson et al. (2017) study the generalization propertiesof w ada and w SGD , but not their robustness properties. In Section 3.4 we study the robustness properties of w ada and w SGD . Specifically, we show that w SGD exhibits superior robustness properties to w ada against both L - and L ∞ -adversaries. Let X ∈ R n × d be a design matrix representing a dataset with n sample points and d features and let y ∈ {± } n be a vector of labels. Wilson et al. (2017) restrict their attention to binary classification problems of this type, andlearn a classifier by minimizing the least-squares loss min w L ( X, y ; w ) = min w (cid:107) Xw − y (cid:107) . (1)They construct the following learning problem for which they can solve for both the adaptive and non-adaptivesolutions in closed form. Their construction uses an infinite-dimensional feature space for simplicity, but they notethat n dimensions suffice. For i ∈ . . . n , sample y i = 1 with probability p , and y i = − with probability − p for some p > . . Then set x i to be the infinite-dimensional vector x ij = y i j = 11 j = 2 , j = 4 + 5( i − − y i ) / j = 5 + 5( i − , . . . , i − otherwise . (2)For example, a dataset with three sample points following Equation 3 is . . . − . . . . . . . (3)The first feature encodes the label, and is alone sufficient for classification. Note that this trick of encoding thelabel is also commonly used in the robustness literature to construct examples of hard-to-learn-robustly problems(Bubeck et al. (2019); Degwekar et al. (2019)). The second and third feature are identically for every sample.Then there is a subset of five dimensions which are identified with x i and contain a set of features which are unique to x i . If y i = 1 then there is a single in this subset of five dimensions and x i is the only sample with a in this dimension. If y i = − then all five dimensions are set to and again x i is the only sample with a at thesefive positions.While this problem may seem contrived, it contains several properties that are common in machine learningproblems and that are particularly important for robustness. It contains a single robust feature that is stronglycorrelated with the label. However it may not be easy for an optimization algorithm to identify such a feature.Additionally there are many non-robust features which are weakly or not at all correlated with the label, but whichmay appear useful for generalization because they are uniquely identified with samples from specific classes.Wilson et al. (2017) show that both adaptive and non-adaptive methods find classifiers that place at least someweight on every nonzero feature. w ada Let ( X, y ) be generated by the generative model in Section 3.1. When initialized at the origin, Wilson et al. (2017)show that any adaptive optimization algorithm – such as RMSprop, Adam, and Adagrad – minimizing Equation 1for ( X, y ) converges to w ada ∝ v where 3 = j = 11 j = 2 , y (cid:98) ( j +1) / (cid:99) j > and x (cid:98) ( j +1) / (cid:99) = 10 otherwise . (4)Thus we can write w ada = τ v for some positive constant τ > . On a test example ( x test , y test ) , that isdistinct from all the training examples, (cid:104) w ada , x test (cid:105) = τ ( y test + 2) > . Thus w ada labels every unseen exampleas a positive example. w SGD
For ( X, y ) , let P , N denote the sets of positive and negative samples in X respectively. Let n + = |P| , n − = |N | and note that n = n + + n − . When the weight vector is initialized in the row space of X , Wilson et al. (2017)show that all non-adaptive methods – such as gradient descent, SGD, SGD with momentum, Nesterov’s method,and conjugate gradient – converge to w SGD = X † y , where X † denotes the pseudo-inverse. That is, among theinfinitely many solutions of the underdetermined system Xw = y , non-adaptive methods converge to the solutionwhich minimizes (cid:107) w (cid:107) , and thus maximizes the L -margin. Specifically w SGD = (cid:80) i ∈P α + x i + (cid:80) j ∈N α − x j where α + = 4 n − + 515 n + + 3 n − + 8 n + n − + 5 , α − = − n + + 115 n + + 3 n − + 8 n + n − + 5 . Note that these values for α + , α − differ slightly from those presented in Wilson et al. (2017). In Appendix Bwe discuss in detail two errors in their derivation that lead to this discrepancy. These errors do not qualitativelychange their results. Furthermore, in Appendix A.2 we carefully discuss under what conditions (cid:104) w SGD , x test (cid:105) ispositive and negative for y test = ± . For now, we simply state that for all n + , n − ≥ , w SGD correctly classifiesevery test example. w ada and w SGD
In this section we analyze the robustness properties of w ada and w SGD against L - and L ∞ -adversaries. Weshow that w SGD exhibits considerably more robustness against both L - and L ∞ -adversaries than w ada . A priorithis is surprising; one may have expected w ada , which is a small L ∞ -norm solution, to be more robust to L ∞ -perturbations, while w SGD , which is a small L -norm solution, to be robust to L -perturbations. However thisexpectation is wrong. Interestingly the robustness of w ada against both L - and L ∞ -adversaries decreases as thedimension increases, whereas the robustness of w SGD does not. Finally, neither method recovers the “obvious” so-lution w ∗ = (1 , , . . . , , which generalizes well and is optimally robust against both L - and L ∞ -perturbations.Theorems 1 and 3 are our main results of this section; the proofs are deferred to Appendix A. We start bycomputing the robustness of w ada against L - and L ∞ -adversaries. Theorem 1.
Let ( x test , y test ) be a test sample that is correctly classified by w ada and let δ ∈ R d be a perturbation.The adaptive solution w ada is robust against any L -perturbation for which (cid:107) δ (cid:107) < √ n + + 1125 n − + 2725 n − + n + + 3 (5) and any L ∞ -perturbation for which (cid:107) δ (cid:107) ∞ <
33 + n + + 5 n − . (6) Furthermore these bounds are tight, meaning that an L - or L ∞ -ball with these radii centered at x test inter-sects the decision boundary. Corollary 2.
Asymptotically, the L - and L ∞ -robustness of w ada are, respectively, Θ (cid:18) √ n + + n − (cid:19) and Θ (cid:18) n + + n − (cid:19) . In particular both the L - and L ∞ -robustness go to as the number of samples n + , n − → ∞ . L - and L ∞ -robustnessof w ada decrease reflects a dependence on dimension. The number of dimensions on which w ada puts nonzeroweight increases as we increase the number of samples, which reduces robustness. We also find it interesting that,despite classifying every test point as a positive example, w ada ’s predictions on correctly classified test samplesare brittle. In summary, w ada exhibits nearly no robustness against L - or L ∞ -adversaries.Next we show that w SGD exhibits significant robustness against both L - and L ∞ -adversaries. Theorem 3.
Let ( x test , y test ) be a test sample that is correctly classified by w SGD and let δ ∈ R d be a perturbation.The SGD solution w SGD is robust against any L -perturbations for which (cid:107) δ (cid:107) ≤ n + +8 n + n − − n − √ n n − +160 n n − +75 n +32 n + n − +60 n + n − +70 n + +3 n − +5 n − y test = 1 − n + +8 n + n − +3 n − √ n n − +160 n n − +75 n +32 n + n − +60 n + n − +70 n + +3 n − +5 n − y test = − (7) and any L ∞ -perturbation for which (cid:107) δ (cid:107) ∞ ≤ (cid:40) n + +8 n + n − − n − n + +32 n + n − +4 n − y test = 1 − n + +8 n + n − +3 n − n + +32 n + n − +4 n − y test = − . (8) Furthermore these bounds are tight, meaning that an L - or L ∞ -ball with these radii centered at x test intersectsthe decision boundary. Corollary 4.
Asymptotically, the L - and L ∞ -robustness of w SGD are both
Θ (1) . In particular the L -robustnessapproaches and the L ∞ -robustness approaches as the number of samples n + , n − → ∞ . Unsurprisingly, w SGD , which maximizes the L -margin, exhibits near optimal robustness against L -adversaries.As the number of samples increases, the L -robustness of w SGD approaches . Perhaps surprisingly, w SGD alsoexhibits moderate robustness to L ∞ -perturbations. As the number of samples increases, the L ∞ -robustness of w SGD approaches . Unlike w ada , the amount of robustness exhibited by w SGD does not decrease as the dimensionincreases, instead asymptotically approaching a constant.However the L -robustness of w SGD is not exactly for any finite sample. One class ( y test = 1) approaches from above, while the other class ( y test = − approaches from below. To maximize the margin, w SGD places asmall amount of weight on every other nonzero feature, even though all but the first are useless for classification.This lack of sparsity is also what causes the L ∞ -robustness to drop from a possible maximum of to . Incontrast, w ∗ = (1 , , . . . , generalizes perfectly, has L -robustness equal to for both classes, and, as an addedbenefit, has L ∞ -robustness equal to for both classes. Thus we have an example of a problem for which the max L -margin solution could reasonably be considered to not be the best classifier against L -perturbations.Furthermore, w ∗ is not in the row space of X . ( w SGD is the projection of w ∗ onto the row space.) Thusnon-adaptive methods, when restricted to the row space, are incapable of recovering w ∗ , irrespective of samplecomplexity (Schmidt et al. (2018)) or computational considerations (Bubeck et al. (2019)). This is simply thewrong algorithm for the desired objective. In the next section we study the effect that adversarial training has onthe loss landscape and on the solutions found by various optimization algorithms. In the previous section we presented a learning problem for which adaptive optimization methods find a solutionwith significantly worse robustness properties against both L - and L ∞ -adversaries compared to non-adaptivemethods. In this section we consider a different algorithm, adversarial training, for finding robust solutions toEquation 1. We are interested in two questions. First, does adversarial training sufficiently regularize the losslandscape so that adaptive and non-adaptive methods find solutions with identical or qualitatively similar robust-ness properties? Second, are the solutions to the robust objective qualitatively different than those found by naturaltraining or does adversarial training simply choose a robust solution from the space of solutions to the natural prob-lem? We address the first question in Section 4.2 for L -adversarial training and the second in Section 4.3 for thelearning problem defined in Section 3. Madry et al. (2018) formalize adversarial training by introducing the robust objective min w E ( x,y ) ∈D (cid:20) max δ ∈ ∆ L ( x + δ, y ; w ) (cid:21) (9)5here D is the data distribution, ∆ is a perturbation set meant to enforce a desired constraint, and L is a lossfunction. The goal then is to find a setting of the parameters w of the model that minimize the expected lossagainst the worst-case perturbation in ∆ .Take L as in Equation 1 and ∆ to be an L p -ball of radius (cid:15) > . In the linear case, we can solve the innermaximization problem exactly. max { δ i } i ∈ [ n ] ∈ ∆ n L ( x + δ i , y ; w ) = max { δ i } i ∈ [ n ] ∈ ∆ n n (cid:88) i =1 ( (cid:104) x i + δ i , w (cid:105) − y i ) (10) = max { δ i } i ∈ [ n ] ∈ ∆ n n (cid:88) i =1 (cid:16) ( (cid:104) x i , w (cid:105) − y i ) + 2 (cid:104) δ i , w (cid:105) ( (cid:104) x i , w (cid:105) − y i ) + (cid:104) δ i , w (cid:105) (cid:17) = 12 n (cid:88) i =1 (cid:16) ( (cid:104) x i , w (cid:105) − y i ) + 2 (cid:15) (cid:107) w (cid:107) ∗ sign( (cid:104) x i , w (cid:105) − y i )( (cid:104) x i , w (cid:105) − y i ) + (cid:15) (cid:107) w (cid:107) ∗ (cid:17) = 12 (cid:107) Xw − y (cid:107) + (cid:15) (cid:107) w (cid:107) ∗ (cid:107) Xw − y (cid:107) + (cid:15) n (cid:107) w (cid:107) ∗ . (11)The third identity follows from the definition of the dual norm, where (cid:107) · (cid:107) ∗ denotes the norm dual to the L p norm that defines ∆ . As a technical note, it is important that sign(0) = 1 (or − ) and not equal to . Thischoice represents the fact that the solution to the inner maximization problem for each individual squared term ( (cid:104) x i + δ i , w (cid:105) − y i ) is nonzero even if x (cid:62) i w − y i = 0 .At first glance the objective looks similar to ridge regression or Lasso, particularly when we consider L -and L ∞ -adversarial training for which the dual norms are L and L respectively. However the solutions to thisobjective are not, in general, identical to the ridge regression or Lasso solutions. In Section 4.2 we will show howthe second term (cid:15) (cid:107) w (cid:107) ∗ (cid:107) Xw − y (cid:107) influences the geometry of the loss landscape when (cid:107) · (cid:107) ∗ = (cid:107) · (cid:107) . For the remainder of the paper we will exclusively analyze the case where ∆ is an L -ball, leaving the case of L ∞ for future work. We define the loss of interest L ( X, y ; w ) = 12 (cid:107) Xw − y (cid:107) + (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) + (cid:15) n (cid:107) w (cid:107) . (12)To build intuition, suppose that Xw = y is an underdetermined system. (Our results will not depend on thisassumption.) The set of solutions is given by the affine subspace S = { X † y + u : u ∈ nullspace( X ) } , where X † is the pseudo-inverse. The first thing to notice about (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) is that, on its own, it is non-convex,having local minima both at the origin and in S . Along any path starting at the origin and ending at a point in S ,the loss landscape induced by (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) is negatively curved.The second thing to notice about (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) is that it is non-smooth. To understand the loss landscapeof Equation 12, it is crucial to understand where (cid:107) Xw − y (cid:107) is non-smooth. The term (cid:107) Xw − y (cid:107) = (cid:80) i | x (cid:62) i w − y i | is non-smooth at any point w where some x (cid:62) i w − y i = 0 . Geometrically, x (cid:62) i w − y i = 0 is the equation of ahyperplane h i with normal vector x i and bias y i . The hyperplane h i partitions R d into two halfspaces h + i , h − i suchthat every point w ∈ h + i has sign( x (cid:62) i w − y i ) = 1 and w ∈ h − i has sign( x (cid:62) i w − y i ) = − . The set of hyperplanes { h i : i ∈ [ n ] } define a hyperplane arrangement H , a subdivision of R d into convex cells. See Figure 1. Let C ∈ H be a cell of the hyperplane arrangement . Every point w in the interior Int C of C lies on the same side of everyhyperplane h i as every other point in Int C . Thus we can identify each C with a signature s = sign( Xw − y ) forany w ∈ Int C .Theorem 5, our main result of this section, fully characterizes the geometry of Equation 12. Theorem 5. L is always a convex function, whose optimal solution(s) always lie in rowspace( X ) . There arefour possible cases, three of which depend on the value of (cid:15) .1. If Xw = y is an inconsistent system, then L is a strictly convex function.2. If Xw = y is a consistent system and (cid:15) ∈ (0 , / (cid:107) X † y (cid:107) ) then L is a convex function. Moreover, L isstrictly convex everywhere except along two line segments, both of which have one endpoint at the originand terminate at X † y ± u respectively, for some u ∈ nullspace( X ) . The gradient at every point on theseline segments is nonzero, and so the optimal solution is found in rowspace( X ) at a point of strict convexity. We use “cell” to refer to a d -dimensional face H . When considering a lower dimensional face of H we will refer to the dimensionexplicitly. - - - - - - - - - - - - - - - Figure 1: The top leftmost figure shows a hyperplane arrangement with two lines that subdivides R into fourconvex cells. The top center left figure shows the isocontours of (cid:107) Xw − y (cid:107) . Within each convex cell, theisocontours behave as the linear function s (cid:62) ( Xw − y ) , where s is the signature of the cell, and are non-smoothalong the two black lines. The top center right figure shows the isocontours of (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) , which are clearlynon-convex. The top rightmost figure shows the isocontours of the function L = (cid:107) Xw − y (cid:107) + (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) + (cid:15) n (cid:107) w (cid:107) . Notice how these isocontours are non-smooth along the two black lines and the asymmetry ofthe isocontours caused by the (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) term. The bottom row shows the graphs of the functions in thetop row.
3. If Xw = y is a consistent system and (cid:15) = 1 / (cid:107) X † y (cid:107) , then L is a convex function. Moreover, L is strictlyconvex everywhere except along a single line segment with one endpoint at the origin and the other endpointat X † y . The optimal solution(s) may or may not lie along this line.4. If Xw = y is a consistent system and (cid:15) > / (cid:107) X † y (cid:107) , then L is a strictly convex function.Furthermore, L is subdifferentiable everywhere. The proof of Theorem 5 first characterizes the geometry of L restricted to the interior of each convex cell C ∈ H . We denote this function by L | Int C . If the signature s of C is not equal to − y , meaning that C doesnot contain the origin, then L | Int C is always strongly convex. The cell C with signature s = − y is the onlycell in which L might not be strongly convex. The cases in Theorem 5 correspond to the cases that characterizethe geometry of L | Int C for s = − y . The transitions between cells are strictly convex and the subdifferential isnon-empty at these transitions.We find it very interesting that L is strictly or strongly convex almost everywhere. For (cid:15) ∈ (0 , / (cid:107) X † y (cid:107) ) , L is convex, but not strictly convex, only along two line segments which lie outside of the rowspace of X .The gradient along these line segments is nonzero, and so this particular type of convexity does not preventan optimization algorithm from finding the unique solution in the rowspace of X . For (cid:15) = 1 / (cid:107) X † y (cid:107) , L isconvex, but not strictly convex, only along a single line segment in the rowspace of X . However the condition (cid:15) (cid:54) = 1 / (cid:107) X † y (cid:107) can be ensured by an infinitesimal perturbation. The following remark is immediate. Corollary 6.
Suppose (cid:15) (cid:54) = 1 / (cid:107) X † y (cid:107) . Then any optimization algorithm which is guaranteed to converge to theglobal minimum for a strictly convex subdifferentiable function and which does not prematurely terminate at apoint with nonzero gradient finds the unique global minimum of L . Corollary 6 states that, in the linear case, any reasonable optimization algorithm finds the unique global opti-mum of L , almost always. We conclude that, in the linear case, L -adversarial training does indeed sufficiently The condition on “premature termination” in Corollary 6 is meant to rule out the following case. One could construct an optimizationalgorithm that is guaranteed to converge for strictly convex functions, but terminates early upon detecting a point at which there exists adirection in which the function is convex but not strictly convex, even if the gradient is nonzero and the global minimum is at a point of strictconvexity. We doubt any commonly used optimization algorithm would have difficulty with the geometry of L . L . In Lemma 11 wecharacterize the subdifferential ∂ L ( w ) at every point. However solving for w where ∈ ∂ L ( w ) is similar tosolving a linear program, and so we suspect that no closed-form solution exists. In Section 4.3 we discuss thesolution for L in the particular case of the learning problem defined in Section 3 and show that the max-marginsolution is often not the solution recovered by adversarial training. While we know of no technique to characterize the set of solutions for L in general, we can still make somestatements about the solution in specific instances, such as the learning problem described in Section 3. First, sincethe solution(s) of L must lie in the rowspace, the “obvious” solution w ∗ to the learning problem in Section 3 isnot recovered by L -adversarial training. A priori, one might guess that the minimum L -norm solution X (cid:62) α isthe solution to L . However this is only true under specific conditions which depend on the class imbalance. Theorem 7.
Let ( X, y ) be the learning problem defined in Section 3. X (cid:62) α is a solution to L if and only if (cid:15) ≤ (cid:113) n n − + 160 n n − + 75 n + 32 n + n − + 60 n + n − + 70 n + + 3 n − + 5 n − max (cid:0) n − + 4 n − n + + 5 n + + 5 n − , n + 4 n − n + + n + + n − (cid:1) . (13) Let c > be a constant such that n + = cn − . If (cid:15) ≤ min (cid:110) c c , c (cid:111) then X (cid:62) α is a solution to L . While we need the first condition, which is both necessary and sufficient, to draw our forthcoming conclusion,the second, merely sufficient, condition provides greater intuition. The first condition states that X (cid:62) α is a solutionif and only if (cid:15) is sufficiently small, as a function of n + , n − . For the learning problem in Section 3, we know that (cid:15) = 1 is achievable, so it is natural to ask how large an (cid:15) is allowable by Equation 13. This relationship dependson the class imbalance, and so we set n + = cn − and derive the condition in the second part of the proof ofTheorem 7, which is a lower bound on the right-hand side of Equation 13. The term min (cid:110) c c , c (cid:111) is at most ,when c = 1 , but can be arbitrarily less than depending on c ; see Figure 2. We note also that the gap between thelower bound min (cid:110) c c , c (cid:111) and the right-hand side of Equation 13 is already small for n − ≈ and vanishesas n − → ∞ . Thus we conclude that if (cid:15) is close to and the dataset is even moderately imbalanced, X (cid:62) α , whichmaximizes the L -margin, is not a solution for L . c m i n { c + c , + c } Figure 2: The bound min (cid:110) c c , c (cid:111) as a function of the class imbalance c . If the classes are perfectly balanced( c = 1 ), then we can take (cid:15) up to and recover the minimum L -norm solution X (cid:62) α . As the class imbalanceincreases the maximum allowable (cid:15) for which we recover X (cid:62) α decreases rapidly. In this section we experimentally explore the effect of different optimization algorithms on robustness for deepnetworks. We are interested in the following questions. (1) Do different optimization algorithms give qualitatively8ifferent robustness results? (2) Does adversarial training reduce or eliminate the influence of the optimizationalgorithm? (3) Do adaptive or non-adaptive methods consistently outperform the other for both L - and L ∞ -adversarial attacks, even if the difference is small? Models
For MNIST our model consist of two convolutional layers with 32 and 64 filters respectively, eachfollowed by × max pooling. After the two convolutional layers, there are two fully connected layers eachwith 1024 hidden units. For CIFAR10 we use a ResNet18 model (He et al. (2016)). We use the same modelarchitectures for both natural and adversarial training. These models were chosen because they are small enoughfor us to run a large hyperparameter search. Parameters for Adversarial Training
For adversarial training we use the approach of Madry et al. (2018),and train against a projected gradient descent (PGD) adversary. For MNIST with L ∞ -adversarial training, wetrain against a -step PGD adversary with step size . and maximum perturbation size of (cid:15) = 0 . . For L -adversarial training we train against a -step PGD adversary with step size . and maximum perturbation sizeof (cid:15) = 1 . . For CIFAR10 with L ∞ -adversarial training, we train against against a -step PGD adversary withstep size . / and maximum perturbation size of (cid:15) = 0 . / . For L -adversarial trainingwe train against a -step PGD adversary with step size . / and a maximum perturbation size of (cid:15) = 0 . / . Attacks for Evaluation
On MNIST we apply -step PGD with random restarts. For L ∞ we apply PGDwith step sizes { . , . , . , . } , and for L we apply PGD with step sizes { . , . , , } . On CIFAR10we apply -step PGD with random restarts. For L ∞ we apply PGD with step sizes { / , / , / } andfor L we apply PGD with step sizes { / , / , } . We also apply the gradient-free BoundaryAttack++(Chen et al. (2019)). We evaluate these attacks per sample , meaning that if any attack successfully constructs anadversarial example for a sample x at a specific (cid:15) , it reduces the robust accuracy of the model at that (cid:15) . Metrics
We plot the robust classification accuracy for each attack as a function of (cid:15) ∈ [0 , (cid:15) max ] . We are in-terested in both natural and adversarial training. Usually when heuristic methods for adversarial training areevaluated, they are compared at the specific (cid:15) for which the model was adversarially trained. Such a comparisonis arbitrary for naturally trained models and is also unsatisfying for adversarially trained models. To compare therobustness of different optimization algorithms we instead consider the area under the robustness curve. FollowingKhoury and Hadfield-Menell (2019), we report the normalized area under the curve (NAUC) defined as NAUC(acc) = 1 (cid:15) max (cid:90) (cid:15) max acc( (cid:15) ) d(cid:15), (14)where acc : [0 , (cid:15) max ] → [0 , measures the classification accuracy. Note that NAUC ∈ [0 , with higher valuescorresponding to more robust models. Hyperparameter Selection
We perform an extensive hyperparameter search over the learning rate (and if appli-cable momentum) parameter(s) of a each optimization algorithm to identify parameter settings which produce themost robust models. For each dataset we set aside a validation set of size 5000 from the training set. We then trainmodels, with the architecture described above, for each of the hyperparameter settings described below for 100epochs. All optimization algorithms are started from the same initialization. We evaluate the robustness of eachmodel as described above on the validation set. The hyperparameter settings for each optimization algorithm withachieve the largest NAUC are then used to train new models and then evaluated on the full test set. These finalresults are the ones we report in this section. The hyperparameters we explored were influenced hyperparamtersearch of Wilson et al. (2017).The following search space is defined for MNIST. For SGD we consider the learning rates { , , . , . , . , . , . , . } . For SGD with momentum we consider the set of learning rates forSGD for each of the momentum settings { . , . , . } . For Adam, Adagrad, and RMSprop we consider initiallearning rates { . , . , . , . , . } .The following search space is defined for CIFAR10. For SGD we consider the learning rates { , , . , . , . } . For SGD with momentum we consider the set of learning rates for SGD for each ofthe momentum settings { . , . , . } . For both SGD and SGD with momentum we reduce the learning rateusing the R EDUCE
LRO N P LATEAU scheduler in PyTorch. For Adam and RMSprop we consider the initiallearning rates { . , . , . , . , . , . } . For Adagrad we consider initial learning rates { . , . , . , . , . } . 9nfortunately adversarially training takes an order of magnitude longer than natural training, since in theinnermost loop we must perform an iterative PGD attack to construct adversarial examples. Due to our limitedresources, we consider only a subset of the hyperparameters above for adversarial training. Figure 3 (Top) shows the robustness of naturally trained models to L ∞ - and L -adversarial attacks on MNIST.Against L ∞ -adversarial attacks, RMSprop produce the most robust model with NAUC . , followed by Adam( . ), SGD with momentum ( . ), and SGD and Adagrad ( . ). Against L -adversarial attacks, SGD producesthe most robust model with NAUC . , followed by SGD with momentum ( . ), Adam ( . ), and Adagradand RMSprop ( . ).Against L ∞ -adversarial attacks, RMSprop produces a model that appears qualitatively more robust than thenext best performing model. This difference can be large at specific (cid:15) ; for example at (cid:15) = 0 . , the RMSprop modelmaintains a robust accuracy of , while the Adam model has robust accuracy . This is the only instanceacross all of our experiment in which we observe a notable qualitative difference between different algorithms.Figure 3 (Bottom) shows the robustness of adversarially trained models. Training adversarially improves therobustness over naturally trained models regardless of the optimization algorithm and all optimization algorithmsgive qualitatively similar results. For L ∞ -adversarial training, SGD produces the most robust model with NAUC0.66, followed by SGD with momentum (0.65), Adam (0.64), RMSprop (0.63), and Adagrad (0.61). For L -adversarial training SGD with momentum and RMSprop produce models with NAUC 0.56, followed by SGD(0.54), Adam (0.53), and Adagrad (0.51). For adversarial training on MNIST, either SGD or SGD with momentumwere among the top performers, with Adagrad always producing the worst performing model.Adversarial training does seem to reduce the dependence on the choice of optimization algorithm, though doesnot completely remove it. Against L ∞ -adversarial attacks at (cid:15) = 0 . , the SGD model maintains robust accuracyof . , while the Adagrad model maintains robust accuracy . . We consider this difference noteworthy forMNIST. The difference is less pronounced for L -adversarial training.SGD and SGD with momentum consistently outperform other optimization algorithms, yielding the best mod-els in three out of four experiments. (Or one of the most robust models in the case of ties.) While the differenceis qualitatively small, we believe that the consistency with which SGD or SGD with momentum produces themost robust model is noteworthy. Furthermore Adagrad seems to consistently under-perform other optimizationalgorithms. (Though this may depend on the domain (Tifrea et al. (2019))). Figure 4 (Top) shows the robustness of naturally trained models to L ∞ - and L -adversarial attacks on CIFAR10.Against L ∞ -adversarial attacks, SGD with momentum produces the most robust model with NAUC . , fol-lowed by RMSprop ( . ), Adam and Adagrad ( . ), and SGD ( . ). Against L -adversarial attacks Adagradproduces the most robust models with NAUC . , followed by SGD, Adam, and RMSprop ( . ), and SGD withmomentum ( . ).Figure 4 (Bottom) shows the robustness of adversarially trained models. For L ∞ -adversarial training, SGDwith momentum produced the most robust model with NAUC . , followed by SGD ( . ), Adam ( . ), andAdagrad and RMSprop ( . ). For L -adversarial training, SGD produces the most robust model with NAUC . , followed by SGD with momentum ( . ), Adam and RMSprop ( . ), and Adagrad ( . ).Adversarial training lessens the dependence on the choice of optimization algorithm. Against L ∞ -adversarialattacks at (cid:15) = 0 . / , the SGD with momentum model maintains robust accuracy of , while theRMSprop model maintains robust accuracy . The difference is less pronounced for L -adversarial training.SGD or SGD with momentum consistently outperform other optimization algorithms, yielding the best modelsin three out of four experiments. Again while the difference is qualitatively small, we believe that the consistencywith which SGD or SGD with momentum produces the most robust model is noteworthy.10 .0 0.1 0.2 0.3 0.4 0.5 epsilon a cc u r a cy MNIST Natural L algorithmsgdmomentumadamadagradrmsprop 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 epsilon a cc u r a cy MNIST Natural L algorithmsgdmomentumadamadagradrmsprop0.0 0.1 0.2 0.3 0.4 0.5 epsilon a cc u r a cy MNIST Adversarial Training L algorithmsgdmomentumadamadagradrmsprop 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 epsilon a cc u r a cy MNIST Adversarial Training L algorithmsgdmomentumadamadagradrmsprop Figure 3:
Top : Robust accuracy for naturally trained MNIST models against L ∞ - and L -adversarial attacks. Bot-tom : Robust accuracy for adversarially trained MNIST models. Left, L ∞ -adversarially trained models evaluatedagainst L ∞ -adversarial attacks; right L . We have presented evidence that adaptive algorithms may produce less robust solutions by giving undue influenceto irrelevant features which can be exploited by an adversary. While the geometry of the loss-landscape of ad-versarial training in the linear regression setting is surprisingly complicated, we are still able to demonstrate theuniqueness of the optimum. In the future, it would be awesome to prove similar results for deep networks.
The Geometry of L ∞ -Adversarial Training Many of the statements that we proved about L -adversarial train-ing are not true for L ∞ -adversarial training. While we believe it’s possible to show that L ∞ -adversarial traininggives a convex optimization problem using an argument nearly identical to ours, the geometry of the L ∞ -objectiveis much less strongly convex than the L -objective. It is unlikely that the optimal solution is unique or lies in therowspace, for similar reasons as in Lasso regression. It would be interesting to see what, if any, statements can bemade about the solutions to the L ∞ -adversarial training objective in the least-squares linear regression case. Deep Networks
An obvious future direction is to extend our theoretical analysis to deep networks. Even thecase of deep linear networks is interesting and will likely require new techniques. Consider a deep linear function f : x (cid:55)→ W (cid:62) l W (cid:62) l − . . . W (cid:62) x with l trainable layers. In this case, we can solve the inner maximization problem ofEquation 12 for adversarial training, which gives the loss function (cid:107) XW W . . . W l − y (cid:107) + (cid:15) (cid:107) W W . . . W l (cid:107) ∗ (cid:107) XW W . . . W l − y (cid:107) + (cid:15) n (cid:107) W W . . . W l (cid:107) ∗ . We know of no technique for analyzing the geometry of this loss function. Even when ∆ is an L -ball andthere are only l = 2 trainable layers, the geometry of (cid:107) W W (cid:107) is highly non-convex.11 .00 0.02 0.04 0.06 0.08 0.10 epsilon a cc u r a cy CIFAR Natural L algorithmsgdmomentumadamadagradrmsprop epsilon a cc u r a cy CIFAR Natural L algorithmsgdmomentumadamadagradrmsprop epsilon a cc u r a cy CIFAR Adversarial Training L algorithmsgdmomentumadamadagradrmsprop epsilon a cc u r a cy CIFAR Adversarial Training L algorithmsgdmomentumadamadagradrmsprop Figure 4:
Top : Robust accuracy for naturally trained CIFAR10 models against L ∞ - and L -adversarial attacks. Bottom : Robust accuracy for adversarially trained CIFAR10 models. Left, L ∞ -adversarially trained modelsevaluated against L ∞ -adversarial attacks; right L .We believe that the primary bottleneck for understanding the generalization and robustness properties of so-lutions found by optimization algorithms for deep networks is an adequate set of tools for reasoning about thegeometry of high-dimensional non-convex loss landscapes .Future work should attempt to fully characterize theeffect of depth on the curvature of loss landscape using tools from differential geometry. Acknowledgements
The author would like to thank Jonathan Shewchuk for providing comments on earlier drafts of this manuscriptand Dylan Hadfield-Menell for providing the compute resources necessary for the experiments. This work waspartially funded by NSF award CCF-1909235.
References
Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In
ICML ,2019.S. Bubeck, Y. T. Lee, E. Price, and I. P. Razenshteyn. Adversarial examples from computational constraints. In
ICML , 2019.J. Chen, M. I. Jordan, and M. J. Wainwright. Hopskipjumpattack: Query-efficient decision-based adversarialattack.
CoRR , abs/1904.02144, 2019. 12. Codevilla, M. M¨uller, A. Dosovitskiy, A. L´opez, and V. Koltun. End-to-end driving via conditional imitationlearning. In
ICRA , 2018.A. Degwekar, P. Nakkiran, and V. Vaikuntanathan. Computational limitations in robust classification and win-winresults. In
COLT , 2019.S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks.In
ICML , 2019.J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimiza-tion.
Journal of Machine Learning Research , 2011.A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level classifi-cation of skin cancer with deep neural networks.
Nature , 2017.J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. J. Goodfellow. Adversarialspheres.
CoRR , abs/1801.02774, 2018. URL http://arxiv.org/abs/1801.02774 .I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In
ICLR , 2014.S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linear convolutionalnetworks. In
NeurIPS , 2018a.S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry.In
ICML , 2018b.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
CVPR , 2016.R. Impagliazzo. A personal view of average-case complexity. In
Proceedings of the Tenth Annual Structure inComplexity Theory Conference , 1995.A. Karparthy. A peak at trends in machine learning. https://medium.com/@karpathy/a-peek-at-trends-in-machine-learning-ab8a1085a106 , 2017.M. Khoury and D. Hadfield-Menell. On the geometry of adversarial examples.
CoRR , abs/1811.00525, 2018.M. Khoury and D. Hadfield-Menell. Adversarial training with voronoi constraints.
CoRR , abs/1905.01019, 2019.D. Kingma and J. Ba. Adam: A method for stochastic optimization. In
ICLR , 2015.A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adver-sarial attacks. In
ICLR , 2018.M. Mirman, T. Gehr, and M. T. Vechev. Differentiable abstract interpretation for provably robust neural networks.In
ICML , 2018.K. Nar, O. Ocal, S. S. Sastry, and K. Ramchandran. Cross-entropy loss and low-rank features have responsibilityfor adversarial examples.
CoRR , abs/1901.08360, 2019.A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples. In
ICLR , 2018.L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires moredata. In
NIPS , 2018.A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein. Are adversarial examples inevitable? In
ICLR ,2019.A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarialtraining. In
ICLR , 2018.C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing propertiesof neural networks.
CoRR , abs/1312.6199, 2013. URL http://arxiv.org/abs/1312.6199 .T. Tieleman and G. Hinton. RMSprop: Divide the gradient by a running average of its recent magnitude, 2012.A. Tifrea, G. B´ecigneul, and O. Ganea. Poincare glove: Hyperbolic word embeddings. In
ICLR , 2019.13. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods inmachine learning. In
NIPS , 2017.E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial poly-tope. In
ICML , 2018.H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-off betweenrobustness and accuracy. In
ICML , 2019.
A Proofs
A.1 Proof of Theorem 1
Proof.
Let δ be an adversarial perturbation and let x test be a test sample. Then (cid:104) w ada , x test + δ (cid:105) = (cid:104) w ada , x test (cid:105) + (cid:104) w ada , δ (cid:105) = τ ( y test + 2) + (cid:104) w ada , δ (cid:105) = τ ( y test + 2) + τ δ + δ + δ + (cid:88) i ∈P δ i − (cid:88) j ∈N δ j = 3 τ + τ δ + δ + δ + (cid:88) i ∈P δ i − (cid:88) j ∈N δ j = τ ( y test + 2) − τ δ (3 + n + + 5 n − ) The second last equality follows from the fact that x test is correctly classified by w ada , and so y test = 1 . Noticethat to flip the sign of the classifier using the smallest L ∞ -perturbation, it is optimal to distribute the magnitude ofthe perturbation equally to each δ i , where the signs of each δ i are − for i ∈ { , , } ∪ P and +1 for i ∈ N . Itfollows that to flip the sign of the classifier requires δ (3 + n + + 5 n − ) > δ >
33 + n + + 5 n − . To find the smallest L -perturbation we must instead solve the following constrained optimization problem min δ (cid:88) i δ i s.t. δ + δ + δ + (cid:88) i ∈P δ i − (cid:88) j ∈N δ j < − (15)where R = (cid:80) i δ i is the squared-radius of the smallest L -ball that crosses the decision boundary. The La-grangian for this problem is L ( δ, λ ) = (cid:88) i δ i + λ δ + δ + δ + (cid:88) i ∈P δ i − (cid:88) j ∈N δ j + 3 The partial derivatives are given by ∂ L ∂δ i = (cid:40) δ i + λ i = 1 , , or i ∈ P δ i − λ i ∈ N ∂ L ∂λ = δ + δ + δ + (cid:88) i ∈P δ i − (cid:88) j ∈N δ j + 3 . gives δ i = (cid:40) − λ i = 1 , , or i ∈ P λ i ∈ N , (16)which can then be used to solve the last equation yielding λ = 625 n − + n + + 3 . Substituting the expression for λ back into Equation 16 gives δ i = (cid:40) − n − + n + +3 i = 1 , , or i ∈ P n − + n + +3 i ∈ N . (17)Then the minimum L -perturbation R is R = (cid:88) i δ i = (3 + n + ) (cid:18) − n − + n + + 3 (cid:19) + 5 n − (cid:18) n − + n + + 3 (cid:19) = 9( n + + 3) + 1125 n − (25 n − + n + + 3) R = √ n + + 1125 n − + 2725 n − + n + + 3 . A.2 Proof of Theorem 3
It is worth taking a moment to understand (cid:104) w SGD , x test (cid:105) when y test = 1 and when y test = − . In particular, itwill be important in our proofs to understand the signs of each term.First, we have α + > and α − < by definition. When y test = 1 we have (cid:104) w SGD , x test (cid:105) = ( n + α + − n − α − ) + 2( n + α + + n − α − )= 5 n + + n − + 8 n + n − n + + 3 n − + 8 n + n − + 5 + 2(5 n + − n − )15 n + + 3 n − + 8 n + n − + 5= 15 n + + 8 n + n − − n − n + + 3 n − + 8 n + n − + 5 . The denominator is clearly positive, so w SGD correctly classifies x test so long as n + + 8 n + n − − n − > ,which is true for any n + , n − ≥ .Now when y test = − we have (cid:104) w SGD , x test (cid:105) = − ( n + α + − n − α − ) + 2( n + α + + n − α − )= − n + + n − + 8 n + n − n + + 3 n − + 8 n + n − + 5 + 2(5 n + − n − )15 n + + 3 n − + 8 n + n − + 5= 5 n + − n + n − − n − n + + 3 n − + 8 n + n − + 5 . In this case, w SGD correctly classifies x test so long as n + − n + n − − n − < , which is true for any n + , n − ≥ . Thus w SGD correctly classifies every test example so long as there at least one training example fromeach class.We will also be interested in the signs of the individual terms in (cid:104) w SGD , x test (cid:105) . Note that n + + n − +8 n + n − > for any n + , n − ≥ , and so ( n + α + − n − α − ) is positive. Lastly n + − n − > so long as n + > n − / , andso ( n + α + + n − α − ) > if and only if n + > n − / . For convenience we will assume that n + > n − / from hereonward which will allow us to consider fewer cases. 15 roof. Let δ be an adversarial perturbation and let x test be a test sample. Then (cid:104) w SGD , x test + δ (cid:105) = (cid:104) w SGD , x test (cid:105) + (cid:104) w SGD , δ (cid:105) where (cid:104) w SGD , x test (cid:105) = y test ( n + α + − n − α − ) + 2( n + α + + n − α − ) and (cid:104) w SGD , δ (cid:105) = ( n + α + − n − α − ) δ + ( n + α + + n − α − )( δ + δ ) + α + (cid:88) i ∈P δ i + α − (cid:88) j ∈N ( δ j, + . . . + δ j, ) . There are two cases to consider corresponding to y test = ± .Suppose that y test = 1 . To flip the sign we need (cid:104) w SGD , δ (cid:105) < −(cid:104) w SGD , x test (cid:105) . For brevity’s sake, we willdefine
C ≡ ( n + α + − n − α − ) δ + ( n + α + + n − α − )( δ + δ ) + α + (cid:88) i ∈P δ i + α − (cid:88) j ∈N ( δ j, + . . . + δ j, ) + (cid:104) w SGD , x test (cid:105) . The constraint C < is equivalent to (cid:104) w SGD , δ (cid:105) < −(cid:104) w SGD , x test (cid:105) . Clearly we can ensure the sign of (cid:104) w SGD , δ (cid:105) is negative by choosing each δ i opposite in sign to the term by which it is multiplied in C . Note that, by our assump-tions on n + , n − , ( n + α + − n − α − ) , ( n + α + + n − α − ) , α + > and α − < . Thus we choose sign( δ j, ..., ) = 1 for all j ∈ N and sign( δ i ) = − otherwise. Furthermore the optimal solution sets each δ i to the same magnitude,and so to change the sign the perturbation δ must be at least δ > (cid:104) w SGD , x test (cid:105) ( n + α + − n − α − ) + 2( n + α + + n − α − ) + n + α + − n − α − = (cid:104) w SGD , x test (cid:105) n + α + − n − α − )= 3 n + α + + n − α − n + α + − n − α − )= 15 n + + 8 n + n − − n − n + + 32 n + n − + 4 n − Now suppose that y test = − . In this case we need (cid:104) w SGD , δ (cid:105) > −(cid:104) w SGD , x test (cid:105) , (equivalently C > ). Notethat in this case (cid:104) w SGD , x test (cid:105) is negative, and so we choose the signs of each δ i to match the signs of the terms bywhich δ i is multiplied. We choose sign( δ j, ..., ) = − and sign( δ i ) = 1 otherwise. Thus to change the sign theperturbation δ must be at least δ > −(cid:104) w SGD , x test (cid:105) ( n + α + − n − α − ) + 2( n + α + + n − α − ) + n + α + − n − α − = −(cid:104) w SGD , x test (cid:105) n + α + − n − α − )= − n + α + − n − α − n + α + − n − α − )= − n + + 8 n + n − + 3 n − n + + 32 n + n − + 4 n − To find the smallest L -perturbation, in the case where y test = 1 , we must instead solve the following con-strained optimization problem min δ (cid:88) i δ i s.t. C ≤ (18)where R ≡ (cid:80) i δ i is the squared-radius of the smallest L -ball that crosses the decision boundary. The La-grangian for this problem is L ( δ, λ ) = (cid:88) i δ i + λ C ∂ L ∂δ i = δ i + λ ( n + α + − n − α − ) i = 12 δ i + λ ( n + α + + n − α − ) i = 2 , δ i + λα + i ∈ P δ i,j + λα − i ∈ N , j ∈ [5] ∂ L ∂λ = C . Setting the first set of partial derivatives to gives δ i = − λ ( n + α + − n − α − ) i = 1 − λ ( n + α + + n − α − ) i = 2 , − λ α + i ∈ P− λ α − i ∈ N , j ∈ [5] , (19)which can then be used to solve the last equation yielding λ = (cid:104) w SGD , x test (cid:105) ( n + α + − n − α − ) + ( n + α + + n − α − ) + n + α + n − α − . Substituting the expression for λ back into Equation 19 and solving for R gives R = (cid:88) i δ i = λ (cid:0) ( n + α + − n − α − ) + 2( n + α + + n − α − ) + n + α + 5 n − α − (cid:1) = λ (cid:18)
12 ( n + α + − n − α − ) + ( n + α + + n − α − ) + 12 n + α + 52 n − α − (cid:19) = (cid:104) w SGD , x test (cid:105) ( n + α + − n − α − ) + 2( n + α + + n − α − ) + n + α + 5 n − α − R = (cid:104) w SGD , x test (cid:105) (cid:113) ( n + α + − n − α − ) + 2( n + α + + n − α − ) + n + α + 5 n − α − = 3 n + α + + n − α − (cid:113) ( n + α + − n − α − ) + 2( n + α + + n − α − ) + n + α + 5 n − α − = 15 n + + 8 n + n − − n − (cid:113) n n − + 160 n n − + 75 n + 32 n + n − + 60 n + n − + 70 n + + 3 n − + 5 n − The case with y test = − is similar, but with the constraint −C ≤ , which yields a similar solution for λ ,except that the numerator is −(cid:104) w SGD , x test (cid:105) > . Subsequently R = − n + + 8 n + n − + 3 n − (cid:113) n n − + 160 n n − + 75 n + 32 n + n − + 60 n + n − + 70 n + + 3 n − + 5 n − . A.3 Proof of Theorem 5
The proof of Theorem 5 is the combination of the following lemmas.
Lemma 8.
Let w ∈ R d be any vector and let w (cid:107) be the orthogonal projection of w onto rowspace( X ) . Then, forthe objective function L ( X, y ; w ) = 12 (cid:107) Xw − y (cid:107) + (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) + (cid:15) n (cid:107) w (cid:107) . we have that L ( X, y ; w ) ≥ L ( X, y ; w (cid:107) ) , with equality if and only if w = w (cid:107) . Hence for any optimal solution w ∗ of L , w ∗ ∈ rowspace( X ) . roof. Let w = w (cid:107) + w ⊥ be any vector R d where w (cid:107) ∈ rowspace( X ) and w ⊥ ∈ nullspace( X ) . L ( X, y ; w ) = 12 (cid:107) Xw − y (cid:107) + (cid:15) (cid:107) w (cid:107) (cid:107) Xw − y (cid:107) + (cid:15) n (cid:107) w (cid:107) = 12 (cid:107) X ( w (cid:107) + w ⊥ ) − y (cid:107) + (cid:15) (cid:107) w (cid:107) + w ⊥ (cid:107) (cid:107) X ( w (cid:107) + w ⊥ ) − y (cid:107) + (cid:15) n (cid:107) w (cid:107) + w ⊥ (cid:107) = 12 (cid:107) Xw (cid:107) − y (cid:107) + (cid:15) (cid:107) w (cid:107) + w ⊥ (cid:107) (cid:107) Xw (cid:107) − y (cid:107) + (cid:15) n (cid:107) w (cid:107) + w ⊥ (cid:107) = 12 (cid:107) Xw (cid:107) − y (cid:107) + (cid:15) (cid:113) (cid:107) w (cid:107) (cid:107) + (cid:107) w ⊥ (cid:107) (cid:107) Xw (cid:107) − y (cid:107) + (cid:15) n (cid:0) (cid:107) w (cid:107) (cid:107) + (cid:107) w ⊥ (cid:107) (cid:1) ≥ (cid:107) Xw (cid:107) − y (cid:107) + (cid:15) (cid:107) w (cid:107) (cid:107) (cid:107) Xw (cid:107) − y (cid:107) + (cid:15) n (cid:107) w (cid:107) (cid:107) with equality if and only if (cid:107) w ⊥ (cid:107) = 0 . The third equality follows from the fact that w ⊥ is in nullspace( X ) , thefourth from the fact that w (cid:107) ⊥ w ⊥ . This proves the first statement. The second statement regarding w ∗ followsimmediately. Lemma 9.
Let
C ∈ H be a convex cell with signature s . The restriction of L to the interior of C , denoted L | Int C ,is a convex function. Furthermore, if s (cid:54) = − y then L | Int C is a strongly convex function.Suppose that s = − y , meaning that C contains the origin. There are four possible cases, three of which dependon the value of (cid:15) .1. If Xw = y is an inconsistent system, then L | Int C is a strongly convex function.2. If Xw = y is a consistent system and (cid:15) ∈ (0 , (cid:107) X † y (cid:107) ) then L | Int C is a convex function. Specifically, L| Int C is convex but not strongly convex along two line segments both of which have one endpoint at the origin andterminate at X † y ± u for some u ∈ nullspace( X ) respectively. The gradient at every point on these linesegments is nonzero, and so the optimal solution is found in the rowspace( X ) at a point of strong convexity.3. If Xw = y is a consistent system and (cid:15) = (cid:107) X † y (cid:107) , then L | Int C is a convex function. Specifically L| Int C is convex but not strongly convex along a single line segment with one endpoint at the origin and the otherendpoint at X † y . The optimal solution may lie along this line.4. If Xw = y is a consistent system and (cid:15) > (cid:107) X † y (cid:107) , then L | Int C is a strongly convex function.Proof. Let C be any cell in the hyperplane arrangement induced by (cid:107) Xw − y (cid:107) and let s ∈ ± n denote thesignature of C . We will show that the Hessian matrix within C is positive semi-definite.The Hessian matrix H ( w ) at a point w ∈ Int C is X (cid:62) X + (cid:15) (cid:107) w (cid:107) (cid:0) X (cid:62) sw (cid:62) + ws (cid:62) X (cid:1) + (cid:15) n (cid:107) w (cid:107) ww (cid:62) − (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) ww (cid:62) + (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) I. This form of the Hessian comes from twice differentiating Equation 10 and is equivalent to twice differentiatingEquation 11. Note that it is crucial in the third term that the sign function is always ± and not defined as whenthe input is ; this is from where the factor of n is derived. At a high level, we examine the curvature inducedby H ( w ) in each unit direction v ∈ S d − at w and show that it is everywhere non-negative. It is worth taking amoment to examine how each term of H ( w ) affects the curvature of the objective at w .The term X (cid:62) X is a positive semi-definite matrix and induces a quadratic form with positive curvature in eacheigen-direction whose corresponding eigenvalue is positive, and zero curvature in every eigen-direction corre-sponding to a zero eigenvalue.The term (cid:107) w (cid:107) (cid:0) X (cid:62) sw (cid:62) + ws (cid:62) X (cid:1) is a sum of outer-product matrices. Note that this matrix is symmetric,since ( X (cid:62) sw (cid:62) ) (cid:62) = ws (cid:62) X . This matrix has a ( d − -dimensional nullspace, corresponding to the intersection nullspace( w ) ∩ nullspace( X (cid:62) s ) . On the -dimensional subspace spanned by { w (cid:107) w (cid:107) , X (cid:62) s } , and with respect tothat basis, the outer-product has the matrix (cid:32) w (cid:107) w (cid:107) · X (cid:62) s w (cid:107) w (cid:107) · w (cid:107) w (cid:107) X (cid:62) s · X (cid:62) s w (cid:107) w (cid:107) · X (cid:62) s (cid:33) . w (cid:107) w (cid:107) · ( X (cid:62) s ) ± (cid:115)(cid:18) w (cid:107) w (cid:107) · w (cid:107) w (cid:107) (cid:19) (( X (cid:62) s ) · ( X (cid:62) s )) = w (cid:107) w (cid:107) · ( X (cid:62) s ) ± (cid:107) X (cid:62) s (cid:107) . By triangle inequality, one of these eigenvalues is always positive while the other is always negative. Thus thereis one direction of positive curvature and one direction of negative curvature. The eigenvectors are (cid:112) (cid:107) X (cid:62) s (cid:107) X (cid:62) s ± (cid:107) X (cid:62) s (cid:107) (cid:112) (cid:107) X (cid:62) s (cid:107) w (cid:107) w (cid:107) . The term (cid:15) n (cid:107) w (cid:107) ww (cid:62) induces positive curvature in the direction w with eigenvalue (cid:15) n and curvature in everydirection orthogonal to w .The term − (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) ww (cid:62) induces negative curvature in the direction w with eigenvalue − (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) . However the negative curvature in the direction w is exactly undone by the positivecurvature induced by the term (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) I which induces positive curvature in every directionwith eigenvalues all equal to (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) . The result of the sum of these two terms is a quadraticform which induces curvature in the direction w and positive curvature in every direction orthogonal to w witheigenvalue (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) . Note that the value (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) is positive by definition,since w is in the convex cell with signature s , and so (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) = (cid:15) (cid:107) w (cid:107) ( (cid:107) Xw − y (cid:107) + (cid:15) (cid:107) w (cid:107) n ) > . Let v ∈ S d − be a unit vector. The curvature in the direction v is proportional (with positive constant ofproportionality) to v (cid:62) H ( w ) v = v (cid:62) X (cid:62) Xv + (cid:15) (cid:107) w (cid:107) (cid:0) v (cid:62) ( X (cid:62) sw (cid:62) + ws (cid:62) X ) v (cid:1) + (cid:15) n (cid:107) w (cid:107) v (cid:62) ww (cid:62) v − (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) v (cid:62) ww (cid:62) v + (cid:15) (cid:107) w (cid:107) s (cid:62) ( Xw − y + (cid:15) (cid:107) w (cid:107) s ) v (cid:62) v = (cid:107) Xv (cid:107) + 2 (cid:15) (cid:107) w (cid:107) ( w (cid:62) v )( s (cid:62) Xv ) + (cid:15) n (cid:107) w (cid:107) ( w (cid:62) v ) + (cid:15) (cid:18) (cid:107) Xw − y (cid:107) (cid:107) w (cid:107) + (cid:15)n (cid:19) (cid:32) − (cid:18) w (cid:107) w (cid:107) · v (cid:19) (cid:33) = (cid:107) Xv (cid:107) + 2 (cid:15) √ n (cid:107) w (cid:107) ( w (cid:62) v ) (cid:107) Xv (cid:107) cos ϕ + (cid:15) n (cid:107) w (cid:107) ( w (cid:62) v ) (cid:124) (cid:123)(cid:122) (cid:125) term 1 + (cid:15) (cid:18) (cid:107) Xw − y (cid:107) (cid:107) w (cid:107) + (cid:15)n (cid:19) (cid:0) − cos θ (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) term 2 where ϕ = ∠ ( s, Xv ) and θ = ∠ ( w, v ) . It’s easy to see that term 2 is always greater than or equal to , since cos θ ∈ [0 , , with equality when cos θ = 1 . By the quadratic formula, term 1 is also always greater thanor equal to , with equality when cos ϕ = ± and sign(cos ϕ ) (cid:54) = sign( w (cid:62) v ) ; otherwise the zeros given by thequadratic formula have an imaginary component that depends on sin ϕ . Thus, at this point, we see that H ( w ) is atleast positive semi-definite in Int C .We wish to derive under which conditions this inequality is strict, implying that H ( w ) is positive-definite in C . First we will show that if w is in a cell of the hyperplane arrangement whose signature is s (cid:54) = − y , then H ( w ) is positive definite. The conditions which must be true for v (cid:62) H ( w ) v = 0 imply that s = − y ; w must be in thecell that contains the origin.For v (cid:62) H ( w ) v = 0 we need both term 1 and term 2 to be equal to . Term 2 is equal to if and only if cos θ = ± , which implies that v (cid:107) w . Since v is a unit vector, we have v = ± w (cid:107) w (cid:107) . For term 1 to be equalto we need cos ϕ = ± and sign(cos ϕ ) (cid:54) = sign( w (cid:62) v ) . The first of these two conditions implies that s (cid:107) Xv .Suppose that v = w (cid:107) w (cid:107) ; then Xv = − αs for some α > . So we have that Xv (cid:107) Xv (cid:107) = − s (cid:107) s (cid:107) Xw (cid:107) Xw (cid:107) = Xw = − (cid:107) Xw (cid:107) (cid:107) s (cid:107) sx (cid:62) i w = − (cid:107) Xw (cid:107) (cid:107) s (cid:107) s i w ∈ C , which implies that sign( x (cid:62) i w − y i ) = s i . If s i = 1 , then x (cid:62) i w − y i > x (cid:62) i w > y i − (cid:107) Xw (cid:107) (cid:107) s (cid:107) s i > y i > − (cid:107) Xw (cid:107) (cid:107) s (cid:107) s i > y i which implies that y i = − . The case where s i = − is similar, as is the case where v = − w (cid:107) w (cid:107) . All together,we have that s = − y .Thus the necessary ( not sufficient) conditions for v (cid:62) H ( w ) v = 0 can only be satisfied if s = − y . If s (cid:54) = − y then H ( w ) is defined and positive-definite everywhere in Int C .We now turn our attention toward a necessary condition, which when combined with our other necessaryconditions, give a set of sufficient conditions for H ( w ) to be positive semi-definite but not positive-definite.Suppose that cos θ = ± , cos ϕ = ± and sign(cos ϕ ) (cid:54) = sign( w (cid:62) v ) . By the above discussion s = − y . Suppose v = w (cid:107) w (cid:107) and, thus, cos ϕ = − . Under these conditions we have that v (cid:62) H ( w ) v = (cid:107) Xv (cid:107) − (cid:15) √ n (cid:107) w (cid:107) ( w (cid:62) v ) (cid:107) Xv (cid:107) + (cid:15) n (cid:107) w (cid:107) ( w (cid:62) v ) + (cid:15) (cid:18) (cid:107) Xw − y (cid:107) (cid:107) w (cid:107) + (cid:15)n (cid:19) (cid:0) − cos θ (cid:1) = (cid:18) (cid:107) Xv (cid:107) − (cid:15) √ n (cid:107) w (cid:107) w (cid:62) v (cid:19) = (cid:0) (cid:107) Xv (cid:107) − (cid:15) √ n (cid:1) . Note that when any one of the conditions detailed in the previous paragraph do not hold, the first equalityis instead a lower bound on v (cid:62) H ( w ) v . From this we see that the final necessary condition for v (cid:62) H ( w ) v = 0 is for (cid:107) Xv (cid:107) = (cid:15) √ n . Since cos ϕ = − , we must have Xv = − (cid:15)s = (cid:15)y which implies v = (cid:15)X † y + u for u ∈ nullspace( X ) . Recall that v is a unit vector, so (cid:107) v (cid:107) = (cid:107) (cid:15)X † y + u (cid:107) = (cid:15) (cid:107) X † y (cid:107) + (cid:107) u (cid:107) = 1 , from whichit follows that (cid:15) = √ −(cid:107) u (cid:107) (cid:107) X † y (cid:107) .The relationship (cid:15) = √ −(cid:107) u (cid:107) (cid:107) X † y (cid:107) gives three intervals for (cid:15) in which the curvature of L behaves qualitativelydifferently. For (cid:15) ∈ (0 , / (cid:107) X † y (cid:107) ) the equation has two solutions ± u in the nullspace( X ) with (cid:107) u (cid:107) < . Since w (cid:107) v this ray of curvature lies outside of rowspace( X ) , and, by Lemma 8, the gradient cannot be along thisray. For (cid:15) = 1 / (cid:107) X † y (cid:107) , the solution is given by u = 0 and so there is a single ray in the direction of X † y in the rowspace( X ) . This ray is parameterized by αX † y for α ∈ (0 , . The gradient may or may not be zero along thisray. Finally for (cid:15) > / (cid:107) X † y (cid:107) there is no solution to the relationship and L is strongly convex within C withsignature s = − y .Before concluding we must address the fact that the Hessian H is not defined at w = 0 . Let { O i } i be the setof d closed orthants of R d . We further subdivide C with signature s = − y as C i = C ∩ O i . Within the relativeinteriors of each C i , L | Int C i is twice differentiable everywhere with Hessian as described above. Thus L | Int C i is convex for all i .Let w ∈ Bd C i ∩ Int C and w (cid:54) = 0 . Then the subdifferential ∂ L | C i ( w ) is nonempty and, in particular,contains the gradient ∇L ( w ) , which is defined at w since w ∈ Int C and w (cid:54) = 0 . The intersection ∂ L | C i ( w ) ∩ ∂ L | C j ( w ) = {∇L ( w ) } for w ∈ Bd C i ∩ Bd C j ∩ Int C , since L is actually differentiable at w .Now let w ∈ Int C i and w (cid:48) ∈ Int C j for i (cid:54) = j and such that the line segment ww (cid:48) does not intersect theorigin in its relative interior. Further suppose that C i and C j are adjacent along the line segment ww (cid:48) , meaning thatthere is a single point w (cid:48)(cid:48) at which the line segment ww (cid:48) leaves Int C i and enters Int C j . Note that w, w (cid:48)(cid:48) , w (cid:48) are20ollinear. Then (cid:104)∇L ( w ) , w (cid:48) − w (cid:105) = (cid:104)∇L ( w ) , w (cid:48) − w (cid:48)(cid:48) (cid:105) + (cid:104)∇L ( w ) , w (cid:48)(cid:48) − w (cid:105) = (cid:107) w (cid:48)(cid:48) − w (cid:107) (cid:107) w (cid:48) − w (cid:48)(cid:48) (cid:107) (cid:104)∇L ( w ) , w (cid:48)(cid:48) − w (cid:105) + (cid:104)∇L ( w ) , w (cid:48)(cid:48) − w (cid:105)≤ (cid:107) w (cid:48)(cid:48) − w (cid:107) (cid:107) w (cid:48) − w (cid:48)(cid:48) (cid:107) (cid:104)∇L ( w (cid:48)(cid:48) ) , w (cid:48)(cid:48) − w (cid:105) + (cid:104)∇L ( w ) , w (cid:48)(cid:48) − w (cid:105) = (cid:104)∇L ( w (cid:48)(cid:48) ) , w (cid:48) − w (cid:48)(cid:48) (cid:105) + (cid:104)∇L ( w ) , w (cid:48)(cid:48) − w (cid:105)≤ L | Int C j ( w (cid:48) ) − L | Int C j ( w (cid:48)(cid:48) ) + L | Int C i ( w (cid:48)(cid:48) ) − L | Int C i ( w )= L | Int C j ( w (cid:48) ) − L | Int C i ( w )= L ( w (cid:48) ) − L ( w ) . The second equality follows from collinearity and the first inequality follows from convexity of L | Int C i fromwhich we can derive (cid:104)∇L ( w (cid:48)(cid:48) ) − ∇L ( w ) , w (cid:48)(cid:48) − w (cid:105) ≥ . The remaining steps are straightforward. Thisargument can be extended to a line segment ww (cid:48) for w and w (cid:48) in two cells that only intersect at the origin in astraightforward manner using induction. Thus it follows that L | Int C is convex along ww (cid:48) . All that remains is thecase where ww (cid:48) intersects the origin.Suppose that ww (cid:48) , parameterized by (cid:96) ( t ) = (1 − t ) w + tw (cid:48) for t ∈ [0 , , intersects the origin. Choose anyunit vector v such that v is not parallel to ww (cid:48) . Then consider the perturbed line segment ˜ (cid:96) ( t, (cid:15) ) = (1 − t )( w + (cid:15)v ) + t ( w (cid:48) + (cid:15)v ) = (cid:96) ( t ) + (cid:15)v for (cid:15) > . Let t be such that (cid:96) ( t ) = 0 . As (cid:15) → , ˜ (cid:96) ( t, (cid:15) ) → (cid:96) ( t ) and, inparticular, ˜ (cid:96) ( t , (cid:15) ) → . Since v is not parallel with ww (cid:48) , ˜ (cid:96) ( t, (cid:15) ) does not intersect the origin for (cid:15) > , and so L | Int C (˜ (cid:96) ( t, (cid:15) )) ≤ (1 − t ) L | Int C (˜ (cid:96) (0 , (cid:15) )) + t L | Int C (˜ (cid:96) (1 , (cid:15) )) . Taking (cid:15) → convexity follows from continuityof L | Int C . Note that this approach only applies when d ≥ ; however the d = 1 for L case is straightforward. Lemma 10. L is a convex function. If L | Int C for C with signature s = − y is a strongly convex function, then L is a strictly convex function. Furthermore transitions between two cells are strictly convex.Proof. Let w, w (cid:48) ∈ R d be any two points. The line segment ww (cid:48) with endpoints w and w (cid:48) is parameterized by w t = (1 − t ) w + tw (cid:48) for t ∈ [0 , . If ww (cid:48) ⊂ Int C for some C then Lemma 9 gives the results. Suppose that w ∈ C and w (cid:48) ∈ C (cid:48) are in distinct cells of the hyperplane arrangement and that ww (cid:48) intersect the boundaries ofthese cells at t , . . . , t m . This partitions the interval [0 , into m + 1 subintervals [0 , t ] ∪ [ t , t ] ∪ . . . ∪ [ t m , ,in each of which the function L is convex along w t i w t i +1 by Lemma 9.Consider the base case where m = 1 . The point w ∈ C ∩ C (cid:48) , where the line segment ww (cid:48) leaves C and enters C (cid:48) . The facet f = C ∩ C (cid:48) is a ( d − k ) -dimensional facet, where k is the number of hyperplanes that intersect at w . Said differently, at w the signs of k hyperplane equations x (cid:62) i w − y i flip.Imagine removing these k hyperplanes, then w and w (cid:48) lie in the same cell of the induced hyperplane arrange-ment, and, by Lemma 9, the objective function L ( − k )2 with these k hyperplanes removed is convex. (Simply repeatthe argument for n − k samples.) Thus we have L ( w t ) = L ( − k )2 ( w t ) + 12 (cid:88) i ( (cid:104) x i , w t (cid:105) − y i + (cid:15) sign( (cid:104) x i , w t (cid:105) − y i ) (cid:107) w t (cid:107) ) = L ( − k )2 ( w t ) + 12 (cid:88) i ( (cid:15) sign( (cid:104) x i , w t (cid:105) − y i ) (cid:107) w t (cid:107) ) = L ( − k )2 ( w t ) + 12 (cid:88) i (cid:15) (cid:107) w t (cid:107) ≤ (1 − t ) L ( − k )2 ( w ) + t L ( − k )2 ( w (cid:48) ) + (1 − t ) 12 (cid:88) i (cid:15) (cid:107) w (cid:107) + t (cid:88) i (cid:15) (cid:107) w (cid:48) (cid:107) < (1 − t ) L ( − k )2 ( w ) + t L ( − k )2 ( w (cid:48) )+ (1 − t ) 12 (cid:88) i (cid:16) ( (cid:104) x i , w (cid:105) − y i ) + 2 (cid:15) (cid:107) w (cid:107) sign( (cid:104) x i , w (cid:105) − y i ) ( (cid:104) x i , w (cid:105) − y i ) + (cid:15) (cid:107) w (cid:107) (cid:17) + t (cid:88) i (cid:16) ( (cid:104) x i , w (cid:48) (cid:105) − y i ) + 2 (cid:15) (cid:107) w (cid:48) (cid:107) sign( (cid:104) x i , w (cid:48) (cid:105) − y i ) ( (cid:104) x i , w (cid:48) (cid:105) − y i ) + (cid:15) (cid:107) w (cid:48) (cid:107) (cid:17) = (1 − t ) L ( w ) + t L ( w (cid:48) ) w t , each hyperplane constraint x (cid:62) i w − y i = 0 . Thefirst inequality follows from the convexity of L ( − k )2 and (cid:107) w (cid:107) . The second inequality follows from adding strictlypositive terms. The final equality follows by definition. With this fact we are ready to show the convexity of L along the entire segment ww (cid:48) . L ( w t ) ≤ (cid:40) (1 − α ( t )) L ( w ) + α ( t ) L ( w t ) t ∈ [0 , t ](1 − β ( t )) L ( w t ) + β ( t ) L ( w (cid:48) ) t ∈ [ t , < (cid:40) (1 − α ( t )) L ( w ) + α ( t ) ((1 − t ) L ( w ) + t L ( w (cid:48) )) t ∈ [0 , t ](1 − β ( t )) ((1 − t ) L ( w ) + t L ( w (cid:48) )) + β ( t ) L ( w (cid:48) ) t ∈ [ t , (cid:40) (1 − t ) L ( w ) + t L ( w (cid:48) ) t ∈ [0 , t ](1 − t ) L ( w ) + t L ( w (cid:48) ) t ∈ [ t , − t ) L ( w ) + t L ( w (cid:48) ) . The first inequality follows from the fact that L is convex along each sub-segment. The functions α : [0 , t ] → [0 , , β : [ t , → [0 , are the reparameterization functions defined as α ( t ) = tt , β ( t ) = t − t − t . The secondinequality follows from the statement we proved about L ( w t ) . The final equality follows from the definitions of α, β . Thus L is convex along the line segment ww (cid:48) when m = 1 .Repeating the argument inductively gives that L is convex along ww (cid:48) for any m . We have proven that thetransitions between cells are strictly convex. When L restricted to each cell C is strongly convex, then the wholefunction L is strictly convex, otherwise L is convex. Lemma 11. L is subdifferentiable everywhere. Let w ∈ R d . If w ∈ Int C for some C ∈ H and w (cid:54) = 0 , then L is differentiable at w with ∇L ( w ) = X (cid:62) ( Xw − y ) + (cid:15) (cid:107) w (cid:107) X (cid:62) s + (cid:15) (cid:107) Xw − y (cid:107) w (cid:107) w (cid:107) + (cid:15) nw. (20) If w = 0 , then the subdifferential ∂ L (0) is parameterized by replacing w (cid:107) w (cid:107) in Equation 20 with any g suchthat (cid:107) g (cid:107) ≤ .Otherwise w ∈ Bd C , meaning that w ∈ f for some ( d − k ) -dimensional face f of C . Let { i , . . . , i k } ⊂ [ n ] bethe k indices for which w ∈ h i j ( x (cid:62) i j w − y i j = 0 ). The subdifferential ∂ L ( w ) is non-empty and is parameterizedby every setting of s i j ∈ [ − , in Equation 20.Proof. By Lemma 10, the epigraph of L is a convex set. The Separating Hyperplane Theorem implies theexistence of a supporting hyperplane at every point ( w, L ( w )) . If w ∈ Int C for some cell C in the hyperplanearrangement, then L is differentiable at w and there is a single supporting hyperplane at ( w, L ( w )) . Otherwise, w is on the boundary ∂ C of some C , and the existence of a supporting hyperplane implies the existence of asubgradient of L at w .The gradient of L , where defined, is ∇L ( w ) = X (cid:62) ( Xw − y ) + (cid:15) (cid:107) w (cid:107) X (cid:62) s + (cid:15) (cid:107) Xw − y (cid:107) w (cid:107) w (cid:107) + (cid:15) nw. Suppose that w = 0 . Let v ∈ S d − be a unit vector and δ > sufficiently small. Then convexity of L | Int C for C with signature s = − y (Lemma 9) and a standard limit argument gives (cid:104)− X (cid:62) y + (cid:15) (cid:107) y (cid:107) v, w (cid:48) (cid:105) = lim δ → + (cid:104)∇L ( δv ) , w (cid:48) − δv (cid:105)≤ lim δ → + L ( w (cid:48) ) − L (0)= L ( w (cid:48) ) − L (0) where the inequality holds for all δ > by continuity. Thus v induces a subgradient at w = 0 .Let g be a vector such that (cid:107) g (cid:107) ≤ . g can be written as g = (1 − α ) v + α ( − v ) for α = (1 − (cid:107) g (cid:107) ) / forsubgradients v, − v ∈ ∂ L (0) . Since (cid:107) g (cid:107) ≤ , α ∈ [0 , . So (cid:104)− X (cid:62) y + (cid:15) (cid:107) y (cid:107) g, w (cid:48) (cid:105) = (cid:104)− X (cid:62) y + (cid:15) (cid:107) y (cid:107) ((1 − α ) v + α ( − v )) , w (cid:48) (cid:105) = (1 − α ) (cid:0) (cid:104)− X (cid:62) y + (cid:15) (cid:107) y (cid:107) v, w (cid:48) (cid:105) (cid:1) + α (cid:0) (cid:104)− X (cid:62) y + (cid:15) (cid:107) y (cid:107) ( − v ) , w (cid:48) (cid:105) (cid:1) ≤ (1 − α )( L ( w (cid:48) ) − L (0)) + α ( L ( w (cid:48) ) − L (0))= L ( w (cid:48) ) − L (0) , g induces a subgradient at w = 0 as well.To find the subdifferential ∂ L ( w ) for w (cid:54) = 0 , we consider w ∈ f for some ( d − k ) -dimensional facet f of C ,and proceed by induction over k .In the base case, k = 1 . Since f is ( d − -dimensional, there is only one tight hyerplane equation x (cid:62) i w − y i = 0 at w . Let h = { w ∈ R d : x (cid:62) i w − y i = 0 } denote the hyperplane and let h + , h − denote the halfspaces in which sign( x (cid:62) i w − y i ) = ± respectively. The limit of the gradient as approach w by a sequence in h + is ∇L ( w ) where s i = 1 ; similarly approaching w by a sequence in h − gives ∇L ( w ) where s i = − . These vectors definetwo supporting hyperplanes of the epigraph at ( w, L ( w )) .Note that only (cid:15) (cid:107) w (cid:107) X (cid:62) s and (cid:15) (cid:107) Xw − y (cid:107) w (cid:107) w (cid:107) in ∇L depend upon s , and when x (cid:62) i w − y i = 0 , (cid:107) Xw − y (cid:107) is identical regardless of the setting of s i , so we need only consider (cid:15) (cid:107) w (cid:107) X (cid:62) s . Let s i ∈ [ − , , then (cid:15) (cid:107) w (cid:107) (cid:104) X (cid:62) s, w (cid:48) − w (cid:105) = (cid:15) (cid:107) w (cid:107) (cid:104) (cid:88) j (cid:54) = i s j x j + s i x i , w (cid:48) − w (cid:105) = (cid:15) (cid:107) w (cid:107) (cid:104) (cid:88) j (cid:54) = i s j x j , w (cid:48) − w (cid:105) + (cid:104) s i x i , w (cid:48) − w (cid:105) = (cid:15) (cid:107) w (cid:107) (cid:104) (cid:88) j (cid:54) = i s j x j , w (cid:48) − w (cid:105) + (1 − α ) (cid:104)− x i , w (cid:48) − w (cid:105) + α (cid:104) x i , w (cid:48) − w (cid:105) where α = s i . Then we can express ∇L ( w ) | s i , where s i ∈ [ − , , as a convex combination of the terms L ( w ) | s i = − , L ( w ) | s i =1 . (cid:104)∇L ( w ) | s i , w (cid:48) − w (cid:105) = (cid:104) X (cid:62) ( Xw − y ) + (cid:15) (cid:107) w (cid:107) X (cid:62) s + (cid:15) (cid:107) Xw − y (cid:107) w (cid:107) w (cid:107) + (cid:15) nw, w (cid:48) − w (cid:105) = (cid:104) X (cid:62) ( Xw − y ) + (cid:15) (cid:107) Xw − y (cid:107) w (cid:107) w (cid:107) + (cid:15) nw, w (cid:48) − w (cid:105) + (cid:15) (cid:107) w (cid:107) (cid:104) (cid:88) j (cid:54) = i s j x j , w (cid:48) − w (cid:105) + (1 − α ) (cid:104)− x i , w (cid:48) − w (cid:105) + α (cid:104) x i , w (cid:48) − w (cid:105) = (1 − α ) (cid:104)∇L ( w ) | s i = − , w (cid:48) − w (cid:105) + α (cid:104)∇L ( w ) | s i =1 , w (cid:48) − w (cid:105)≤ (1 − α ) ( L ( w (cid:48) ) − L ( w )) + α ( L ( w (cid:48) ) − L ( w ))= L ( w (cid:48) ) − L ( w ) where the inequality follows from the fact that ∇L ( w ) | s i = − , ∇L ( w ) | s i =1 are subgradients. Thus L ( w ) | s i isa subgradient for any s i ∈ [ − , at w .Now suppose that w ∈ f is a ( d − k ) -dimensional facet and the statement holds for all ≤ j < k . Let { i , . . . , i k } index the hyperplane equations x (cid:62) i j w − y i j = 0 at w . Consider the subset of hyperplane equations { i , . . . , i k − } along which subgradients exist for any setting of s i j ∈ [ − , by the inductive hypothesis. Anidentical limit argument as above implies the existence of two subgradients at w with s i k = ± . Then an identicalcalculation to those above imply that ∇L ( w ) | s ik is a subgradient for any s i k ∈ [ − , . Thus at w ∈ f thereexists a subdifferential parameterized by s i j ∈ [ − , for every ≤ j ≤ k . Proof of Theorem 5.
Lemma 10 states that L is convex and that transitions between cells are strictly convex.The cases in the theorem statement correspond to the cases in Lemma 9 which describe the geometry of thecell containing the origin. Finally Lemma 8 states that any optimal solution must be in the rowspace of X andLemma 11 states that L is subdifferentiable everywhere. A.4 Proof of Theorem 7
Proof.
The gradient at the minimum L norm solution X (cid:62) α is ∇L ( X (cid:62) α ) = X (cid:62) ( XX (cid:62) α − y ) + (cid:15) (cid:107) X (cid:62) α (cid:107) X (cid:62) s + (cid:15) (cid:107) XX (cid:62) α − y (cid:107) X (cid:62) α (cid:107) X (cid:62) α (cid:107) + (cid:15) nX (cid:62) α = (cid:15) (cid:107) X (cid:62) α (cid:107) X (cid:62) s + (cid:15) nX (cid:62) α. ∇L = 0 gives − X (cid:62) s = (cid:15)n X (cid:62) α (cid:107) X (cid:62) α (cid:107) (21) − n (cid:88) i =1 s i x i = (cid:15)n (cid:107) X (cid:62) α (cid:107) (cid:88) i ∈P α + x i + (cid:88) j ∈N α − x j Lemma 11 states that, at X (cid:62) α , there exists a subgradient for every setting of s ∈ [ − , n . To prove the result wemust show that there exists some setting of s that satisfies Equation 21, which we will do by showing that, underthe conditions on (cid:15) , the coefficient of each x i on the right hand side of Equation 21 is in the range [ − , .Since s i ∈ [ − , the negative sign on the left hand side of Equation 21 is inconsequential. It is sufficient toshow that (cid:15)nα + (cid:107) X (cid:62) α (cid:107) and (cid:15)nα − (cid:107) X (cid:62) α (cid:107) are in the range [ − , . Necessity follows from the fact that the rows of X arelinearly independent. (cid:15)nα + (cid:107) X (cid:62) α (cid:107) = (cid:15) ( n + + n − ) α + (cid:113) ( n + α + − n − α − ) + 2( n + α + + n − α − ) + n + α + 5 n − α − = (cid:15) n − + 4 n − n + + 5 n + + 5 n − (cid:113) n n − + 160 n n − + 75 n + 32 n + n − + 60 n + n − + 70 n + + 3 n − + 5 n − ≤ where the last inequality follows from the condition on (cid:15) . Note also that (cid:15)nα + (cid:107) X (cid:62) α (cid:107) ≥ by definition. The case for α − is similar.Now assume that n + = cn − . The right hand side of Equation 13 becomes (cid:113) c n − + 160 c n − + 75 c n − + 32 cn − + 60 cn − + 70 cn − + 3 n − + 5 n − max (cid:0) n − + 4 cn − + 5 cn − + 5 n − , c n − + 4 cn − + cn − + n − (cid:1) . (22)The maximum evaluates as max (cid:0) n − + 4 cn − + 5 cn − + 5 n − , c n − + 4 cn − + cn − + n − (cid:1) = (cid:40) n − + 4 cn − + 5 cn − + 5 n − if c < n − n − c n − + 4 cn − + cn − + n − if c ≥ n − n − . Within each of these ranges it can be checked, using Mathematica, that the gradient of Equation 22 is negative.Thus we can consider the limit as n − → ∞ , which gives the lower bound (cid:113) c n − + 160 c n − + 75 c n − + 32 cn − + 60 cn − + 70 cn − + 3 n − + 5 n − max (cid:0) n − + 4 cn − + 5 cn − + 5 n − , c n − + 4 cn − + cn − + n − (cid:1) ≥ min (cid:18) c c ,
21 + c (cid:19) . Taking (cid:15) ≤ min (cid:16) c c , c (cid:17) is a sufficient condition, but not necessary due to the gap in the lower bound. B Corrections
Wilson et al. (2017) derive the minimum norm solution using the kernel trick. The optimal solutions w SDG = X (cid:62) α where α = K − y for K = XX (cid:62) . They compute K ij = if i = j and y i = 18 if i = j and y i = − if i (cid:54) = j and y i y j = 11 if i (cid:54) = j and y i y j = − and positing, correctly, that α i = α + if y i = 1 and α i = α − if y i = − they derive the system of equations (3 n + + 1) α + + n − α − = 1 n + α + + (3 n − + 3) α − = − α + = 4 n − + 39 n + + 3 n − + 8 n + n − + 5 , α − = − n + + 19 n + + 3 n − + 8 n + n − + 5 . Wilson et al. (2017) mistakenly dropped the negative in α − . Unfortunately there is an additional minor mistake inthe linear system. The system is derived by computing ( Kα ) i = (cid:40) α i + (cid:80) j ∈P− i α j + (cid:80) j ∈N α j if y i = 18 α i + (cid:80) j ∈P α j + 3 (cid:80) j ∈N − i α j if y j = 1= (cid:40) α i + (cid:80) j ∈P α j + (cid:80) j ∈N α j if y i = 15 α i + (cid:80) j ∈P α j + 3 (cid:80) j ∈N α j if y j = 1 . Subtracting equations we reach the conclusion that α i = α + if y i = 1 and α i = α − if y i = − . Then it’s clearthat there are really only two equations in this system (3 n + + 1) α + + n − α − = 1 n + α + + (3 n − + 5) α − = − which gives α + = 4 n − + 515 n + + 3 n − + 8 n + n − + 5 , α − = − n + + 115 n + + 3 n − + 8 n + n − + 5 ..