Rademacher Complexity for Adversarially Robust Generalization
RRademacher Complexity for Adversarially RobustGeneralization
Dong Yin ∗1 , Kannan Ramchandran †1 , and Peter Bartlett ‡1,21 Department of Electrical Engineering and Computer Sciences, UC Berkeley Department of Statistics, UC Berkeley
Abstract
Many machine learning models are vulnerable to adversarial attacks; for example, addingadversarial perturbations that are imperceptible to humans can often make machine learningmodels produce wrong predictions with high confidence. Moreover, although we may obtainrobust models on the training dataset via adversarial training, in some problems the learnedmodels cannot generalize well to the test data. In this paper, we focus on (cid:96) ∞ attacks, andstudy the adversarially robust generalization problem through the lens of Rademacher com-plexity. For binary linear classifiers, we prove tight bounds for the adversarial Rademachercomplexity, and show that the adversarial Rademacher complexity is never smaller than itsnatural counterpart, and it has an unavoidable dimension dependence, unless the weight vectorhas bounded (cid:96) norm. The results also extend to multi-class linear classifiers. For (nonlin-ear) neural networks, we show that the dimension dependence in the adversarial Rademachercomplexity also exists. We further consider a surrogate adversarial loss for one-hidden layerReLU network and prove margin bounds for this setting. Our results indicate that having (cid:96) norm constraints on the weight matrices might be a potential way to improve generalizationin the adversarial setting. We demonstrate experimental results that validate our theoreticalfindings. In recent years, many modern machine learning models, in particular, deep neural networks, haveachieved success in tasks such as image classification [27], speech recognition [25], machine trans-lation [6], game playing [48], etc. However, although these models achieve the state-of-the-artperformance in many standard benchmarks or competitions, it has been observed that by adver-sarially adding some perturbation to the input of the model (images, audio signals), the machinelearning models can make wrong predictions with high confidence. These adversarial inputs areoften called the adversarial examples . Typical methods of generating adversarial examples includeadding small perturbations that are imperceptible to humans [51], changing surrounding areas ofthe main objects in images [21], and even simple rotation and translation [17]. This phenomenonwas first discovered by Szegedy et al. [51] in image classification problems, and similar phenomenahave been observed in other areas [14, 32]. Adversarial examples bring serious challenges in manysecurity-critical applications, such as medical diagnosis and autonomous driving—the existence ofthese examples shows that many state-of-the-art machine learning models are actually unreliablein the presence of adversarial attacks.Since the discovery of adversarial examples, there has been a race between designing robustmodels that can defend against adversarial attacks and designing attack algorithms that can gen-erate adversarial examples and fool the machine learning models [24, 26, 12, 13]. As of now, itseems that the attackers are winning this game. For example, a recent work shows that many ofthe defense algorithms fail when the attacker uses a carefully designed gradient-based method [4]. ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ c s . L G ] J u l eanwhile, adversarial training [28, 47, 37] seems to be the most effective defense method. Ad-versarial training takes a robust optimization [10] perspective to the problem, and the basic ideais to minimize some adversarial loss over the training data. We elaborate below.Suppose that data points ( x , y ) are drawn according to some unknown distribution D over thefeature-label space X ×Y , and
X ⊆ R d . Let F be a hypothesis class (e.g., a class of neural networkswith a particular architecture), and (cid:96) ( f ( x ) , y ) be the loss associated with f ∈ F . Consider the (cid:96) ∞ white-box adversarial attack where an adversary is allowed to observe the trained model and choosesome x (cid:48) such that (cid:107) x (cid:48) − x (cid:107) ∞ ≤ (cid:15) and (cid:96) ( f ( x (cid:48) ) , y ) is maximized. Therefore, to better defend againstadversarial attacks, during training, the learner should aim to solve the empirical adversarial riskminimization problem min f ∈F n n (cid:88) i =1 max (cid:107) x (cid:48) i − x i (cid:107) ∞ ≤ (cid:15) (cid:96) ( f ( x (cid:48) i ) , y i ) , (1)where { ( x i , y i ) } ni =1 are n i.i.d. training examples drawn according to D . This minimax formulationraises many interesting theoretical and practical questions. For example, we need to understandhow to efficiently solve the optimization problem in (1), and in addition, we need to characterizethe generalization property of the adversarial risk, i.e., the gap between the empirical adversarialrisk in (1) and the population adversarial risk E ( x ,y ) ∼ D [max (cid:107) x (cid:48) − x (cid:107) ∞ ≤ (cid:15) (cid:96) ( f ( x (cid:48) ) , y )] . In fact, for deepneural networks, both questions are still wide open. In particular, for the generalization problem,it has been observed that even if we can minimize the adversarial training error, the adversarialtest error can still be large. For example, for a Resnet [27] model on CIFAR10, using the PGDadversarial training algorithm by Madry et al. [37], one can achieve about 96% adversarial trainingaccuracy, but the adversarial test accuracy is only 47%. This generalization gap is significantlylarger than that in the natural setting (without adversarial attacks), and thus it has becomeincreasingly important to better understand the generalization behavior of machine learning modelsin the adversarial setting.In this paper, we focus on the adversarially robust generalization property and make a first steptowards deeper understanding of this problem. We focus on (cid:96) ∞ adversarial attacks and analyzegeneralization through the lens of Rademacher complexity. We study both linear classifiers andnonlinear feedforward neural networks, and both binary and multi-class classification problems.We summarize our contributions as follows. • For binary linear classifiers, we prove tight upper and lower bounds for the adversarialRademacher complexity. We show that the adversarial Rademacher complexity is neversmaller than its counterpart in the natural setting, which provides theoretical evidence forthe empirical observation that adversarially robust generalization can be hard. We alsoshow that under an (cid:96) ∞ adversarial attack, when the weight vector of the linear classifier hasbounded (cid:96) p norm ( p ≥ , a polynomial dimension dependence in the adversarial Rademachercomplexity is unavoidable, unless p = 1 . For multi-class linear classifiers, we prove marginbounds in the adversarial setting. Similar to binary classifiers, the margin bound also ex-hibits polynomial dimension dependence when the weight vector for each class has bounded (cid:96) p norm ( p > .• For neural networks, we show that in contrast to the margin bounds derived by Bartlettet al. [9] and Golowich et al. [23] which depend only on the norms of the weight matricesand the data points, the adversarial Rademacher complexity has a lower bound with anexplicit dimension dependence, which is also an effect of the (cid:96) ∞ attack. We further considera surrogate adversarial loss for one hidden layer ReLU networks, based on the SDP relaxationproposed by Raghunathan et al. [44]. We prove margin bounds using the surrogate loss andshow that if the weight matrix of the first layer has bounded (cid:96) norm, the margin bounddoes not have explicit dimension dependence. This suggests that in the adversarial setting,controlling the (cid:96) norms of the weight matrices may be a way to improve generalization.• We conduct experiments on linear classifiers and neural networks to validate our theoret-ical findings; more specifically, our experiments show that (cid:96) regularization could reducethe adversarial generalization error, and the adversarial generalization gap increases as thedimension of the feature spaces increases. 2 .2 Notation We define the set [ N ] := { , , . . . , N } . For two sets A and B , we denote by B A the set of allfunctions from A to B . We denote the indicator function of a event A as ( A ) . Unless otherwisestated, we denote vectors by boldface lowercase letters such as w , and the elements in the vector aredenoted by italics letters with subscripts, such as w k . All-one vectors are denoted by . Matricesare denoted by boldface uppercase letters such as W . For a matrix W ∈ R d × m with columns w i , i ∈ [ m ] , the ( p, q ) matrix norm of W is defined as (cid:107) W (cid:107) p,q = (cid:107) [ (cid:107) w (cid:107) p , (cid:107) w (cid:107) p , · · · , (cid:107) w m (cid:107) p ] (cid:107) q , andwe may use the shorthand notation (cid:107) · (cid:107) p ≡ (cid:107) · (cid:107) p,p . We denote the spectral norm of matrices by (cid:107) · (cid:107) σ and the Frobenius norm of matrices by (cid:107) · (cid:107) F (i.e., (cid:107) · (cid:107) F ≡ (cid:107) · (cid:107) ). We use B ∞ x ( (cid:15) ) to denotethe (cid:96) ∞ ball centered at x ∈ R d with radius (cid:15) , i.e., B ∞ x ( (cid:15) ) = { x (cid:48) ∈ R d : (cid:107) x (cid:48) − x (cid:107) ∞ ≤ (cid:15) } . The rest of this paper is organized as follows: in Section 2, we discuss related work; in Section 3,we describe the formal problem setup; we present our main results for linear classifiers and neuralnetworks in Sections 4 and 5, respectively. We demonstrate our experimental results in Section 6and make conclusions in Section 7.
During the preparation of the initial draft of this paper, we become aware of another independentand concurrent work by Khim and Loh [29], which studies a similar problem. In this section, wefirst compare our work with Khim and Loh [29] and then discuss other related work. We make thecomparison in the following aspects.• For binary classification problems, the adversarial Rademacher complexity upper bound by Khimand Loh [29] is similar to ours. However, we provide an adversarial Rademacher complexity lowerbound that matches the upper bound. Our lower bound shows that the adversarial Rademachercomplexity is never smaller than that in the natural setting, indicating the hardness of ad-versarially robust generalization. As mentioned, although our lower bound is for Rademachercomplexity rather than generalization, Rademacher complexity is a tight bound for the rate ofuniform convergence of a loss function class [30] and thus in many settings can be a tight boundfor generalization. In addition, we provide a lower bound for the adversarial Rademacher com-plexity for neural networks. These lower bounds do not appear in the work by Khim and Loh[29].• We discuss the generalization bounds for the multi-class setting, whereas Khim and Loh [29]focus only on binary classification.• Both our work and Khim and Loh [29] prove adversarial generalization bound using surrogateadversarial loss (upper bound for the actual adversarial loss). Khim and Loh [29] use a methodcalled tree transform whereas we use the SDP relaxation proposed by [44]. These two approachesare based on different ideas and thus we believe that they are not directly comparable.We proceed to discuss other related work.
Adversarially robust generalization
As discussed in Section 1, it has been observed by Madryet al. [37] that there might be a significant generalization gap when training deep neural networks inthe adversarial setting. This generalization problem has been further studied by Schmidt et al. [46],who show that to correctly classify two separated d -dimensional spherical Gaussian distributions,in the natural setting one only needs O (1) training data, but in the adversarial setting one needs Θ( √ d ) data. Getting distribution agnostic generalization bounds (also known as the PAC-learningframework) for the adversarial setting is proposed as an open problem by Schmidt et al. [46]. In asubsequent work, Cullina et al. [15] study PAC-learning guarantees for binary linear classifiers inthe adversarial setting via VC-dimension, and show that the VC-dimension does not increase inthe adversarial setting. This result does not provide explanation to the empirical observation thatadversarially robust generalization may be hard. In fact, although VC-dimension and Rademachercomplexity can both provide valid generalization bounds, VC-dimension usually depends on thenumber of parameters in the model while Rademacher complexity usually depends on the norms3f the weight matrices and data points, and can often provide tighter generalization bounds [7].Suggala et al. [50] discuss a similar notion of adversarial risk but do not prove explicit generalizationbounds. Attias et al. [5] prove adversarial generalization bounds in a setting where the numberof potential adversarial perturbations is finite, which is a weaker notion than the (cid:96) ∞ attack thatwe consider. Sinha et al. [49] analyze the convergence and generalization of an adversarial trainingalgorithm under the notion of distributional robustness. Farnia et al. [18] study the generalizationproblem when the attack algorithm of the adversary is provided to the learner, which is also aweaker notion than our problem. In earlier work, robust optimization has been studied in Lasso [56]and SVM [57] problems. Xu and Mannor [55] make the connection between algorithmic robustnessand generalization property in the natural setting, whereas our work focus on generalization in theadversarial setting. Provable defense against adversarial attacks
Besides generalization property, another re-cent line of work aim to design provable defense against adversarial attacks. Two examples ofprovable defense are SDP relaxation [44, 45] and LP relaxation [31, 54]. The idea of these methodsis to construct upper bounds of the adversarial risk that can be efficiently evaluated and opti-mized. The analyses of these algorithms usually focus on minimizing training error and do nothave generalization guarantee; in contrast, we focus on generalization property in this paper.
Other theoretical analysis of adversarial examples
A few other lines of work have con-ducted theoretical analysis of adversarial examples. Wang et al. [53] analyze the adversarial ro-bustness of nearest neighbors estimator. Papernot et al. [43] try to demonstrate the unavoidabletrade-offs between accuracy in the natural setting and the resilience to adversarial attacks, andthis trade-off is further studied by Tsipras et al. [52] through some constructive examples of dis-tributions. Fawzi et al. [19] analyze adversarial robustness of fixed classifiers, in contrast to ourgeneralization analysis. Fawzi et al. [20] construct examples of distributions with large latent vari-able space such that adversarially robust classifiers do not exist; here we argue that these examplesmay not explain the fact that adversarially perturbed images can usually be recognized by hu-mans. Bubeck et al. [11] try to explain the hardness of learning an adversarially robust modelfrom the computational constraints under the statistical query model. Another recent line of workexplains the existence of adversarial examples via high dimensional geometry and concentration ofmeasure [22, 16, 38]. These works provide examples where adversarial examples provably exist aslong as the test error of a classifier is non-zero.
Generalization of neural networks
Generalization of neural networks has been an importanttopic, even in the natural setting where there is no adversary. The key challenge is to understandwhy deep neural networks can generalize to unseen data despite the high capacity of the model class.The problem has received attention since the early stage of neural network study [7, 2]. Recently,understanding generalization of deep nets is raised as an open problem since traditional techniquessuch as VC-dimension, Rademacher complexity, and algorithmic stability are observed to producevacuous generalization bounds [58]. Progress has been made more recently. In particular, it hasbeen shown that when properly normalized by the margin, using Rademacher complexity or PAC-Bayesian analysis, one can obtain generalization bounds that tend to match the experimentalresults [9, 42, 3, 23]. In addition, in this paper, we show that when the weight vectors or matriceshave bounded (cid:96) norm, there is no dimension dependence in the adversarial generalization bound.This result is consistent and related to several previous works [36, 7, 40, 59]. We start with the general statistical learning framework. Let X and Y be the feature and labelspaces, respectively, and suppose that there is an unknown distribution D over X × Y . In thispaper, we assume that the feature space is a subset of the d dimensional Euclidean space, i.e., X ⊆ R d . Let F ⊆ V X be the hypothesis class that we use to make predictions, where V isanother space that might be different from Y . Let (cid:96) : V × Y → [0 , B ] be the loss function.Throughout this paper we assume that (cid:96) is bounded, i.e., B is a positive constant. In addition,we introduce the function class (cid:96) F ⊆ [0 , B ] X ×Y by composing the functions in F with (cid:96) ( · , y ) , i.e.,4 F := { ( x , y ) (cid:55)→ (cid:96) ( f ( x ) , y ) : f ∈ F} . The goal of the learning problem is to find f ∈ F such thatthe population risk R ( f ) := E ( x ,y ) ∈D [ (cid:96) ( f ( x ) , y )] is minimized.We consider the supervised learning setting where one has access to n i.i.d. training exam-ples drawn according to D , denoted by ( x , y ) , ( x , y ) , . . . , ( x n , y n ) . A learning algorithm mapsthe n training examples to a hypothesis f ∈ F . In this paper, we are interested in the gap be-tween the empirical risk R n ( f ) := n (cid:80) ni =1 (cid:96) ( f ( x i ) , y i ) and the population risk R ( f ) , known as thegeneralization error.Rademacher complexity [8] is one of the classic measures of generalization error. Here, wepresent its formal definition. For any function class H ⊆ R Z , given a sample S = { z , z , . . . , z n } of size n , and empirical Rademacher complexity is defined as R S ( H ) := 1 n E σ (cid:34) sup h ∈H n (cid:88) i =1 σ i h ( z i ) (cid:35) , where σ , . . . , σ n are i.i.d. Rademacher random variables with P { σ i = 1 } = P { σ i = − } = . In ourlearning problem, denote the training sample by S , i.e., S := { ( x , y ) , ( x , y ) , . . . , ( x n , y n ) } . Wethen have the following theorem which connects the population and empirical risks via Rademachercomplexity. Theorem 1. [8, 41] Suppose that the range of (cid:96) ( f ( x ) , y ) is [0 , B ] . Then, for any δ ∈ (0 , , withprobability at least − δ , the following holds for all f ∈ F , R ( f ) ≤ R n ( f ) + 2 B R S ( (cid:96) F ) + 3 B (cid:115) log δ n . As we can see, Rademacher complexity measures the rate that the empirical risk convergesto the population risk uniformly across F . In fact, according to the anti-symmetrization lowerbound by Koltchinskii et al. [30], one can show that R S ( (cid:96) F ) (cid:46) sup f ∈F R ( f ) − R n ( f ) (cid:46) R S ( (cid:96) F ) .Therefore, Rademacher complexity is a tight bound for the rate of uniform convergence of a lossfunction class, and in many settings can be a tight bound for generalization error.The above discussions can be extended to the adversarial setting. In this paper, we focuson the (cid:96) ∞ adversarial attack. In this setting, the learning algorithm still has access to n i.i.d.uncorrupted training examples drawn according to D . Once the learning procedure finishes, theoutput hypothesis f is revealed to an adversary. For any data point ( x , y ) drawn according to D ,the adversary is allowed to perturb x within some (cid:96) ∞ ball to maximize the loss. Our goal is tominimize the adversarial population risk , i.e., (cid:101) R ( f ) := E ( x ,y ) ∼D (cid:20) max x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:96) ( f ( x (cid:48) ) , y ) (cid:21) , and to this end, a natural way is to conduct adversarial training —minimizing the adversarialempirical risk (cid:101) R n ( f ) := 1 n n (cid:88) i =1 max x (cid:48) i ∈ B ∞ x i ( (cid:15) ) (cid:96) ( f ( x (cid:48) i ) , y i ) . Let us define the adversarial loss (cid:101) (cid:96) ( f ( x ) , y ) := max B ∞ x ( (cid:15) ) (cid:96) ( f ( x (cid:48) ) , y ) and the function class (cid:101) (cid:96) F ⊆ [0 , B ] X ×Y as (cid:101) (cid:96) F := { (cid:101) (cid:96) ( f ( x ) , y ) : f ∈ F} . Since the range of (cid:101) (cid:96) ( f ( x ) , y ) is still [0 , B ] , we have thefollowing direct corollary of Theorem 1. Corollary 1.
For any δ ∈ (0 , , with probability at least − δ , the following holds for all f ∈ F , (cid:101) R ( f ) ≤ (cid:101) R n ( f ) + 2 B R S ( (cid:101) (cid:96) F ) + 3 B (cid:115) log δ n . As we can see, the Rademacher complexity of the adversarial loss function class (cid:101) (cid:96) F , i.e., R S ( (cid:101) (cid:96) F ) is again the key quantity for the generalization ability of the learning problem. A natural problemof interest is to compare the Rademacher complexities in the natural and the adversarial settings,and obtain generalization bounds for the adversarial loss.5 Linear Classifiers
We start with binary linear classifiers. In this setting, we define Y = {− , +1 } , and let thehypothesis class F ⊆ R X be a set of linear functions of x ∈ X . More specifically, we define f w ( x ) := (cid:104) w , x (cid:105) , and consider prediction vector w with (cid:96) p norm constraint ( p ≥ ), i.e., F = { f w ( x ) : (cid:107) w (cid:107) p ≤ W } . (2)We predict the label with the sign of f w ( x ) ; more specifically, we assume that the loss function (cid:96) ( f w ( x ) , y ) can be written as (cid:96) ( f w ( x ) , y ) ≡ φ ( y (cid:104) w , x (cid:105) ) , where φ : R → [0 , B ] is monotonicallynonincreasing and L φ -Lipschitz. In fact, if φ (0) ≥ , we can obtain a bound on the classificationerror according to Theorem 1. That is, with probability at least − δ , for all f w ∈ F , P ( x ,y ) ∼D { sgn( f w ( x )) (cid:54) = y } ≤ n n (cid:88) i =1 (cid:96) ( f w ( x i ) , y i ) + 2 B R S ( (cid:96) F ) + 3 B (cid:115) log δ n . In addition, recall that according to the Ledoux-Talagrand contraction inequality [35], we have R S ( (cid:96) F ) ≤ L φ R S ( F ) .For the adversarial setting, we have (cid:101) (cid:96) ( f w ( x ) , y ) = max x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:96) ( f w ( x (cid:48) ) , y ) = φ ( min x (cid:48) ∈ B ∞ x ( (cid:15) ) y (cid:104) w , x (cid:48) (cid:105) ) . Let us define the following function class (cid:101)
F ⊆ R X ×{± } : (cid:101) F = (cid:26) min x (cid:48) ∈ B ∞ x ( (cid:15) ) y (cid:104) w , x (cid:48) (cid:105) : (cid:107) w (cid:107) p ≤ W (cid:27) . (3)Again, we have R S ( (cid:101) (cid:96) F ) ≤ L φ R S ( (cid:101) F ) . The first major contribution of our work is the followingtheorem, which provides a comparison between R S ( F ) and R S ( (cid:101) F ) . Theorem 2 ( Main Result 1 ) . Let F := { f w ( x ) : (cid:107) w (cid:107) p ≤ W } and (cid:101) F := { min x (cid:48) ∈ B ∞ x ( (cid:15) ) y (cid:104) w , x (cid:48) (cid:105) : (cid:107) w (cid:107) p ≤ W } . Suppose that p + q = 1 . Then, there exists a universal constant c ∈ (0 , such that max { R S ( F ) , c(cid:15)W d q √ n } ≤ R S ( (cid:101) F ) ≤ R S ( F ) + (cid:15)W d q √ n . We prove Theorem 2 in Appendix A. We can see that the adversarial Rademacher complexity,i.e., R S ( (cid:101) F ) is always at least as large as the Rademacher complexity in the natural setting. Thisimplies that uniform convergence in the adversarial setting is at least as hard as that in the naturalsetting. In addition, since max { a, b } ≥ ( a + b ) , we have c (cid:16) R S ( F ) + (cid:15)W d q √ n (cid:17) ≤ R S ( (cid:101) F ) ≤ R S ( F ) + (cid:15)W d q √ n . Therefore, we have a tight bound for R S ( (cid:101) F ) up to a constant factor. Further, if p > theadversarial Rademacher complexity has an unavoidable polynomial dimension dependence, i.e., R S ( (cid:101) F ) is always at least as large as O ( (cid:15)W d /q √ n ) . On the other hand, one can easily show thatin the natural setting, R S ( F ) = Wn E σ [ (cid:107) (cid:80) ni =1 σ i x i (cid:107) q ] , which implies that R S ( F ) depends on thedistribution of x i and the norm constraint W , but does not have an explicit dimension dependence.This means that R S ( (cid:101) F ) could be order-wise larger than R S ( F ) , depending on the distribution ofthe data. An interesting fact is that, if we have an (cid:96) norm constraint on the prediction vector w ,we can avoid the dimension dependence in R S ( (cid:101) F ) . We proceed to study multi-class linear classifiers. We start with the standard margin boundframework for multi-class classification. In K -class classification problems, we choose Y = [ K ] ,6nd the functions in the hypothesis class F map X to R K , i.e., F ⊆ ( R K ) X . Intuitively, the k -thcoordinate of f ( x ) is the score that f gives to the k -th class, and we make prediction by choosingthe class with the highest score. We define the margin operator M ( z , y ) : R K × [ K ] → R as M ( z , y ) = z y − max y (cid:48) (cid:54) = y z y (cid:48) . For a training example ( x , y ) , a hypothesis f makes a correct predictionif and only if M ( f ( x ) , y ) > . We also define function class M F := { ( x , y ) (cid:55)→ M ( f ( x ) , y ) : f ∈F} ⊆ R X × [ K ] . For multi-class classification problems, we consider a particular loss function (cid:96) ( f ( x ) , y ) = φ γ ( M ( f ( x ) , y )) , where γ > and φ γ : R → [0 , is the ramp loss: φ γ ( t ) = t ≤ − tγ < t < γ t ≥ γ. (4)One can check that (cid:96) ( f ( x ) , y ) satisfies: ( y (cid:54) = arg max y (cid:48) ∈ [ K ] [ f ( x )] y (cid:48) ) ≤ (cid:96) ( f ( x ) , y ) ≤ ([ f ( x )] y ≤ γ + max y (cid:48) (cid:54) = y [ f ( x )] y (cid:48) ) . (5)Let S = { ( x i , y i ) } ni =1 ∈ ( X × [ K ]) n be the i.i.d. training examples, and define the function class (cid:96) F := { ( x , y ) (cid:55)→ φ γ ( M ( f ( x ) , y )) : f ∈ F} ⊆ R X × [ K ] . Since φ γ ( t ) ∈ [0 , and φ γ ( · ) is /γ -Lipschitz,by combining (5) with Theorem 1, we can obtain the following direct corollary as the generalizationbound in the multi-class setting [41]. Corollary 2.
Consider the above multi-class classification setting. For any fixed γ > , we havewith probability at least − δ , for all f ∈ F , P ( x ,y ) ∼D (cid:26) y (cid:54) = arg max y (cid:48) ∈ [ K ] [ f ( x )] y (cid:48) (cid:27) ≤ n n (cid:88) i =1 ([ f ( x i )] y i ≤ γ + max y (cid:48) (cid:54) = y [ f ( x i )] y (cid:48) ) + 2 R S ( (cid:96) F ) + 3 (cid:115) log δ n . In the adversarial setting, the adversary tries to maximize the loss (cid:96) ( f ( x ) , y ) = φ γ ( M ( f ( x ) , y )) around an (cid:96) ∞ ball centered at x . We have the adversarial loss (cid:101) (cid:96) ( f ( x ) , y ) := max x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:96) ( f ( x (cid:48) ) , y ) ,and the function class (cid:101) (cid:96) F := { ( x , y ) (cid:55)→ (cid:101) (cid:96) ( f ( x ) , y ) : f ∈ F} ⊆ R X × [ K ] . Thus, we have the followinggeneralization bound in the adversarial setting. Corollary 3.
Consider the above adversarial multi-class classification setting. For any fixed γ > ,we have with probability at least − δ , for all f ∈ F , P ( x ,y ) ∼D (cid:26) ∃ x (cid:48) ∈ B ∞ x ( (cid:15) ) s.t. y (cid:54) = arg max y (cid:48) ∈ [ K ] [ f ( x (cid:48) )] y (cid:48) (cid:27) ≤ n n (cid:88) i =1 ( ∃ x (cid:48) i ∈ B ∞ x i ( (cid:15) ) s.t. [ f ( x (cid:48) i )] y i ≤ γ + max y (cid:48) (cid:54) = y [ f ( x (cid:48) i )] y (cid:48) ) + 2 R S ( (cid:101) (cid:96) F ) + 3 (cid:115) log δ n . We now focus on multi-class linear classifiers. For linear classifiers, each function in the hypothesisclass is linearly parametrized by a matrix W ∈ R K × d , i.e., f ( x ) ≡ f W ( x ) = Wx . Let w k ∈ R d bethe k -th column of W (cid:62) , then we have [ f W ( x )] k = (cid:104) w k , x (cid:105) . We assume that each w k has (cid:96) p norm( p ≥ ) upper bounded by W , which implies that F = { f W ( x ) : (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W } . In the naturalsetting, we have the following margin bound for linear classifiers as a corollary of the multi-classmargin bounds by Kuznetsov et al. [33], Maximov and Reshetova [39]. Theorem 3.
Consider the multi-class linear classifiers in the above setting, and suppose that p + q = 1 , p, q ≥ . For any fixed γ > and W > , we have with probability at least − δ , forall W such that (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W , P ( x ,y ) ∼D (cid:26) y (cid:54) = arg max y (cid:48) ∈ [ K ] (cid:104) w y (cid:48) , x (cid:105) (cid:27) ≤ n n (cid:88) i =1 ( (cid:104) w y i , x i (cid:105) ≤ γ + max y (cid:48) (cid:54) = y i (cid:104) w y (cid:48) , x i (cid:105) )+ 4 KWγn E σ (cid:34) (cid:107) n (cid:88) i =1 σ i x i (cid:107) q (cid:35) + 3 (cid:115) log δ n .
7e prove Theorem 3 in Appendix B.1 for completeness. In the adversarial setting, we have thefollowing margin bound.
Theorem 4 ( Main Result 2 ) . Consider the multi-class linear classifiers in the adversarial set-ting, and suppose that p + q = 1 , p, q ≥ . For any fixed γ > and W > , we have with probabilityat least − δ , for all W such that (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W , P ( x ,y ) ∼D (cid:26) ∃ x (cid:48) ∈ B ∞ x ( (cid:15) ) s.t. y (cid:54) = arg max y (cid:48) ∈ [ K ] (cid:104) w y (cid:48) , x (cid:48) (cid:105) (cid:27) ≤ n n (cid:88) i =1 ( (cid:104) w y i , x i (cid:105) ≤ γ + max y (cid:48) (cid:54) = y i ( (cid:104) w y (cid:48) , x i (cid:105) + (cid:15) (cid:107) w y (cid:48) − w y i (cid:107) ))+ 2 W Kγ (cid:34) (cid:15) √ Kd q √ n + 1 n K (cid:88) y =1 E σ (cid:104) (cid:107) n (cid:88) i =1 σ i x i ( y i = y ) (cid:107) q (cid:105)(cid:35) + 3 (cid:115) log δ n . We prove Theorem 4 in Appendix B.2. As we can see, similar to the binary classification prob-lems, if p > , the margin bound in the adversarial setting has an explicit polynomial dependenceon d , whereas in the natural setting, the margin bound does not have dimension dependence. Thisshows that, at least for the generalization upper bound that we obtain, the dimension dependencein the adversarial setting also exists in the multi-class classification problems. We proceed to consider feedforward neural networks with ReLU activation. Here, each function f in the hypothesis class F is parametrized by a sequence of matrices W = ( W , W , . . . , W L ) ,i.e., f ≡ f W . Assume that W h ∈ R d h × d h − , and ρ ( · ) be the ReLU function, i.e., ρ ( t ) = max { t, } for t ∈ R . For vectors, ρ ( x ) is vector generated by applying ρ ( · ) on each coordinate of x , i.e., [ ρ ( x )] i = ρ ( x i ) . We have f W ( x ) = W L ρ ( W L − ρ ( · · · ρ ( W x ) · · · )) . For K -class classification, we have d L = K , f W ( x ) : R d → R K , and [ f W ( x )] k is the score for the k -th class. In the special case of binary classification, as discussed in Section 4.1, we can have Y = { +1 , − } , d L = 1 , and the loss function can be written as (cid:96) ( f W ( x ) , y ) = φ ( yf W ( x )) , where φ : R → [0 , B ] is monotonically nonincreasing and L φ -Lipschitz. We start with a comparison of Rademacher complexities of neural networks in the natural and ad-versarial settings. Although naively applying the definition of Rademacher complexity may providea loose generalization bound [58], when properly normalized by the margin, one can still derivegeneralization bound that matches experimental observations via Rademacher complexity [9]. Ourcomparison shows that, when the weight matrices of the neural networks have bounded norms, inthe natural setting, the Rademacher complexity is upper bounded by a quantity which only haslogarithmic dependence on the dimension; however, in the adversarial setting, the Rademachercomplexity is lower bounded by a quantity with explicit √ d dependence.We focus on the binary classification. For the natural setting, we review the results by Bartlettet al. [9]. Let S = { ( x i , y i ) } ni =1 ∈ ( X × {− , +1 } ) n be the i.i.d. training examples, and define X := [ x , x , · · · , x n ] ∈ R d × n , and d max = max { d, d , d , . . . , d L } . Theorem 5. [9] Consider the neural network hypothesis class F = { f W ( x ) : W = ( W , W , . . . , W L ) , (cid:107) W h (cid:107) σ ≤ s h , (cid:107) W (cid:62) h (cid:107) , ≤ b h , h ∈ [ L ] } ⊆ R X . This implies that d ≡ d . hen, we have R S ( F ) ≤ n / + 26 log( n ) log(2 d max ) n (cid:107) X (cid:107) F (cid:32) L (cid:89) h =1 s h (cid:33) L (cid:88) j =1 ( b j s j ) / / . On the other hand, in this work, we prove the following result which shows that when theproduct of the spectral norms of all the weight matrices is bounded, the Rademacher complexityof the adversarial loss function class is lower bounded by a quantity with an explicit √ d factor.More specifically, for binary classification problems, since (cid:101) (cid:96) ( f W ( x ) , y ) = max x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:96) ( f W ( x (cid:48) ) , y ) = φ ( min x (cid:48) ∈ B ∞ x ( (cid:15) ) yf W ( x (cid:48) )) , and φ ( · ) is Lipschitz, we consider the function class (cid:101) F = { ( x , y ) (cid:55)→ min x (cid:48) ∈ B ∞ x ( (cid:15) ) yf W ( x (cid:48) ) : W = ( W , W , . . . , W L ) , L (cid:89) h =1 (cid:107) W h (cid:107) σ ≤ r } ⊆ R X ×{− , +1 } . (6)Then we have the following result. Theorem 6 ( Main Result 3 ) . Let (cid:101) F be defined as in (6) . Then, there exists a universal constant c > such that R S ( (cid:101) F ) ≥ cr (cid:32) n (cid:107) X (cid:107) F + (cid:15) (cid:114) dn (cid:33) . We prove Theorem 6 in Appendix C.1. This result shows that if we aim to study the Rademachercomplexity of the function class defined as in (6), a √ d dimension dependence may be unavoidable,in contrast to the natural setting where the dimension dependence is only logarithmic. For neural networks, even if there is only one hidden layer, for a particular data point ( x , y ) ,evaluating the adversarial loss (cid:101) (cid:96) ( f W ( x ) , y ) = max x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:96) ( f W ( x (cid:48) ) , y ) can be hard, since it requiresmaximizing a non-concave function in a bounded set. A recent line of work tries to find upperbounds for (cid:101) (cid:96) ( f W ( x ) , y ) that can be computed in polynomial time. More specifically, we wouldlike to find surrogate adversarial loss (cid:98) (cid:96) ( f W ( x ) , y ) such that (cid:98) (cid:96) ( f W ( x ) , y ) ≥ (cid:101) (cid:96) ( f W ( x ) , y ) , ∀ x , y, W .These surrogate adversarial loss can thus provide certified defense against adversarial examples,and can be computed efficiently. In addition, the surrogate adversarial loss (cid:98) (cid:96) ( f W ( x ) , y ) should beas tight as possible—it should be close enough to the original adversarial loss (cid:101) (cid:96) ( f W ( x ) , y ) , so thatthe surrogate adversarial loss can indeed represent the robustness of the model against adversarialattacks. Recently, a few approaches to designing surrogate adversarial loss have been developed,and SDP relaxation [44, 45] and LP relaxation [31, 54] are two major examples.In this section, we focus on the SDP relaxation for one hidden layer neural network withReLU activation proposed by Raghunathan et al. [44]. We prove a generalization bound regardingthe surrogate adversarial loss, and show that this generalization bound does not have explicitdimension dependence if the weight matrix of the first layer has bounded (cid:96) norm. We consider K -class classification problems in this section (i.e., Y = [ K ] ), and start with the definition andproperty of the SDP surrogate loss. Since we only have one hidden layer, f W ( x ) = W ρ ( W x ) .Let w ,k be the k -th column of W (cid:62) . Then, we have the following results according to Raghunathanet al. [44]. Theorem 7. [44] For any ( x , y ) , W , W , and y (cid:48) (cid:54) = y , max x (cid:48) ∈ B ∞ x ( (cid:15) ) ([ f W ( x (cid:48) )] y (cid:48) − [ f W ( x (cid:48) )] y ) ≤ [ f W ( x )] y (cid:48) − [ f W ( x )] y + (cid:15) P (cid:23) , diag( P ) ≤ (cid:104) Q ( w ,y (cid:48) − w ,y , W ) , P (cid:105) , where Q ( v , W ) is defined as Q ( v , W ) := (cid:62) W (cid:62) diag( v )0 0 W (cid:62) diag( v )diag( v ) (cid:62) W1 diag( v ) (cid:62) W . (7)9ince we consider multi-class classification problems in this section, we use the ramp loss φ γ defined in (4) composed with the margin operator as our loss function. Thus, we have (cid:96) ( f W ( x ) , y ) = φ γ ( M ( f W ( x ) , y )) and (cid:101) (cid:96) ( f W ( x ) , y ) = max x (cid:48) ∈ B ∞ x ( (cid:15) ) φ γ ( M ( f W ( x (cid:48) ) , y )) . Here, we design a surrogateloss (cid:98) (cid:96) ( f W ( x ) , y ) based on Theorem 7. Lemma 1.
Define (cid:98) (cid:96) ( f W ( x ) , y ) := φ γ (cid:16) M ( f W ( x ) , y ) − (cid:15) k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) (cid:17) . Then, we have max x (cid:48) ∈ B ∞ x ( (cid:15) ) ( y (cid:54) = arg max y (cid:48) ∈ [ K ] [ f W ( x (cid:48) )] y (cid:48) ) ≤ (cid:98) (cid:96) ( f W ( x ) , y ) ≤ (cid:16) M ( f W ( x ) , y ) − (cid:15) k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) ≤ γ (cid:17) . We prove Lemma 1 in Appendix C.2. With this surrogate adversarial loss in hand, we candevelop the following margin bound for adversarial generalization. In this theorem, we use X =[ x , x , · · · , x n ] ∈ R d × n , and d max = max { d, d , K } . Theorem 8.
Consider the neural network hypothesis class F = { f W ( x ) : W = ( W , W ) , (cid:107) W h (cid:107) σ ≤ s h , h = 1 , , (cid:107) W (cid:107) ≤ b , (cid:107) W (cid:62) (cid:107) , ≤ b } . Then, for any fixed γ > , with probability at least − δ , we have for all f W ( · ) ∈ F , P ( x ,y ) ∼D (cid:26) ∃ x (cid:48) ∈ B ∞ x ( (cid:15) ) s.t. y (cid:54) = arg max y (cid:48) ∈ [ K ] [ f W ( x (cid:48) )] y (cid:48) (cid:27) ≤ n n (cid:88) i =1 (cid:16) [ f W ( x i )] y i ≤ γ + max y (cid:48) (cid:54) = y i [ f W ( x i )] y (cid:48) + (cid:15) k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) (cid:17) + 1 γ (cid:18) n / + 60 log( n ) log(2 d max ) n s s (cid:16) ( b s ) / + ( b s ) / (cid:17) / (cid:107) X (cid:107) F + 2 (cid:15)b b √ n (cid:19) + 3 (cid:115) log δ n . We prove Theorem 8 in Appendix C.3. Similar to linear classifiers, in the adversarial setting,if we have an (cid:96) norm constraint on the matrix matrix W , then the generalization bound of thesurrogate adversarial loss does not have an explicit dimension dependence. In this section, we validate our theoretical findings for linear classifiers and neural networks viaexperiments. Our experiments are implemented with Tensorflow [1] on the MNIST dataset [34]. We validate two theoretical findings for linear classifiers: (i) controlling the (cid:96) norm of the modelparameters can reduce the adversarial generalization error, and (ii) there is a dimension dependencein adversarial generalization, i.e., adversarially robust generalization is harder when the dimensionof the feature space is higher. We train the multi-class linear classifier using the following objectivefunction: min W n n (cid:88) i =1 max x (cid:48) i ∈ B ∞ x i ( (cid:15) ) (cid:96) ( f W ( x (cid:48) i ) , y i ) + λ (cid:107) W (cid:107) , (8)where (cid:96) ( · ) is cross entropy loss and f W ( x ) ≡ Wx . Since we focus on the generalization property,we use a small number of training data so that the generalization gap is more significant. Morespecifically, in each run of the training algorithm, we randomly sample n = 1000 data points The implementation of the experiments can be found at https://github.com/dongyin92/adversarially-robust-generalization . W and computing adversarial examples on the chosen batch in each iteration.Here, we note that since we consider linear classifiers, the adversarial examples can be analyticallycomputed according to Appendix B.2.In our first experiment, we vary the values of (cid:15) and λ , and for each ( (cid:15), λ ) pair, we conduct runs of the training algorithm, and in each run we sample the training data independently.In Figure 1, we plot the adversarial generalization error as a function of (cid:15) and λ , and the error barsshow the standard deviation of the runs. As we can see, when λ increases, the generalizationgap decreases, and thus we conclude that (cid:96) regularization is helpful for reducing adversarialgeneralization error. t r a i n _ a cc - t e s t _ a cc ¸ =0 : ¸ =0 : ¸ =0 : Figure 1:
Linear classifiers. Adversarial generalization error vs (cid:96) ∞ perturbation (cid:15) and regularizationcoefficient λ . In our second experiment, we choose λ = 0 and study the dependence of adversarial general-ization error on the dimension of the feature space. Recall that each data point in the originalMNIST dataset is a × image, i.e., d = 784 . We construct two additional image datasets with d = 196 (downsampled) and d = 3136 (expanded), respectively. To construct the downsampledimage, we replace each × patch—say, with pixel values a, b, c, d —on the original image witha single pixel with value √ a + b + c + d . To construct the expanded image, we replace eachpixel—say, with value a —on the original image with a × patch, with the value of each pixelin the patch being a/ . Such construction keeps the (cid:96) norm of the every single image the sameacross the three datasets, and thus leads a fair comparison. The adversarial generalization erroris plotted in Figure 2, and as we can see, when the dimension d increases, the generalization gapalso increases. t r a i n _ a cc - t e s t _ a cc d =196 d =784 d =3136 Figure 2:
Linear classifiers. Adversarial generalization error vs (cid:96) ∞ perturbation (cid:15) and dimension offeature space d . .2 Neural Networks In this experiment, we validate our theoretical result that (cid:96) regularization can reduce the ad-versarial generalization error on a four-layer ReLU neural network, where the first two layers areconvolutional and the last two layers are fully connected. We use PGD attack [37] adversarialtraining to minimize the (cid:96) regularized objective (8). We use the whole training set of MNIST,and once the model is obtained, we use PGD attack to check the adversarial training and testerror. We present the adversarial generalization errors under the PGD attack in Figure 3. As wecan see, the adversarial generalization error decreases as we increase the regularization coefficient λ ; thus (cid:96) regularization indeed reduces the adversarial generalization error under the PGD attack. t r a i n _ a cc - t e s t _ a cc ² =0 : ² =0 : Figure 3:
Neural networks. Adversarial generalization error vs regularization coefficient λ . We study the adversarially robust generalization properties of linear classifiers and neural networksthrough the lens of Rademacher complexity. For binary linear classifiers, we prove tight boundsfor the adversarial Rademacher complexity, and show that in the adversarial setting, Rademachercomplexity is never smaller than that in the natural setting, and it has an unavoidable dimensiondependence, unless the weight vector has bounded (cid:96) norm. The results also extends to multi-classlinear classifiers. For neural networks, we prove a lower bound of the Rademacher complexity ofthe adversarial loss function class and show that there is also an unavoidable dimension dependencedue to (cid:96) ∞ adversarial attack. We further consider a surrogate adversarial loss and prove marginbound for this setting. Our results indicate that having (cid:96) norm constraints on the weight matricesmight be a potential way to improve generalization in the adversarial setting. Our experimentalresults validate our theoretical findings. Acknowledgements
D. Yin is partially supported by Berkeley DeepDrive Industry Consortium. K. Ramchandran ispartially supported by NSF CIF award 1703678. P. Bartlett is partially supported by NSF grantIIS-1619362. The authors would like to thank Justin Gilmer for helpful discussion.
References [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: asystem for large-scale machine learning. In
OSDI , volume 16, pages 265–283, 2016.[2] Martin Anthony and Peter L Bartlett.
Neural network learning: Theoretical foundations .Cambridge University Press, 1999.[3] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization boundsfor deep nets via a compression approach. arXiv preprint arXiv:1802.05296 , 2018.124] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense ofsecurity: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420 ,2018.[5] Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds forrobust learning. arXiv preprint arXiv:1810.02180 , 2018.[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.[7] Peter L Bartlett. The sample complexity of pattern classification with neural networks: thesize of the weights is more important than the size of the network.
IEEE Transactions onInformation Theory , 44(2):525–536, 1998.[8] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk boundsand structural results.
Journal of Machine Learning Research , 3(Nov):463–482, 2002.[9] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin boundsfor neural networks. In
Advances in Neural Information Processing Systems , 2017.[10] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski.
Robust optimization . PrincetonUniversity Press, 2009.[11] Sébastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computationalconstraints. arXiv preprint arXiv:1805.10204 , 2018.[12] Nicholas Carlini and David Wagner. Defensive distillation is not robust to adversarial exam-ples. arXiv preprint arXiv:1607.04311 , 2016.[13] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassingten detection methods. In
Proceedings of the 10th ACM Workshop on Artificial Intelligenceand Security , pages 3–14. ACM, 2017.[14] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. arXiv preprint arXiv:1801.01944 , 2018.[15] Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. PAC-learning in the presence ofevasion adversaries. arXiv preprint arXiv:1806.01471 , 2018.[16] Elvis Dohmatob. Limitations of adversarial robustness: strong no free lunch theorem. arXivpreprint arXiv:1810.04065 , 2018.[17] Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rota-tion and a translation suffice: Fooling CNNs with simple transformations. arXiv preprintarXiv:1712.02779 , 2017.[18] Farzan Farnia, Jesse M Zhang, and David Tse. Generalizable adversarial training via spectralnormalization. arXiv preprint arXiv:1811.07457 , 2018.[19] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of clas-sifiers: from adversarial to random noise. In
Advances in Neural Information ProcessingSystems , 2016.[20] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. arXiv preprint arXiv:1802.08686 , 2018.[21] Justin Gilmer, Ryan P Adams, Ian Goodfellow, David Andersen, and George E Dahl. Motivat-ing the rules of the game for adversarial example research. arXiv preprint arXiv:1807.06732 ,2018.[22] Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, MartinWattenberg, and Ian Goodfellow. Adversarial spheres. arXiv preprint arXiv:1801.02774 ,2018. 1323] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexityof neural networks. arXiv preprint arXiv:1712.06541 , 2017.[24] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver-sarial examples. arXiv preprint arXiv:1412.6572 , 2014.[25] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deeprecurrent neural networks. In
ICASSP . IEEE, 2013.[26] Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adver-sarial examples. arXiv preprint arXiv:1412.5068 , 2014.[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Computer Vision and Pattern Recognition , 2016.[28] Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvári. Learning with a strongadversary. arXiv preprint arXiv:1511.03034 , 2015.[29] Justin Khim and Po-Ling Loh. Adversarial risk bounds for binary classification via functiontransformation. arXiv preprint arXiv:1810.09519 , 2018.[30] Vladimir Koltchinskii et al. Local rademacher complexities and oracle inequalities in riskminimization.
The Annals of Statistics , 34(6):2593–2656, 2006.[31] J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convexouter adversarial polytope. arXiv preprint arXiv:1711.00851 , 2017.[32] Jernej Kos, Ian Fischer, and Dawn Song. Adversarial examples for generative models. In , pages 36–42. IEEE, 2018.[33] Vitaly Kuznetsov, Mehryar Mohri, and U Syed. Rademacher complexity margin boundsfor learning with a large number of classes. In
ICML Workshop on Extreme Classification:Learning with a Very Large Number of Labels , 2015.[34] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learningapplied to document recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.[35] Michel Ledoux and Michel Talagrand.
Probability in Banach Spaces: isoperimetry and pro-cesses . Springer Science & Business Media, 2013.[36] Wee Sun Lee, Peter L Bartlett, and Robert C Williamson. Efficient agnostic learning of neuralnetworks with bounded fan-in.
IEEE Transactions on Information Theory , 42(6):2118–2132,1996.[37] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and AdrianVladu. Towards deep learning models resistant to adversarial attacks. arXiv preprintarXiv:1706.06083 , 2017.[38] Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. The curse of con-centration in robust learning: Evasion and poisoning attacks from concentration of measure. arXiv preprint arXiv:1809.03063 , 2018.[39] Yu Maximov and Daria Reshetova. Tight risk bounds for multi-class margin classifiers.
PatternRecognition and Image Analysis , 26(4):673–680, 2016.[40] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscapeof two-layers neural networks. arXiv preprint arXiv:1804.06561 , 2018.[41] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.
Foundations of machine learn-ing . MIT press, 2012.[42] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprintarXiv:1707.09564 , 2017. 1443] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. Towards thescience of security and privacy in machine learning. arXiv preprint arXiv:1611.03814 , 2016.[44] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarialexamples. arXiv preprint arXiv:1801.09344 , 2018.[45] Aditi Raghunathan, Jacob Steinhardt, and Percy S Liang. Semidefinite relaxations for cer-tifying robustness to adversarial examples. In
Advances in Neural Information ProcessingSystems , 2018.[46] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry.Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285 ,2018.[47] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial train-ing: Increasing local stability of neural nets through robust optimization. arXiv preprintarXiv:1511.05432 , 2015.[48] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George VanDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-tot, et al. Mastering the game of go with deep neural networks and tree search.
Nature , 529(7587):484, 2016.[49] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustnesswith principled adversarial training. In
International Conference on Learning Representations ,2018.[50] Arun Sai Suggala, Adarsh Prasad, Vaishnavh Nagarajan, and Pradeep Ravikumar. On ad-versarial risk and training. arXiv preprint arXiv:1806.02924 , 2018.[51] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, IanGoodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprintarXiv:1312.6199 , 2013.[52] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and AleksanderMadry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152 , 2018.[53] Yizhen Wang, Somesh Jha, and Kamalika Chaudhuri. Analyzing the robustness of nearestneighbors to adversarial examples. arXiv preprint arXiv:1706.03922 , 2017.[54] Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adver-sarial defenses. arXiv preprint arXiv:1805.12514 , 2018.[55] Huan Xu and Shie Mannor. Robustness and generalization.
Machine learning , 86(3):391–423,2012.[56] Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and Lasso. In
Ad-vances in Neural Information Processing Systems , 2009.[57] Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of supportvector machines.
JMLR , 10(Jul):1485–1510, 2009.[58] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-standing deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 ,2016.[59] Yuchen Zhang, Jason D Lee, and Michael I Jordan. (cid:96) -regularized neural networks are im-properly learnable in polynomial time. In International Conference on Machine Learning ,2016. 15
Proof of Theorem 2
First, we have R S ( F ) := 1 n E σ (cid:34) sup (cid:107) w (cid:107) p ≤ W n (cid:88) i =1 σ i (cid:104) w , x i (cid:105) (cid:35) = Wn E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 σ i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q . (9)We then analyze R S ( (cid:101) F ) . Define (cid:101) f w ( x , y ) := min x (cid:48) ∈ B ∞ x ( (cid:15) ) y (cid:104) w , x (cid:48) (cid:105) . Then, we have (cid:101) f w ( x , y ) = (cid:40) min x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:104) w , x (cid:48) (cid:105) y = 1 , − max x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:104) w , x (cid:48) (cid:105) y = − . When y = 1 , we have (cid:101) f w ( x , y ) = (cid:101) f w ( x ,
1) = min x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:104) w , x (cid:48) (cid:105) = min x (cid:48) ∈ B ∞ x ( (cid:15) ) d (cid:88) i =1 w i x (cid:48) i = d (cid:88) i =1 w i [ ( w i ≥ x i − (cid:15) ) + ( w i < x i + (cid:15) )] = d (cid:88) i =1 w i ( x i − sgn( w i ) (cid:15) )= (cid:104) w , x (cid:105) − (cid:15) (cid:107) w (cid:107) . Similarly, when y = − , we have (cid:101) f w ( x , y ) = (cid:101) f w ( x , −
1) = − max x (cid:48) ∈ B ∞ x ( (cid:15) ) (cid:104) w , x (cid:48) (cid:105) = − max x (cid:48) ∈ B ∞ x ( (cid:15) ) d (cid:88) i =1 w i x (cid:48) i = − d (cid:88) i =1 w i [ ( w i ≥ x i + (cid:15) ) + ( w i < x i − (cid:15) )] = − d (cid:88) i =1 w i ( x i + sgn( w i ) (cid:15) )= −(cid:104) w , x (cid:105) − (cid:15) (cid:107) w (cid:107) . Thus, we conclude that (cid:101) f w ( x , y ) = y (cid:104) w , x (cid:105) − (cid:15) (cid:107) w (cid:107) , and therefore R S ( (cid:101) F ) = 1 n E σ (cid:34) sup (cid:107) w (cid:107) ≤ W n (cid:88) i =1 σ i ( y i (cid:104) w , x i (cid:105) − (cid:15) (cid:107) w (cid:107) ) (cid:35) . Define u := (cid:80) ni =1 σ i y i x i and v := (cid:15) (cid:80) ni =1 σ i . Then we have R S ( (cid:101) F ) = 1 n E σ (cid:34) sup (cid:107) w (cid:107) p ≤ W (cid:104) w , u (cid:105) − v (cid:107) w (cid:107) (cid:35) Since the supremum of (cid:104) w , u (cid:105) − v (cid:107) w (cid:107) over w can only be achieved when sgn( w i ) = sgn( u i ) , weknow that R S ( (cid:101) F ) = 1 n E σ (cid:34) sup (cid:107) w (cid:107) p ≤ W (cid:104) w , u (cid:105) − v (cid:104) w , sgn( w ) (cid:105) (cid:35) = 1 n E σ (cid:34) sup (cid:107) w (cid:107) p ≤ W (cid:104) w , u (cid:105) − v (cid:104) w , sgn( u ) (cid:105) (cid:35) = 1 n E σ (cid:34) sup (cid:107) w (cid:107) p ≤ W (cid:104) w , u − v sgn( u ) (cid:105) (cid:35) = Wn E σ [ (cid:107) u − v sgn( u ) (cid:107) q ]= Wn E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 σ i y i x i − (cid:0) (cid:15) n (cid:88) i =1 σ i (cid:1) sgn( n (cid:88) i =1 σ i y i x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q . (10)16ow we prove an upper bound for R S ( (cid:101) F ) . By triangle inequality, we have R S ( (cid:101) F ) ≤ Wn E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 σ i y i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q + (cid:15)Wn E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( n (cid:88) i =1 σ i ) sgn( n (cid:88) i =1 σ i y i x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q = R S ( F ) + (cid:15)W d q n E σ (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 σ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ R S ( F ) + (cid:15)W d q √ n , where the last step is due to Khintchine’s inequality.We then proceed to prove a lower bound for R S ( (cid:101) F ) . According to (10) and by symmetry, weknow that R S ( (cid:101) F ) = Wn E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ( − σ i ) y i x i − (cid:0) (cid:15) n (cid:88) i =1 ( − σ i ) (cid:1) sgn( n (cid:88) i =1 ( − σ i ) y i x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q = Wn E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 σ i y i x i + (cid:0) (cid:15) n (cid:88) i =1 σ i (cid:1) sgn( n (cid:88) i =1 σ i y i x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q . (11)Then, combining (10) and (11) and using triangle inequality, we have R S ( (cid:101) F ) = W n E σ (cid:34) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 σ i y i x i − (cid:0) (cid:15) n (cid:88) i =1 σ i (cid:1) sgn( n (cid:88) i =1 σ i y i x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 σ i y i x i + (cid:0) (cid:15) n (cid:88) i =1 σ i (cid:1) sgn( n (cid:88) i =1 σ i y i x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q (cid:35) ≥ Wn E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 σ i y i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q = R S ( F ) . (12)Similarly, we have R S ( (cid:101) F ) ≥ Wn E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:0) (cid:15) n (cid:88) i =1 σ i (cid:1) sgn( n (cid:88) i =1 σ i y i x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q = Wn E σ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 σ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) sgn( n (cid:88) i =1 σ i y i x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q = (cid:15)W d q n E σ (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 σ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) . By Khintchine’s inequality, we know that there exists a universal constant c > such that E σ [ | (cid:80) ni =1 σ i | ] ≥ c √ n . Therefore, we have R S ( (cid:101) F ) ≥ c(cid:15)W d q √ n . Combining with (12), we completethe proof. B Multi-class Linear Classifiers
B.1 Proof of Theorem 3
According to the multi-class margin bound in [33], for any fixed γ , with probability at least − δ ,we have P ( x ,y ) ∼D (cid:26) y (cid:54) = arg max y (cid:48) ∈ [ K ] [ f ( x )] y (cid:48) (cid:27) ≤ n n (cid:88) i =1 ([ f ( x i )] y i ≤ γ + max y (cid:48) (cid:54) = y [ f ( x i )] y (cid:48) )+ 4 Kγ R S (Π ( F )) + 3 (cid:115) log δ n , Π ( F ) ⊆ R X is defined as Π ( F ) := { x (cid:55)→ [ f ( x )] k : f ∈ F , k ∈ [ K ] } . In the special case of linear classifiers F = { f W ( x ) : (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W } , we can see that Π ( F ) = { x (cid:55)→ (cid:104) w , x (cid:105) : (cid:107) w (cid:107) p ≤ W } . Thus, we have R S (Π ( F )) = 1 n E σ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 σ i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q , which completes the proof. B.2 Proof of Theorem 4
Since the loss function in the adversarial setting is (cid:101) (cid:96) ( f W ( x ) , y ) = max x (cid:48) ∈ B ∞ x ( (cid:15) ) φ γ ( M ( f W ( x ) , y )) = φ γ ( min x (cid:48) ∈ B ∞ x ( (cid:15) ) M ( f W ( x ) , y )) . Since we consider linear classifiers, we have min x (cid:48) ∈ B ∞ x ( (cid:15) ) M ( f W ( x ) , y ) = min x (cid:48) ∈ B ∞ x ( (cid:15) ) min y (cid:48) (cid:54) = y ( w y − w y (cid:48) ) (cid:62) x (cid:48) = min y (cid:48) (cid:54) = y min x (cid:48) ∈ B ∞ x ( (cid:15) ) ( w y − w y (cid:48) ) (cid:62) x (cid:48) = min y (cid:48) (cid:54) = y ( w y − w y (cid:48) ) (cid:62) x − (cid:15) (cid:107) w y − w y (cid:48) (cid:107) (13)Define h ( k ) W ( x , y ) := ( w y − w k ) (cid:62) x − (cid:15) (cid:107) w y − w k (cid:107) + γ ( y = k ) . We now show that (cid:101) (cid:96) ( f W ( x ) , y ) = max k ∈ [ K ] φ γ ( h ( k ) W ( x , y )) . (14)To see this, we can see that according to (13), min x (cid:48) ∈ B ∞ x ( (cid:15) ) M ( f W ( x ) , y ) = min k (cid:54) = y h ( k ) W ( x , y ) . If min k (cid:54) = y h ( k ) W ( x , y ) ≤ γ , we have min k (cid:54) = y h ( k ) W ( x , y ) = min k ∈ [ K ] h ( k ) W ( x , y ) , since h ( y ) W ( x , y ) = γ .On the other hand, if min k (cid:54) = y h ( k ) W ( x , y ) > γ , then min k ∈ [ K ] h ( k ) W ( x , y ) = γ . In this case, we have φ γ (min k (cid:54) = y h ( k ) W ( x , y )) = φ γ (min k ∈ [ K ] h ( k ) W ( x , y )) = 0 . Therefore, we can see that (14) holds.Define the K function classes F k := { h ( k ) W ( x , y ) : (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W } ⊆ R X ×Y . Since φ γ ( · ) is /γ -Lipschitz, according to the Ledoux-Talagrand contraction inequality [35] and Lemma 8.1 in [41],we have R S ( (cid:101) (cid:96) F ) ≤ γ K (cid:88) k =1 R S ( F k ) . (15)We proceed to analyze R S ( F k ) . The basic idea is similar to the proof of Theorem 2. We define18 y = (cid:80) ni =1 σ i x i ( y i = y ) and v y = (cid:80) ni =1 σ i ( y i = y ) . Then, we have R S ( F k ) = 1 n E σ (cid:104) sup (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W n (cid:88) i =1 σ i (( w y i − w k ) (cid:62) x i − (cid:15) (cid:107) w y i − w k (cid:107) + γ ( y i = k )) (cid:105) = 1 n E σ (cid:104) sup (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W n (cid:88) i =1 K (cid:88) y =1 σ i (( w y i − w k ) (cid:62) x i − (cid:15) (cid:107) w y i − w k (cid:107) + γ ( y i = k )) ( y i = y ) (cid:105) = 1 n E σ (cid:104) sup (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W K (cid:88) y =1 n (cid:88) i =1 σ i (( w y − w k ) (cid:62) x i ( y i = y ) − (cid:15) (cid:107) w y − w k (cid:107) ( y i = y )+ γ ( y i = k ) ( y i = y )) (cid:105) = 1 n E σ (cid:104) γ n (cid:88) i =1 σ i ( y i = k ) + sup (cid:107) W (cid:62) (cid:107) p, ∞ ≤ W (cid:88) y (cid:54) = k ( (cid:104) w y − w k , u y (cid:105) − (cid:15)v y (cid:107) w y − w k (cid:107) ) (cid:105) ≤ n E σ (cid:104) (cid:88) y (cid:54) = k sup (cid:107) w k (cid:107) p , (cid:107) w y (cid:107) p ≤ W ( (cid:104) w y − w k , u y (cid:105) − (cid:15)v y (cid:107) w y − w k (cid:107) ) (cid:105) = 1 n E σ (cid:104) (cid:88) y (cid:54) = k sup (cid:107) w (cid:107) p ≤ W ( (cid:104) w , u y (cid:105) − (cid:15)v y (cid:107) w (cid:107) ) (cid:105) = 2 Wn E σ (cid:104) (cid:88) y (cid:54) = k (cid:107) u y − (cid:15)v y sgn( u y ) (cid:107) q (cid:105) , where the last equality is due to the same derivation as in the proof of Theorem 2. Let n y = (cid:80) ni =1 ( y i = y ) . Then, we apply triangle inequality and Khintchine’s inequality and obtain R S ( F k ) ≤ Wn (cid:88) y (cid:54) = k E σ [ (cid:107) u y (cid:107) ] + (cid:15)d q √ n y . Combining with (15), we obtain R S ( (cid:101) (cid:96) F ) ≤ W Kγn ( K (cid:88) y =1 E σ [ (cid:107) u y (cid:107) ] + (cid:15)d q √ n y ) ≤ W Kγ (cid:34) (cid:15) √ Kd q √ n + 1 n K (cid:88) y =1 E σ [ (cid:107) u y (cid:107) ] (cid:35) , where the last step is due to Cauchy-Schwarz inequality. C Neural Network
C.1 Proof of Theorem 6
We first review a Rademacher complexity lower bound in [9].
Lemma 2. [9] Define the function class (cid:98) F = { x (cid:55)→ f W ( x ) : W = ( W , W , . . . , W L ) , L (cid:89) h =1 (cid:107) W h (cid:107) σ ≤ r } , and (cid:98) F (cid:48) = { x (cid:55)→ (cid:104) w , x (cid:105) : (cid:107) w (cid:107) ≤ r } . Then we have (cid:98) F (cid:48) ⊆ (cid:98) F , and thus there exists a universalconstant c > such that R S ( (cid:98) F ) ≥ crn (cid:107) X (cid:107) F . According to Lemma 2, in the adversarial setting, by defining (cid:101) F (cid:48) = { x (cid:55)→ min x (cid:48) ∈ B ∞ x ( (cid:15) ) y (cid:104) w , x (cid:48) (cid:105) : (cid:107) w (cid:107) ≤ r } ⊆ R X ×{− , +1 } , we have (cid:101) F (cid:48) ⊆ (cid:101) F . Therefore, there exists a universal constant c > such that R S ( (cid:101) F ) ≥ R S ( (cid:101) F (cid:48) ) ≥ cr (cid:32) n (cid:107) X (cid:107) F + (cid:15) (cid:114) dn (cid:33) , where the last inequality is due to Theorem 2. 19 .2 Proof of Lemma 1 Since Q ( · , · ) is a linear function in its first argument, we have for any y, y (cid:48) ∈ [ K ] , max P (cid:23) , diag( P ) ≤ (cid:104) Q ( w ,y (cid:48) − w ,y , W ) , P (cid:105)≤ max P (cid:23) , diag( P ) ≤ (cid:104) Q ( w ,y (cid:48) , W ) , P (cid:105) + max P (cid:23) , diag( P ) ≤ (cid:104)− Q ( w ,y , W ) , P (cid:105)≤ k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) . (16)Then, for any ( x , y ) , we have max x (cid:48) ∈ B ∞ x ( (cid:15) ) ( y (cid:54) = arg max y (cid:48) ∈ [ K ] [ f W ( x (cid:48) )] y (cid:48) ) ≤ φ γ ( min x (cid:48) ∈ B ∞ x ( (cid:15) ) M ( f W ( x (cid:48) ) , y )) ≤ φ γ (min y (cid:48) (cid:54) = y min x (cid:48) ∈ B ∞ x ( (cid:15) ) [ f W ( x (cid:48) )] y − [ f W ( x (cid:48) )] y (cid:48) ) ≤ φ γ (cid:16) min y (cid:48) (cid:54) = y [ f W ( x )] y − [ f W ( x )] y (cid:48) − (cid:15) y (cid:48) (cid:54) = y max P (cid:23) , diag( P ) ≤ (cid:104) Q ( w ,y (cid:48) − w ,y , W ) , P (cid:105) (cid:17) ≤ φ γ (cid:16) min y (cid:48) (cid:54) = y [ f W ( x )] y − [ f W ( x )] y (cid:48) − (cid:15) k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) (cid:17) ≤ φ γ (cid:16) M ( f W ( x ) , y ) − (cid:15) k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) (cid:17) := (cid:98) (cid:96) ( f W ( x ) , y ) ≤ (cid:16) M ( f W ( x ) , y ) − (cid:15) k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) ≤ γ (cid:17) , where the first inequality is due to the property of ramp loss, the second inequality is by thedefinition of the margin, the third inequality is due to Theorem 7, the fourth inequality is dueto (16), the fifth inequality is by the definition of the margin and the last inequality is due to theproperty of ramp loss. C.3 Proof of Theorem 8
We study the Rademacher complexity of the function class (cid:98) (cid:96) F := { ( x , y ) (cid:55)→ (cid:98) (cid:96) ( f W ( x ) , y ) : f W ∈ F} . Define M F := { ( x , y ) (cid:55)→ M ( f W ( x ) , y ) : f W ∈ F} . Then we have R S ( (cid:98) (cid:96) F ) ≤ γ (cid:16) R S ( M F ) + (cid:15) n E σ (cid:104) sup f W ∈F n (cid:88) i =1 σ i max k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) (cid:105)(cid:17) , (17)where we use the Ledoux-Talagrand contraction inequality and the convexity of the supreme oper-ation. For the first term, since we have (cid:107) W (cid:107) ≤ b , we have (cid:107) W (cid:62) (cid:107) , ≤ b . Then, we can applythe Rademacher complexity bound in [9] and obtain R S ( M F ) ≤ n / + 60 log( n ) log(2 d max ) n s s (cid:18) ( b s ) / + ( b s ) / (cid:19) / (cid:107) X (cid:107) F . (18)Now consider the second term in (17). According to [44], we always have max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) ≥ . (19)In addition, we know that when P (cid:23) and diag( P ) ≤ , we have (cid:107) P (cid:107) ∞ ≤ . (20)Moreover, we have (cid:107) W (cid:107) ∞ ≤ (cid:107) W (cid:62) (cid:107) , ≤ b . (21)20hen, we obtain (cid:15) n E σ (cid:104) sup f W ∈F n (cid:88) i =1 σ i max k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) (cid:105) ≤ (cid:15) n (cid:16) sup f W ∈F max k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105) (cid:17) E σ (cid:104) | n (cid:88) i =1 σ i | (cid:105) ≤ (cid:15) √ n sup f W ∈F max k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:104) zQ ( w ,k , W ) , P (cid:105)≤ (cid:15) √ n sup f W ∈F max k ∈ [ K ] ,z = ± max P (cid:23) , diag( P ) ≤ (cid:107) zQ ( w ,k , W ) (cid:107) (cid:107) P (cid:107) ∞ ≤ (cid:15) √ n sup f W ∈F max k ∈ [ K ] (cid:107) diag( w ,k ) (cid:62) W (cid:107) ≤ (cid:15) √ n sup f W ∈F (cid:107) W (cid:107) (cid:107) W (cid:107) ∞ ≤ (cid:15)b b √ n , (22)where the first inequality is due to (19), the second inequality is due to Khintchine’s inequality, thethird inequality is due to Hölder’s inequality, and the fourth inequality is due to the definition of Q ( · , · ) and (20), the fifth inequality is a direct upper bound, and the last inequality is due to (21).Now we can combine (18) and (22) and get an upper bound for R S ( (cid:98) (cid:96) F ))