[PDF] Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?

Abstract

Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional net (for a fixed task). Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be shown, for standard training algorithms. We construct a single natural distribution on R d ×{±1} on which any orthogonal-invariant algorithm (i.e. fully-connected networks trained with most gradient-based methods from gaussian initialization) requires Ω( d 2 ) samples to generalize while O(1) samples suffice for convolutional architectures. Furthermore, we demonstrate a single target function, learning which on all possible distributions leads to an O(1) vs Ω( d 2 /ε) gap. The proof relies on the fact that SGD on fully-connected network is orthogonal equivariant. Similar results are achieved for ℓ 2 regression and adaptive training algorithms, e.g. Adam and AdaGrad, which are only permutation equivariant.

Full PDF

WW HY A RE C ONVOLUTIONAL N ETS M ORE S AMPLE -E FFICIENT THAN F ULLY -C ONNECTED N ETS ? Zhiyuan Li ∗ Yi Zhang ∗ Sanjeev Arora ∗† Princeton University ∗ & Institute for Advanced Study † {zhiyuanli,y.zhang,arora}@cs.princeton.edu A BSTRACT

Convolutional neural networks often dominate fully-connected counterparts ingeneralization performance, especially on image classiﬁcation tasks. This is oftenexplained in terms of “better inductive bias.” However, this has not been mademathematically rigorous, and the hurdle is that the fully connected net can alwayssimulate the convolutional net (for a ﬁxed task). Thus the training algorithm playsa role. The current work describes a natural task on which a provable samplecomplexity gap can be shown, for standard training algorithms. We constructa single natural distribution on R d × {± } on which any orthogonal-invariantalgorithm (i.e. fully-connected networks trained with most gradient-based methodsfrom gaussian initialization) requires Ω( d ) samples to generalize while O (1) samples sufﬁce for convolutional architectures. Furthermore, we demonstratea single target function, learning which on all possible distributions leads to an O (1) vs Ω( d /ε ) gap. The proof relies on the fact that SGD on fully-connectednetwork is orthogonal equivariant. Similar results are achieved for (cid:96) regression andadaptive training algorithms, e.g. Adam and AdaGrad, which are only permutationequivariant. NTRODUCTION

Deep convolutional nets (“ConvNets”) are at the center of the deep learning revolution (Krizhevskyet al., 2012; He et al., 2016; Huang et al., 2017). For many tasks, especially in vision, convolutionalarchitectures perform signiﬁcantly better their fully-connected (“FC”) counterparts, at least giventhe same amount of training data. Practitioners explain this phenomenon at an intuitive level bypointing out that convolutional architectures have better “inductive bias”, which intuitively meansthe following: (i) ConvNet is a better match to the underlying structure of image data, and thus areable to achieve low training loss with far fewer parameters (ii) models with fewer total number ofparameters generalize better.Surprisingly, the above intuition about the better inductive bias of ConvNets over FC nets hasnever been made mathematically rigorous. The natural way to make it rigorous would be to showexplicit learning tasks that require far more training samples on FC nets than for ConvNets. (Here“task”means, as usual in learning theory, a distribution on datapoints, and binary labels for themgenerated given using a ﬁxed labeling function.) Surprisingly, the standard repertoire of lower boundtechniques in ML theory does not seem capable of demonstrating such a separation. The reason isthat any ConvNet can be simulated by an FC net of sufﬁcient width, since a training algorithm canjust zero out unneeded connections and do weight sharing as needed. Thus the key issue is not anexpressiveness per se , but the combination of architecture plus the training algorithm . But if thetraining algorithm must be accounted for, the usual hurdle arises that we lack good mathematicalunderstanding of the dynamics of deep net training (whether FC or ConvNet). How then can oneestablish the limitations of “FC nets + current training algorithms”? (Indeed, many lower boundtechniques in PAC learning theory are information theoretic and ignore the training algorithm.)The current paper makes signiﬁcant progress on the above problem by exhibiting simple tasks thatrequire Ω( d ) factor more training samples for FC nets than for ConvNets, where d is the datadimension. (In fact this is shown even for 1-dimensional ConvNets; the lowerbound easily extendsto 2-D ConvNets.) The lower bound holds for FC nets trained with any of the popular algorithms1 a r X i v : . [ c s . L G ] O c t t e s t a cc Gaussian cnnfc 10 t e s t a cc cifar cnnfc Figure 1: Illustration of generalization performance of convolutional versus fully-connected modelstrained by SGD. Here the input data are × × three-channel images and the binary labelindicates for each image whether the ﬁrst channel has larger (cid:96) norm than the second one. The inputimages are drawn from entry-wise independent Gaussian (left) and CIFAR-10 (right). In both cases,the convolutional net consists of a × convolution with hidden channel + quadratic activation + × convolution with a single output channel + global average pooling, and the fully-connected netconsists of a fully-connected layer with 3072 hidden nodes + quadratic activation + a fully-connectedlayer with a single node.listed in Table 1. (The reader can concretely think of vanilla SGD with Gaussian initialization ofnetwork weights, though the proof allows use of momentum, BatchNorm(Ioffe & Szegedy, 2015), (cid:96) regularization, and various learning rate schedules.) Our proof relies on the fact that these popularalgorithms lead to an orthogonal-equivariance property on the trained FC nets, which says that atthe end of training the FC net —no matter how deep or how wide — will make the same predictionseven if we apply orthogonal transformation on all datapoints (i.e., both training and test). This notionis inspired by Ng (2004) (where it is named “orthogonal invariant”), which showed the power oflogistic regression with (cid:96) regularization versus other learners. For a variety of learners (includingkernels and FC nets) that paper described explicit tasks where the learner has Ω( d ) higher samplecomplexity than logistic regression with (cid:96) regularization. The lower bound example and techniquecan also be extended to show a (weak) separation between FC nets and ConvNets. (See Section 4.2)Our separation is quantitatively stronger than the result one gets using Ng (2004) because the samplecomplexity gap is Ω( d ) vs O (1) , and not Ω( d ) vs O (1) . But in a more subtle way our result isconceptually far stronger: the technique of Ng (2004) seems incapable of exhibiting a sample gap ofmore than O (1) between Convnets and FC nets in our framework. The reason is that the technique ofNg (2004) can exhibit a hard task for FC nets only after ﬁxing the training algorithm. But there areinﬁnitely many training algorithms once we account for hyperparameters associated in various epochswith LR schedules, (cid:96) regularizer and momentum. Thus Ng (2004)’s technique cannot exclude thepossibility that the hard task for “FC net + Algorithm 1” is easy for “FC net + Algorithm 2”. Notethat we do not claim any issues with the results claimed in Ng (2004); merely that the techniquecannot lead to a proper separation between ConvNets and FC nets, when the FC nets are allowed tobe trained with any of the inﬁnitely many training algorithms. (Section 4.2 spells out in more detailthe technical difference between our technique and Ng’s idea.)The reader may now be wondering what is the single task that is easy for ConvNets but hard for FCnets trained with any standard algorithm? A simple example is the following: data distribution in R d is standard Gaussian, and target labeling function is the sign of (cid:80) d/ i =1 x i − (cid:80) di = d/ x i . Figure 1shows that this task is indeed much more difﬁcult for FC nets. Furthermore, the task is also hard inpractice for data distributions other than Gaussian; the ﬁgure shows that a sizeable performance gapexists even on CIFAR images with such a target label. Extension to broader class of algorithms.

The orthogonal-equivariance property holds for manytypes of practical training algorthms, but not all. Notable exceptions are adaptive gradient methods(e.g. Adam and AdaGrad), (cid:96) regularizer, and initialization methods that are not spherically symmetric.To prove a lower bound against FC nets with these algorithms, we identify a property, permutation-invariance , which is satisﬁed by nets trained using such algorithms. We then demonstrate a single and natural task on R d × {± } that resembles real-life image texture classiﬁcation, on which weprove any permutation-invariant learning algorithm requires Ω( d ) training examples to generalize,while Empirical Risk Minimization with O (1) examples can learn a convolutional net.2 aper structure. In Section 2 we discuss about related works. In section 3, we deﬁne the notationand terminologies. In Section 4, we give two warmup examples and an overview for the prooftechnique for the main theorem. In Section 5, we present our main results on the lower bound oforthogonal and permutation equivariant algorithms.

ELATED W ORKS

Du et al. (2018) attempted to investigate the reason why convolutional nets are more sample efﬁcient.Speciﬁcally they prove O (1) samples sufﬁce for learning a convolutional ﬁlter and also proved a Ω( d ) min-max lower bound for learning the class of linear classiﬁers. Their lower bound is against learninga class of distributions, and their work fails to serve as a sample complexity separation, because theirupper and lower bounds are proved on different classes of tasks.Arjevani & Shamir (2016) also considered the notion of distribution-speciﬁc hardness of learningneural nets. They focused on proving running time complexity lower bounds against so-called"orthogonally invariant" and "linearly invariant" algorithms. However, here we focus on samplecomplexity.Recently, there has been progress in showing lower bounds against learning with kernels. Wei et al.(2019) constructed a single task on which they proved a sample complexity separation betweenlearning with neural networks vs. with neural tangent kernels. Notably the lower bound is speciﬁcto neural tangent kernels (Jacot et al., 2018). Relatedly, Allen-Zhu & Li (2019) showed a samplecomplexity lower bound against all kernels for a family of tasks, i.e., learning k -XOR on thehypercube. OTATION AND P RELIMINARIES

We will use X = R d , Y = {− , } to denote the domain of the data and label and H = { h | h : X → Y} to denote the hypothesis class. Formally, given a joint distribution P , the error of ahypothesis h ∈ H is deﬁned as err P ( h ) := P x ,y ∼ P [ h ( x ) (cid:54) = y ] . If h is a random hypothesis, wedeﬁne err P ( h ) := P x ,y ∼ P,h [ h ( x ) (cid:54) = y ] for convenience. A class of joint distributions supported on X × Y is referred as a problem , P .We use (cid:107)·(cid:107) to denote the spectrum norm and (cid:107)·(cid:107) F to denote the Frobenius norm of a matrix. Weuse A ≤ B to denote that B − A is a semi-deﬁnite positive matrix. We also use O ( d ) and GL ( d ) to denote the d -dimensional orthogonal group and general linear group respectively. We use B d p todenote the unit Schatten-p norm ball in R d × d .We use N ( µ, Σ) to denote Gaussian distribution with mean µ and covariance Σ . For random variable X and Y , we denote X is equal to Y in distribution by X d = Y . In this work, we also always use P X to denote the distributions on X and P to denote the distributions supported jointly on X × Y .Given an input distribution P X and a hypothesis h , we deﬁne P X (cid:5) h as the joint distribution on X × Y , such that ( P X (cid:5) h )( S ) = P ( { x | ( x , h ( x )) ∈ S } ) , ∀ S ⊂ X × Y . In other words, to sample ( X, Y ) ∼ P X (cid:5) h means to ﬁrst sample X ∼ P X , and then set Y = h ( X ) . For a family of inputdistributions P X and a hypothesis class H , we deﬁne P X (cid:5) H = { P X (cid:5) h | P X ∈ P X , h ∈ H} . Inthis work all joint distribution P can be written as P X (cid:5) h for some h , i.e. P Y|X is deterministic.For set S ⊂ X and 1-1 map g : X → X , we deﬁne g ( S ) = { g ( x ) | x ∈ S } . We use ◦ to denotefunction composition. ( f ◦ g )( x ) is deﬁned as f ( g ( x )) , and for function class F , G , F ◦ G = { f ◦ g | f ∈ F , g ∈ G} . For any distribution P X supported on X , we deﬁne P X ◦ g as the distribution suchthat ( P X ◦ g )( S ) = P X ( g ( S )) . In other words, if X ∼ P X ⇐⇒ g − ( X ) ∼ P X ◦ g , because ∀ S ⊆ X , P X ∼ P X (cid:2) g − ( X ) ∈ S (cid:3) = P X ∼ P X [ X ∈ g ( S )] = [ P X ◦ g ]( S ) . For any joint distribution P of form P = P X (cid:5) h , we deﬁne P ◦ g = ( P X ◦ g ) (cid:5) ( h ◦ g ) . In otherwords, ( X, Y ) ∼ P ⇐⇒ ( g − ( X ) , Y ) ∼ P ◦ g . For any distribution class P and group G acting on X , we deﬁne P ◦ G as { P ◦ g | P ∈ P , g ∈ G} . Deﬁnition 3.1.

A deterministic supervised

Learning Algorithm A is a mapping from a sequenceof training data, { ( x i , y i ) } ni =1 ∈ ( X × Y ) n , to a hypothesis A ( { ( x i , y i ) } ni =1 ) ∈ H ⊆ Y X . The3 lgorithm 1 Iterative algorithm A Require:

Initial parameter distribution P init supported in W = R m , total iterations T , trainingdataset { x i , y i } ni =1 , parametric model M : W → H , iterative update rule F ( W , M , { x i , y i } ni =1 ) Ensure:

Hypothesis h : X → Y .Sample W (0) ∼ P init . for t = 0 to T − do W ( t +1) = F ( W ( t ) , M , { x i , y i } ni =1 ) . return h = sign (cid:2) M [ W ( T ) ] (cid:3) .algorithm A could also be randomized, in which case the output A ( { ( x i , y i ) } ni =1 ) is a distribution onhypotheses. Two randomized algorithms A and A (cid:48) are the same if for any input, their outputs havethe same distribution in function space, which is denoted by A ( { x i , y i } ni =1 ) d = A (cid:48) ( { x i , y i } ni =1 ) . Deﬁnition 3.2 (Equivariant Algorithms) . A learning algorithm is equivariant under group G X (or G X -equivariant) if and only if for any dataset { x i , y i } ni =1 ∈ ( X × Y ) n and ∀ g ∈ G X , x ∈ X , A ( { g ( x i ) , y i } ni =1 ) ◦ g = A ( { x i , y i } ni =1 ) , or A ( { g ( x i ) , y i } ni =1 )( g ( x )) = [ A ( { x i , y i } ni =1 )]( x ) . Deﬁnition 3.3 (Sample Complexity) . Given a problem P and a randomized learning algorithm A , δ, ε ∈ [0 , , we deﬁne the ( ε, δ ) - sample complexity , denoted N ( A , P , ε, δ ) , as the smallest number n ∈ (cid:110) such that ∀ P ∈ P , w.p. − δ over the randomness of { x i , y i } ni =1 , err P ( A ( { x i , y i } ni =1 )) ≤ ε .We also deﬁne the ε -expected sample complexity for a problem P , denoted N ∗ ( A , P , ε ) , as thesmallest number n ∈ N such that ∀ P ∈ P , E ( x i ,y i ) ∼ P [err P ( A ( { x i , y i } ni =1 ))] ≤ ε . By deﬁnition, wehave N ∗ ( A , P , ε + δ ) ≤ N ∗ ( A , P , ε, δ ) ≤ N ∗ ( A , P , εδ ) , ∀ ε, δ ∈ [0 , .3.1 P ARAMETRIC M ODELS AND I TERATIVE A LGORITHMS

A parametric model M : W → H is a functional mapping from weight W to a hypothesis M ( · ) : X → Y . Given a speciﬁc parametric model M , a general iterative algorithm is deﬁned asAlgorithm 1. In this work, we will only use the two parametric models below , FC-NN and

CNN . FC Nets: A L -layer Fully-connected Neural Network parameterized by its weights W =( W , W , . . . , W L ) is a function FC-NN [ · ] : R d → R , where W i ∈ R d i − × d i , d = d , and d L = 1 : FC-NN [ W ]( x ) = W L σ ( W L − · · · σ ( W σ ( W x ))) . Here, σ : R → R can be any function, and we abuse the notation such that σ is also deﬁned for vectorinputs, in the sense that [ σ ( x )] i = σ ( x i ) . ConvNets(CNN):

In this paper we will only use two layer Convolutional Neural Networks withone channel. Suppose d = d (cid:48) r for some integer d (cid:48) , r , a 2-layer CNN parameterized by its weights W = ( w , a , b ) ∈ R k × R r × R is a function CNN [ · ] : R d → R : CNN [ W ]( x ) = r (cid:88) i =1 a r σ ([ w ∗ x ] d (cid:48) ( r − d (cid:48) r ) + b, where ∗ : R k × R d → R d is the convolution operator, deﬁned as [ w ∗ x ] i = (cid:80) kj =1 w j x [ i − j − mod d ]+1 ,and σ : R d (cid:48) → R is the composition of pooling and element-wise non-linearity.3.2 E QUIVARIANCE AND TRAINING ALGORITHMS

This section gives an informal sketch of why FC nets trained with standard algorithms have certainequivariance properties. The high level idea here is if update rule of the network, or more generally,the parametrized model, exhibits certain symmetry per step, i.e., property 2 in Theorem C.1, then byinduction it will hold till the last iteration. For randomized algorithms, the condition becomes A ( { g ( x i ) , y i } ni =1 ) ◦ g d = A ( { x i , y i } ni =1 ) , which isstronger than A ( { g ( x i ) , y i } ni =1 )( g ( x )) d = [ A ( { x i , y i } ni =1 )]( x ) , ∀ x ∈ X . | M ii | = 1 Permutation Orthogonal InvertibleAlgorithms AdaGrad, Adam AdaGrad, Adam SGD Momentum Newton’s methodInitialization Symmetric distribution i.i.d. i.i.d. Gaussian All zeroRegularization (cid:96) p norm (cid:96) p norm (cid:96) norm NoneTable 1: Examples of gradient-based equivariant training algorithms for FC networks. The initializa-tion requirement is only for the ﬁrst layer of the network.Taking linear regression as an example, let x i ∈ R d , i ∈ [ n ] be the data and y ∈ R n be thelabels, the GD update for L ( w ) = (cid:80) ni =1 ( x (cid:62) i w − y i ) = (cid:13)(cid:13) X (cid:62) w − y (cid:13)(cid:13) would be w t +1 = F ( w t , X , y ) := w t − η X ( X (cid:62) w t − y ) . Now suppose there’s another person trying to solve the sameproblem using GD with the same initial linear function, but he observes everything in a differentbasis, i.e., X (cid:48) = U X and w (cid:48) = U w , for some orthogonal matrix U . Not surprisingly, he would getthe same solution for GD, just in a different basis. Mathematically, this is because w (cid:48) t = U w t = ⇒ w (cid:48) t +1 = F ( w (cid:48) t , U X , y ) = U F ( w t , X , y ) = U w t +1 . In other words, he would make the sameprediction for unseen data. Thus if the initial distribution of w is the same under all basis (i.e., underrotations), e.g., gaussian N (0 , I d ) , then w d = U w = ⇒ F t ( w , U X , y ) = U F t ( w , X , y ) , forany iteration t , which means GD for linear regression is orthogonal invariant.To show orthogonal equivariance for gradient descent on general deep FC nets, it sufﬁces to applythe above argument on each neuron in the ﬁrst layer of the FC nets. Equivariance for other trainingalgorithms (see Table 1) can be derived in the exact same method. The rigorous statement and theproofs are deferred into Appendix C. ARM - UP EXAMPLES AND P ROOF O VERVIEW

XAMPLE Ω( d ) LOWER BOUND AGAINST ORTHOGONAL EQUIVARIANT METHODS

We start with a simple but insightful example to how equivariance alone could sufﬁce for somenon-trivial lower bounds.We consider a task on R d × {± } which is a uniform distribution on the set { ( e i y, y ) | i ∈{ , , . . . , d } , y = ± } , denoted by P . Each sample from P is a one-hot vector in R d and thesign of the non-zero coordinate determines its label. Now imagine our goal is to learn this task usingan algorithm A . After observing a training set of n labeled points S := { ( x i , y i ) } ni =1 , the algorithmis asked to make a prediction on an unseen test data x , i.e., A ( S )( x ) . Here we are concerned with orthogonal equivariant algorithms ——the prediction of the algorithm on the test point remains thesame even if we rotate every x i and the test point x by any orthogonal matrix R , i.e., A ( { ( R x i , y i ) } ni =1 )( R x ) d = A ( { ( x i , y i ) } ni =1 )( x ) Now we show this algorithm fails to generalize on task P , if it observes only d/ training examples.The main idea here is that, for a ﬁxed training set S , the prediction A ( { ( x i , y i ) } ni =1 )( x ) is determinedsolely by the inner products bewteen x and x i ’s due to orthogonal equivariance, i.e., there exists arandom function f ( which may depend on S ) such that A ( { ( x i , y i ) } ni =1 )( x ) d = f ( x (cid:62) x , . . . , x (cid:62) x n ) But the input distribution for this task is supported on 1-hot vectors. Suppose n < d/ . Then at testtime the probability is at least / that the new data point ( x , y ) ∼ P , is such that x has zero innerproduct with all n points seen in the training set S . This fact alone ﬁxes the prediction of A to thevalue f (0 , . . . , whereas y is independently and randomly chosen to be ± . We conclude that A outputs the wrong answer with probability at least / .4.2 E XAMPLE Ω( d ) LOWER BOUND IN THE WEAK SENSE

The warm up example illustrates the main insight of (Ng, 2004), namely, that when an orthogonalequivariant algorithm is used to do learning on a certain task, it is actually being forced to simulta- this can be made formal using the fact that Gram matrix determine a set of vectors up to an orthogonaltransformation. h ∗ = sign (cid:104)(cid:80) di =1 x i − (cid:80) di = d +1 x i (cid:105) . Algorithm A is orthogonal equivariant (Deﬁnition 3.2) means that for any task P = P X (cid:5) h ∗ , where P X isthe input distribution and h ∗ is the labeling function, A must have the same performance on P and its rotated version P ◦ U = ( P X ◦ U ) (cid:5) ( h ∗ ◦ U ) , where U can be any orthogonal matrix.Therefore if there’s an orthogonal equivariant learning algorithm A that learns h ∗ on all distributions,then A will also learn every the rotated copy of h ∗ , h ∗ ◦ U , on every distribution P X , simplybecause A learns h ∗ on distribution P X ◦ U − . Thus A learns the class of labeling functions h ∗ ◦ O ( d ) := { h ( x ) = h ∗ ( U ( x )) | U ∈ O ( d ) } on all distributions. (See formal statement inTheorem 5.1) By the standard lower bounds with VC dimension (See Theorem B.1), it takes atleast Ω( VCdim ( H◦O ( d )) ε ) samples for A to guarantee − ε accuracy. Thus it sufﬁces to show theVC dimension VCdim ( H ◦ O ( d )) = Ω( d ) , towards a Ω( d ) sample complexity lower bound. (Ng(2004) picks a linear thresholding function as h ∗ , and thus VCdim ( h ∗ ◦ O ( d )) is only O ( d ) .)Formally, we have the following theorem, whose proof is deferred into appendix: Theorem 4.1 (All distributions, single hypothesis) . Let P = { all distributions } (cid:5) { h ∗ } . For anyorthogonal equivariant algorithms A , N ( A , P , ε, δ ) = Ω(( d + ln δ ) /ε ) , while there’s a 2-layer Con-vNet architecture, such that N ( ERM

CNN , P , ε, δ ) = O (cid:0) ε (cid:0) log ε + log δ (cid:1)(cid:1) . Moreover, ERM

CNN could be realized by gradient descent.But as noted in the introduction, this doesn’t imply there is some task hard for every training algorithmfor the FC net. The VC dimension based lower bound implies for each algorithm A the existence of aﬁxed distribution P X ∈ P and some orthogonal matrix U A such that the task ( P X ◦ U − A ) (cid:5) h ∗ ishard for it. However, this does not preclude ( P X ◦ U − A ) (cid:5) h ∗ being easy for some other algorithm A (cid:48) .4.3 P ROOF OVERVIEW FOR FIXED DISTRIBUTION LOWER BOUNDS

At ﬁrst sight, the issue highlighted above (and in the Introduction) seems difﬁcult to get around. Onepossible avenue is if the hard input distribution P X in the task were invariant under all orthogonaltransformations, i.e., P X = P X ◦ U for all orthogonal matrices U . Unfortunately, the distributionconstructed in the proof of lower bound with VC dimension is inherently discrete and cannot be madeinvariant to orthogonal transformations.Our proof will use a ﬁxed P X namely, the standard Gaussian, which is indeed invariant underorthogonal transformations. The proof also uses the Benedek-Itai’s lower bound, Theorem 4.2, andthe main technical part of our proof is the lower bound for the the packing number D ( H , ρ, ε ) deﬁnedbelow (also see Equation (2)).For function class H , we use Π H ( n ) to denote the growth function of H , i.e. Π H ( n ) :=sup x ,...,x n ∈X |{ ( h ( x ) , h ( x ) , . . . , h ( x n )) | h ∈ H}| . Denote the VC-Dimension of H by VCdim ( H ) ,by Sauer-Shelah Lemma, we know Π H ( n ) ≤ (cid:16) en VCdim ( H ) (cid:17) VCdim ( H ) for n ≥ VCdim ( H ) .Let ρ be a metric on H , We deﬁne N ( H , ρ, ε ) as the ε -covering number of H w.r.t. ρ , and D ( H , ρ, ε ) as the ε -packing number of H w.r.t. ρ . For distribution P X , we use ρ X ( h, h (cid:48) ) := P X ∼ P X [ h ( X ) (cid:54) = h (cid:48) ( X )] to denote the discrepancy between hypothesis h and h (cid:48) w.r.t. P X . Theorem 4.2. [Benedek-Itai’s lower bound] For any algorithm A that ( ε, δ ) -learns H with n i.i.d.samples from a ﬁxed , it must hold for every Π H ( n ) ≥ (1 − δ ) D ( H , ρ X , ε ) (1)Since Π H ( n ) ≤ n , we have N ( A , P X (cid:5) H , ε, δ ) ≥ log D ( H , ρ X , ε ) + log (1 − δ ) , which is theoriginal bound from Benedek & Itai (1991). Later Long (1995) improved this bound for the regime n ≥ VCdim ( H ) using Sauer-Shelah lemma, i.e., N ( A , P X , ε, δ ) ≥ VCdim ( H ) e ((1 − δ ) D ( H , ρ X , ε )) VCdim ( H ) . (2)6 ntuition behind Benedek-Itai’s lower bound. We ﬁrst ﬁx the data distribution as P X . Supposethe ε -packing is labeled as { h , . . . , h D ( H ,ρ X , ε ) } and ground truth is chosen from this ε -packing, ( ε, δ ) -learns the hypothesis H means the algorithm is able to recover the index of the ground truth w.p. − δ . Thus one can think this learning process as a noisy channel which delivers log D ( H , ρ X , ε ) bits of information. Since the data distribution is ﬁxed, unlabeled data is independent of the groundtruth, and the only information source is the labels. With some information-theoretic inequalities,we can show the number of labels, or samples (i.e., bits of information) N ( A , P X (cid:5) H , ε, δ ) ≥ log D ( H , ρ X , ε )+log (1 − δ ) . A more closer look yields Equation (2), because when VCdim ( H ) < ∞ , then only log Π H ( n ) instead of n bits information can be delivered. OWER B OUNDS

Below we ﬁrst present a reduction from a special subclass of PAC learning to equivariant learning(Theorem 5.1), based on which we prove our main separation results, Theorem 4.1, 5.2, 5.3 and 5.4.

Theorem 5.1. If P X is a set of data distributions that is invariant under group G X , i.e., P X ◦G X = P X ,then the following inequality holds. (Furthermore it becomes an equality when G X is a compactgroup.) inf A∈ A GX N ∗ ( A , P X (cid:5) H , ε ) ≥ inf A∈ A N ∗ ( A , P X (cid:5) ( H ◦ G X ) , ε ) (3) Remark 5.1.

The sample complexity in standard PAC learning is usually deﬁned again hypothesisclass H only, i.e., P X is the set of all the possible input distributions. In that case, P X is alwaysinvariant under group G X , and thus Theorem 5.1 says that G X -equivariant learning against hypothesisclass H is as hard as learning against hypothesis H ◦ G X without equivariance constraint.5.1 Ω( d ) LOWER BOUND FOR ORTHOGONAL EQUIVARIANCE WITH A FIXED DISTRIBUTION

In this subsection we show Ω( d ) vs O (1) separation on a single task in our main theorem (Theo-rem 5.2). With the same proof technique, we further show we can get correct dependency on ε forthe lower bound, i.e., Ω( d ε ) , by considering a slightly larger function class, which can be learnt byConvNets with O ( d ) samples. We also generalize this Ω( d ) vs O ( d ) separation to the case of (cid:96) regression with a different proof technique. Theorem 5.2.

There’s a single task, P X (cid:5) h ∗ , where h ∗ = sign (cid:104)(cid:80) di =1 x i − (cid:80) di = d +1 x i (cid:105) and P X = N (0 , I d ) and a constant ε > , independent of d , such that for any orthogonal equivariantalgorithm A , we have N ∗ ( A , P X (cid:5) h ∗ , ε ) = Ω( d ) , (4)while there’s a 2-layer ConvNet, such that N ( ERM

CNN , P X (cid:5) h ∗ , ε, δ ) = O (cid:0) ε (cid:0) log ε + log δ (cid:1)(cid:1) .Moreover, ERM

CNN could be realized by gradient descent.

Proof of Theorem 5.2.

Upper bound: implied by upper bound in Theorem 4.1.

Lower bound:

Note that the P X = N (0 , I d ) is invariant under O (2 d ) , by Theorem 5.1, it sufﬁces to show thatthere’s a constant ε > (independent of d ), for any algorithm A , it takes Ω( d ) samples to learn theaugmented function class h ∗ ◦ O (2 d ) w.r.t. P X = N (0 , I d ) . Deﬁne h U = sign (cid:2) x (cid:62) d U x d +1:2 d (cid:3) , ∀ U ∈ R d × d , and by Lemma D.2, we have H = { h U | U ∈ O ( d ) } ⊆ h ∗ ◦ O (2 d ) . Thus it sufﬁces toa Ω( d ) sample complexity lower bound for the sub function class H , i.e., N ∗ ( A , N (0 , I d ) (cid:5) { sign (cid:2) x (cid:62) d U x d +1:2 d (cid:3) } , ε ) = Ω( d ) . (5)By Benedek&Itai’s lower bound, (Benedek & Itai, 1991) (Equation (1)), we know N ( A , P , ε , δ ) ≥ log ((1 − δ ) D ( H , ρ X , ε )) . (6)By Lemma D.4, there’s some constant C , such that D ( H , ρ X , ε ) ≥ ( Cε ) d ( d − , ∀ ε > .The high-level idea for Lemma D.4 is to ﬁrst show that ρ X ( h U , h V ) ≥ Ω( (cid:107) U − V (cid:107) F √ d ) , and then weshow the packing number of orthogonal matrices in a small neighborhood of I d w.r.t. (cid:107)·(cid:107) F √ d is roughly7he same as that in the tangent space of orthogonal manifold at I d , i.e., the set of skew matrices,which is of dimension d ( d − and has packing number ( Cε ) d ( d − . The advantage of working in thetangent space is that we can apply the standard volume argument.Finally, setting δ = , we conclude that N ∗ ( A , P , ε ) ≥ N ( A , P , , ε ) ≥ d ( d − log C ε − d ) .Indeed, we can improve the above lower bound by applying Equation (2), and get N ( A , P , ε,

12 ) ≥ d e (cid:18) (cid:19) d (cid:18) Cε (cid:19) − d = Ω( d ε − + d ) . (7)Note that the dependency in ε in Equation (7) is ε − + d is not optimal, as opposed to ε − in upperbounds and other lower bounds. A possible reason for this might be that Theorem 4.2 (Long’simproved version) is still not tight and it might require a tighter probabilistic upper bound for thegrowth number Π H ( n ) , at least taking P X into consideration, as opposed to the current upper boundusing VC dimension only. We left it as an open problem to show a single task P with Ω( d ε ) samplecomplexity for all orthogonal equivariant algorithms.However, if the hypothesis is of VC dimension O ( d ) , using a similar idea, we can prove a Ω( d /ε ) sample complexity lower bound for equivariant algorithms, and O ( d ) upper bounds for ConvNets. Theorem 5.3 (Single distribution, multiple functions) . There is a problem with single input distribu-tion, P = { P X } (cid:5) H = { N (0 , I d ) } (cid:5) { sign (cid:104)(cid:80) di =1 α i x i (cid:105) | α i ∈ R } , such that for any orthogonalequivariant algorithms A and ε > , N ∗ ( A , P , ε ) = Ω( d /ε ) , while there’s a 2-layer ConvNetsarchitecture, such that N ( ERM

CNN , P , ε, δ ) = O ( d log ε +log δ ε ) .Interestingly, we can show an analog of Theorem 5.3 for (cid:96) regression, i.e., the algorithm not onlyobserves the signs but also the values of labels y i . Here we deﬁne the (cid:96) loss of function h : R d → R as (cid:96) P ( h ) = E ( x ,y ) ∼ P (cid:2) ( h ( x ) − y ) (cid:3) and the sample complexity N ∗ ( A , P , ε ) for (cid:96) loss similarly asthe smallest number n ∈ N such that ∀ P ∈ P , E ( x i ,y i ) ∼ P [ (cid:96) P ( A ( { x i , y i } ni =1 ))] ≤ ε E ( x,y ) ∼ P (cid:2) y (cid:3) . Thelast term E ( x,y ) ∼ P (cid:2) y (cid:3) is added for normalization to avoid the scaling issue and thus any ε > couldbe achieved trivially by predicting for all data. Theorem 5.4 (Single distribution, multiple functions, (cid:96) regression) . There is a problem with singleinput distribution, P = { P X } (cid:5) H = { N (0 , I d ) } (cid:5) { (cid:80) di =1 α i x i | α i ∈ R } , such that for anyorthogonal equivariant algorithms A and ε > , N ∗ ( A , P , ε ) ≥ d ( d +1)2 − d ( d +2)2 ε , while there’s a2-layer ConvNet architecture, such that N ∗ ( ERM

CNN , P , ε ) ≤ d for any ε > .5.2 Ω( d ) LOWER BOUND FOR PERMUTATION EQUIVARIANCE

In this subsection we will present Ω( d ) lower bound for permutation equivariance via a differentproof technique — direct coupling. The high-level idea of direct coupling is to show with constantprobability over ( X n , x ) , we can ﬁnd a g ∈ G X , such that g ( X n ) = X n , but x and g ( x ) has differentlabels, in which case no equivariant algorithm could make the correct prediction. Theorem 5.5.

Let t i = e i + e i +1 and s i = e i + e i +2 and P be the uniform distribution on { ( s i , } ni =1 ∪ { ( t i , − } ni =1 , which is the classiﬁcation problem for local textures in a 1-dimensionalimage with d pixels. Then for any permutation equivariant algorithm A , N ( A , P , , ) ≥N ∗ ( A , P , ) ≥ d . Meanwhile, N ( ERM

CNN , P , , δ ) ≤ log δ + 2 , where ERM

CNN standsfor

ERM

CNN for function class of 2-layer ConvNets.

Remark 5.2.

The task could be understood as detecting if there are two consecutive white pixels inthe black background. For proof simplicity, we take texture of length 2 as an illustrative example. Itis straightforward to extend the same proof to more sophisticated local pattern detection problem ofany constant length and to 2-dimensional images. For vector x ∈ R d , we deﬁne x i = x ( i − mod d +1 . C ONCLUSION

We rigorously justify the common intuition that ConvNets can have better inductive bias than FCnets, by constructing a single natural distribution on which any FC net requires Ω( d ) samples togeneralize if trained with most gradient-based methods starting with gaussian initialization. On thesame task, O (1) samples sufﬁce for convolutional architectures. We further extend our results topermutation equivariant algorithms, including adaptive training algorithms like Adam and AdaGrad, (cid:96) regularization, etc. The separation becomes Ω( d ) vs O (1) in this case. A CKNOWLEDGMENTS

We acknowledge support from NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Re-search, Amazon Research, DARPA and SRC. R EFERENCES

Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efﬁciently, going beyond kernels? In

Advances in Neural Information Processing Systems , pp. 9015–9025, 2019.Yossi Arjevani and Ohad Shamir. On the iteration complexity of oblivious ﬁrst-order optimizationalgorithms. In

International Conference on Machine Learning , pp. 908–916, 2016.Gyora M Benedek and Alon Itai. Learnability with respect to ﬁxed distributions.

TheoreticalComputer Science , 86(2):377–389, 1991.Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and thevapnik-chervonenkis dimension.

J. ACM , 36(4):929–965, October 1989. ISSN 0004-5411. doi:10.1145/76359.76371. URL https://doi.org/10.1145/76359.76371 .Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R Salakhutdinov, and AartiSingh. How many samples are needed to estimate a convolutional neural network? In

Advances inNeural Information Processing Systems , pp. 373–383, 2018.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 770–778, 2016.Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connectedconvolutional networks. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pp. 4700–4708, 2017.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence andgeneralization in neural networks. In

Advances in neural information processing systems , pp.8571–8580, 2018.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolu-tional neural networks. In

Advances in neural information processing systems , pp. 1097–1105,2012.Philip M. Long. On the sample complexity of PAC learning half-spaces against the uniform distribu-tion.

IEEE Transactions on Neural Networks , 6(6):1556–1559, 1995.Zongming Ma and Yihong Wu. Volume ratio, sparsity, and minimaxity under unitarily invariantnorms.

IEEE Transactions on Information Theory , 61(12):6939–6956, 2015.Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In

Proceedingsof the twenty-ﬁrst international conference on Machine learning , pp. 78, 2004.Stanislaw J Szarek. Metric entropy of homogeneous spaces. arXiv preprint math/9701213 , 1997.9ichel Talagrand.

Upper and lower bounds for stochastic processes: modern methods and classicalproblems , volume 60. Springer Science & Business Media, 2014.Roman Vershynin.

High-Dimensional Probability: An Introduction with Applications in Data Science .Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.doi: 10.1017/9781108231596.Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization andoptimization of neural nets vs their induced kernel. In

Advances in Neural Information ProcessingSystems , pp. 9709–9721, 2019. 10 S OME BASIC INEQUALITIES

Lemma A.1. ∀ x ∈ [ − , , arccos x √ − x ≥ √ . Proof.

Let x = cos( t ) , t ∈ [ − π, π ] , we have arccos( x ) √ − x = t (cid:112) − cos( t ) = t √ t/ ≥ √ . Lemma A.2. ∃ C > , ∀ d ∈ N + , M ∈ R d × d , C (cid:107) M (cid:107) F / √ d ≤ E x ∼ S d − [ (cid:107) M x (cid:107) ] ≤ (cid:107) M (cid:107) F / √ d. (8) Proof of Lemma A.2.

Upper Bound:

By Cauchy-Schwarz inequality, we have E x ∼ S d − [ (cid:107) M x (cid:107) ] ≤ (cid:114) E x ∼ S d − (cid:104) (cid:107) M x (cid:107) (cid:105) = (cid:115) tr (cid:20) M E x ∼ S d − [ xx (cid:62) ] M (cid:62) (cid:21) = (cid:114) tr[ M M (cid:62) ] d = (cid:107) M (cid:107) F √ d . Lower Bound:

Let M = U Σ V (cid:62) be the singular value decomposition of M , where U, V are orthog-onal matrices and Σ is diagonal. Since (cid:107) M (cid:107) F = (cid:107) Σ (cid:107) F , and E x ∼ S d − [ (cid:107) M x (cid:107) ] = E x ∼ S d − [ (cid:107) Σ x (cid:107) ] ,w.l.o.g., we only need to prove the lower bound for all diagonal matrices.By Proposition 2.5.1 in (Talagrand, 2014), there’s some constant C , such that C (cid:107) Σ (cid:107) F = C (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 σ i ≤ E x ∼ N (0 ,I d ) (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) i =1 x i σ i = E x ∼ N (0 ,I d ) [ (cid:107) M x (cid:107) ] . By Cauchy-Schwarz Inequality, we have E x ∼ N (0 ,I d ) [ (cid:107) x (cid:107) ] ≤ (cid:114) E x ∼ N (0 ,I d ) (cid:104) (cid:107) x (cid:107) (cid:105) = √ d . Therefore,we have C (cid:107) Σ (cid:107) F ≤ E x ∼ N (0 ,I d ) [ (cid:107) M x (cid:107) ] = E ˆ x ∼ S d − [ (cid:107) M ˆ x (cid:107) ] E x ∼ N (0 ,I d ) [ (cid:107) x (cid:107) ] ≤ E ˆ x ∼ S d − [ (cid:107) M ˆ x (cid:107) ] √ d, (9)which completes the proof. Lemma A.1.

For any z > , we have Pr x ∼ N (0 ,σ ) ( | x | ≤ z ) ≤ √ π zσ Proof. Pr x ∼ N (0 ,σ ) ( | x | ≤ z ) = (cid:90) z − z √ π σ exp (cid:18) − x σ (cid:19) d x ≤ (cid:114) π zσ U PPERAND LOWER BOUND FOR SAMPLE COMPLEXITY WITH VC DIMENSION

Theorem B.1. [Blumer et al. (1989)] If learning algorithm A is consistent and ranged in H , i.e. A ( { x i , y i } ni =1 ) ∈ H and A ( { x i , y i } ni =1 )( x i ) = y i , ∀ i ∈ [ n ] , then for any distribution P X and < ε, δ < , we have N ( A , P X (cid:5) H , ε, δ ) = O ( VCdim ( H ) ln ε + ln δ ε ) . (10) Meanwhile, there’s a distribution P X supported on any subsets { x , . . . , x d − } which can be shat-tered by H , such that for any < ε, δ < and any algorithm A , it holds N ( A , P X (cid:5) H , ε, δ ) = Ω( VCdim ( H ) + ln δ ε ) . (11) C E

QUIVARIANCE IN A LGORITHMS

In this section, we give sufﬁcient conditions for an iterative algorithm to be equivariant (as deﬁned inAlgorithm 1).

Theorem C.1.

Suppose G X is a group acting on X = R d , the iterative algorithm A is G X -equivariant(as deﬁned in Algorithm 1) if the following conditions are met: (proof in appendix)1. There’s a group G W acting on W and a group isomorphism τ : G X → G W , such that M [ τ ( g )( W )]( g ( x )) = M [ W ]( x ) , ∀ x ∈ X , W ∈ W , g ∈ G . (One can think g as therotation U applied on data x in linear regression and τ ( U ) as the rotation U applied on w .)2. Update rule F is invariant under any joint group action ( g, τ ( g )) , ∀ g ∈ G . In other words, [ τ ( g )]( F ( W , M , { x i , y i } ni =1 )) = F ([ τ ( g )]( W ) , M , { g ( x i ) , y i } ni =1 ) .3. The initialization P init is invariant under group G W , i.e. ∀ g ∈ G W , P init = P init ◦ g − .Here we want to address that the three conditions in Theorem C.1 are natural and almost necessary.Condition 1 is the minimal expressiveness requirement for model M to allow equivariance. Condition3 is required for equivariance at initialization. Condition 2 is necessary for induction. Proof of Theorem C.1. ∀ g ∈ G X , we sample W (0) ∼ P init , and (cid:102) W (0) = τ ( g )( W (0) ) .By property (3), (cid:102) W (0) d = W (0) ∼ P init . Let W ( t +1) = F (cid:0) W ( t ) , M , { x i , y i } ni =1 (cid:1) and (cid:102) W ( t +1) = F (cid:16) (cid:102) W ( t ) , M , { g ( x i ) , y i } ni =1 (cid:17) for ≤ t ≤ T − , we can show (cid:102) W ( t ) = τ ( g ) W ( t ) ) by inductionusing property (2). By deﬁnition of Algorithm 1, we have A { x i , y i } ni =1 d = M [ W ( T ) ] , and M [ (cid:102) W ( T ) ] ◦ g d = A ( { g ( x i ) , y i } ni =1 ) ◦ g. By property (1), we have M [ (cid:102) W ( T ) ]( g ( x )) = M [ τ ( g )( W ( T ) ]( g ( x )) = M [ W ( T ) ]( x ) . There-fore, A ( { x i , y i } ni =1 ) d = M [ W ( T ) ] = M [ (cid:102) W ( T ) ] ◦ g d = A ( { g ( x i ) , y i } ni =1 ) ◦ g , meaning A is G X -equivariant. Remark C.1.

Theorem C.1 can be extended to the stochastic case and the adaptive case which allowsthe algorithm to use information of the whole trajectory, i.e., the update rule could be generalizedas W ( t +1) = F t ( { W ( s ) } ts =1 , M , { x i , y i } ni =1 ) , as long as (the distribution of) each F t is invariantunder joint transformations. 12elow are two example applications of Theorem C.1. Other results in Table 1 could be achieved inthe same way.For classiﬁcation tasks, optimization algorithms often work with a differentiable surrogate loss (cid:96) : R → R instead the 0-1 loss, such that (cid:96) ( yh ( x )) ≥ [ yh ( x ) ≤ , and the total loss for hypothesis h and training, L ( M ( W ); { x i , y i } ni =1 ) is deﬁned as (cid:80) ni =1 (cid:96) ( y i [ M ( W )]( x i )) . It’s also denoted by L ( W ) when there’s no confusion. Deﬁnition C.1 (Gradient Descent for FC nets) . We call Algorithm 1

Gradient Descent if M = FC-NN and F = GD L , where GD L ( W ) = W − η ∇L ( W ) is called the one-step Gradient Descentupdate and η > is the learning rate. Algorithm 2

Gradient Descent for

FC-NN (FC networks)

Require:

Initial parameter distribution P init , total iterations T , training dataset { x i , y i } ni =1 , lossfunction (cid:96) Ensure:

Hypothesis h : X → Y .Sample W (0) ∼ P init . for t = 0 to T − do W ( t +1) = W ( t ) − η n (cid:80) i =1 ∇ (cid:96) ( FC-NN ( W ( t ) )( x i ) , y i ) return h = sign (cid:2) FC-NN [ W ( T ) ] (cid:3) . Corollary C.2.

Fully-connected networks trained with (stochastic) gradient descent from i.i.d.Gaussian initialization is equivariant under the orthogonal group.

Proof of Corollary C.2.

We will verify the three conditions required in Theorem C.1 one by one.The only place we use the FC structure is for the ﬁrst condition.

Lemma C.3.

There’s a subgroup G W of O ( m ) , and a group isomorphism τ : G X = O ( d ) → G W ,such that FC-NN [ τ ( R )( W )] ◦ R = FC-NN [ W ] , ∀ W ∈ W , R ∈ G X . Proof of Lemma C.3.

By deﬁnition,

FC-NN [ W ]( x ) could be written FC-NN [ W L ]( σ ( W x )) ,which implies FC-NN [ W ]( x ) = FC-NN [ W R − , W L ]( R x ) , ∀ R ∈ O ( d ) , and thus we canpick τ ( R ) = O ∈ O ( m ) , where O ( W ) = [ W R − , W L ] , and G W = τ ( O ( d )) .A notable property of Gradient Descent is that it is invariant under orthogonal re-parametrization. For-mally, given loss function L : R m → R and parameters W ∈ R m , an orthogonal re-parametrizationof the problem is to replace ( L , W ) by ( L ◦ O − , OW ) , where O ∈ R m × m is an orthogonal matrix. Lemma C.4 (Gradient Descent is invariant under orthogonal re-parametization) . For any L , W andorthogonal matrix O ∈ R m × m , we have O GD L ( W ) = GD L◦ O − ( OW ) . Proof sketch of Lemma C.4.

Chain rule.For any R ∈ O ( d ) , and set O = τ ( R ) by Lemma C.3, [ L ◦ O − ]( W ) = (cid:80) ni =1 (cid:96) ( y i FC-NN [ O − ( W )]( x i )) = (cid:80) ni =1 (cid:96) ( y i FC-NN [ W ]( R x i )) . The second condition in Theo-rem C.1 is satisﬁed by plugging above equality into Lemma C.4.The third condition is also satisﬁed since the initialization distribution is i.i.d. Gaussian, which isknown to be orthogonal invariant. In fact, from the proof, it sufﬁces to have the initialization of theﬁrst layer invariant under G X . Corollary C.5.

FC nets trained with newton’s method from zero initialization for the ﬁrst layer andany initialization for the rest parameters is GL ( d ) -equivariant, or equivariant under the group ofinvertible linear transformations.Here, Netwon’s method means to use NT ( W ) = W − η ( ∇ L ( W )) − ∇L ( W ) as the update ruleand we assume ∇ L ( W ) is invertible. Proof is deferred into Appendix, .13 roof of Corollary C.5. The proof is almost the same as that of Corollary C.2, except the followingmodiﬁcations.

Condition 1:

If we replace the O ( d ) , O ( m ) by GL ( d ) , GL ( m ) in the statement and proof Lemma C.3,the lemma still holds. Condition 2:

By chain rule, one can verify the update rule Newton’s method is invariant underinvertible linear re-parametization, i.e. O GD L ( W ) = NT L◦ O − ( OW ) , for all invertible matrix O . Condition 3:

Since the ﬁrst layer is initialized to be 0, it is invariant under any linear transformation.

Remark C.2.

The above results can be easily extended to the case of momentum and L p reg-ularization. For momentum, we only need to ensure that the following update rule, W ( t +1) = GDM ( W ( t ) , W ( t − , M , { x i , y i } ni =1 ) = (1 + γ ) W ( t ) − γ W ( t − − η ∇L ( W ( t ) ) , also satisﬁesthe property in Lemma C.4. For L p regularization, because (cid:107) W (cid:107) p is independent of { x i , y i } ni =1 ,we only need to ensure (cid:107) W (cid:107) p = (cid:107) τ ( R )( W ) (cid:107) p , ∀ R ∈ G X , which is easy to check when G X onlycontains permutation or sign-ﬂip.C.1 E XAMPLES OF E QUIVARIANCE FOR NON - ITERATIVE ALGORITHMS

To demonstrate the wide application of our lower bounds, we give two more examples of algorithmicequivariance where the algorithm is not iterative. The proofs are folklore.

Deﬁnition C.2.

Given a semi-deﬁnite positive kernel K , the Kernel Regression algorithm

REG K isdeﬁned as: REG K ( { x i , y i } ni =1 )( x ) := (cid:2) K ( x , X N ) · K ( X N , X N ) † y ≥ (cid:3) where K ( X N , X N ) ∈ R n × n , [ K ( X N , X N )] i,j = K ( x i , x j ) , y = [ y , y , . . . , y N ] (cid:62) and K ( x , X N ) = [ K ( x , x ) , . . . , K ( x , x N )] . Kernel Regression:

If kernel K is G X -equivariant, i.e., ∀ g ∈ G X , x , y ∈ X , K ( g ( x ) , g ( y )) = K ( x , y ) , then algorithm REG K is G X -equivariant. ERM: If F = F ◦ G X , and argmin h ∈F (cid:80) ni =1 [ h ( x i ) (cid:54) = y i ] is unique, then ERM F is G X -equivariant. D O

MITTED PROOFS

D.1 P

ROOFS OF SAMPLE COMPLEXITY REDUCTION FOR GENERAL EQUIVARIANCE

Given G X -equivariant algorithm A , by deﬁnition, N ∗ ( A , P , ε ) = N ∗ ( A , P ◦ g − , ε ) , ∀ g ∈ G X .Consequently, we have N ∗ ( A , P , ε ) = N ∗ ( A , P ◦ G X , ε ) . (12) Lemma D.1.

Let A be the set of all algorithms and A G X be the set of all G X -equivariant algorithms,the following inequality holds. The equality is attained when G X is a compact group. inf A∈ A GX N ∗ ( A , P , ε ) ≥ inf A∈ A N ∗ ( A , P ◦ G X , ε ) (13) Proof of Lemma D.1.

Take inﬁmum over A G X over the both side of Equation 12, and note that A G X ⊂ A , Inequality is immediate.Suppose the group G X is compact and let µ be the Haar measure on it, i.e. ∀ S ⊂ G X , g ∈ G X , µ ( S ) = µ ( g ◦ S ) . We claim for each algorithm A , the sample complexity of the following equivariant algorithm A (cid:48) is no higher than that of A on P (cid:5) G X : A (cid:48) ( { x i , y i } ni =1 ) = A ( { g ( x i ) , y i } ni =1 ) ◦ g, where g ∼ µ.

14y the deﬁnition of Haar measure, A (cid:48) is G X -equivariant. Moreover, for any ﬁxed n ≥ , we have inf P ∈P E ( x i ,y i ) ∼ P [err P ( A (cid:48) ( { x i , y i } ni =1 ))] = inf P ∈P E g ∼ µ E ( x i ,y i ) ∼ P ◦ g − [err P ( A ( { x i , y i } ni =1 ))] ≥ inf P ∈P inf g ∈G X E ( x i ,y i ) ∼ P ◦ g − [err P ( A ( { x i , y i } ni =1 ))] = inf P ∈P◦G X E ( x i ,y i ) ∼ P [err P ( A ( { x i , y i } ni =1 ))] , which implies inf A∈ A GX N ∗ ( A , P , ε ) ≤ inf A∈ A N ∗ ( A , P ◦ G X , ε ) . Proof of Theorem 5.1.

Simply note that ( P X (cid:5)H ) ◦G X = ∪ g ∈G X ( P X ◦ g ) (cid:5) ( H◦ g − ) = ∪ g ∈G X P X (cid:5) ( H ◦ g − ) = P X (cid:5) ( H ◦ G X ) , the theorem is immediate from Lemma D.1.D.2 P ROOF OF T HEOREM

Lemma D.2.

Deﬁne h U = sign (cid:2) x (cid:62) d U x d +1:2 d (cid:3) , ∀ U ∈ R d × d , we have H = { h U | U ∈ O ( d ) } ⊆ sign (cid:104)(cid:80) di =1 x i − (cid:80) di = d +1 x i (cid:105) ◦ O (2 d ) . Proof.

Note that (cid:20) UU (cid:62) (cid:21) = (cid:20) I d U (cid:62) (cid:21) · (cid:20) I d I d (cid:21) · (cid:20) I d U (cid:21) , and (cid:20) I d I d (cid:21) = (cid:34) √ I d − √ I d √ I d √ I d (cid:35) · (cid:20) I d − I d (cid:21) · (cid:34) √ I d √ I d − √ I d √ I d (cid:35) , thus for any U ∈ O ( d ) , ∀ x ∈ R d , h U ( x ) = sign (cid:2) x (cid:62) d U x d +1:2 d (cid:3) = sign (cid:20) x (cid:62) (cid:20) UU (cid:62) (cid:21) x (cid:21) = sign (cid:20) g U ( x ) (cid:62) (cid:20) I d − I d (cid:21) g U ( x ) (cid:21) ∈ h ∗ ◦ O (2 d ) , (14)where g U ( x ) = (cid:20) I d U (cid:21) · (cid:34) √ I d − √ I d √ I d √ I d (cid:35) · x is an orthogonal transformation on R d . Lemma D.3.

Deﬁne h U = sign (cid:2) x (cid:62) d U x d +1:2 d (cid:3) , ∀ U ∈ R d × d , and H = { h U | U ∈ O ( d ) } , wehave VCdim ( H ) ≥ d ( d − . Proof.

Now we claim H shatters { e i + e d + j } ≤ i

Suppose d = 2 d (cid:48) for some integer d (cid:48) , we construct P = P X (cid:5) H , where P X is the set of all possible distributions on X = R k , and H = { sign (cid:104)(cid:80) d (cid:48) i =1 x i − (cid:80) d (cid:48) i = d (cid:48) +1 x i (cid:105) } . By Lemma D.2, H (cid:48) = { sign (cid:2) x (cid:62) d U x d +1:2 d (cid:3) | U ∈ O ( d (cid:48) ) } ⊆H ◦ O ( d ) . By Theorem 5.1, we have inf A∈ A GX N ∗ ( A , P X (cid:5) H , ε ) ≥ inf A∈ A N ∗ ( A , P X (cid:5) ( H ◦ G X ) , ε ) ≥ inf A∈ A N ∗ ( A , P X (cid:5) H (cid:48) , ε ) (15)By the lower bound in Theorem B.1, we have inf A∈ A N ∗ ( A , P X (cid:5) H (cid:48) , ε ) ≥ VCdim ( H (cid:48) )+ln δ ε . ByLemma D.3 VCdim ( H (cid:48) ) ≥ d (cid:48) ( d (cid:48) − = Ω( d ) . Upper Bound:

Take

CNN as deﬁned in Section 3.1 with d = 2 d (cid:48) , r = 2 , k =1 , σ : R d (cid:48) → R , σ ( x ) = (cid:80) d (cid:48) i =1 x i (square activation + average pooling), we have F CNN = (cid:110) sign (cid:104)(cid:80) i =1 a i (cid:80) d (cid:48) j =1 x i − d (cid:48) + j w + b (cid:105) | a , a , w , b ∈ R (cid:111) .Note that min h ∈F CNN err P ( h ) = 0 , ∀ P ∈ P , and the VC dimension of F is 3, by Theorem B.1, we have ∀ P ∈ P , w.p. − δ , err P ( ERM F CNN ( { x i , y i } ni =1 )) ≤ ε, if n = Ω (cid:0) ε (cid:0) log ε + log δ ) (cid:1)(cid:1) . Convergence guarantee for Gradient Descent:

Similar to that in the proof of Theorem 5.3.D.3 P

ROOFS OF T HEOREM

Lemma D.4.

Deﬁne h U = sign (cid:2) x (cid:62) d U x d +1:2 d (cid:3) , H = { h U | U ∈ O ( d ) } , and ρ ( U, V ) := ρ X ( h U , h V ) = P x ∼ N (0 ,I d ) [ h U ( x ) (cid:54) = h V ( x )] . There exists a constant C , such that the packingnumber D ( H , ρ X , ε ) = D ( O ( d ) , ρ, ε ) ≥ (cid:0) Cε (cid:1) d ( d − . Proof of Lemma D.4.

The key idea here is to ﬁrst lower bound ρ X ( U, V ) by (cid:107) U − V (cid:107) F / √ d andapply volume argument in the tangent space of I d in O ( d ) . We have ρ ( h U , h V ) = P x ∼ N (0 ,I d ) [ h U ( x ) (cid:54) = h V ( x )]= P x ∼ N (0 ,I d ) (cid:2)(cid:0) x (cid:62) d U x d +1:2 d (cid:1) (cid:0) x (cid:62) d V x d +1:2 d (cid:1) < (cid:3) = 1 π E x d ∼ N (0 ,I d ) (cid:34) arccos (cid:32) x (cid:62) d U V (cid:62) x d (cid:107) x d (cid:107) (cid:33)(cid:35) ≥ π E x d ∼ N (0 ,I d ) (cid:34)(cid:115) − x (cid:62) d U V (cid:62) x d (cid:107) x d (cid:107) (cid:35) ( by Lemma A.1 )= 1 π E x ∼ S d − (cid:104)(cid:112) − x (cid:62) U V (cid:62) x (cid:105) = 1 π E x ∼ S d − (cid:2)(cid:13)(cid:13) ( U (cid:62) − V (cid:62) ) x (cid:13)(cid:13) F (cid:3) ≥ C (cid:107) U − V (cid:107) F / √ d ( by Lemma A.2 ) (16)Below we show it sufﬁces to pack in the 0.4 (cid:96) ∞ neighborhood of I d . Let so ( d ) be the Lie algebraof SO ( d ) , i.e., { M ∈ R d × d | M = − M (cid:62) } . We also deﬁne the matrix exponential mapping exp : R d × d → R d × d , where exp( A ) = A + A + A + · · · . It holds that exp( so ( d )) = SO ( d ) ⊆ O ( d ) .The beneﬁt of covering in such neighborhood is that it allows us to translate the problem into thetangent space of I d by the following lemma. 16 emma D.5 (Implication of Lemma 4 in (Szarek, 1997)) . For any matrix

A, B ∈ so ( d ) , satisfyingthat (cid:107) A (cid:107) ∞ ≤ π , (cid:107) B (cid:107) ∞ ≤ π , we have . (cid:107) A − B (cid:107) F ≤ (cid:107) exp( A ) − exp( B ) (cid:107) F ≤ (cid:107) A − B (cid:107) F . (17)Therefore, we have D ( H , ρ X , ε ) ≥ D ( O ( d ) , C (cid:107)·(cid:107) F / √ d, ε ) ≥ D ( so ( d ) ∩ π B d ∞ , C (cid:107)·(cid:107) F / √ d, . ε ) . (18)Note that so ( d ) is a d ( d − -dimensional subspace of R d , by Inverse Santalo’s inequality (Lemma 3,(Ma & Wu, 2015)), we have (cid:32) vol ( so ( d ) ∩ B d ∞ ) vol ( so ( d ) ∩ B d ) (cid:33) d ( d − ≥ C (cid:112) dim ( so ( d )) E G ∼ N (0 ,I d ) (cid:2)(cid:13)(cid:13) Π so ( d ) ( G ) (cid:13)(cid:13) ∞ (cid:3) . where vol is the d ( d − volume deﬁned in the space of so ( d ) and Π so ( d ) ( G ) = G − G (cid:62) is theprojection operator onto the subspace so ( d ) . We further have E G ∼ N (0 ,I d ) (cid:2)(cid:13)(cid:13) Π so ( d ) ( G ) (cid:13)(cid:13) ∞ (cid:3) = E G ∼ N (0 ,I d ) (cid:20)(cid:13)(cid:13)(cid:13)(cid:13) G − G (cid:62) (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:21) ≤ E G ∼ N (0 ,I d ) [ (cid:107) G (cid:107) ∞ ] ≤ C √ d, where the last inequality is by Theorem 4.4.5, Vershynin (2018).Finally, we have D ( so ( d ) ∩ π B d ∞ , C (cid:107)·(cid:107) F / √ d, . ε )= D ( so ( d ) ∩ B d ∞ , (cid:107)·(cid:107) F , √ dεC π ) ≥ vol ( so ( d ) ∩ B d ∞ ) vol ( so ( d ) ∩ B d ) × (cid:18) C π √ dε (cid:19) d ( d − ≥  C C π (cid:113) d ( d − dε  d ( d − := (cid:18) Cε (cid:19) d ( d − (19)D.4 P ROOF OF T HEOREM

Proof of Theorem 5.3.

Lower bound:

Note P = { N (0 , I d ) } (cid:5) H , where H = { sign (cid:104)(cid:80) di =1 α i x i (cid:105) | α i ∈ R } . Since N (0 , I d ) is invariant under all orthogonal transformations, by Theorem 5.1, inf equivariant A N ∗ ( A , N (0 , I d ) ◦ H , ε ) = inf A N ∗ ( A , N (0 , I d ) (cid:5) ( H ◦ O ( d )) , ε ) . Furthermore, it canbe show that H ◦ O ( d ) = { sign (cid:104)(cid:80) i,j β ij x i x j (cid:105) | β ij ∈ R } , the sign functions of all quadratics in R d . Thus it sufﬁces to show learning quadratic functions on Gaussian distribution needs Ω( d /ε ) samples for any algorithm (see Lemma D.6, where we assume the dimension d can be divided by ).17 pper bound: Take

CNN as deﬁned in Section 3.1 with d = d (cid:48) , r = 1 , k = 1 , σ : R → R , σ ( x ) = x (square activation + no pooling), we have F CNN = (cid:110) sign (cid:104)(cid:80) di =1 a i w x i + b (cid:105) | a i , w , b ∈ R (cid:111) = (cid:110) sign (cid:104)(cid:80) di =1 a i x i + b (cid:105) | a i , b ∈ R (cid:111) .Note that min h ∈F CNN err P ( h ) = 0 , ∀ P ∈ P , and the VC dimension of F is d + 1 , by Theorem B.1, wehave ∀ P ∈ P , w.p. − δ , err P ( ERM F CNN ( { x i , y i } ni =1 )) ≤ ε, if n = Ω (cid:0) ε (cid:0) d log ε + log δ ) (cid:1)(cid:1) . Convergence guarantee for Gradient Descent:

We initialize all the parameters by i.i.d. standardgaussian and train the second layer by gradient descent only, i.e. set the LR of w as 0. (Notetraining the second layer only is still a orthogonal-equivariant algorithm for FC nets, thus it’s a validseparation.)For any convex non-increasing surrogate loss of 0-1 loss l satisfying l (0) ≥ , lim x →∞ l ( x ) = 0 e.g.logistic loss, we deﬁne the loss of the weight W as L ( W ) = n (cid:88) i =1 l ( F CNN [ W ]( x i ) y i ) = n (cid:88) i =1 l (cid:32) ( d (cid:88) i =1 a i x i + b ) y i (cid:33) , which is convex in a i and b . Note w (cid:54) = 0 with probability 1, which means the data are separableeven with ﬁxed ﬁrst layer, i.e. min a ,b L ( W ) = L ( W ) | a = a ∗ ,b =0 = 0 , where a ∗ is the ground truth.Thus with sufﬁciently small step size, GD converges to 0 loss solution. By the deﬁnition of surrogateloss, L ( W ) < implies for x i , l ( x i y i ) < and thus the training error is .D.5 P ROOF OF L EMMA

D.6

Lemma D.6.

For A ∈ R d × d , we deﬁne M A ∈ R d × d as M A = (cid:20) A I d (cid:21) , and h A : R d → {− , } as h A ( x ) = sign (cid:2) x (cid:62) d M A x d +1:4 d (cid:3) . Then for H = { h A | ∀ A ∈ R d × d } ⊆{ sign (cid:2) x (cid:62) A x ] |∀ A ∈ R d × d (cid:3) } , satisﬁes that it holds that for any d , algorithm A and ε > , N ∗ ( A , { N (0 , I d ) } (cid:5) H , ε ) = Ω( d ε ) . Proof of Lemma D.6.

Below we will prove a Ω( (cid:0) ε (cid:1) d ) lower bound for packing number, i.e. D ( H , ρ X , ε ) = D ( R d × d , ρ, ε ) , where ρ ( U, V ) = ρ X ( h U , h V ) . Then we can apply Long’simproved version Equation (2) of Benedek-Itai’s lower bound and get a Ω( d /ε ) sample complexitylower bound. The reason that we can get the correct rate of ε is that the VCdim ( H ) is exactly equalto the exponent of the packing number. (cf. the proof of Theorem 5.2)Similar to the proof of Theorem 5.2, the key idea here is to ﬁrst lower bound ρ ( U, V ) by (cid:107) U − V (cid:107) F / √ d and apply volume argument. Recall for A ∈ R d × d , we deﬁne M A ∈ R d × d as M A = (cid:20) A I d (cid:21) , and h A : R d → {− , } as h A ( x ) = sign (cid:2) x (cid:62) d M A x d +1:4 d (cid:3) . Then for H = { h A | ∀ A ∈ R d × d } . Below we will see it sufﬁces to lower bound the packing num-ber of a subset of R d × d , i.e. I d + 0 . B d ∞ , where B d ∞ is the unit spectral norm ball. Clearly ∀ x , (cid:107) x (cid:107) = 1 , ∀ U ∈ I d + 0 . B d ∞ , . ≤ (cid:107) U x (cid:107) ≤ . .18hus ∀ U, V ∈ I d + 0 . B d ∞ we have, ρ X ( h U , h V ) = P x ∼ N (0 ,I d ) [ h U ( x ) (cid:54) = h V ( x )]= P x ∼ N (0 ,I d ) (cid:2)(cid:0) x (cid:62) d M U x d +1:4 d (cid:1) (cid:0) x (cid:62) d M V x d +1:4 d (cid:1) < (cid:3) = 1 π E x d ∼ N (0 ,I d ) (cid:34) arccos (cid:32) x (cid:62) d M U M (cid:62) V x d (cid:13)(cid:13) M (cid:62) U x d (cid:13)(cid:13) (cid:13)(cid:13) M (cid:62) V x d (cid:13)(cid:13) (cid:33)(cid:35) ≥ π E x d ∼ N (0 ,I d ) (cid:34)(cid:115) − x (cid:62) d M U M (cid:62) V x d (cid:13)(cid:13) M (cid:62) U x d (cid:13)(cid:13) (cid:13)(cid:13) M (cid:62) V x d (cid:13)(cid:13) (cid:35) ( by Lemma A.1 ) ≥ √ . π E x d ∼ N (0 ,I d ) (cid:20)(cid:113)(cid:13)(cid:13) M (cid:62) U x d (cid:13)(cid:13) (cid:13)(cid:13) M (cid:62) V x d (cid:13)(cid:13) − x (cid:62) d M U M (cid:62) V x d (cid:21) = 11 . π E x d ∼ N (0 ,I d ) (cid:20)(cid:113)(cid:13)(cid:13) ( M (cid:62) U − M (cid:62) V ) x d (cid:13)(cid:13) − (cid:0)(cid:13)(cid:13) M (cid:62) U x d (cid:13)(cid:13) − (cid:13)(cid:13) M (cid:62) V x d (cid:13)(cid:13) (cid:1) (cid:21) ≥ . π (cid:18) E x d ∼ N (0 ,I d ) (cid:2)(cid:13)(cid:13) ( M (cid:62) U − M (cid:62) V ) x d (cid:13)(cid:13) (cid:3) − E x d ∼ N (0 ,I d ) (cid:2)(cid:12)(cid:12)(cid:13)(cid:13) M (cid:62) U x d (cid:13)(cid:13) − (cid:13)(cid:13) M (cid:62) V x d (cid:13)(cid:13) (cid:12)(cid:12)(cid:3)(cid:19) ≥ C . π E x d ∼ N (0 ,I d ) (cid:2)(cid:13)(cid:13) ( M (cid:62) U − M (cid:62) V ) x d (cid:13)(cid:13) (cid:3) ( by Lemma D.7 ) ≥ C (cid:107) M U − M V (cid:107) F / √ d ( by Lemma A.2 )= C (cid:107) U − V (cid:107) F / √ d (20)It remains to lower bound the packing number. We have M (0 . B d ∞ , C (cid:107)·(cid:107) F / √ d, ε ) ≥ vol ( B d ∞ ) vol ( B d ) × (cid:18) . C √ dε (cid:19) d ≥ (cid:18) Cε (cid:19) d , (21)for some constant C . The proof is completed by plugging the above bound and VCdim ( H ) = d into Equation (2). Lemma D.7.

Suppose x , x ∼ N (0 , I d ) , then ∀ R, S ∈ R d × d , we have E x [ (cid:107) ( R − S ) x (cid:107) ] − E x , y (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) − (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≥ C E x [ (cid:107) ( R − S ) x (cid:107) ] , (22)for some constants C independent of R, S and d . Proof of Lemma D.7.

Note that (cid:12)(cid:12)(cid:12)(cid:12)(cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) − (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) = |(cid:107) R x (cid:107) − (cid:107) S x (cid:107) | (cid:107) R x (cid:107) + (cid:107) S x (cid:107) (cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) + (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107) ≤ (cid:107) ( R − S ) x (cid:107) (cid:107) R x (cid:107) + (cid:107) S x (cid:107) (cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) + (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107) F ( x, d ) be the cdf of chi-square distribution, i.e. F ( x, d ) = P x (cid:104) (cid:107) x (cid:107) ≤ x (cid:105) . Let z = xd , wehave F ( zd, d ) ≤ ( ze − z ) d/ ≤ ( ze − z ) / . Thus P y (cid:104) (cid:107) y (cid:107) ≤ d/ (cid:105) < p for some constant p < (independent of d ), which implies for any (cid:107) x (cid:107) ≤ √ d (or C √ d for any sufﬁcient large constant C ), the following inequality holds E y  (cid:107) R x (cid:107) + (cid:107) S x (cid:107) (cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) + (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107)  ≤ p . Therefore, E y (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) − (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ (cid:107) ( R − S ) x (cid:107) E y  (cid:107) R x (cid:107) + (cid:107) S x (cid:107) (cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) + (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107)  ≤ (1 − α ) (cid:107) ( R − S ) x (cid:107) , for some < α .Therefore, we have E x [ (cid:107) ( R − S ) x (cid:107) ] − E x , y (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) − (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≥ E x (cid:104) (cid:107) ( R − S ) x (cid:107) (cid:104) (cid:107) x (cid:107) ≤ √ d (cid:105)(cid:105) − E x , y (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:113) (cid:107) R x (cid:107) + (cid:107) y (cid:107) − (cid:113) (cid:107) S x (cid:107) + (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) (cid:104) (cid:107) x (cid:107) ≤ √ d (cid:105)(cid:21) ≥ α E x (cid:104) (cid:107) ( R − S ) x (cid:107) (cid:104) (cid:107) x (cid:107) ≤ √ d (cid:105)(cid:105) ≥ α α E x [ (cid:107) ( R − S ) x (cid:107) ] , for some constant α > .The last inequality is because we can decompose E x [ (cid:107) ( R − S ) x (cid:107) ] into (cid:82) ∞ r =0 T ( r ) dF ( r, d ) , where T ( r ) := E x (cid:104) (cid:107) ( R − S ) x (cid:107) | (cid:107) x (cid:107) = r (cid:105) . It’s clear that T ( r ) = rT (1) . Thus it holds that E x (cid:104) (cid:107) ( R − S ) x (cid:107) (cid:104) (cid:107) x (cid:107) ≤ d (cid:105)(cid:105) E x [ (cid:107) ( R − S ) x (cid:107) ] = (cid:82) dr =0 T ( r ) d F ( r, d ) (cid:82) ∞ r =0 T ( r ) d F ( r, d ) = (cid:82) dr =0 r d F ( r, d ) (cid:82) ∞ r =0 r d F ( r, d ) = 1 − (cid:82) ∞ r =100 d r d F ( r, d ) (cid:82) ∞ r =0 r d F ( r, d ) . Note that (cid:82) ∞ r =0 r d F ( r, d ) = E x ∈ N (0 ,I d ) (cid:104) (cid:107) x (cid:107) = d (cid:105) , and using integral by parts, we have (cid:90) ∞ r =100 d r d F ( r, d ) = (cid:90) ∞ r =100 d (1 − F ( r, d )) d r − ((1 − F ( r, d )) r ) ∞ r =100 d = 100 d (1 − F (100 d, d ))+ (cid:90) ∞ r =100 d (1 − F ( r, d )) d r. Here we will use the other side of the tail bound of cdf of chi-square, i.e. for z > , − F ( zd, d ) < ( ze − z ) d/ < ( ze − z ) / . We have 20 ∞ r =100 d r d F ( r, d ) (cid:82) ∞ r =0 r d F ( r, d )=100(1 − F (100 d, d )) + (cid:90) ∞ z =100 (1 − F ( dz, d )) d z ≤ e − ) / + (cid:90) ∞ z =100 ( ze − z ) / d z. It’s clear that the ﬁrst term e − ) / is much smaller than . For the second term, note z ≤ e z − for z ≥ , we have (cid:82) ∞ z =100 ( ze − z ) / d z ≤ (cid:82) ∞ z =100 e − z d z = 4 e − / (cid:28) . Thussuch α exists and is independent of d .D.6 P ROOFS OF T HEOREM

Lemma D.8.

Let M ∈ R d × d , we have E x ∼ N (0 ,I d ) (cid:2) ( x (cid:62) M x ) (cid:3) = 2 (cid:13)(cid:13)(cid:13) M + M (cid:62) (cid:13)(cid:13)(cid:13) F + (tr[ M ]) . Proof of Lemma D.8. E x ∼ N (0 ,I d ) (cid:2) ( x (cid:62) M x ) (cid:3) = E x ∼ N (0 ,I d )  (cid:88) i,j,i (cid:48) j (cid:48) x i x j x i (cid:48) x j (cid:48) M ij M i (cid:48) j (cid:48)  = (cid:88) i (cid:54) = j ( M ij + M ij M ji + M ii M jj ) (cid:18) E x ∼ N (0 , (cid:2) x (cid:3)(cid:19) + (cid:88) i M ii E x ∼ N (0 , (cid:2) x (cid:3) = (cid:88) i (cid:54) = j ( M ij + M ij M ji + M ii M jj ) + 3 (cid:88) i M ii =2 (cid:13)(cid:13)(cid:13)(cid:13) M + M (cid:62) (cid:13)(cid:13)(cid:13)(cid:13) F + (tr[ M ]) Proof of Theorem 5.4.

Lower bound:

Similar to the proof of Theorem 5.3, it sufﬁces to for anyalgorithm A , N ∗ ( A , H ◦ O ( d ) , ε ) ≥ d ( d +1)2 − d ( d +2)2 ε . Note that H ◦ O ( d ) = { (cid:80) i,j β ij x i x j | β ij ∈ R } is the set of all quadratic functions. For convenience we denote h M ( x ) = x (cid:62) M x , ∀ M ∈ R d × d . Now we claim quadratic functions such that any learning algorithm A taking at most n samples must suffer d ( d + 1) − n loss if the ground truth quadratic function is sampled from i.i.d.gaussian. Moreover, the loss is at most d ( d + 2) for the trivial algorithm always predicting . In otherwords, if the expected relative loss ε ≤ d ( d +1) − nd ( d +2) , we must have the expected sample complexity N ∗ ( A , P , ε ) ≥ n . That is N ∗ ( A , P , ε ) ≥ d ( d +1)2 − d ( d +2)2 ε .(1). Upper bound for E (cid:2) y (cid:3) . By Lemma D.8, E M ∼ N (0 ,I d ) E x ∼ P X ,y = x (cid:62) M x (cid:2) y (cid:3) = E M ∼ N (0 ,I d ) (cid:34) (cid:13)(cid:13)(cid:13)(cid:13) M + M (cid:62) (cid:13)(cid:13)(cid:13)(cid:13) F + (tr[ M ]) (cid:35) = d ( d + 2) . (2). Lower bound for expected loss.The inﬁmum of the test loss over all possible algorithms A is21 nf A E M ∼ N (0 ,I d ) (cid:20) E ( x i ,y i ) ∼ P X (cid:5) h M [ (cid:96) P ( A ( { x i , y i } ni =1 ))] (cid:21) = inf A E M ∼ N (0 ,I d ) (cid:20) E ( x i ,y i ) ∼ P X (cid:5) h M (cid:20) E x ,y ∼ P X ◦ h M (cid:2) ([ A ( { x i , y i } ni =1 )]( x ) − y ) (cid:3)(cid:21)(cid:21) = inf A E M ∼ N (0 ,I d ) (cid:20) E x i ∼ P X (cid:20) E x ∼ P X (cid:2) ([ A ( { x i , h M ( x i ) } ni =1 )]( x ) − h M ( x )) (cid:3)(cid:21)(cid:21) ≥ E x i , x ∼ P X M ∼ N (0 ,I d ) (cid:20) Var x , x i ,M [ h M ( x ) | { x i , h M ( x i ) } ni =1 , x ] (cid:21) = E x i , x ∼ P X M ∼ N (0 ,I d ) (cid:104) Var M [ h M ( x ) | { h M ( x i ) } ni =1 ] (cid:105) , where the inequality is achieved when [ A ( { x i , y i } ni =1 )]( x ) = E M [ h M ( x ) | { x i , y i } ni =1 ] .Thus it sufﬁces to lower bound Var M [ h M ( x ) | { h M ( x i ) } ni =1 ] , for ﬁxed { x i } ni =1 and x . For conve-nience we deﬁne S d = { A ∈ R d × d | A = A (cid:62) } be the linear space of all d × d symmetric matrices,where the inner product (cid:104) A, B (cid:105) := tr[ A (cid:62) B ] and Π n : R d × d → R d × d as the projection operator forthe orthogonal complement of the n-dimensional space spanned by x i x (cid:62) i in S d . By deﬁnition, wecan expand xx (cid:62) = n (cid:88) i =1 α i x i x (cid:62) i + Π n ( xx (cid:62) ) . Thus even conditioned on { x i , y i } ni =1 and x , h M ( x ) = tr[ xx (cid:62) ] = n (cid:88) i =1 α i tr[ x i x (cid:62) i M ] + tr[Π n ( xx (cid:62) ) M ] , still follows a gaussian distribution, N (0 , (cid:13)(cid:13) Π n ( xx (cid:62) ) (cid:13)(cid:13) F ) .Note we can always ﬁnd symmetric matrices E i with (cid:107) E i (cid:107) F = 1 and tr[ E (cid:62) i E j ] = 0 such that Π n ( A ) = (cid:80) ki =1 E i tr[ E (cid:62) i A ] , where the rank of Π n , k , is at least d ( d +1)2 − n . Thus we have E x (cid:104)(cid:13)(cid:13) Π n ( xx (cid:62) ) (cid:13)(cid:13) F (cid:105) = E x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) i =1 E i tr[ E (cid:62) i xx (cid:62) ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F  = k (cid:88) i =1 E x (cid:104)(cid:13)(cid:13) E i tr[ E (cid:62) i xx (cid:62) ] (cid:13)(cid:13) F (cid:105) = k (cid:88) i =1 E x (cid:2) ( x (cid:62) E (cid:62) i x ) (cid:3) ( by Lemma D. ≥ k (cid:88) i =1 (cid:107) E i (cid:107) F ≥ k ≥ d ( d + 1) − n Thus the inﬁmum of the expected test loss is 22 nf A E M ∼ N (0 ,I d ) (cid:20) E ( x i ,y i ) ∼ P X (cid:5) h M [ (cid:96) P ( A ( { x i , y i } ni =1 ))] (cid:21) ≥ E x i , x ∼ P X M ∼ N (0 ,I d ) (cid:104) Var M [ h M ( x ) | { h M ( x i ) } ni =1 ] (cid:105) . = E x i ∼ P X M ∼ N (0 ,I d ) (cid:20) E x (cid:104)(cid:13)(cid:13) Π n ( xx (cid:62) ) (cid:13)(cid:13) F (cid:105)(cid:21) . ≥ d ( d + 1) − n. Upper bound:

We use the same CNN construction as in the proof of Theorem 5.3, i.e., the functionclass is F CNN = (cid:110)(cid:80) di =1 a i w x i + b | a i , w , b ∈ R (cid:111) = (cid:110)(cid:80) di =1 a i x i + b | a i , b ∈ R (cid:111) . Thus given d + 1 samples, w.p. 1, ( x , x , . . . , x d , will be linear independent, which means ERM

CNN couldrecover the ground truth and thus have loss.D.7 P ROOF OF T HEOREM

Proof of Theorem 5.5.

Lower Bound:

We further deﬁne permutation g i as g i ( x ) = x − ( e i +1 − e i +2 ) (cid:62) ( e i +1 − e i +2 ) x . Clearly, g i ( t i ) = s i , g i ( s i ) = t i . For i, j ∈ { , , . . . , d } , we deﬁne d ( i, j ) = min { ( i − j ) mod d, ( j − i ) mod d } . It can be veriﬁed that if d ( i, j ) ≥ , then τ i ( s j ) = s j , τ i ( t j ) = t j . For x = s i or t i , x (cid:48) = s j or t j , we deﬁne d ( x , x (cid:48) ) = d ( i, j ) .Given X n , y n , we have P [ B ] := P x [ d ( x , x i ) ≥ , ∀ i ] ≥ d − d ∗ d = . Therefore, we have err P ( A ( X n , y n )) = P x ,y, A [ A ( X n , y n )( x ) (cid:54) = y ] ≥ P x ,y, A [ A ( X n , y n )( x ) (cid:54) = y | B ] P [ B ] ≥ P x ,y, A [ A ( X n , y n )( x ) (cid:54) = y | B ]= 14 P i, A [ A ( X n , y n )( s i ) (cid:54) = 1 | B ] + 14 P i, A [ A ( X n , y n )( t i ) (cid:54) = − | B ] (3 . = 14 P i, A [ A ( τ i ( X n ) , y n )( τ i ( s i )) (cid:54) = 1 | B ] + 14 P i, A [ A ( X n , y n )( t i ) (cid:54) = − | B ]= 14 P i, A [ A ( X n , y n )( t i ) (cid:54) = 1 | B ] + 14 P i, A [ A ( X n , y n )( t i ) (cid:54) = − | B ] = 14 . Thus for any permutation equivariant algorithm A , N ∗ ( A , { P } , ) ≥ d . Upper Bound:

Take

CNN as deﬁned in Section 3.1 with d (cid:48) = d, r = 1 , k = 2 , σ : R d → R , σ ( x ) = (cid:80) di =1 x i , we have F CNN = (cid:110) sign (cid:104) a (cid:80) di =1 ( w x i + w x i − ) + b | a , w , w , b ∈ R (cid:105)(cid:111) .Note that ∀ h ∈ F CNN , ∀ ≤ i ≤ d , h ( s i ) = a (2 w +2 w )+ b , h ( t i ) = a ( w + w +( w + w ) )+ b ,thus the probability of ERM F CNN not achieving error is at most the probability that all data in thetraining dataset are t i or s i : (note the training error of ERM F CNN is 0) P (cid:2) x i ∈ { s j } dj =1 , ∀ i ∈ [ n ] (cid:3) + P (cid:2) x i ∈ { t j } dj =1 , ∀ i ∈ [ n ] (cid:3) = 2 − n × − n +1 . Convergence guarantee for Gradient Descent:

We initialize all the parameters by i.i.d. standardgaussian and train the second layer by gradient descent only, i.e. set the LR of w , w as 0. (Notetraining the second layer only is still a permutation-equivariant algorithm for FC nets, thus it’s a validseparation.) 23or any convex non-increasing surrogate loss of 0-1 loss l satisfying l (0) ≥ , lim x →∞ l ( x ) = 0 e.g.logistic loss, we deﬁne the loss of the weight W as L ( W ) = n (cid:88) i =1 l ( F CNN [ W ]( x i ) y i )= N S × l (cid:0) a (2 w + 2 w ) + b (cid:1) + N t × l (cid:0) − a ( w + w + ( w + w ) ) + b (cid:1) . Note w w (cid:54) = 0 with probability 1, which means the data are separable even with ﬁxed ﬁrst layer, i.e. inf a ,b L ( W ) = 0 . Further note L ( W ) is convex in a and b , which implies with sufﬁciently smallstep size, GD converges to 0 loss solution. By the deﬁnition of surrogate loss, L ( W ) < implies for x i , l ( x i y i ) < and thus the training error is0