Generalization bounds for deep convolutional neural networks
PPublished as a conference paper at ICLR 2020 G ENERALIZATION BOUNDSFOR DEEP CONVOLUTIONAL NEURAL NETWORKS
Philip M. Long ∗ and Hanie Sedghi ∗ Google Brain {plong,hsedghi}@google.com A BSTRACT
We prove bounds on the generalization error of convolutional networks. Thebounds are in terms of the training loss, the number of parameters, the Lipschitzconstant of the loss and the distance from the weights to the initial weights. Theyare independent of the number of pixels in the input, and the height and widthof hidden feature maps. We present experiments using CIFAR-10 with varyinghyperparameters of a deep convolutional network, comparing our bounds withpractical generalization gaps.
NTRODUCTION
Recently, substantial progress has been made regarding theoretical analysis of the generalizationof deep learning models (see Neyshabur et al., 2015; Zhang et al., 2016; Dziugaite and Roy, 2017;Bartlett et al., 2017; Neyshabur et al., 2017; 2018; Arora et al., 2018; Golowich et al., 2018; Neyshaburet al., 2019; Wei and Ma, 2019a; Cao and Gu, 2019; Daniely and Granot, 2019). One interestingpoint that has been explored, with roots in (Bartlett, 1998), is that even if there are many parameters,the set of models that can be represented using weights with small magnitude is limited enoughto provide leverage for induction (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al.,2018). Intuitively, if the weights start small, since the most popular training algorithms make small,incremental updates that get smaller as the training accuracy improves, there is a tendency for thesealgorithms to produce small weights. (For some deeper theoretical exploration of implicit bias indeep learning and related settings, see (Gunasekar et al., 2017; 2018a;b; Ma et al., 2018).) Even morerecently, authors have proved generalization bounds in terms of the distance from the initial settingof the weights instead of the size of the weights (Dziugaite and Roy, 2017; Bartlett et al., 2017;Neyshabur et al., 2019; Nagarajan and Kolter, 2019). This is important because small initial weightsmay promote vanishing gradients; it is advisable instead to choose initial weights that maintaina strong but non-exploding signal as computation flows through the network (see LeCun et al.,2012; Glorot and Bengio, 2010; Saxe et al., 2013; He et al., 2015). A number of recent theoreticalanalyses have shown that, for a large network initialized in this way, accurate models can be found bytraveling a short distance in parameter space (see Du et al., 2019b;a; Allen-Zhu et al., 2019; Zou et al.,2018; Lee et al., 2019). Thus, the distance from initialization may be expected to be significantlysmaller than the magnitude of the weights. Furthermore, there is theoretical reason to expect that,as the number of parameters increases, the distance from initialization decreases. This motivatesgeneralization bounds in terms of distance from initialization (Dziugaite and Roy, 2017; Bartlettet al., 2017).Convolutional layers are used in all competitive deep neural network architectures applied to imageprocessing tasks. The most influential generalization analyses in terms of distance from initializationhave thus far concentrated on networks with fully connected layers. Since a convolutional layer has analternative representation as a fully connected layer, these analyses apply in the case of convolutionalnetworks, but, intuitively, the weight-tying employed in the convolutional layer constrains the set offunctions computed by the layer. This additional restriction should be expected to aid generalization. ∗ Authors are ordered alphabetically. a r X i v : . [ c s . L G ] A p r ublished as a conference paper at ICLR 2020In this paper, we prove new generalization bounds for convolutional networks that take account ofthis effect. As in earlier analyses for the fully connected case, our bounds are in terms of the distancefrom the initial weights, and the number of parameters. Additionally, our bounds independent of thenumber of pixels in the input, or the height and width of the hidden feature maps.Our most general bounds apply to networks including both convolutional and fully connected layers,and, as such, they also apply for purely fully connected networks. In contrast with earlier bounds forsettings like the one considered here, our bounds are in terms of a sum over layers of the distancefrom initialization of the layer. Earlier bounds were in terms of product of these distances which ledto an exponential dependency on depth. Our bounds have linear dependency on depth which is morealigned with practical observations.As is often the case for generalization analyses, the central technical lemmas are bounds on coveringnumbers. Borrowing a technique due to Barron et al. (1999), these are proved by bounding theLipschitz constant of the mapping from the parameters to the loss of the functions computed by thenetworks. (Our proof also borrows ideas from the analysis of the fully connected case, especially(Bartlett et al., 2017; Neyshabur et al., 2018).) Covering bounds may be applied to obtain a hugevariety of generalization bounds. We present two examples for each covering bound. One is astandard bound on the difference between training and test error. Perhaps the more relevant boundhas the flavor of “relative error”; it is especially strong when the training loss is small, as is often thecase in modern practice. Our covering bounds are polynomial in the inverse of the granularity of thecover. Such bounds seem to be especially useful for bounding the relative error.In particular, our covering bounds are of the form ( B/(cid:15) ) W , where (cid:15) is the granularity of the cover, B is proportional to the Lipschitz constant of a mapping from parameters to functions, and W isthe number of parameters in the model. We apply a bound from the empirical process literature interms of covering bounds of this form due to Giné and Guillou (2001), who paid particular attentionto the dependence of estimation error on B . This bound may be helpful for other analyses of thegeneralization of deep learning in terms of different notions of distance from initialization. (Applyingbounds in terms of Dudley’s entropy integral in the standard way leads to an exponentially worsedependence on B .) Related previous work.
Du et al. (2018) proved bounds for CNNs in terms of the number ofparameters, for two-layer networks. Arora et al. (2018) analyzed the generalization of networksoutput by a compression scheme applied to CNNs. Zhou and Feng (2018) provided a generalizationguarantee for CNNs satisfying a constraint on the rank of matrices formed from their kernels. Liet al. (2018) analyzed the generalization of CNNs under other constraints on the parameters. Lee andRaginsky (2018) provided a size-free bound for CNNs in a general unsupervised learning frameworkthat includes PCA and codebook learning.
Related independent work.
Ledent et al. (2019) proved bounds for CNNs that also took accountof the effect of weight-tying. (Their bounds retain the exponential dependence on the depth of thenetwork from earlier work, and are otherwise qualitatively dissimilar to ours.) Wei and Ma (2019b)obtained bounds for fully connected networks with an improved dependence on the depth of thenetwork. Daniely and Granot (2019) obtained improved bounds for constant-depth fully-connectednetworks. Jiang et al. (2019) conducted a wide-ranging empirical study of the dependence of thegeneralization gap on a variety of quantities, including distance from initialization.
Notation. If K ( i ) is the kernel of convolutional layer number i , then op( K ( i ) ) refers to its operatormatrix and vec( K ( i ) ) denotes the vectorization of the kernel tensor K ( i ) . For matrix M , (cid:107) M (cid:107) denotes the operator norm of M . For vectors, || · || represents the Euclidian norm, and || · || is the L norm. For a multiset S of elements of some set Z , and a function g from Z to R , let E S [ g ] = m (cid:80) mt =1 g ( z t ) . We will denote the function parameterized by Θ by f Θ . OUNDS FOR A BASIC SETTING
In this section, we provide a bound for a clean and simple setting. Convolution is a linear operator and can thus be written as a matrix-vector product. The operator matrix ofkernel K , refers to the matrix that describes convolving the input with kernel K . For details, see (Sedghi et al.,2018). HE SETTING AND THE BOUNDS
In the basic setting, the input and all hidden layers have the same number c of channels. Each input x ∈ R d × d × c satisfies (cid:107) vec( x ) (cid:107) ≤ .We consider a deep convolutional network, whose convolutional layers use zero-padding (see Good-fellow et al., 2016). Each layer but the last consists of a convolution followed by an activation functionthat is applied componentwise. The activations are -Lipschitz and nonexpansive (examples includeReLU and tanh). The kernels of the convolutional layers are K ( i ) ∈ R k × k × c × c for i ∈ { , . . . , L } .Let K = ( K (1) , ..., K ( L ) ) be the L × k × k × c × c -tensor obtained by concatening the kernels forthe various layers. Vector w represents the last layer; the weights in the last layer are fixed with || w || = 1 . Let W = Lk c be the total number of trainable parameters in the network.We let K (1)0 , . . . , K ( L )0 take arbitrary fixed values (interpreted as the initial values of the kernels)subject to the constraint that, for all layers i , || op( K ( i )0 ) || = 1 . (This is often the goal of initializationschemes.) Let K be the corresponding L × k × k × c × c tensor. We provide a generalization boundin terms of distance from initialization, along with other natural parameters of the problem. Thedistance is measured with || K − K || σ def = (cid:80) Li =1 || op( K ( i ) ) − op( K ( i )0 ) || .For β > , define K β to be the set of kernel tensors within || · || σ distance β of K , and define F β tobe set of functions computed by CNNs with kernels in K β . That is, F β = { f K : || K − K || σ ≤ β } .Let (cid:96) : R × R → [0 , be a loss function such that (cid:96) ( · , y ) is λ -Lipschitz for all y . An example is the /λ -margin loss.For a function f from R d × d × c to R , let (cid:96) f ( x, y ) = (cid:96) ( f ( x ) , y ) .We will use S to denote a set { ( x , y ) , . . . , ( x m , y m ) } = { z , . . . z m } of random training exampleswhere each z t = ( x t , y t ) . Theorem 2.1 (Basic bounds) . For any η > , there is a C > such that for any β, δ > , λ ≥ ,for any joint probability distribution P over R d × d × c × R , if a training set S of n examples is drawnindependently at random from P , then, with probability at least − δ , for all f ∈ F β , E z ∼ P [ (cid:96) f ( z )] ≤ (1 + η ) E S [ (cid:96) f ] + C ( W ( β + log( λn )) + log(1 /δ )) n and, if β ≥ , then E z ∼ P [ (cid:96) f ( z )] ≤ E S [ (cid:96) f ] + C (cid:114) W ( β + log( λ )) + log(1 /δ ) n and otherwise E z ∼ P [ (cid:96) f ( z )] ≤ E S [ (cid:96) f ] + C (cid:32) βλ (cid:114) Wn + (cid:114) log(1 /δ ) n (cid:33) . If Theorem 2.1 is applied with the margin loss, then E z ∼ P [ (cid:96) f ( z )] is in turn an upper bound onthe probability of misclassification on test data. Using the algorithm from (Sedghi et al., 2018), || · || σ may be efficiently computed. Since || K − K || σ ≤ || vec( K ) − vec( K ) || (Sedghi et al.,2018), Theorem 2.1 yields the same bounds as a corollary if the definition of F β is replaced with theanalogous definition using || vec( K ) − vec( K ) || .2.2 T OOLS
Definition 2.2.
For d ∈ N , a set G of real-valued functions with a common domain Z , we say that G is ( B, d ) -Lipschitz parameterized if there is a norm || · || on R d and a mapping φ from the unitball w.r.t. || · || in R d to G such that, for all θ and θ (cid:48) such that || θ || ≤ and || θ (cid:48) || ≤ , and all z ∈ Z , | ( φ ( θ ))( z ) − ( φ ( θ (cid:48) ))( z ) | ≤ B || θ − θ (cid:48) || . The following lemma is essentially known. Its proof, which uses standard techniques (see Pollard,1984; Talagrand, 1994; 1996; Barron et al., 1999; Van de Geer, 2000; Giné and Guillou, 2001; Mohriet al., 2018), is in Appendix A. 3ublished as a conference paper at ICLR 2020
Lemma 2.3.
Suppose a set G of functions from a common domain Z to [0 , M ] is ( B, d ) -Lipschitzparameterized for B > and d ∈ N .Then, for any η > , there is a C such that, for all large enough n ∈ N , for any δ > , for anyprobability distribution P over Z , if S is obtained by sampling n times independently from P , then,with probability at least − δ , for all g ∈ G , E z ∼ P [ g ( z )] ≤ (1 + η ) E S [ g ] + CM ( d log( Bn ) + log(1 /δ )) n and if B ≥ , E z ∼ P [ g ( z )] ≤ E S [ g ] + CM (cid:115) d log B + log δ n , and, for all B , E z ∼ P [ g ( z )] ≤ E S [ g ] + C B (cid:114) dn + M (cid:115) log δ n . ROOF OF T HEOREM
Definition 2.4.
Let (cid:96) F = { (cid:96) f : f ∈ F } . We will prove Theorem 2.1 by showing that (cid:96) F β is (cid:0) βλe β , W (cid:1) -Lipschitz parameterized. This will beachieved through a series of lemmas. Lemma 2.5.
Choose K ∈ K β and a layer j . Suppose ˜ K ∈ K β satisfies K ( i ) = ˜ K ( i ) for all i (cid:54) = j .Then, for all examples ( x, y ) , | (cid:96) ( f K ( x ) , y ) − (cid:96) ( f ˜ K ( x ) , y ) | ≤ λe β (cid:13)(cid:13)(cid:13) op( K ( j ) ) − op( ˜ K ( j ) ) (cid:13)(cid:13)(cid:13) . Proof.
For each layer i , let β i = || op( K ( i ) ) − op( K ( i )0 ) || .Since (cid:96) is λ -Lipschitz w.r.t. its first argument, we have that | (cid:96) ( f K ( x ) , y ) − (cid:96) ( f ˜ K ( x ) , y ) | ≤ λ | f K ( x ) − f ˜ K ( x ) | , so it suffices to bound | f K ( x ) − f ˜ K ( x ) | . Let g up be the function from the inputs to thewhole network with parameters K to the inputs to the convolution in layer j , and let g down be thefunction from the output of this convolution to the output of the whole network, so that f K = g down ◦ f op( K ( j ) ) ◦ g up . Choose an input x to the network, and let u = g up ( x ) . Recalling that || x || ≤ ,and the non-linearities are nonexpansive, we have (cid:107) u (cid:107) ≤ (cid:89) i
For any K, ˜ K ∈ K β , for any input x to the network, | (cid:96) ( f K ( x ) , y ) − (cid:96) ( f ˜ K ( x ) , y ) | ≤ λe β (cid:13)(cid:13)(cid:13) K − ˜ K (cid:13)(cid:13)(cid:13) σ . Proof.
Consider transforming K to ˜ K by replacing one layer of K at a time with the correspondinglayer in ˜ K . Applying Lemma 2.5 to bound the distance traversed with each replacement andcombining this with the triangle inequality gives | (cid:96) ( f K ( x ) , y ) − (cid:96) ( f ˜ K ( x ) , y ) | ≤ λe β L (cid:88) j =1 (cid:13)(cid:13)(cid:13) op( K ( j ) ) − op( ˜ K ( j ) ) (cid:13)(cid:13)(cid:13) = λe β (cid:13)(cid:13)(cid:13) K − ˜ K (cid:13)(cid:13)(cid:13) σ . Now we are ready to prove our basic bound.
Proof (of Theorem 2.1).
Consider the mapping φ from the ball w.r.t. (cid:107)·(cid:107) σ of radius in R Lk c centered at vec( K ) to (cid:96) F β defined by φ ( θ ) = (cid:96) f K β vec − θ ) , where vec − ( θ ) is the reshaping of θ into a L × k × k × c × c -tensor. Lemma 2.6 implies that this mapping is βλe β -Lipschitz. ApplyingLemma 2.3 completes the proof.2.4 C OMPARISONS
Since a convolutional network has an alternative parameterization as a fully connected network, thebounds of (Bartlett et al., 2017) have consequences for convolutional networks. To compare ourbound with this, first, note that Theorem 2.1, together with standard model selection techniques,yields a O (cid:32)(cid:114) W ( || K − K || σ + log( λ )) + log(1 /δ ) n (cid:33) (1)bound on E z ∼ P [ (cid:96) f ( z )] − E S [ (cid:96) f ( z )] (For more details, please see Appendix B.) Translating the boundof (Bartlett et al., 2017) to our setting and notation directly yields a bound on E z ∼ P [ (cid:96) f ( z )] − E S [ (cid:96) f ( z )] whose main terms are proportional to λ (cid:16)(cid:81) Li =1 || op( K ( i ) ) || (cid:17) (cid:18)(cid:80) Li =1 || op( K ( i ) ) (cid:62) − op( K ( i )0 ) (cid:62) || / , || op( K ( i ) ) || / (cid:19) / log( d c L ) + (cid:112) log(1 /δ ) √ n (2)where, for a p × q matrix A , || A || , = || ( || A : , || , ..., || A : ,q || ) || . One can get an idea of howthis bound relates to (1) by comparing the bounds in a simple concrete case. Suppose that eachof the convolutional layers of the network parameterized by K computes the identity function,and that K is obtained from K by adding (cid:15) to each entry. In this case, disregarding edge effects,for all i , || op( K ( i ) ) || = 1 + (cid:15)k c and || K − K || σ = (cid:15)k cL (as proved in Appendix C). Also, || op( K ( i ) ) (cid:62) − op( K ( i )0 ) (cid:62) || , = ( cd )( (cid:15) √ ck ) = (cid:15)c / d k. We get additional simplification if weset (cid:15) = k . In this case, (2) gives a constant times ( c + 1) L √ cd ( d/k ) L / λ log( dcL ) + (cid:112) log(1 /δ ) √ n c / kL + ck (cid:112) log( λ ) + (cid:112) log(1 /δ ) √ n . In this scenario, the new bound is independent of d , and grows more slowly with λ , c and L . Notethat k ≤ d (and, typically, it is much less).This specific case illustrates a more general phenomenon that holds when the initialization is close tothe identity, and changes to the parameters are on a similar scale.Golowich et al. (2017) established bounds that improve on the bounds of (Bartlett et al., 2017) insome cases. Their bound requires a restriction on the activation function (albeit one that is satisfied bythe ReLU). For large n , the main term of the natural consequence of Corollary 1 of their paper in thesetting of this section, with the required additional assumption on the activation function, grows like λ √ L (cid:81) Li =1 || op( K ( i ) ) || F √ n ≈ λ ( cd ) L √ L √ n , when (cid:15) = 1 /k . (We note, however, that, in addition to not trying to take account of the convolutionalstructure, Golowich et al. (2017) also did not make an effort to obtain stronger bounds in the casethat the distance from initialization is small. On the other hand, we suspect that modifying their proofanalogously to (Bartlett et al., 2017) to do so would not remove the exponential dependence on L .) MORE GENERAL BOUND
In this section, we generalize Theorem 2.1.3.1 T
HE SETTING
The more general setting concerns a neural network where the input is a d × d × c tensor whoseflattening has Euclidian norm at most χ , and network’s output is a m -dimensional vector, which maybe logits for predicting a one-hot encoding of an m -class classification problem.The network is comprised of L c convolutional layers followed by L f fully connected layers. The i th convolutional layer includes a convolution, with kernel K ( i ) ∈ R k i × k i × c i − × c i , followed by acomponentwise non-linearity and an optional pooling operation. We assume that the non-linearityand any pooling operations are -Lipschitz and nonexpansive. Let V ( i ) be the matrix of weights forthe i th fully connected layer. Let Θ = ( K (1) , ..., K ( L c ) , V (1) , ..., V ( L f ) ) be all of the parameters ofthe network. Let L = L c + L f .We assume that, for all y , (cid:96) ( · , y ) is λ -Lipschitz for all y and that (cid:96) (ˆ y, y ) ∈ [0 , M ] for all ˆ y and y .An example ( x, y ) includes a d × d × c -tensor x and y ∈ R m .We let K (1)0 , . . . , K ( L c )0 , V (1)0 , ..., V ( L f )0 take arbitrary fixed values subject to the constraint that, forall convolutional layers i , || op( K ( i )0 ) || ≤ ν , and for all fully connected layers i , || V ( i )0 || ≤ ν .Let Θ = ( K (1)0 , ..., K ( L c )0 , V (1)0 , ..., V ( L f )0 ) .For Θ = ( K (1) , ..., K ( L c ) , V (1) , ..., V ( L f ) ) and ˜Θ = ( ˜ K (1) , ..., ˜ K ( L c ) , ˜ V (1) , ..., ˜ V ( L f ) ) . define || Θ − ˜Θ || N = (cid:32) L c (cid:88) i =1 || op( K ( i ) ) − op( ˜ K ( i ) ) || (cid:33) + L f (cid:88) i =1 || V ( i ) − ˜ V ( i ) || . For β, ν ≥ , define F β,ν to be set of functions computed by CNNs as described in this subsectionwith parameters within || · || N -distance β of Θ . Let O β,ν be the set of their parameterizations. Theorem 3.1 (General Bound) . For any η > , there is a constant C such that the following holds.For any β, ν, χ > , for any δ > , for any joint probability distribution P over R d × d × c × R m suchthat, with probability 1, ( x, y ) ∼ P satisfies || vec( x ) || ≤ χ , under the assumptions of this section, if a training set S of n examples is drawn independently at random from P , then, with probability atleast − δ , for all f ∈ F β,ν , E z ∼ P [ (cid:96) f ( z )] ≤ (1 + η ) E S [ (cid:96) f ] + CM ( W ( β + νL + log( χλβn )) + log(1 /δ )) n and, if χλβ (1 + ν + β/L ) L ≥ , E z ∼ P [ (cid:96) f ( z )] ≤ E S [ (cid:96) f ] + CM (cid:114) W ( β + νL + log( χλβ )) + log(1 /δ ) n and a bound of E z ∼ P [ (cid:96) f ( z )] ≤ E S [ g ] + C χλβ (1 + ν + β/L ) L (cid:114) Wn + M (cid:115) log δ n holds for all χ, λ, β > . ROOF OF T HEOREM ||·|| N to witness the fact that (cid:96) F β,ν is (cid:0) χλβ (1 + ν + β/L ) L , W (cid:1) -Lipschitz parameterized.The first two lemmas concern the effect of changing a single layer. Their proofs are very similar tothe proof of Lemma 2.5, and are in the Appendices D and E. Lemma 3.2.
Choose
Θ = ( K (1) , ..., K ( L c ) , V (1) , ..., V ( L f ) ) , ˜Θ =( ˜ K (1) , ..., ˜ K ( L c ) , ˜ V (1) , ..., ˜ V ( L f ) ) ∈ O β,ν and a convolutional layer j . Suppose that K ( i ) = ˜ K ( i ) for all convolutional layers i (cid:54) = j and V ( i ) = ˜ V ( i ) for all fully connected layers i . Then, for allexamples ( x, y ) , | (cid:96) ( f Θ ( x ) , y ) − (cid:96) ( f ˜Θ ( x ) , y ) | ≤ χλ (1 + ν + β/L ) L (cid:13)(cid:13)(cid:13) op( K ( j ) ) − op( ˜ K ( j ) ) (cid:13)(cid:13)(cid:13) . Lemma 3.3.
Choose
Θ = ( K (1) , ..., K ( L c ) , V (1) , ..., V ( L f ) ) , ˜Θ =( ˜ K (1) , ..., ˜ K ( L c ) , ˜ V (1) , ..., ˜ V ( L f ) ) ∈ O β,ν and a fully connected layer j . Suppose that K ( i ) = ˜ K ( i ) for all convolutional layers i and V ( i ) = ˜ V ( i ) for all fully connected layers i (cid:54) = j . Then, for allexamples ( x, y ) , | (cid:96) ( f Θ ( x ) , y ) − (cid:96) ( f ˜Θ ( x ) , y ) | ≤ χλ (1 + ν + β/L ) L (cid:13)(cid:13)(cid:13) V ( j ) − ˜ V ( j ) (cid:13)(cid:13)(cid:13) . Now we prove a bound when all the layers can change between Θ and ˜Θ . Lemma 3.4.
For any
Θ = ( K (1) , ..., K ( L c ) , V (1) , ..., V ( L f ) ) , ˜Θ =( ˜ K (1) , ..., ˜ K ( L c ) , ˜ V (1) , ..., ˜ V ( L f ) ) ∈ O β,ν , for any input x , | (cid:96) ( f Θ ( x ) , y ) − (cid:96) ( f ˜Θ ( x ) , y ) | ≤ χλ (1 + ν + β/L ) L (cid:13)(cid:13)(cid:13) Θ − ˜Θ (cid:13)(cid:13)(cid:13) N . Proof.
Consider transforming Θ to ˜Θ by replacing one layer at a time of Θ with the correspondinglayer in ˜Θ . Applying Lemma 3.2 to bound the distance traversed with each replacement of aconvolutional layer, and Lemma 3.3 to bound the distance traversed with each replacement of a fullyconnected layer, and combining this with the triangle inequality gives the lemma.Now we are ready to prove our more general bound. Proof (of Theorem 3.1).
Consider the mapping φ from the ball of || · || N -radius centeredat Θ to (cid:96) F β,ν defined by φ (Θ) = (cid:96) f Θ0+ β Θ . Lemma 2.6 implies that this mapping is (cid:16) χλβ (1 + ν + β/L ) L , W (cid:17) -Lipschitz. Applying Lemma 2.3 completes the proof.7ublished as a conference paper at ICLR 2020Figure 1: Generalization gaps for a 10-layer all-conv model on CIFAR10 dataset.Figure 2: Generalization gap as a function of W ORE COMPARISONS
Theorem 3.1 applies in the case that there are no convolutional layers, i.e. for a fully connectednetwork. In this subsection, we compare its bound in this case with the bound of (Bartlett et al.,2017). Because the bounds are in terms of different quantities, we compare them in a simpleconcrete case. In this case, for D = cd , each hidden layer has D components, and there are D classes. For all i , V ( i )0 = I and V ( i ) = I + H/ √ D , where H is a Hadamard matrix (usingthe Sylvester construction), and χ = M = 1 . Then, dropping the superscripts, each layer V has || V || = 2 , || V − V || = 1 , || V − V || , = D. Further, in the notation of Theorem 3.1, W = D L , and β = L , and ν = 0 . Plugging into toTheorem 3.1 yields a bound on the generalization gap proportional to DL + D (cid:112) L log( λ ) + (cid:112) log(1 /δ ) √ n where, in this case, the bound of (Bartlett et al., 2017) is proportional to λ L L / D log( DL ) + (cid:112) log(1 /δ ) √ n and, when D is large relative to L , Corollary 1 of (Golowich et al., 2017) (approximately) gives λ L/ √ LD L/ + (cid:112) log(1 /δ ) √ n . (a) mean and error bar (b) median Figure 3: || K − K || σ as a function of W . XPERIMENTS
We trained a 10-layer all-convolutional model on the CIFAR-10 dataset. The architecture was similarto VGG (Simonyan and Zisserman, 2014). The network was trained with dropout regularization andan exponential learning rate schedule. We define the generalization gap as the difference betweentrain error and test error. In order to analyze the effect of the number of network parameters on thegeneralization gap, we scaled up the number of channels in each layer, while keeping other elementsof the architecture, including the depth, fixed. Each network was trained repeatedly, sweeping overdifferent values of the initial learning rate and batch sizes , , . For each setting the resultswere averaged over five different random initializations. Figure 1 shows the generalization gap fordifferent values of W || K − K || σ . As in the bound of Theorem 3.1, the generalization gap increaseswith W || K − K || σ . Figure 2 shows that as the network becomes more over-parametrized, thegeneralization gap remains almost flat with increasing W . This is expected due to role of over-parametrization on generalization (Neyshabur et al., 2019). An explanation of this phenomenon thatis consistent with the bound presented here is that, ultimately, increasing W leads to a decrease invalue of || K − K || σ ; see Figure 3a. The fluctuations in Figure 3a are partly due to the fact thattraining neural networks is not an stable process. We provide the medians || K − K || σ for differentvalues of W in Figure 3b.A CKNOWLEDGMENTS
We thank Peter Bartlett, Jaeho Lee and Sam Schoenholz for valuable conversations, and anonymousreviewers for helpful comments. R EFERENCES
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization.
ICML , 2019.Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds fordeep nets via a compression approach. arXiv preprint arXiv:1802.05296 , 2018.Andrew Barron, Lucien Birgé, and Pascal Massart. Risk bounds for model selection via penalization.
Probability theory and related fields ∼ bartlett/courses/2013spring-stat210b/notes/14notes.pdf; downloaded 4/17/19.Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size of theweights is more important than the size of the network. IEEE transactions on Information Theory ,44(2):525–536, 1998.Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds forneural networks. In
Advances in Neural Information Processing Systems , pages 6240–6249, 2017.9ublished as a conference paper at ICLR 2020Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deepneural networks. arXiv preprint arXiv:1905.13210 , 2019.Amit Daniely and Elad Granot. Generalization bounds for neural networks via approximate de-scription length. In
Advances in Neural Information Processing Systems , pages 12988–12996,2019.Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Ruslan R Salakhutdinov, and AartiSingh. How many samples are needed to estimate a convolutional neural network? In
Advances inNeural Information Processing Systems , pages 373–383, 2018.Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds globalminima of deep neural networks.
ICML , 2019a.Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizesover-parameterized neural networks.
ICLR , 2019b.Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds fordeep (stochastic) neural networks with many more parameters than training data.
UAI , 2017.Evarist Giné and Armelle Guillou. On consistency of kernel density estimators for randomlycensored data: rates holding uniformly over adaptive intervals. In
Annales de l’IHP Probabilités etstatistiques , volume 37, pages 503–522, 2001.Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neuralnetworks. In
Proceedings of the thirteenth international conference on artificial intelligence andstatistics , pages 249–256, 2010.Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity ofneural networks. arXiv preprint arXiv:1712.06541 , 2017.Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity ofneural networks. In
Conference On Learning Theory , pages 297–299, 2018.Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep learning . MIT press, 2016.Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro.Implicit regularization in matrix factorization. In
Advances in Neural Information ProcessingSystems , pages 6151–6159, 2017.Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias interms of optimization geometry. arXiv preprint arXiv:1802.08246 , 2018a.Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descenton linear convolutional networks. In
Advances in Neural Information Processing Systems , pages9461–9471, 2018b.D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learningapplications.
Information and Computation , 100(1):78–150, 1992.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification. In
Proceedings of the IEEE internationalconference on computer vision , pages 1026–1034, 2015.Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantasticgeneralization measures and where to find them.
ICLR , 2019.A. N. Kolmogorov and V. M. Tikhomirov. ε -entropy and ε -capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk , 14(2):3–86, 1959.Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In
Neural networks: Tricks of the trade , pages 9–48. Springer, 2012.Antoine Ledent, Yunwen Lei, and Marius Kloft. Norm-based generalisation bounds for multi-classconvolutional neural networks. arXiv 1905.12430 , 2019.10ublished as a conference paper at ICLR 2020Jaeho Lee and Maxim Raginsky. Learning finite-dimensional coding schemes with nonlinear recon-struction maps. arXiv preprint arXiv:1812.09658 , 2018.Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear modelsunder gradient descent. In
Advances in neural information processing systems , pages 8570–8581,2019.Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter generalization boundfor deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159 , 2018.Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvexstatistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion.In
ICML , pages 3351–3360, 2018.Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.
Foundations of machine learning . MITpress, 2018.Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance frominitialization. arXiv preprint arXiv:1901.01672 , 2019.Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neuralnetworks. In
Conference on Learning Theory , pages 1376–1401, 2015.Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generaliza-tion in deep learning. In
Advances in Neural Information Processing Systems , pages 5947–5956,2017.Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks.
ICLR , 2018.Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towardsunderstanding the role of over-parametrization in generalization of neural networks.
ICLR , 2019.D. Pollard.
Convergence of Stochastic Processes . Springer Verlag, Berlin, 1984.D. Pollard.
Empirical Processes : Theory and Applications , volume 2 of
NSF-CBMS RegionalConference Series in Probability and Statistics . Institute of Math. Stat. and Am. Stat. Assoc., 1990.Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamicsof learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 , 2013.Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers. arXivpreprint arXiv:1805.10408 , 2018.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 , 2014.M. Talagrand. Sharper bounds for Gaussian and empirical processes.
Annals of Probability , 22:28–76, 1994.Michel Talagrand. New concentration inequalities in product spaces.
Inventiones mathematicae , 126(3):505–563, 1996.Sara A Van de Geer.
Empirical Processes in M-estimation , volume 6. Cambridge university press,2000.V. N. Vapnik.
Estimation of Dependencies based on Empirical Data . Springer Verlag, 1982.V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of eventsto their probabilities.
Theory of Probability and its Applications , 16(2):264–280, 1971.Colin Wei and Tengyu Ma. Data-dependent sample complexity of deep neural networks via lipschitzaugmentation. arXiv preprint arXiv:1905.03684 , 2019a.11ublished as a conference paper at ICLR 2020Colin Wei and Tengyu Ma. Improved sample complexities for deep networks and robust classificationvia an all-layer margin.
ICLR , 2019b.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.Pan Zhou and Jiashi Feng. Understanding generalization and optimization performance of deep cnns.In
ICML , pages 5955–5964, 2018.Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizesover-parameterized deep relu networks. arXiv preprint arXiv:1811.08888 , 2018.
A P
ROOF OF L EMMA
Definition A.1. If ( X, ρ ) is a metric space and H ⊆ X , we say that G is an (cid:15) -cover of H withrespect to ρ if every h ∈ H has a g ∈ G such that ρ ( g, h ) ≤ (cid:15) . Then N ρ ( H, (cid:15) ) denotes the size of thesmallest (cid:15) -cover of H w.r.t. ρ . Definition A.2.
For a domain Z , define a metric ρ max on pairs of functions from Z to R by ρ max ( f, g ) = sup z ∈ Z | f ( z ) − g ( z ) | . We need two lemmas in terms of these covering numbers. The first is by now a standard bound fromVapnik-Chervonenkis theory (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Pollard, 1984). Forexample, it is a direct consequence of (Haussler, 1992, Theorem 3).
Lemma A.3.
For any η > , there is a constant C depending only on η such that the following holds.Let G be an arbitrary set of functions from a common domain Z to [0 , M ] . If there are constants B and d such that, N ρ max ( G, (cid:15) ) ≤ (cid:0) B(cid:15) (cid:1) d for all (cid:15) > , then, for all large enough n ∈ N , for any δ > ,for any probability distribution P over Z , if S is obtained by sampling n times independently from P ,then, with probability at least − δ , for all g ∈ G , E z ∼ P [ g ( z )] ≤ (1 + η ) E S [ g ] + CM ( d log( Bn ) + log(1 /δ )) n . We will also use the following, which is the combination of (2.5) and (2.7) of (Giné and Guillou,2001).
Lemma A.4.
Let G be an arbitrary set of functions from a common domain Z to [0 , M ] . If thereare constants B ≥ and d such that N ρ max ( G, (cid:15) ) ≤ (cid:0) B(cid:15) (cid:1) d for all (cid:15) > , for all large enough n ∈ N , for any δ > , for any probability distribution P over Z , if S is obtained by sampling n timesindependently from P , then, with probability at least − δ , for all g ∈ G , E z ∼ P [ g ( z )] ≤ E S [ g ] + CM (cid:115) d log B + log δ n , where C is an absolute constant. The above bound only holds for B ≥ . The following, which can be obtained by combiningTalagrand’s Lemma with the standard bound on Rademacher complexity in terms of the Dudleyentropy integral (see (Van de Geer, 2000; Bartlett, 2013)), yields a bound for all B . Lemma A.5.
Let G be an arbitrary set of functions from a common domain Z to [0 , M ] . If thereare constants B > and d such that N ρ max ( G, (cid:15) ) ≤ (cid:0) B(cid:15) (cid:1) d for all (cid:15) > , then, for all large enough n ∈ N , for any δ > , for any probability distribution P over Z , if S is obtained by sampling n timesindependently from P , then, with probability at least − δ , for all g ∈ G , E z ∼ P [ g ( z )] ≤ E S [ g ] + C B (cid:114) dn + M (cid:115) log δ n , where C is an absolute constant. N ρ max ( G, (cid:15) ) for Lipschitz-parameterized classes. For this, we need thenotion of a packing which we now define. Definition A.6.
For any metric space ( X, ρ ) and any H ⊆ S , let M ρ ( H, (cid:15) ) be the size of the largestsubset of H whose members are pairwise at a distance greater than (cid:15) w.r.t. ρ . Lemma A.7 ((Kolmogorov and Tikhomirov, 1959)) . For any metric space ( X, ρ ) , any H ⊆ X , andany (cid:15) > , we have N ρ ( H, (cid:15) ) ≤ M ρ ( H, (cid:15) ) . We will also need a lemma about covering a ball by smaller balls. This is probably also alreadyknown, and uses a standard proof (see Pollard, 1990, Lemma 4.1), but we haven’t found a referencefor it.
Lemma A.8.
Let• d be a positive integer,• || · || be a norm• ρ be the metric induced by || · || , and• κ, (cid:15) > .A ball in R d of radius κ w.r.t. ρ can be covered by (cid:0) κ(cid:15) (cid:1) d balls of radius (cid:15) .Proof. We may assume without loss of generality that κ > (cid:15) . Let q > be the volume of the unitball w.r.t. ρ in R d . Then the volume of any α -ball with respect to ρ is α d q . Let B be the ball ofradius r in R d . The (cid:15)/ -balls centered at the members of any (cid:15) -packing of B are disjoint. Since thesecenters are contained in B , the balls are contained in a ball of radius κ + (cid:15)/ . Thus M ρ ( B, (cid:15) ) (cid:16) (cid:15) (cid:17) d q ≤ (cid:16) κ + (cid:15) (cid:17) d q ≤ (cid:18) κ (cid:19) d q. Solving for M ρ ( B, (cid:15) ) and applying Lemma A.7 completes the proof.We now prove Lemma 2.3. Let || · || be the norm witnessing the fact that G is ( B, d ) -Lipschitzparameterized, and let B be the unit ball in R d w.r.t. || · || and let ρ be the metric induced by || · || .Then, for any (cid:15) , an (cid:15)/B -cover of B w.r.t. ρ induces an (cid:15) -cover of G w.r.t. ρ max , so N ρ max ( G, (cid:15) ) ≤ N ρ ( B , (cid:15)/B ) . Applying Lemma A.8, this implies N ρ max ( G, (cid:15) ) ≤ (cid:18) B(cid:15) (cid:19) d . Then applying Lemma A.3, Lemma A.4 and Lemma A.5 completes the proof.
B P
ROOF OF (1)
For δ > , and for each j ∈ N , let β j = 5 × j let δ j = j . Taking a union bound over anapplication of Theorem 2.1 for each value of j , with probability at least − (cid:80) j δ j ≥ − δ , for all j ,and all f ∈ F β j E z ∼ P [ (cid:96) f ( z )] ≤ (1 + η ) E S [ (cid:96) f ( z )] + C ( W ( β j + log( λn )) + log( j/δ )) n and E z ∼ P [ (cid:96) f ( z )] ≤ E S [ (cid:96) f ( z )] + C (cid:114) W ( β j + log( λ )) + log( j/δ ) n . K , if we apply these bounds in the case of the least j such that || K − K || σ ≤ β j , we get E z ∼ P [ (cid:96) f ( z )] ≤ (1 + η ) E S [ (cid:96) f ( z )] + C ( W (2 || K − K || σ + log( λn )) + log(log( || K − K || σ ) /δ )) n and E z ∼ P [ (cid:96) f ( z )] ≤ E S [ (cid:96) f ( z )] + C (cid:114) W (2 || K − K || σ + log( λ )) + log(log( || K − K || σ ) /δ ) n , and simplifying completes the proof. C T
HE OPERATOR NORM OF op( K ( i ) ) Let J = K ( i ) − K ( i )0 . Since || op( K ( i ) ) || = 1 + || op( J ) || , it suffices to find || op( J ) || .For the rest of this section, we number indices from , let [ d ] = { , ..., d − } , and define ω =exp(2 πi/d ) . To facilitate the application of matrix notation, pad the k × k × c × c tensor J out withzeros to make a d × d × c × c tensor ˜ J .The following lemma is an immediate consequence of Theorem 6 of Sedghi et al. (2018). Lemma C.1 (Sedghi et al. (2018)) . Let F be the complex d × d matrix defined by F ij = ω ij .For each u, v ∈ [ d ] × [ d ] , let P ( u,v ) be the c × c matrix given by P ( u,v ) k(cid:96) = ( F T ˜ J : , : ,k,(cid:96) F ) uv . Then || op( J ) || = max u,v || P ( u,v ) || . First, note that, by symmetry, for each u and v , all components of P ( u,v ) are the same. Thus, || P ( u,v ) || = c | P ( u,v )00 | . (3)For any u, v , | P ( u,v )00 | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) p,q ω up ω vq ˜ J p,q, , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15)k and P (0 , = (cid:15)k . Combining this with (3) and Lemma C.1, || op( J ) || = (cid:15)ck , which implies || op( K ) || = 1 + (cid:15)ck . D P
ROOF OF L EMMA
For each convolutional layer i , let β i = || op( K ( i ) ) − op( K ( i )0 ) || , and, for each fully connectedlayer i , let γ i = || V ( i ) − V ( i )0 || .Since (cid:96) is λ -Lipschitz w.r.t. its first argument, we have that | (cid:96) ( f Θ ( x ) , y ) − (cid:96) ( f ˜Θ ( x ) , y ) | ≤ λ | f Θ ( x ) − f ˜Θ ( x ) | . Let g up be the function from the inputs to the whole network with parameters Θ to the inputsto the convolution in layer j , and let g down be the function from the output of this convolution to theoutput of the whole network, so that f Θ = g down ◦ f op( K ( j ) ) ◦ g up . Choose an input x to the network,and let u = g up ( x ) . Recalling that || x || ≤ χ , and that the non-linearities and pooling operationsare non-expansive, we have (cid:107) u (cid:107) ≤ χ (cid:89) i
ROOF OF L EMMA
For each convolutional layer i , let β i = || op( K ( i ) ) − op( K ( i )0 ) || , and, for each fully connectedlayer i , let γ i = || V ( i ) − V ( i )0 || .Since (cid:96) is λ -Lipschitz w.r.t. its first argument, we have that | (cid:96) ( f Θ ( x ) , y ) − (cid:96) ( f ˜Θ ( x ) , y ) | ≤ λ | f Θ ( x ) − f ˜Θ ( x ) | . Let g up be the function from the inputs to the whole network with parameters Θ to the inputsto fully connected layer layer j , and let g down be the function from the output of this layer to theoutput of the whole network, so that f Θ = g down ◦ f V ( j ) ◦ g up . Choose an input x to the network,and let u = g up ( x ) . Recalling that || x || ≤ χ , and that the non-linearities and pooling operations arenon-expansive, we have (cid:107) u (cid:107) ≤ χ (cid:32)(cid:89) i (cid:13)(cid:13)(cid:13) op( K ( i ) ) (cid:13)(cid:13)(cid:13) (cid:33) (cid:89) i