[PDF] Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error

Abstract

Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective compression method and its rigorous background of statistical learning theory. To resolve this issue, we develop a new theoretical framework for model compression and propose a new pruning method called {\it spectral pruning} based on this framework. We define the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes and show that the compression ability is essentially controlled by this quantity. Moreover, we present a sharp generalization error bound of the compressed model and characterize the bias--variance tradeoff induced by the compression procedure. We apply our method to several datasets to justify our theoretical analyses and show the superiority of the the proposed method.

Full PDF

SSpectral Pruning: Compressing Deep Neural Networksvia Spectral Analysis and its Generalization Error ∗ Taiji Suzuki , , ∗ , Hiroshi Abe , † , Tomoya Murata , Shingo Horiuchi , Kotaro Ito , TokumaWachi , So Hirai , Masatoshi Yukishima and Tomoaki Nishimura , ‡ The University of Tokyo, Japan , Center for Advanced Intelligence Project, RIKEN, Japan , iPride Co., Ltd., Japan , NTT DATA Mathematical Systems Inc., Japan , NTT Data Corporation, Japan, ∗ [email protected], † [email protected], ‡ [email protected] Abstract

Compression techniques for deep neural networkmodels are becoming very important for the ef-ﬁcient execution of high-performance deep learn-ing systems on edge-computing devices. The con-cept of model compression is also important foranalyzing the generalization error of deep learn-ing, known as the compression-based error bound.However, there is still huge gap between a practi-cally effective compression method and its rigor-ous background of statistical learning theory. Toresolve this issue, we develop a new theoreticalframework for model compression and propose anew pruning method called spectral pruning basedon this framework. We deﬁne the “degrees offreedom” to quantify the intrinsic dimensionalityof a model by using the eigenvalue distributionof the covariance matrix across the internal nodesand show that the compression ability is essentiallycontrolled by this quantity. Moreover, we present asharp generalization error bound of the compressedmodel and characterize the bias–variance tradeoffinduced by the compression procedure. We applyour method to several datasets to justify our theo-retical analyses and show the superiority of the theproposed method.

Currently, deep learning is the most promising approachadopted by various machine learning applications such ascomputer vision, natural language processing, and audio pro-cessing. Along with the rapid development of the deep learn-ing techniques, its network structure is becoming consider-ably complicated. In addition to the model structure, themodel size is also becoming larger, which prevents the imple-mentation of deep neural network models in edge-computing ∗ Copyright c (cid:13) devices for applications such as smartphone services, au-tonomous vehicle driving, and drone control. To overcomethis problem, model compression techniques such as prun-ing, factorization [Denil et al. , 2013; Denton et al. , 2014], andquantization [Han et al. , 2015] have been extensively studiedin the literature.Among these techniques, pruning is a typical approach thatdiscards redundant nodes, e.g., by explicit regularization suchas (cid:96) and (cid:96) penalization during training [Lebedev and Lem-pitsky, 2016; Wen et al. , 2016; He et al. , 2017]. It has beenimplemented as ThiNet [Luo et al. , 2017], Net-Trim [Aghasi et al. , 2017], NISP [Yu et al. , 2018], and so on [Denil et al. ,2013]. A similar effect can be realized by implicit random-ized regularization such as DropConnect [Wan et al. , 2013],which randomly removes connections during the trainingphase. However, only few of these techniques (e.g., Net-Trim[Aghasi et al. , 2017]) are supported by statistical learning the-ory. In particular, it unclear which type of quantity controlsthe compression ability. On the theoretical side, compression-based generalization analysis is a promising approach formeasuring the redundancy of a network [Arora et al. , 2018;Zhou et al. , 2019]. However, despite their theoretical nov-elty, the connection of these generalization error analyses topractically useful compression methods is not obvious.In this paper, we develop a new compression based gen-eralization error bound and propose a new simple pruningmethod that is compatible with the generalization error anal-ysis. Our method aims to minimize the information loss in-duced by compression; in particular, it minimizes the redun-dancy among nodes instead of merely looking at the amountof information of each individual node. It can be executed bysimply observing the covariance matrix in the internal layersand is easy to implement. The proposed method is supportedby a comprehensive theoretical analysis. Notably, the approx-imation error induced by compression is characterized by thenotion of the statistical degrees of freedom [Mallows, 1973;Caponnetto and de Vito, 2007]. It represents the intrinsic di-mensionality of a model and is determined by the eigenvaluesof the covariance matrix between each node in each layer.Usually, we observe that the eigenvalue rapidly decreases(Fig. 1a) for several reasons such as explicit regularization a r X i v : . [ s t a t . M L ] J u l Dropout [Wager et al. , 2013], weight decay [Krogh andHertz, 1992]), and implicit regularization [Hardt et al. , 2016;Gunasekar et al. , 2018], which means that the amount of im-portant information processed in each layer is not large. Inparticular, the rapid decay in eigenvalues leads to a low num-ber of degrees of freedom. Then, we can effectively compressa trained network into a smaller one that has fewer parame-ters than the original. Behind the theory, there is essentially aconnection to the random feature technique for kernel meth-ods [Bach, 2017]. Compression error analysis is directly con-nected to generalization error analysis. The derived bound isactually much tighter than the naive VC-theory bound on theuncompressed network [Bartlett et al. , 2017] and even tighterthan recent compression-based bounds [Arora et al. , 2018].Further, there is a tradeoff between the bias and the variance,where the bias is induced by the network compression and thevariance is induced by the variation in the training data. In ad-dition, we show the superiority of our method and experimen-tally verify our theory with extensive numerical experiments.Our contributions are summarized as follows: • We give a theoretical compression bound which is com-patible with a practically useful pruning method, andpropose a new simple pruning method called spectralpruning for compressing deep neural networks. • We characterize the model compression ability by uti-lizing the notion of the degrees of freedom, whichrepresents the intrinsic dimensionality of the model.We also give a generalization error bound when atrained network is compressed by our method and showthat the bias–variance tradeoff induced by model com-pression appears. The obtained bound is fairly tightcompared with existing compression-based bounds andmuch tighter than the naive VC-dimension bound.

Suppose that the training data D tr = { ( x i , y i ) } ni =1 are ob-served, where x i ∈ R d x is an input and y i is an output thatcould be a real number ( y i ∈ R ), a binary label ( y i ∈ {± } ),and so on. The training data are independently and identicallydistributed. To train the appropriate relationship between x and y , we construct a deep neural network model as f ( x ) = ( W ( L ) η ( · ) + b ( L ) ) ◦ · · · ◦ ( W (1) x + b (1) ) , where W ( (cid:96) ) ∈ R m (cid:96) +1 × m (cid:96) , b ( (cid:96) ) ∈ R m (cid:96) +1 ( (cid:96) = 1 , . . . , L ),and η : R → R is an activation function (here, the acti-vation function is applied in an element-wise manner; fora vector x ∈ R d , η ( x ) = ( η ( x ) , . . . , η ( x d )) (cid:62) ). Here, m (cid:96) is the width of the (cid:96) -th layer such that m L +1 = 1 (output) and m = d x (input). Let (cid:98) f be a trained net-work obtained from the training data D tr = { ( x i , y i ) } ni =1 where its parameters are denoted by ( ˆ W ( (cid:96) ) , ˆ b ( (cid:96) ) ) L(cid:96) =1 , i.e., (cid:98) f ( x ) = ( ˆ W ( L ) η ( · ) + ˆ b ( L ) ) ◦ · · · ◦ ( ˆ W (1) x + ˆ b (1) ) . The in-put to the (cid:96) -th layer (after activation) is denoted by φ ( (cid:96) ) ( x ) = η ◦ ( ˆ W ( (cid:96) − η ( · ) + ˆ b ( (cid:96) − ) ◦ · · · ◦ ( ˆ W (1) x + ˆ b (1) ) . We do not specify how to train the network (cid:98) f , and the following argu-ment can be applied to any learning method such as the em-pirical risk minimizer, the Bayes estimator, or another esti-mator. We want to compress the trained network (cid:98) f to anothersmaller one f (cid:93) having widths ( m (cid:93)(cid:96) ) L(cid:96) =1 with keeping the testaccuracy as high as possible.To compress the trained network (cid:98) f to a smaller one f (cid:93) , wepropose a simple strategy called spectral pruning . The mainidea of the method is to ﬁnd the most informative subset of thenodes. The amount of information of the subset is measuredby how well the selected nodes can explain the other nodesin the layer and recover the output to the next layer. For ex-ample, if some nodes are heavily correlated with each other,then only one of them will be selected by our method. Theinformation redundancy can be computed by a covariancematrix between nodes and a simple regression problem. Wedo not need to solve a speciﬁc nonlinear optimization prob-lem unlike the methods in [Lebedev and Lempitsky, 2016;Wen et al. , 2016; Aghasi et al. , 2017]. Our method basically simultaneously minimizes the input in-formation loss and output information loss , which will be de-ﬁned as follows. (i) Input information loss.

First, we explain the input in-formation loss. Denote φ ( x ) = φ ( (cid:96) ) ( x ) for simplicity, andlet φ J ( x ) = ( φ j ( x )) j ∈ J ∈ R m (cid:93)(cid:96) be a subvector of φ ( x ) corresponding to an index set J ∈ [ m (cid:96) ] m (cid:93)(cid:96) , where [ m ] := { , . . . , m } (here, duplication of the index is allowed). Thebasic strategy is to solve the following optimization problemso that we can recover φ ( x ) from φ J ( x ) as accurately as pos-sible: ˆ A J := ( ˆ A ( (cid:96) ) J =) argmin A ∈ R m(cid:96) ×| J | (cid:98) E[ (cid:107) φ − Aφ J (cid:107) ] + (cid:107) A (cid:107) τ , (1)where (cid:98) E[ · ] is the expectation with respect to the empirical dis-tribution ( (cid:98) E[ f ] = n (cid:80) ni =1 f ( x i ) ) and (cid:107) A (cid:107) τ = Tr[ A I τ A (cid:62) ] for a regularization parameter τ ∈ R | J | + and I τ := diag ( τ ) (how to set the regularization parameter will be given inTheorem 1). The optimal solution ˆ A J can be explicitly ex-pressed by utilizing the (noncentered) covariance matrix inthe (cid:96) -th layer of the trained network (cid:98) f , which is deﬁned as (cid:98) Σ := (cid:98) Σ ( (cid:96) ) = n (cid:80) ni =1 φ ( x i ) φ ( x i ) (cid:62) , deﬁned on the empiri-cal distribution (here, we omit the layer index (cid:96) for notationalsimplicity). Let (cid:98) Σ I,I (cid:48) ∈ R K × H for K, H ∈ N be the subma-trix of (cid:98) Σ for the index sets I ∈ [ m (cid:96) ] K and I (cid:48) ∈ [ m (cid:96) ] H suchthat (cid:98) Σ I,I (cid:48) = ( (cid:98) Σ i,j ) i ∈ I,j ∈ I (cid:48) . Let F = { , . . . , m (cid:96) } be the fullindex set. Then, we can easily see that ˆ A J = (cid:98) Σ F,J ( (cid:98) Σ J,J + I τ ) − . Hence, the full vector φ ( x ) can be decoded from φ J ( x ) as φ ( x ) ≈ ˆ A J φ J ( x ) = (cid:98) Σ F,J ( (cid:98) Σ J,J + I τ ) − φ J ( x ) . Tomeasure the approximation error, we deﬁne L (A) τ ( J ) =min A ∈ R m(cid:96) ×| J | (cid:98) E[ (cid:107) φ − Aφ J (cid:107) ] + (cid:107) A (cid:107) τ . By substituting thexplicit formula ˆ A J into the objective, this is reformulated as L (A) τ ( J ) =Tr[ (cid:98) Σ F,F − (cid:98) Σ F,J ( (cid:98) Σ J,J + I τ ) − (cid:98) Σ J,F ] . (ii) Output information loss. Next, we explain the output in-formation loss. Suppose that we aim to directly approximatethe outputs Z ( (cid:96) ) φ for a weight matrix Z ( (cid:96) ) ∈ R m × m (cid:96) with anoutput size m ∈ N . A typical situation is that Z ( (cid:96) ) = ˆ W ( (cid:96) ) sothat we approximate the output ˆ W ( (cid:96) ) φ (the concrete settingof Z ( (cid:96) ) will be speciﬁed in Theorem 1). Then, we considerthe objective L (B) τ ( J ) := m (cid:88) j =1 min α ∈ R m(cid:96) (cid:110)(cid:98) E[( Z ( (cid:96) ) j, : φ − α (cid:62) φ J ) ]+ (cid:107) α (cid:62) (cid:107) τ (cid:111) = Tr { Z ( (cid:96) ) [ (cid:98) Σ F,F − (cid:98) Σ F,J ( (cid:98) Σ J,J + I τ ) − (cid:98) Σ J,F ] Z ( (cid:96) ) (cid:62) } , where Z ( (cid:96) ) j, : means the j -th raw of the matrix Z ( (cid:96) ) . It can beeasily checked that the optimal solution ˆ α J of the minimumin the deﬁnition of L (B) τ is given as ˆ α J = ˆ A (cid:62) J Z ( (cid:96) ) (cid:62) j, : for each j = 1 , . . . , m . (iii) Combination of the input and output informationlosses. Finally, we combine the input and output informationlosses and aim to minimize this combination. To do so, wepropose to the use of the convex combination of both criteriafor a parameter ≤ θ ≤ and optimize it with respect to J under a cardinality constraint | J | = m (cid:93)(cid:96) for a prespeciﬁedwidth m (cid:93)(cid:96) of the compressed network: min J L ( θ ) τ ( J ) = θL (A) τ ( J ) + (1 − θ ) L (B) τ ( J )s . t . J ∈ [ m (cid:96) ] m (cid:93)(cid:96) . (2)We call this method spectral pruning . There are the hyper-parameter θ and regularization parameter τ . However, wesee that it is robust against the choice of hyperparameter inexperiments (Sec. 5). Let J (cid:93)(cid:96) be the optimal J that mini-mizes the objective. This optimization problem is NP-hard,but an approximate solution is obtained by the greedy algo-rithm since it is reduced to maximization of a monotonic sub-modular function [Krause and Golovin, 2014]. That is, westart from J = ∅ , sequentially choose an element j ∗ ∈ [ m (cid:96) ] that maximally reduces the objective L ( θ ) τ , and add this ele-ment j ∗ to J ( J ← J ∪ { j ∗ } ) until | J | = m (cid:93)(cid:96) is satisﬁed.After we chose an index J (cid:93)(cid:96) ( (cid:96) = 2 , . . . , L ) for eachlayer, we construct the compressed network f (cid:93) as f (cid:93) ( x ) =( W (cid:93) ( L ) η ( · ) + b (cid:93) ( L ) ) ◦ · · · ◦ ( W (cid:93) (1) x + b (cid:93) (1) ) , where W (cid:93) ( (cid:96) ) = W ( (cid:96) ) J (cid:93)(cid:96) +1 ,F ˆ A ( (cid:96) ) J (cid:93)(cid:96) and b (cid:93) ( (cid:96) ) = b ( (cid:96) ) J (cid:93)(cid:96) +1 .An application to a CNN is given in Appendix A. Themethod can be executed in a layer-wise manner, thus it canbe applied to networks with complicated structures such asResNet. In this section, we give a theoretical guarantee of our method.First, we give the approximation error induced by our prun-ing procedure in Theorem 1. Next, we evaluate the gen-eralization error of the compressed network in Theorem 2. More speciﬁcally, we introduce a quantity called the degreesof freedom [Mallows, 1973; Caponnetto and de Vito, 2007;Suzuki, 2018; Suzuki et al. , 2020] that represents the intrinsicdimensionality of the model and determines the approxima-tion accuracy.For the theoretical analysis, we deﬁne a neural networkmodel with norm constraints on the parameters W ( (cid:96) ) and b ( (cid:96) ) ( (cid:96) = 1 , . . . , L ). Let R > and R b > be the upper boundsof the parameters, and deﬁne the norm-constrained model as F := { ( W ( L ) η ( · ) + b ( L ) ) ◦ · · · ◦ ( W (1) x + b (1) ) | max j (cid:107) W ( (cid:96) ) j, : (cid:107) ≤ R √ m (cid:96) +1 , (cid:107) b ( (cid:96) ) (cid:107) ∞ ≤ R b √ m (cid:96) +1 } , where W ( (cid:96) ) j, : means the j -th raw of the matrix W ( (cid:96) ) , (cid:107) · (cid:107) isthe Euclidean norm, and (cid:107) · (cid:107) ∞ is the (cid:96) ∞ -norm . We makethe following assumption for the activation function, which issatisﬁed by ReLU and leaky ReLU [Maas et al. , 2013]. Assumption 1.

We assume that the activation function η sat-isﬁes (1) scale invariance: η ( ax ) = aη ( x ) for all a > and x ∈ R d and (2) 1-Lipschitz continuity: | η ( x ) − η ( x (cid:48) ) | ≤(cid:107) x − x (cid:48) (cid:107) for all x, x (cid:48) ∈ R d , where d is arbitrary. Here, we evaluate the approximation error derived by ourpruning procedure. Let ( m (cid:93)(cid:96) ) L(cid:96) =1 denote the width of eachlayer of the compressed network f (cid:93) . We characterize theapproximation error between f (cid:93) and (cid:98) f on the basis of thedegrees of freedom with respect to the empirical L -norm (cid:107) g (cid:107) n := n (cid:80) ni =1 (cid:107) g ( x i ) (cid:107) , which is deﬁned for a vector-valued function g . Recall that the empirical covariance ma-trix in the (cid:96) -th layer is denoted by (cid:98) Σ ( (cid:96) ) . We deﬁne the degreesof freedom as ˆ N (cid:96) ( λ ) := Tr[ (cid:98) Σ ( (cid:96) ) ( (cid:98) Σ ( (cid:96) ) + λ I) − ] = (cid:80) m (cid:96) j =1 ˆ µ ( (cid:96) ) j / (ˆ µ ( (cid:96) ) j + λ ) , where (ˆ µ ( (cid:96) ) j ) m (cid:96) j =1 are the eigenvalues of (cid:98) Σ ( (cid:96) ) sorted in de-creasing order. Roughly speaking, this quantity quantiﬁesthe number of eigenvalues above λ , and thus it is monoton-ically decreasing w.r.t. λ . The degrees of freedom play anessential role in investigating the predictive accuracy of ridgeregression [Mallows, 1973; Caponnetto and de Vito, 2007;Bach, 2017]. To characterize the output information loss, wealso deﬁne the output aware degrees of freedom with respectto a matrix Z ( (cid:96) ) as ˆ N (cid:48) (cid:96) ( λ ; Z ( (cid:96) ) ) := Tr[ Z ( (cid:96) ) (cid:98) Σ ( (cid:96) ) ( (cid:98) Σ ( (cid:96) ) + λ I) − Z ( (cid:96) ) (cid:62) ] . This quantity measures the intrinsic dimensionality of theoutput from the (cid:96) -th layer for a weight matrix Z ( (cid:96) ) . Ifthe covariance (cid:98) Σ ( (cid:96) ) and the matrix Z ( (cid:96) ) are near low rank, ˆ N (cid:48) (cid:96) ( λ ; Z ( (cid:96) ) ) becomes much smaller than ˆ N (cid:96) ( λ ) . Finally, wedeﬁne N θ(cid:96) ( λ ) := θ ˆ N (cid:96) ( λ ) + (1 − θ ) ˆ N (cid:48) (cid:96) ( λ ; Z ( (cid:96) ) ) .To evaluate the approximation error induced by compres-sion, we deﬁne λ (cid:96) > as λ (cid:96) = inf { λ ≥ | m (cid:93)(cid:96) ≥ N (cid:96) ( λ ) log(80 ˆ N (cid:96) ( λ )) } . (3) We are implicitly supposing

R, R b (cid:39) so that (cid:107) W ( (cid:96) ) (cid:107) F , (cid:107) b ( (cid:96) ) (cid:107) = O (1) . onversely, we may determine m (cid:93)(cid:96) from λ (cid:96) to obtain thetheorems we will mention below. Along with the degreesof freedom, we deﬁne the leverage score ˜ τ ( (cid:96) ) ∈ R m (cid:96) as ˜ τ ( (cid:96) ) j := N (cid:96) ( λ (cid:96) ) [ (cid:98) Σ ( (cid:96) ) ( (cid:98) Σ ( (cid:96) ) + λ (cid:96) I) − ] j,j ( j ∈ [ m (cid:96) ]) . Note that (cid:80) m (cid:96) j =1 ˜ τ ( (cid:96) ) j = 1 originates from the deﬁnition of the degreesof freedom. The leverage score can be seen as the amount ofcontribution of node j ∈ [ m (cid:96) ] to the degrees of freedom. Forsimplicity, we assume that ˜ τ ( (cid:96) ) j > for all (cid:96), j (otherwise, wejust need to neglect such a node with ˜ τ ( (cid:96) ) j = 0 ).For the approximation error bound, we consider two situ-ations: (i) (Backward procedure) spectral pruning is appliedfrom (cid:96) = L to (cid:96) = 2 in order, and for pruning the (cid:96) -th layer,we may utilize the selected index J (cid:93)(cid:96) +1 in the (cid:96) + 1 -th layerand (ii) (Simultaneous procedure) spectral pruning is simulta-neously applied for all (cid:96) = 2 , . . . , L . We provide a statementfor only the backward procedure. The simultaneous proce-dure also achieves a similar bound with some modiﬁcations.The complete statement will be given as Theorem 3 in Ap-pendix B.As for Z ( (cid:96) ) for the output information loss, we set Z ( (cid:96) ) k, : = (cid:113) m (cid:96) q ( (cid:96) ) jk (max j (cid:48) (cid:107) ˆ W ( (cid:96) ) j (cid:48) , : (cid:107) ) − ˆ W ( (cid:96) ) j k , : ( k = 1 , . . . , m (cid:93)(cid:96) +1 ) where we let J (cid:93)(cid:96) +1 = { j , . . . , j m (cid:93)(cid:96) +1 } , and q ( (cid:96) ) j := (˜ τ ( (cid:96) +1) j ) − (cid:80) j (cid:48)∈ J(cid:93)(cid:96) +1 (˜ τ ( (cid:96) +1) j (cid:48) ) − ( j ∈ J (cid:93)(cid:96) +1 ) and q ( (cid:96) ) j = 0 ( otherwise ) . Fi-nally, we set the regularization parameter τ as τ ← m (cid:93)(cid:96) λ (cid:96) ˜ τ ( (cid:96) ) . Theorem 1 (Compression rate via the degrees of freedom) . If we solve the optimization problem (2) with the additionalconstraint (cid:80) j ∈ J (˜ τ ( (cid:96) ) j ) − ≤ m (cid:96) m (cid:93)(cid:96) for the index set J , thenthe optimization problem is feasible, and the overall approxi-mation error of f (cid:93) is bounded by (cid:107) (cid:98) f − f (cid:93) (cid:107) n ≤ (cid:88) L(cid:96) =2 (cid:18) ¯ R L − (cid:96) +1 (cid:113)(cid:81) L(cid:96) (cid:48) = (cid:96) ζ (cid:96) (cid:48) ,θ (cid:19) (cid:112) λ (cid:96) (4) for ¯ R = √ ˆ cR , where ˆ c is a universal constant, and ζ (cid:96),θ := N θ(cid:96) ( λ (cid:96) ) (cid:18) θ max j ∈ [ m(cid:96) +1] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) m (cid:96) (cid:19) − . The proof is given in Appendix B. To prove the theorem,we essentially need to use theories of random features in ker-nel methods [Bach, 2017; Suzuki, 2018]. The main messagefrom the theorem is that the approximation error induced bycompression is directly controlled by the degrees of freedom.Since the degrees of freedom ˆ N (cid:96) ( λ ) are a monotonically de-creasing function with respect to λ , they become large as λ decreases to . The behavior of the eigenvalues determineshow rapidly ˆ N (cid:96) ( λ ) increases as λ → . We can see that if theeigenvalues ˆ µ ( (cid:96) )1 ≥ ˆ µ ( (cid:96) )2 ≥ . . . rapidly decrease, then the ap-proximation error λ (cid:96) can be much smaller for a given modelsize m (cid:93)(cid:96) . In other words, f (cid:93) can be much closer to the originalnetwork (cid:98) f if there are only a few large eigenvalues. (cid:107) · (cid:107) op represents the operator norm of a matrix (the largestabsolute singular value). The quantity ζ (cid:96),θ characterizes how well the approximationerror λ (cid:96) (cid:48) of the lower layers (cid:96) (cid:48) ≤ (cid:96) propagates to the ﬁnal out-put. We can see that a tradeoff between ζ (cid:96),θ and θ appears. Bya simple evaluation, N θ(cid:96) in the numerator of ζ (cid:96),θ is boundedby m (cid:96) ; thus, θ = 1 gives ζ (cid:96),θ ≤ . On the other hand, theterm max j ∈ [ m(cid:96) +1] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op takes a value between m (cid:96) +1 and 1;thus, θ = 1 is not necessarily the best choice to maximizethe denominator. From this consideration, we can see that thevalue of θ that best minimizes ζ (cid:96),θ exists between and ,which supports our numerical result (Fig. 2b). In any situa-tion, small degrees of freedom give a small ζ (cid:96),θ , leading to asharper bound. Here, we derive the generalization error bound of the com-pressed network with respect to the population risk. We willsee that a bias–variance tradeoff induced by network com-pression appears. As usual, we train a network through thetraining error (cid:98) Ψ( f ) := n (cid:80) ni =1 ψ ( y i , f ( x i )) , where ψ : R × R → R is a loss function. Correspondingly, the expectederror is denoted by Ψ( f ) := E[ ψ ( Y, f ( X ))] , where the ex-pectation is taken with respect to ( X, Y ) ∼ P . Our aim hereis to bound the generalization error Ψ( f (cid:93) ) of the compressednetwork. Let the marginal distribution of X be P X and thatof y be P Y . First, we assume the Lipschitz continuity for theloss function ψ . Assumption 2.

The loss function ψ is ρ -Lipschitz con-tinuous: | ψ ( y, f ) − ψ ( y, f (cid:48) ) | ≤ ρ | f − f (cid:48) | ( ∀ y ∈ supp( P Y ) , ∀ f, f (cid:48) ∈ R ) . The support of P X is bounded: (cid:107) x (cid:107) ≤ D x ( ∀ x ∈ supp( P X )) . For a technical reason, we assume the following conditionfor the spectral pruning algorithm.

Assumption 3.

We assume that ≤ θ ≤ is appropriatelychosen so that ζ (cid:96),θ in Theorem 1 satisﬁes ζ (cid:96),θ ≤ almostsurely, and spectral pruning is solved under the condition (cid:80) j ∈ J (˜ τ ( (cid:96) ) j ) − ≤ m (cid:96) m (cid:93)(cid:96) on the index set J . As for the choice of θ , this assumption is always satisﬁedat least by the backward procedure. The condition on thelinear constraint on J is merely to ensure the leverage scoresare balanced for the chosen index. Note that the bounds inTheorem 1 can be achieved even with this condition.If L ∞ -norm of networks is loosely evaluated, the gen-eralization error bound of deep learning can be unrealis-tically large because there appears L ∞ -norm in its evalu-ation. However, we may consider a truncated estimator [[ (cid:98) f ( x )]] := max {− M, min { M, (cid:98) f ( x ) }} for sufﬁciently large < M ≤ ∞ to moderate the L ∞ -norm (if M = ∞ , thisdoes not affect anything). Note that the truncation proceduredoes not affect the classiﬁcation error for a classiﬁcation task.To bound the generalization error, we deﬁne δ and δ for ( m (cid:93) , . . . , m (cid:93)L ) and ( λ , . . . , λ L ) satisfying relation (3) as δ = (cid:80) L(cid:96) =2 (cid:0) ¯ R L − (cid:96) +1 (cid:113)(cid:81) L(cid:96) (cid:48) = (cid:96) ζ (cid:96) (cid:48) ,θ (cid:1) √ λ (cid:96) , log + ( x ) = max { , log( x ) } . = 1 n L (cid:88) (cid:96) =1 m (cid:93)(cid:96) m (cid:93)(cid:96) +1 log + (cid:16) G max { ¯ R, ¯ R b } ˆ R ∞ (cid:17) , where ˆ R ∞ := min { ¯ R L D x + (cid:80) L(cid:96) =1 ¯ R L − (cid:96) ¯ R b , M } , ˆ G := L ¯ R L − D x + (cid:80) L(cid:96) =1 ¯ R L − (cid:96) for ¯ R = √ ˆ cR and ¯ R b = √ ˆ cR b with the constants ˆ c introduced in Theorem 1. Let R n,t := n (cid:16) t + (cid:80) L(cid:96) =2 log( m (cid:96) ) (cid:17) for t > . Then, we obtain the fol-lowing generalization error bound for the compressed net-work f (cid:93) . Theorem 2 (Generalization error bound of the compressednetwork) . Suppose that Assumptions 1, 2, and 3 are satisﬁed.Then, the spectral pruning method presented in Theorem 1satisﬁes the following generalization error bound. There ex-ists a universal constant C > such that for any t > , itholds that Ψ([[ f (cid:93) ]]) ≤ ˆΨ([[ (cid:98) f ]]) + ρ (cid:110) δ + C ˆ R ∞ ( δ + δ + (cid:112) R n,t ) (cid:111) (cid:46) ˆΨ([[ (cid:98) f ]])+ (cid:88) L(cid:96) =2 (cid:112) λ (cid:96) + (cid:113) (cid:80) L(cid:96) =1 m (cid:93)(cid:96) +1 m (cid:93)(cid:96) n log + ( ˆ G ) , uniformly over all choices of m (cid:93) = ( m (cid:93) , . . . , m (cid:93)L ) with prob-ability − e − t . The proof is given in Appendix C. From this theorem, thegeneralization error of f (cid:93) is upper-bounded by the trainingerror of the original network (cid:98) f (which is usually small) andan additional term. By Theorem 1, δ represents the ap-proximation error between (cid:98) f and f (cid:93) ; hence, it can be re-garded as a bias . The second term δ is the variance terminduced by the sample deviation. It is noted that the varianceterm δ only depends on the size of the compressed networkrather than the original network size . On the other hand, anaive application of the theorem implies Ψ([[ (cid:98) f ]]) − ˆΨ([[ (cid:98) f ]]) ≤ ˜O (cid:0)(cid:113) n (cid:80) L(cid:96) =1 m (cid:96) +1 m (cid:96) (cid:1) for the original network (cid:98) f , whichcoincides with the VC-dimension based bound [Bartlett et al. ,2017] but is much larger than δ when m (cid:93)(cid:96) (cid:28) m (cid:96) . Therefore,the variance is signiﬁcantly reduced by model compression,resulting in a much improved generalization error. Note thatthe relation between δ and δ is a tradeoff due to the mono-tonicity of the degrees of freedom. When m (cid:93)(cid:96) is large, the bias δ becomes small owing to the monotonicity of the degrees offreedom, but the variance δ ( m (cid:93) ) will be large. Hence, weneed to tune the size ( m (cid:93)(cid:96) ) L(cid:96) =1 to obtain the best generalizationerror by balancing the bias ( δ ) and variance ( δ ).The generalization error bound is uniformly valid over thechoice of m (cid:93) (to ensure this, the term R n,t appears). Thus, m (cid:93) can be arbitrary and chosen in a data-dependent manner.This means that the bound is a posteriori , and the best choiceof m (cid:93) can depend on the trained network. A seminal work [Arora et al. , 2018] showed a generaliza-tion error bound based on how the network can be com-pressed. Although the theoretical insights provided by theiranalysis are quite instructive, the theory does not give a prac-tical compression method. In fact, a random projection is Layer Original [Arora et al. , 2018] Spec Prun1 1,728 1,645 1,0134 147,456 644,654 84,4996 589,824 3,457,882 270,2169 1,179,648 36,920 50,76812 2,359,296 22,735 4,58315 2,359,296 26,584 3,886

Table 1: Comparison of the intrinsic dimensionality of our degreesof freedom and existing one. They are computed for a VGG-19network trained on CIFAR-10. proposed in the analysis, but it is not intended for practicaluse. The most difference is that their analysis exploits thenear low rankness of the weight matrix W ( (cid:96) ) , while ours ex-ploits the near low rankness of the covariance matrix (cid:98) Σ ( (cid:96) ) .They are not directly comparable; thus, we numerically com-pare the intrinsic dimensionality of both with a VGG-19 net-work trained on CIFAR-10. Table 1 summarizes a compar-ison of the intrinsic dimensionalities. For our analysis, weused ˆ N (cid:96) ( λ (cid:96) ) ˆ N (cid:96) +1 ( λ (cid:96) +1 ) k for the intrinsic dimensionality ofthe (cid:96) -th layer, where k is the kernel size . This is the numberof parameters in the (cid:96) -th layer for the width m (cid:93)(cid:96) (cid:39) ˆ N (cid:96) ( λ (cid:96) ) where λ (cid:96) was set as λ (cid:96) = 10 − × Tr[ (cid:98) Σ ( (cid:96) ) ] , which is suf-ﬁciently small. We can see that the quantity based on ourdegrees of freedom give signiﬁcantly small values in almostall layers.The PAC-Bayes bound [Dziugaite and Roy, 2017; Zhou etal. , 2019] is also a promising approach for obtaining the non-vacuous generalization error bound of a compressed network.However, these studies “assume” the existence of effectivecompression methods and do not provide any speciﬁc algo-rithm. [Suzuki, 2018; Suzuki et al. , 2020] also pointed outthe importance of the degrees of freedom for analyzing thegeneralization error of deep learning but did not give a prac-tical algorithm. In this section, we conduct numerical experiments to show thevalidity of our theory and the effectiveness of the proposedmethod.

We show how the rate of decrease in the eigenvalues affectsthe compression accuracy to justify our theoretical analysis.We constructed a network (namely, NN3) consisting of threehidden fully connected layers with widths (300 , , following the settings in [Aghasi et al. , 2017] and trained itwith 60,000 images in MNIST and 50,000 images in CIFAR-10. Figure 1a shows the magnitudes of the eigenvalues ofthe 3rd hidden layers of the networks trained for each dataset(plotted on a semilog scale). The eigenvalues are sorted indecreasing order, and they are normalized by division by the We omitted quantities related to the depth L and log term, butthe intrinsic dimensionality of [Arora et al. , 2018] also omits thesefactors. Index −6 −5 −4 −3 −2 −1 E i g e n v a l u e CIFARMNIST (a) Eigenvalue distributions in each layer forMNIST and CIFAR-10. m ( −6 −4 −2 r e l . m e a n s q . e rr o r compression ratio vs. mean sq. error MNISTCIFAR-10 (b) Approximation error with its s.d. versusthe width m (cid:93) Figure 1: Eigenvalue distribution and compression ability of a fullyconnected network in MNIST and CIFAR-10. −10 −9 −8 −7 −6 −5 −4 λ (relative to Σ i μ i ) c l a ss . a cc u r a c y ( % ) λ vs. classification accuracy ratio=0.4ratio=0.5ratio=0.6ratio=0.7ratio=0.8ratio=0.9ratio=1.0 (a) Accuracy versus λ (cid:96) in CIFAR-10 for eachcompression rate. theta c l a ss . a cc u r a c y (b) Accuracy versus θ in CIFAR-10 for eachcompression rate.Figure 2: Relation between accuracy and the hyper parameters λ (cid:96) and θ . The best λ and θ are indicated by the star symbol. maximum eigenvalue. We see that eigenvalues for MNISTdecrease much more rapidly than those for CIFAR-10. Thisindicates that MINST is “easier” than CIFAR-10 becausethe degrees of freedom (an intrinsic dimensionality) of thenetwork trained on MNIST are relatively smaller than thosetrained on CIFAR-10. Figure 1b presents the (relative) com-pression error (cid:107) (cid:98) f − f (cid:93) (cid:107) n / (cid:107) (cid:98) f (cid:107) n versus the width m (cid:93) of thecompressed network where we compressed only the 3rd layerand λ was ﬁxed to a constant − × Tr[ (cid:98) Σ ( (cid:96) ) ] and θ = 0 . . Itshows a rapid decrease in the compression error for MNISTthan CIFAR-10 (about 100 times smaller). This is becauseMNIST has faster eigenvalue decay than CIFAR-10.Figure 2a shows the relation between the test classiﬁcationaccuracy and λ (cid:96) . It is plotted for a VGG-13 network trainedon CIFAR-10. We chose the width m (cid:93)(cid:96) that gave the best ac-curacy for each λ (cid:96) under the constraint of the compressionrate (relative number of parameters). We see that as the com-pression rate increases, the best λ (cid:96) goes down. Our theoremtells that λ (cid:96) is related to the compression error through (3),that is, as the width goes up, λ (cid:96) must goes down. This experi-ment supports the theoretical evaluation. Figure 2b shows therelation between the test classiﬁcation accuracy and the hy-perparameter θ . We can see that the best accuracy is achievedaround θ = 0 . for all compression rates, which indicatesthe superiority of the “combination” of input- and output-information loss and supports our theoretical bound. For lowcompression rate, the choice of λ (cid:96) and θ does not affect the result so much, which indicates the robustness of the hyper-parameter choice. We applied our method to the ImageNet (ILSVRC2012)dataset [Deng et al. , 2009]. We compared our method us-ing the ResNet-50 network [He et al. , 2016] (experiments forVGG-16 network [Simonyan and Zisserman, 2014] are alsoshown in Appendix D.1). Our method was compared with thefollowing pruning methods: ThiNet [Luo et al. , 2017], NISP[Yu et al. , 2018], and sparse regularization [He et al. , 2017](which we call Sparse-reg). As the initial ResNet network,we used two types of networks: ResNet-50-1 and ResNet-50-2. For training ResNet-50-1, we followed the experimentalsettings in [Luo et al. , 2017] and [Yu et al. , 2018]. Duringtraining, images were resized as in [Luo et al. , 2017]. to 256 × ×

224 random crop was fed into the net-work. In the inference stage, we center-cropped the resizedimages to 224 × et al. , 2017]. In particular, imageswere resized such that the shorter side was 256, and a centercrop of 224 ×

224 pixels was used for testing. The augmen-tation for ﬁne tuning was a 224 ×

224 random crop and itsmirror.We compared ThiNet and NISP for ResNet-50-1 (we callour model for this situation “Spec-ResA”) and Sparse-regfor ResNet-50-2 (we call our model for this situation “Spec-ResB”) for fair comparison. The size of compressed networkodel Top-1 Top-5

Spec-ResA 72.99% 91.56%

Spec-ResB wo/ ft 66.12%

Spec-ResB w/ ft 74.04%

Table 2: Performance comparison of our method and existing onesfor ResNet-50 on ImageNet. “ft” indicates ﬁne tuning after com-pression. f (cid:93) was determined to be as close to the compared networkas possible (except, for ResNet-50-2, we did not adopt the“channel sampler” proposed by [He et al. , 2017] in the ﬁrstlayer of the residual block; hence, our model became slightlylarger). The accuracies are borrowed from the scores pre-sented in each paper, and thus we used different models be-cause the original papers of each model reported for each dif-ferent model. We employed the simultaneous procedure forcompression. After pruning, we carried out ﬁne tuning over10 epochs, where the learning rate was − for the ﬁrst fourepochs, − for the next four epochs, and − for the lasttwo epochs. We employed λ (cid:96) = 10 − × Tr[ (cid:98) Σ ( (cid:96) ) ] and θ = 0 . .Table 2 summarizes the performance comparison forResNet-50. We can see that for both settings, our methodoutperforms the others for about accuracy. This is an in-teresting result because ResNet-50 is already compact [Luo et al. , 2017] and thus there is less room to produce better per-formance. Moreover, we remark that all layers were simul-taneously trained in our method, while other methods weretrained one layer after another. Since our method did notadopt the channel sampler proposed by [He et al. , 2017], ourmodel was a bit larger. However, we could obtain better per-formance by combining it with our method. In this paper, we proposed a simple pruning algorithm forcompressing a network and gave its approximation and gen-eralization error bounds using the degrees of freedom . Unlikethe existing compression based generalization error analysis,our analysis is compatible with a practically useful methodand further gives a tighter intrinsic dimensionality bound.The proposed algorithm is easily implemented and only re-quires linear algebraic operations. The numerical experi-ments showed that the compression ability is related to theeigenvalue distribution, and our algorithm has favorable per-formance compared to existing methods.

Acknowledgements

TS was partially supported by MEXT Kakenhi (18K19793,18H03201 and 20H00576) and JST-CREST, Japan.

References [Aghasi et al. , 2017] A. Aghasi, A. Abdi, N. Nguyen, andJ. Romberg. Net-trim: Convex pruning of deep neural net-works with performance guarantee. In

Advances in Neu-ral Information Processing Systems 30 , pages 3180–3189.2017.[Arora et al. , 2018] S. Arora, R. Ge, B. Neyshabur, andY. Zhang. Stronger generalization bounds for deep nets viaa compression approach. In

Proceedings of InternationalConference on Machine Learning , volume 80, pages 254–263. PMLR, 2018.[Bach, 2017] F. Bach. On the equivalence between kernelquadrature rules and random feature expansions.

Journalof Machine Learning Research , 18(21):1–38, 2017.[Bartlett et al. , 2017] P. L. Bartlett, N. Harvey, C. Liaw, andA. Mehrabian. Nearly-tight VC-dimension and pseu-dodimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930 , 2017.[Caponnetto and de Vito, 2007] A. Caponnetto and E. deVito. Optimal rates for regularized least-squares al-gorithm.

Foundations of Computational Mathematics ,7(3):331–368, 2007.[Deng et al. , 2009] J. Deng, W. Dong, R. Socher, L.-J. Li,K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In

Computer Vision and Pattern Recogni-tion, 2009 , pages 248–255, 2009.[Denil et al. , 2013] M. Denil, B. Shakibi, L. Dinh, andN. De Freitas. Predicting parameters in deep learning. In

Advances in neural information processing systems , pages2148–2156, 2013.[Denton et al. , 2014] E. L. Denton, W. Zaremba, J. Bruna,Y. LeCun, and R. Fergus. Exploiting linear structurewithin convolutional networks for efﬁcient evaluation. In

Advances in Neural Information Processing Systems 27 ,pages 1269–1277. 2014.[Dziugaite and Roy, 2017] G. K. Dziugaite and D. M. Roy.Computing nonvacuous generalization bounds for deep(stochastic) neural networks with many more parametersthan training data. In

Proceedings of the Thirty-Third Con-ference on Uncertainty in Artiﬁcial Intelligence , 2017.[Gin´e and Koltchinskii, 2006] E. Gin´e and V. Koltchinskii.Concentration inequalities and asymptotic results for ra-tio type empirical processes.

The Annals of Probability ,34(3):1143–1216, 2006.[Gin´e and Nickl, 2015] E. Gin´e and R. Nickl.

MathematicalFoundations of Inﬁnite-Dimensional Statistical Models .Cambridge Series in Statistical and Probabilistic Mathe-matics. Cambridge University Press, 2015.[Gunasekar et al. , 2018] S. Gunasekar, J. D. Lee, D. Soudry,and N. Srebro. Implicit bias of gradient descent on linearconvolutional networks. In

Advances in Neural Informa-tion Processing Systems , pages 9482–9491, 2018.[Han et al. , 2015] S. Han, H. Mao, and W. J. Dally. Deepcompression: Compressing deep neural networks withruning, trained quantization and huffman coding. arXivpreprint arXiv:1510.00149 , 2015.[Hardt et al. , 2016] M. Hardt, B. Recht, and Y. Singer. Trainfaster, generalize better: Stability of stochastic gradientdescent. In

Proceedings of The 33rd International Confer-ence on Machine Learning , volume 48, pages 1225–1234.PMLR, 2016.[He et al. , 2016] K. He, X. Zhang, S. Ren, and J. Sun. Deepresidual learning for image recognition. In

Proceedingsof the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[He et al. , 2017] Y. He, X. Zhang, and J. Sun. Channel prun-ing for accelerating very deep neural networks. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 1389–1397, 2017.[Hu et al. , 2016] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang.Network trimming: A data-driven neuron pruning ap-proach towards efﬁcient deep architectures. arXiv preprintarXiv:1607.03250 , 2016.[Iandola et al. , 2016] F. N. Iandola, S. Han, M. W.Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer.Squeezenet: Alexnet-level accuracy with 50x fewer pa-rameters and 0.5 mb model size. arXiv preprintarXiv:1602.07360 , 2016.[Krause and Golovin, 2014] A. Krause and D. Golovin. Sub-modular function maximization, 2014.[Krogh and Hertz, 1992] A. Krogh and J. A. Hertz. A sim-ple weight decay can improve generalization. In

Advancesin neural information processing systems , pages 950–957,1992.[Lebedev and Lempitsky, 2016] V. Lebedev and V. Lempit-sky. Fast convnets using group-wise brain damage. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 2554–2564, 2016.[Ledoux and Talagrand, 1991] M. Ledoux and M. Tala-grand.

Probability in Banach Spaces. Isoperimetry andProcesses . Springer, New York, 1991. MR1102015.[Lin et al. , 2013] M. Lin, Q. Chen, and S. Yan. Network innetwork. arXiv preprint arXiv:1312.4400 , 2013.[Luo et al. , 2017] J.-H. Luo, J. Wu, and W. Lin. ThiNet: aﬁlter level pruning method for deep neural network com-pression. In

International Conference on Computer Vision ,2017.[Maas et al. , 2013] A. L. Maas, A. Y. Hannun, and A. Y. Ng.Rectiﬁer nonlinearities improve neural network acousticmodels. In

ICML Workshop on Deep Learning for Audio,Speech and Language Processing , 2013.[Mallows, 1973] C. L. Mallows. Some comments on Cp.

Technometrics , 15(4):661–675, 1973.[Mendelson, 2002] S. Mendelson. Improving the samplecomplexity using global data.

IEEE Transactions on In-formation Theory , 48:1977–1991, 2002. [Simonyan and Zisserman, 2014] K. Simonyan and A. Zis-serman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.[Suzuki et al. , 2020] T. Suzuki, H. Abe, and T. Nishimura.Compression based bound for non-compressed network:uniﬁed generalization error analysis of large compress-ible deep neural network. In

International Conference onLearning Representations , 2020.[Suzuki, 2018] T. Suzuki. Fast generalization error bound ofdeep learning from a kernel perspective. In

InternationalConference on Artiﬁcial Intelligence and Statistics , pages1397–1406, 2018.[Wager et al. , 2013] S. Wager, S. Wang, and P. S. Liang.Dropout training as adaptive regularization. In

Advancesin Neural Information Processing Systems 26 , pages 351–359. 2013.[Wan et al. , 2013] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun,and R. Fergus. Regularization of neural networks us-ing dropconnect. In

Proceedings of the 30th Interna-tional Conference on Machine Learning , volume 28, pages1058–1066. PMLR, 2013.[Wen et al. , 2016] W. Wen, C. Wu, Y. Wang, Y. Chen, andH. Li. Learning structured sparsity in deep neural net-works. In

Advances in Neural Information Processing Sys-tems 29 , pages 2074–2082. 2016.[Yu et al. , 2018] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I.Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis.NISP: Pruning networks using neuron importance scorepropagation. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 9194–9203, 2018.[Zhou et al. , 2016] B. Zhou, A. Khosla, A. Lapedriza,A. Oliva, and A. Torralba. Learning deep features for dis-criminative localization. In

Computer Vision and PatternRecognition (CVPR), 2016 IEEE Conference on , pages2921–2929. IEEE, 2016.[Zhou et al. , 2019] W. Zhou, V. Veitch, M. Austern, R. P.Adams, and P. Orbanz. Non-vacuous generalizationbounds at the imagenet scale: a PAC-bayesian compres-sion approach. In

International Conference on LearningRepresentations , 2019.

Appendix—

A Extension to convolutional neural network

An extension of our method to convolutional layers is a bit tricky. There are several options, but to perform channel-wisepruning, we used the following “covariance matrix” between channels in the experiments. Suppose that a channel k receives theinput φ k ; u,v ( x ) where ≤ u ≤ I τ , ≤ v ≤ I h indicate the spacial index, then “covariance” between the channels k and k (cid:48) canbe formulated as (cid:98) Σ k,k (cid:48) = n (cid:80) ni =1 ( I τ I h (cid:80) u,v φ k ; u,v ( x i ) φ k (cid:48) ; u,v ( x i )) . As for the covariance between an output channel k (cid:48) andan input channel k (which corresponds to the ( k (cid:48) , k ) -th element of Z ( (cid:96) ) (cid:98) Σ F,J = Cov( Z ( (cid:96) ) φ ( X ) , φ J ( X )) for the fully connectedsituation), it can be calculated as (cid:98) Σ k (cid:48) ,k = n (cid:80) ni =1 ( I τ I h (cid:80) u,v I (cid:48) ( u,v ) (cid:80) u (cid:48) ,v (cid:48) :( u,v ) ∈ Res( u (cid:48) ,v (cid:48) ) ( Z ( (cid:96) ) φ ( x i )) k (cid:48) ; u (cid:48) ,v (cid:48) ( x i ) φ k ; u,v ( x i )) ,where Res( u (cid:48) , v (cid:48) ) is the receptive ﬁeld of the location u (cid:48) , v (cid:48) in the output channel k (cid:48) , and I (cid:48) ( u,v ) are the number of locations ( u (cid:48) , v (cid:48) ) that contain ( u, v ) in their receptive ﬁelds. B Proof of Theorem 1

The output of its internal layer (before activation) is denoted by ˆ F (cid:96) ( x ) = ( ˆ W ( (cid:96) ) η ( · ) + ˆ b ( (cid:96) ) ) ◦ · · · ◦ ( ˆ W (1) x + ˆ b (1) ) . We denotethe set of row vectors of Z ( (cid:96) ) by Z (cid:96) , i.e., Z (cid:96) = { Z ( (cid:96) ) (cid:62) , : , . . . , Z ( (cid:96) ) (cid:62) m, : } . Conversely, we may deﬁne Z ( (cid:96) ) by specifying Z (cid:96) .Here, we restate Theorem 1 in a complete form that contains both of backward procedure and simultaneous procedure. Theorem 3 (Restated) . Assume that the regularization parameter τ in the pruning procedure (2) is deﬁned by the leveragescore τ ← τ ( (cid:96) ) := m (cid:93)(cid:96) λ (cid:96) ˜ τ ( (cid:96) ) .(i) Backward-procedure: Let Z (cid:96) for the output information loss be Z (cid:96) = (cid:40) (cid:113) m (cid:96) q ( (cid:96) ) j max j (cid:48) (cid:107) ˆ W ( (cid:96) ) j (cid:48) , : (cid:107) ˆ W ( (cid:96) ) j, : | j ∈ J (cid:93)(cid:96) +1 (cid:41) where q ( (cid:96) ) j := (˜ τ ( (cid:96) +1) j ) − (cid:80) j (cid:48)∈ J(cid:93)(cid:96) +1 (˜ τ ( (cid:96) +1) j (cid:48) ) − ( j ∈ J (cid:93)(cid:96) +1 ) and q ( (cid:96) ) j = 0 ( otherwise ) , and deﬁne ζ (cid:96),θ = N θ(cid:96) ( λ (cid:96) ) (cid:18) θ max j ∈ [ m(cid:96) +1] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) m (cid:96) (cid:19) − .Then, if we solve the optimization problem (2) with an additional constraint (cid:80) j ∈ J (˜ τ ( (cid:96) ) j ) − ≤ m (cid:96) m (cid:93)(cid:96) for the index set J , thenthe optimization problem is feasible, and the overall approximation error of f (cid:93) is bounded by (cid:107) (cid:98) f − f (cid:93) (cid:107) n ≤ (cid:88) L(cid:96) =2 (cid:18) ¯ R L − (cid:96) +1 (cid:113)(cid:81) L(cid:96) (cid:48) = (cid:96) ζ (cid:96) (cid:48) ,θ (cid:19) (cid:112) λ (cid:96) , (5) for ¯ R = √ ˆ cR where ˆ c is a universal constant.(ii) Simultaneous-procedure: Suppose that there exists c scale > such that (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) ≤ c scale R ˜ τ ( (cid:96) +1) j , (6) and we employ Z (cid:96) = { ˆ W ( (cid:96) ) j, : / (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) | j ∈ [ m (cid:96) +1 ] } for the output aware objective. Then, we have the same bound as (5) for q ( (cid:96) ) j = (˜ τ ( (cid:96) +1) j ) − / (cid:80) j (cid:48) ∈ [ m (cid:96) +1 ] (˜ τ ( (cid:96) +1) j (cid:48) ) − ( ∀ j ∈ [ m (cid:96) +1 ]) and ζ (cid:96),θ = c scale N θ(cid:96) ( λ (cid:96) ) (cid:18) θ m (cid:93)(cid:96) +1 max j q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) m (cid:93)(cid:96) +1 (cid:19) − . (7)The assumption (6) is rather strong, but we see that it is always satisﬁed by c scale = 1 when λ (cid:96) = 0 and by c scale =Tr[ (cid:98) Σ ( (cid:96) +1) ] / ( m (cid:96) +1 min j (cid:98) Σ ( (cid:96) +1) , ( j,j ) ) when λ (cid:96) = ∞ . Thus, it is satisﬁed if the variances of the nodes in the (cid:96) + 1 -th layer isbalanced, which is ensured if we are applying batch normalization. B.1 Preparation of lemmas

To derive the approximation error bound. we utilize the following proposition that was essentially proven by [Bach, 2017].This proposition states the connection between the degrees of freedom and the compression error, that is, it characterize thesufﬁcient width m (cid:93)(cid:96) to obtain a pre-speciﬁed compression error λ (cid:96) . Actually, we will see that the eigenvalue essentially controlsthis relation through the degrees of freedom. (cid:107) · (cid:107) op represents the operator norm of a matrix (the largest absolute singular value). roposition 1. There exists a probability measure q (cid:96) on { , . . . , m (cid:96) } such that for any δ ∈ (0 , and λ > , i.i.d. sample v , . . . , v m ∈ { , . . . , m (cid:96) } from q (cid:96) satisﬁes, with probability − δ , that inf β ∈ R m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) α (cid:62) η ( ˆ F (cid:96) − ( · )) − m (cid:88) j =1 β j q (cid:96) ( v j ) − / η ( ˆ F (cid:96) − ( · )) v j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n + mλ (cid:107) β (cid:107)  ≤ λα (cid:62) (cid:98) Σ (cid:96) ( (cid:98) Σ (cid:96) + λ I) − α, for every α ∈ R m (cid:96) , if m ≥ N (cid:96) ( λ ) log(16 ˆ N (cid:96) ( λ ) /δ ) . Moreover, the optimal solution ˆ β satisﬁes (cid:107) ˆ β (cid:107) ≤ (cid:107) α (cid:107) m .Proof. This is basically a direct consequence from Proposition 1 in [Bach, 2017] and its discussions. The original statementdoes not include the regularization term mλ (cid:107) β (cid:107) in the LHS and α (cid:62) (cid:98) Σ (cid:96) ( (cid:98) Σ (cid:96) + λ I) − α in the right hand side. However, bycarefully following the proof, the bound including these additional factors is indeed proven.The norm bond of ˆ β is guaranteed by the following relation: mλ (cid:107) ˆ β (cid:107) ≤ λα (cid:62) (cid:98) Σ (cid:96) ( (cid:98) Σ (cid:96) + λ I) − α ≤ λ (cid:107) α (cid:107) . Proposition 1 indicates the following lemma by the the scale invariance of η , η ( ax ) = aη ( x ) ( a > . Lemma 1.

Suppose that τ (cid:48) j = 1ˆ N (cid:96) ( λ ) m (cid:96) (cid:88) l =1 U j,l ˆ µ ( (cid:96) ) l ˆ µ ( (cid:96) ) l + λ = 1ˆ N (cid:96) ( λ ) [ (cid:98) Σ ( (cid:96) ) ( (cid:98) Σ ( (cid:96) ) + λ I) − ] j,j ( j ∈ { , . . . , m (cid:96) } ) , (8) where U = ( U j,l ) j,l is the orthogonal matrix that diagonalizes (cid:98) Σ ( (cid:96) ) , that is, (cid:98) Σ ( (cid:96) ) = U diag (cid:16) ˆ µ ( (cid:96) )1 , . . . , ˆ µ ( (cid:96) ) m (cid:96) (cid:17) U (cid:62) . For λ > ,and any / > δ > , if m ≥ N (cid:96) ( λ ) log(16 ˆ N (cid:96) ( λ ) /δ ) , then there exist v , . . . , v m ∈ { , . . . , m (cid:96) } such that, for every α ∈ R m (cid:96) , inf β ∈ R m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) α (cid:62) η ( ˆ F (cid:96) − ( · )) − m (cid:88) j =1 β j τ (cid:48) j − / η ( ˆ F (cid:96) − ( · )) v j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n + mλ (cid:107) β (cid:107)  ≤ λα (cid:62) (cid:98) Σ ( (cid:96) ) ( (cid:98) Σ ( (cid:96) ) + λ I) − α, (9) and m (cid:88) j =1 τ (cid:48) j − ≤ (1 − δ ) − m × m (cid:96) . Proof.

Suppose that the measure Q (cid:96) is the counting measure, Q (cid:96) ( J ) = | J | for J ⊂ { , . . . , m (cid:96) } , and q (cid:96) is a density givenby q (cid:96) ( j ) = τ (cid:48) j ( j ∈ { , . . . , m (cid:96) } ) with respect to the base measure Q (cid:96) . Suppose that v , . . . , v m ∈ { , . . . , m (cid:96) } is an i.i.d.sequence distributed from q (cid:96) d Q (cid:96) , then [Bach, 2017] showed that this sequence satisﬁes the assertion given in Proposition 1.Notice that E v [ m (cid:80) mj =1 q (cid:96) ( v j ) − ] = E v [ q (cid:96) ( v ) − ] = (cid:82) [ m (cid:96) ] q (cid:96) ( v ) − q (cid:96) ( v )d Q (cid:96) ( v ) = (cid:82) [ m (cid:96) ] Q (cid:96) ( v ) = m (cid:96) , thus an i.i.d.sequence { v , . . . , v m } satisﬁes m (cid:80) mj =1 q (cid:96) ( v j ) − ≤ m (cid:96) / (1 − δ ) with probability δ by the Markov’s inequality. Combiningthis with Proposition 1, the i.i.d. sequence { v , . . . , v m } and τ (cid:48) j = q (cid:96) ( v j ) satisﬁes the condition in the statement with probability − ( δ + 1 − δ ) = δ > . This ensures the existence of sequences { v j } mj =1 and { τ (cid:48) j } mj =1 that satisfy the assertion. .2 Proof of Theorem 1 General fact

Since Lemma 1 with δ = 1 / states that if m (cid:93)(cid:96) ≥ N (cid:96) ( λ (cid:96) ) log(80 ˆ N (cid:96) ( λ (cid:96) )) , then there exists J ⊂ [ m (cid:96) ] m (cid:93)(cid:96) such that inf α ∈ R | J | (cid:107) z (cid:62) φ − α (cid:62) φ J (cid:107) n + λ (cid:96) | J |(cid:107) α (cid:107) τ (cid:48) ≤ λ (cid:96) z (cid:62) (cid:98) Σ (cid:96) ( (cid:98) Σ (cid:96) + λ I) − z ( ∀ z ∈ R m (cid:96) ) , and (cid:88) j ∈ J (cid:93)(cid:96) (˜ τ ( (cid:96) ) j ) − ≤ m (cid:96) × m (cid:93)(cid:96) (10)is satisﬁed (here, note that τ (cid:48) given in Eq. (8) is equivalent to ˜ τ ( (cid:96) ) ). Evaluation of L (A) τ ( J ) : By setting z = e j ( j = 1 , . . . , m (cid:96) ) where e j is an indicator vector which has at its j -th componentand in other components, and summing up them for j = 1 , . . . , m (cid:96) , it holds that L (A) τ ( J ) = inf A ∈ R m(cid:96) ×| J | (cid:107) φ − Aφ J (cid:107) n + λ (cid:96) | J |(cid:107) A (cid:107) τ (cid:48) ≤ λ (cid:96) ˆ N (cid:96) ( λ (cid:96) ) , for the same J as above. Here, the optimal A , which is denoted by ˆ A J , is given by ˆ A J = (cid:98) Σ F,J ( (cid:98) Σ J,J + I τ ) − . Evaluation of L (B) τ ( J ) : By letting z ∈ Z (cid:96) and summing up them, we also have L (B) τ ( J ) = inf B ∈ R m(cid:93)(cid:96) +1 ×| J | (cid:107) Z ( (cid:96) ) φ − Bφ J (cid:107) n + λ (cid:96) | J |(cid:107) B (cid:107) τ (cid:48) ≤ λ (cid:96) Tr[ Z ( (cid:96) ) (cid:98) Σ (cid:96) ( (cid:98) Σ (cid:96) + λ (cid:96) I) − Z ( (cid:96) ) (cid:62) ] . for the same J as above. Remind again that the optimal B , which is denoted by ˆ B J , is given by ˆ B J = Z ( (cid:96) ) (cid:98) Σ F,J ( (cid:98) Σ J,J + I τ ) − = Z ( (cid:96) ) ˆ A J . Combining the bounds for L (A) τ ( J ) and L (B) τ ( J ) : By combining the above evaluation, we have that θL (A) τ ( J (cid:93) ) + (1 − θ ) L (B) τ ( J (cid:93) ) ≤ λ (cid:96) { θ ˆ N (cid:96) ( λ (cid:96) ) + (1 − θ )Tr[ Z ( (cid:96) ) (cid:98) Σ ( (cid:96) ) ( (cid:98) Σ ( (cid:96) ) + λ (cid:96) I) − Z ( (cid:96) ) (cid:62) ] } , where J (cid:93)(cid:96) is the minimizer of θL (A) τ ( J ) + (1 − θ ) L (B) τ ( J ) with respect to J .From now on, we let τ = λ (cid:96) m (cid:93)(cid:96) τ (cid:48) (= λ (cid:96) | J (cid:93)(cid:96) | τ (cid:48) ) as deﬁned in the main text. (i) Backward-procedure From now on, we give the bound corresponding to the backward-procedure. The proof consists of three parts: (i) evaluation ofthe compression error in each layer, (ii) evaluation of the norm of the weight matrix for the compressed network, and (iii) overallcompression error of whole layer. In (i), we use Lemma 1 to evaluate the compression error based on the eigenvalue distributionof the covariance matrix. In (ii), we again use Lemma 1 to bound the norm of the compressed network. This is important toevaluate the overall compression error because the norm controls how the compression error in each layer propagates to theﬁnal output. In (iii), we combine the results in (i) and (ii) to obtain the overall compression error.First, note that, for the choice of Z (cid:96) = (cid:40) (cid:113) m (cid:96) q ( (cid:96) ) j max j (cid:48) (cid:107) ˆ W ( (cid:96) ) j (cid:48) , : (cid:107) ˆ W ( (cid:96) ) j, : | j ∈ J (cid:93)(cid:96) +1 (cid:41) , it holds that L (B) τ ( J (cid:93)(cid:96) ) ≤ λ (cid:96) (cid:88) z ∈Z (cid:96) z (cid:62) (cid:98) Σ (cid:96) ( (cid:98) Σ (cid:96) + λ (cid:96) I) − z ≤ λ (cid:96) (cid:88) z ∈Z (cid:96) (cid:107) z (cid:107) ≤ λ (cid:96) m (cid:96) (cid:88) j ∈ J (cid:93)(cid:96) +1 q ( (cid:96) ) j = 4 λ (cid:96) m (cid:96) . Compression error bound:

Here, we give the compression error bound of the backward procedure. For the optimal J (cid:93)(cid:96) , wehave that inf α ∈ R | J(cid:93)(cid:96) | (cid:107) ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93) (cid:107) n + (cid:107) α (cid:107) τ = ˆ W ( (cid:96) ) j, : [ (cid:98) Σ F,F − (cid:98) Σ F,J (cid:93) ( (cid:98) Σ J (cid:93) ,J (cid:93) + I τ ) − (cid:98) Σ J (cid:93) ,F ]( ˆ W ( (cid:96) ) j, : ) (cid:62) = Tr { [ (cid:98) Σ F,F − (cid:98) Σ F,J (cid:93) ( (cid:98) Σ J (cid:93) ,J (cid:93) + I τ ) − (cid:98) Σ J (cid:93) ,F ]( ˆ W ( (cid:96) ) j, : ) (cid:62) ˆ W ( (cid:96) ) j, : } , and the optimal α in the left hand side is given by ˆ W ( (cid:96) ) j, : ˆ A J (cid:93)(cid:96) . Hence, it holds that (cid:88) j ∈ J (cid:93)(cid:96) +1 inf α ∈ R | J(cid:93)(cid:96) | (cid:26) (cid:107) q ( (cid:96) ) j / ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:107) n + (cid:107) α (cid:107) τ (cid:27) = (cid:88) j ∈ J (cid:93)(cid:96) +1 q ( (cid:96) ) j inf α ∈ R | J(cid:93)(cid:96) | (cid:110) (cid:107) ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:107) n + (cid:107) α (cid:107) τ (cid:111) (cid:88) j ∈ J (cid:93)(cid:96) +1 q ( (cid:96) ) j Tr { [ (cid:98) Σ F,F − (cid:98) Σ F,J (cid:93)(cid:96) ( (cid:98) Σ J (cid:93)(cid:96) ,J (cid:93)(cid:96) + I τ ) − (cid:98) Σ J (cid:93)(cid:96) ,F ]( ˆ W ( (cid:96) ) j, : ) (cid:62) ˆ W ( (cid:96) ) j, : } = Tr { [ (cid:98) Σ F,F − (cid:98) Σ F,J (cid:93)(cid:96) ( (cid:98) Σ J (cid:93)(cid:96) ,J (cid:93)(cid:96) + I τ ) − (cid:98) Σ J (cid:93)(cid:96) ,F ]( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) }≤ Tr[ (cid:98) Σ F,F − (cid:98) Σ F,J (cid:93)(cid:96) ( (cid:98) Σ J (cid:93)(cid:96) ,J (cid:93)(cid:96) + I τ ) − (cid:98) Σ J (cid:93)(cid:96) ,F ] (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op = L (A) τ ( J (cid:93)(cid:96) ) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op ≤ L (A) τ ( J (cid:93)(cid:96) ) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op max j ∈ [ m (cid:96) +1 ] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) max j ∈ [ m (cid:96) +1 ] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) ≤ L (A) τ ( J (cid:93)(cid:96) ) m (cid:96) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op max j ∈ [ m (cid:96) +1 ] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) R m (cid:96) +1 m (cid:96) , where where we used the assumption max j (cid:107) W ( (cid:96) ) j, : (cid:107) ≤ R/ √ m (cid:96) +1 . In the same manner, we also have that (cid:88) j ∈ J (cid:93)(cid:96) +1 q ( (cid:96) ) j inf α ∈ R | J(cid:93)(cid:96) | (cid:110) (cid:107) ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:107) n + (cid:107) α (cid:107) τ (cid:111) = max j (cid:48) (cid:107) ˆ W ( (cid:96) ) j (cid:48) , : (cid:107) m (cid:96) (cid:88) j ∈ J (cid:93)(cid:96) +1 inf α ∈ R | J(cid:93)(cid:96) | (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:113) m (cid:96) q ( (cid:96) ) j max j (cid:48) (cid:107) ˆ W ( (cid:96) ) j (cid:48) , : (cid:107) ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n + (cid:107) α (cid:107) τ = ≤ L (B) τ ( J (cid:93)(cid:96) ) max j ∈ [ m (cid:96) +1 ] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) m (cid:96) ≤ L (B) τ ( J (cid:93)(cid:96) ) R m (cid:96) m (cid:96) +1 . These inequalities imply that (cid:88) j ∈ J (cid:93)(cid:96) +1 q ( (cid:96) ) j (cid:110) (cid:107) ˆ W ( (cid:96) ) j, : φ − ˆ W ( (cid:96) ) j, : ˆ A J (cid:93) φ J (cid:93) (cid:107) n + (cid:107) ˆ W ( (cid:96) ) j, : ˆ A J (cid:93) (cid:107) τ (cid:111) ≤ λ (cid:96) { θ ˆ N (cid:96) ( λ (cid:96) ) + (1 − θ )Tr[ Z ( (cid:96) ) (cid:98) Σ (cid:96) ( (cid:98) Σ (cid:96) + λ (cid:96) I) − Z ( (cid:96) ) (cid:62) ] } θ max j ∈ [ m(cid:96) +1] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) m (cid:96) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) R m (cid:96) m (cid:96) +1 ≤ λ (cid:96) { θ ˆ N (cid:96) ( λ (cid:96) ) + (1 − θ ) m (cid:96) } (cid:20) θ max j ∈ [ m(cid:96) +1] (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) m (cid:96) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) (cid:21) m (cid:96) R m (cid:96) +1 ≤ λ (cid:96) ζ (cid:96),θ R m (cid:96) +1 . (11) Norm bound of the coefﬁcients:

Here, we give an upper bound of the norm of the weight matrices for the compressednetwork. From (11) and the deﬁnition that τ ( (cid:96) ) = λ (cid:96) m (cid:93)(cid:96) ˜ τ ( (cid:96) ) , we have that (cid:88) j ∈ J (cid:93)(cid:96) +1 q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : ˆ A J (cid:93) (cid:107) τ ( (cid:96) ) ≤ λ (cid:96) m (cid:93)(cid:96) λ (cid:96) ζ (cid:96),θ R m (cid:96) +1 = 4 ζ (cid:96),θ R m (cid:96) +1 m (cid:93)(cid:96) . Here, by Eq. (10), the condition (cid:80) j ∈ J (cid:93)(cid:96) +1 (˜ τ ( (cid:96) +1) j ) − ≤ m (cid:96) +1 m (cid:93)(cid:96) +1 is feasible, and under this condition, we also have that (cid:88) j ∈ J (cid:93)(cid:96) +1 (˜ τ ( (cid:96) +1) j ) − (cid:107) ˆ W ( (cid:96) ) j, : ˆ A J (cid:93) (cid:107) τ ( (cid:96) ) ≤ ζ (cid:96),θ R m (cid:96) +1 m (cid:93)(cid:96) × m (cid:96) +1 m (cid:93)(cid:96) +1 = 203 ζ (cid:96),θ m (cid:93)(cid:96) +1 m (cid:93)(cid:96) R , where we used the deﬁnition q Similarly, the approximation error bound Eq. (11) can be rewritten as (cid:88) j ∈ J (cid:93)(cid:96) +1 (˜ τ ( (cid:96) +1) j ) − (cid:107) ˆ W ( (cid:96) ) j, : φ − ˆ W ( (cid:96) ) j, : ˆ A J (cid:93) φ J (cid:93) (cid:107) n ≤ λ (cid:96) ζ (cid:96),θ m (cid:93)(cid:96) +1 R . (12)For (cid:96) = L , the same inequality holds for m L +1 = m (cid:93)L +1 = 1 and ˜ τ ( L +1) j = 1 ( j = 1) . verall approximation error bound: Given these inequalities, we bound the overall approximation error bound. Let J (cid:93)(cid:96) bethe optimal index set chosen by Spectral Pruning for the (cid:96) -th layer, and the parameters of compressed network be denoted by W (cid:93) ( (cid:96) ) = ˆ W ( (cid:96) ) J (cid:93)(cid:96) +1 , [ m (cid:96) ] ˆ A J (cid:93)(cid:96) ∈ R m (cid:93)(cid:96) +1 × m (cid:93)(cid:96) , b (cid:93) ( (cid:96) ) = ˆ b ( (cid:96) ) J (cid:93)(cid:96) +1 ∈ R m (cid:93)(cid:96) +1 . Then, it holds that f (cid:93) ( x ) = ( W (cid:93) ( L ) η ( · ) + b (cid:93) ( L ) ) ◦ · · · ◦ ( W (cid:93) (1) x + b (cid:93) (1) ) . Then, due to the scale invariance of η , we also have f (cid:93) ( x ) = ( W (cid:93) ( L ) I (˜ τ ( L ) ) η ( · ) + b (cid:93) ( L ) ) ◦ (I (˜ τ ( L ) ) − W (cid:93) ( L − I (˜ τ ( L − ) η ( · ) + I (˜ τ ( L ) ) − b (cid:93) ( L − ) · · · ◦ (I (˜ τ (2) ) − W (cid:93) (1) x + I (˜ τ (2) ) − b (cid:93) (1) ) . Then, if we deﬁne as (cid:102) W ( (cid:96) ) = I (˜ τ ( (cid:96) +1) ) − W (cid:93) ( (cid:96) ) I (˜ τ ( (cid:96) ) ) and ˜ b ( (cid:96) ) = I (˜ τ ( (cid:96) +1) ) − b (cid:93) ( (cid:96) ) , then we also have another representation of f (cid:93) as f (cid:93) ( x ) = ( (cid:102) W ( L ) η ( · ) + ˜ b ( L ) ) ◦ · · · ◦ ( (cid:102) W (1) x + ˜ b (1) ) . In the same manner, the original trained network (cid:98) f is also rewritten as (cid:98) f ( x ) = ( ˆ W ( L ) η ( · ) + ˆ b ( L ) ) ◦ · · · ◦ ( ˆ W (1) x + ˆ b (1) )= ( ˆ W ( L ) I (˜ τ ( L ) ) η ( · ) + ˆ b ( L ) ) ◦ (I (˜ τ ( L ) ) − ˆ W ( L − I (˜ τ ( L − ) η ( · ) + I (˜ τ ( L ) ) − ˆ b ( L − ) ◦ · · · ◦ (I (˜ τ (2) ) − ˆ W (1) x + I (˜ τ (2) ) − ˆ b (1) )=: ( ¨ W ( L ) η ( · ) + ¨ b ( L ) ) ◦ · · · ◦ ( ¨ W (1) x + ¨ b (1) ) , where we deﬁned ¨ W ( (cid:96) ) := I (˜ τ ( (cid:96) +1) ) − ˆ W ( (cid:96) ) I (˜ τ ( (cid:96) ) ) and ¨ b ( (cid:96) ) := I (˜ τ ( (cid:96) +1) ) − ˆ b ( (cid:96) ) .Then, the difference between f (cid:93) and (cid:98) f can be decomposed into f (cid:93) ( x ) − (cid:98) f ( x ) = ( (cid:102) W ( L ) η ( · ) + ˜ b ( L ) ) ◦ · · · ◦ ( (cid:102) W (1) x + ˜ b (1) ) − ( ¨ W ( L ) η ( · ) + ¨ b ( L ) ) ◦ · · · ◦ ( ¨ W (1) x + ¨ b (1) )= L (cid:88) (cid:96) =2 (cid:110) ( (cid:102) W ( L ) η ( · ) + ˜ b ( L ) ) ◦ · · · ◦ ( (cid:102) W ( (cid:96) +1) η ( · ) + ˜ b ( (cid:96) +1) ) ◦ ( (cid:102) W ( (cid:96) ) η ( · ) + ˜ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) x + ¨ b (1) ) − ( (cid:102) W ( L ) η ( · ) + ˜ b ( L ) ) ◦ · · · ◦ ( (cid:102) W ( (cid:96) +1) η ( · ) + ˜ b ( (cid:96) +1) ) ◦ ( ¨ W ( (cid:96) ) η ( · ) + ¨ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) x + ¨ b (1) ) (cid:111) . We evaluate the (cid:107) · (cid:107) n -norm of this difference. First, notice that Eq. (12) is equivalent to the following inequality: (cid:107) ( (cid:102) W ( (cid:96) ) η ( · ) + ˜ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) · +¨ b (1) ) − ( ¨ W ( (cid:96) ) J (cid:93)(cid:96) +1 , [ m (cid:96) ] η ( · ) + ¨ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) · +¨ b (1) ) (cid:107) n ≤ ˆ cλ (cid:96) ζ (cid:96),θ m (cid:93)(cid:96) +1 R . (We can check that, even for (cid:96) = 2 , this inequality is correct.) Next, by evaluating the Lipschitz continuity of the (cid:96) -th layer of f (cid:93) as (cid:107) (cid:102) W ( (cid:96) ) g − (cid:102) W ( (cid:96) ) g (cid:48) (cid:107) n = 1 n n (cid:88) i =1 (cid:107) (cid:102) W ( (cid:96) ) g ( x i ) − (cid:102) W ( (cid:96) ) g (cid:48) ( x i ) (cid:107) = 1 n n (cid:88) i =1 ( g ( x i ) − g (cid:48) ( x i )) (cid:62) ( (cid:102) W ( (cid:96) ) ) (cid:62) (cid:102) W ( (cid:96) ) ( g ( x i ) − g (cid:48) ( x i )) ≤ n n (cid:88) i =1 (cid:107) g ( x i ) − g (cid:48) ( x i ) (cid:107) Tr[( (cid:102) W ( (cid:96) ) ) (cid:62) (cid:102) W ( (cid:96) ) ] ≤ ˆ cζ (cid:96),θ m (cid:93)(cid:96) +1 m (cid:93)(cid:96) R (cid:107) g − g (cid:48) (cid:107) n , for g, g (cid:48) : R d → R m (cid:93)(cid:96) , then it holds that (cid:107) ( (cid:102) W ( L ) η ( · ) + ˜ b ( L ) ) ◦ · · · ◦ ( (cid:102) W ( (cid:96) +1) η ( · ) + ˜ b ( (cid:96) +1) ) ◦ ( (cid:102) W ( (cid:96) ) η ( · ) + ˜ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) x + ¨ b (1) ) − ( (cid:102) W ( L ) η ( · ) + ˜ b ( L ) ) ◦ · · · ◦ ( (cid:102) W ( (cid:96) +1) η ( · ) + ˜ b ( (cid:96) +1) ) ◦ ( ¨ W ( (cid:96) ) η ( · ) + ¨ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) x + ¨ b (1) ) (cid:107) n ≤ L (cid:89) (cid:96) (cid:48) = (cid:96) +1 ˆ cζ (cid:96) (cid:48) ,θ m (cid:93)(cid:96) (cid:48) +1 m (cid:93)(cid:96) (cid:48) R · ˆ cλ (cid:96) ζ (cid:96),θ m (cid:93)(cid:96) +1 R ≤ λ (cid:96) L (cid:89) (cid:96) (cid:48) = (cid:96) (ˆ cζ (cid:96) (cid:48) ,θ R ) = λ (cid:96) ¯ R L − (cid:96) +1) L (cid:89) (cid:96) (cid:48) = (cid:96) ζ (cid:96) (cid:48) ,θ . Then, by summing up the square root of this for (cid:96) = 2 , . . . , L , then we have the whole approximation error bound. imultaneous procedure

Here, we give bounds corresponding to the simultaneous-procedure. The proof techniques are quite similar to the forwardprocedure. However, instead of the (cid:96) -norm bound derived in the backward-procedure, we derive (cid:96) ∞ -norm bound for both ofthe approximation error and the norm bounds.We let q ( (cid:96) ) j = (˜ τ ( (cid:96) ) j ) − for j = 1 , . . . , m (cid:96) +1 . As for the input aware quantity L (A) τ , for any j ∈ [ m (cid:96) +1 ] , it holds that m (cid:96) +1 (cid:88) j =1 inf α ∈ R | J(cid:93)(cid:96) | (cid:26) (cid:107) q ( (cid:96) ) j / ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:107) n + (cid:107) α (cid:107) τ (cid:27) = m (cid:96) +1 (cid:88) j =1 inf α ∈ R | J(cid:93)(cid:96) | (cid:110) (cid:107) ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:107) n + (cid:107) α (cid:107) τ (cid:111) = m (cid:96) +1 (cid:88) j =1 q ( (cid:96) ) j ˆ W ( (cid:96) ) j, : [ (cid:98) Σ F,F − (cid:98) Σ F,J (cid:93)(cid:96) ( (cid:98) Σ J (cid:93)(cid:96) ,J (cid:93)(cid:96) + I τ ) − (cid:98) Σ J (cid:93)(cid:96) ,F ]( ˆ W ( (cid:96) ) j, : ) (cid:62) ≤ (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op Tr { (cid:98) Σ F,F − (cid:98) Σ F,J (cid:93)(cid:96) ( (cid:98) Σ J (cid:93)(cid:96) ,J (cid:93)(cid:96) + I τ ) − (cid:98) Σ J (cid:93)(cid:96) ,F }≤ c scale R (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op max j q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) L (A) τ ( J (cid:93)(cid:96) ) . Moreover, as for the output aware quantity L (B) τ , we have that m (cid:96) +1 (cid:88) j =1 q ( (cid:96) ) j inf α ∈ R | J(cid:93)(cid:96) | (cid:110) (cid:107) ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:107) n + (cid:107) α (cid:107) τ (cid:111) = m (cid:96) +1 (cid:88) j =1 q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) inf α ∈ R | J(cid:93)(cid:96) | (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:13)(cid:13)(cid:13)(cid:13) n + (cid:107) α (cid:107) τ ≤ c scale R L (B) τ ( J (cid:93)(cid:96) ) . By combining these inequalities, it holds that (cid:88) ≤ j ≤ m (cid:96) +1 q ( (cid:96) ) j inf α ∈ R | J(cid:93)(cid:96) | (cid:110) (cid:107) ˆ W ( (cid:96) ) j, : φ − α (cid:62) φ J (cid:93)(cid:96) (cid:107) n + (cid:107) α (cid:107) τ (cid:111) ≤ [ θL (A) τ ( J (cid:93)(cid:96) ) + (1 − θ ) L (B) τ ( J (cid:93)(cid:96) )] θ max j q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) c scale R ≤ c scale λ (cid:96) θ ˆ N (cid:96) ( λ (cid:96) ) + (1 − θ ) ˆ N (cid:48) (cid:96) ( λ (cid:96) ; N (cid:96) ) θ max j q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) R Therefore, by the deﬁnition of q ( (cid:96) ) j and τ , it holds that (cid:88) ≤ j ≤ m (cid:96) +1 (˜ τ ( (cid:96) +1) j ) − (cid:107) ˆ W ( (cid:96) ) j, : ˆ A J (cid:93) (cid:107) τ ( (cid:96) ) ≤ c scale θ ˆ N (cid:96) ( λ (cid:96) ) + (1 − θ ) ˆ N (cid:48) (cid:96) ( λ (cid:96) ; N (cid:96) ) m (cid:93)(cid:96) (cid:20) θ max j q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) (cid:21) R = 4 ζ (cid:96),θ m (cid:93)(cid:96) +1 m (cid:93)(cid:96) R . (13)Similarly, the approximation error bound can be evaluated as (cid:88) j ∈ [ m (cid:96) +1 ] (˜ τ ( (cid:96) +1) j ) − (cid:107) ˆ W ( (cid:96) ) j, : φ − ˆ W ( (cid:96) ) j, : ˆ A J (cid:93) φ J (cid:93) (cid:107) n ≤ c scale λ (cid:96) θ ˆ N (cid:96) ( λ (cid:96) ) + (1 − θ ) ˆ N (cid:48) (cid:96) ( λ (cid:96) ; N (cid:96) ) θ max j q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) R = 4 λ (cid:96) ζ (cid:96),θ m (cid:93)(cid:96) +1 R . (14)This gives the following equivalent inequality: (cid:88) j ∈ [ m (cid:96) ] (cid:107) ( (cid:102) W ( (cid:96) ) j, : η ( · ) + ˜ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) · +¨ b (1) ) − ( ¨ W ( (cid:96) ) j, : η ( · ) + ¨ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) · +¨ b (1) ) (cid:107) n ≤ c scale λ (cid:96) [ θ ˆ N (cid:96) ( λ (cid:96) ) + (1 − θ ) ˆ N (cid:48) (cid:96) ( λ (cid:96) ; N (cid:96) )] R . Moreover, the norm bound (13) gives the following Lipschitz continuity bound of each layer: (cid:88) j ∈ [ m (cid:96) +1 ] (cid:107) (cid:102) W ( (cid:96) ) j, : g − (cid:102) W ( (cid:96) ) j, : g (cid:48) (cid:107) n = (cid:88) j ∈ [ m (cid:96) +1 ] n n (cid:88) i =1 ( (cid:102) W ( (cid:96) ) j, : g ( x i ) − (cid:102) W ( (cid:96) ) j, : g (cid:48) ( x i )) (cid:88) j ∈ [ m (cid:96) +1 ] (cid:107) (cid:102) W ( (cid:96) ) j, : (cid:107) (cid:88) j (cid:48) ∈ [ m (cid:96) ] n n (cid:88) i =1 ( g j (cid:48) ( x i ) − g (cid:48) j (cid:48) ( x i )) ≤ c scale θ ˆ N (cid:96) ( λ (cid:96) ) + (1 − θ ) ˆ N (cid:48) (cid:96) ( λ (cid:96) ; N (cid:96) ) m (cid:93)(cid:96) (cid:20) θ max j q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) (cid:21) R (cid:107) g − g (cid:48) (cid:107) n for g, g (cid:48) : R d → R m (cid:93)(cid:96) . Combining these inequalities, it holds that (cid:107) ( (cid:102) W ( L ) η ( · ) + ˜ b ( L ) ) ◦ · · · ◦ ( (cid:102) W ( (cid:96) +1) η ( · ) + ˜ b ( (cid:96) +1) ) ◦ ( (cid:102) W ( (cid:96) ) η ( · ) + ˜ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) x + ¨ b (1) ) − ( (cid:102) W ( L ) η ( · ) + ˜ b ( L ) ) ◦ · · · ◦ ( (cid:102) W ( (cid:96) +1) η ( · ) + ˜ b ( (cid:96) +1) ) ◦ ( ¨ W ( (cid:96) ) η ( · ) + ¨ b ( (cid:96) ) ) ◦ ( ¨ W ( (cid:96) − η ( · ) + ¨ b ( (cid:96) − ) ◦ · · · ◦ ( ¨ W (1) x + ¨ b (1) ) (cid:107) n ≤ λ (cid:96) ¯ R L − (cid:96) +1) L (cid:89) (cid:96) (cid:48) = (cid:96) c scale [ θ ˆ N (cid:96) (cid:48) ( λ (cid:96) (cid:48) ) + (1 − θ ) ˆ N (cid:48) (cid:96) (cid:48) ( λ (cid:96) (cid:48) ; N (cid:96) (cid:48) )] θ max j q ( (cid:96) ) j (cid:107) ˆ W ( (cid:96) ) j, : (cid:107) (cid:107) ( ˆ W ( (cid:96) ) ) (cid:62) I q ( (cid:96) ) ˆ W ( (cid:96) ) (cid:107) op + (1 − θ ) 1 (cid:81) L(cid:96) (cid:48) = (cid:96) +1 m (cid:93)(cid:96) (cid:48) . By summing up the square root of this for (cid:96) = 2 , . . . , L , we obtain the assertion.

C Proof of Theorem 2 (Generalization error bound of the compressed network)

C.1 Notations

For a sequence of the width m (cid:48) = ( m (cid:48) , . . . , m (cid:48) L ) , let ˆ F m (cid:48) := { f ( x ) = ( W ( L ) η ( · ) + b ( L ) ) ◦ · · · ◦ ( W (1) x + b (1) ) | (cid:107) W ( (cid:96) ) (cid:107) ≤ ¯ R , (cid:107) b ( (cid:96) ) (cid:107) ≤ ¯ R b , W ( (cid:96) ) ∈ R m (cid:48) (cid:96) +1 × m (cid:48) (cid:96) , b ( (cid:96) ) ∈ R m (cid:48) (cid:96) +1 (1 ≤ (cid:96) ≤ L ) } . Proposition 2.

Under Assumptions 1 and 2, the (cid:96) ∞ -norm of f ∈ ˆ F m (cid:48) is bounded as (cid:107) f (cid:107) ∞ ≤ ¯ R L D x + L (cid:88) (cid:96) =1 ¯ R L − (cid:96) ¯ R b . The proof is easy to see the Lipschitz continuity of the network with respect to (cid:107) · (cid:107) -norm is bounded by (cid:107) W ( (cid:96) ) (cid:107) F .By the scale invariance of the activation function η , ˆ F m (cid:93) can be rewritten as ˆ F m (cid:93) = { f ( x ) = ( W ( L ) η ( · ) + b ( L ) ) ◦ · · · ◦ ( W (1) x + b (1) ) | (cid:107) W ( (cid:96) ) (cid:107) ≤ m (cid:93)(cid:96) +1 m (cid:93)(cid:96) ¯ R , (cid:107) b ( (cid:96) ) (cid:107) ≤ (cid:113) m (cid:93)(cid:96) +1 ¯ R b , W ( (cid:96) ) ∈ R m (cid:93)(cid:96) +1 × m (cid:93)(cid:96) , b ( (cid:96) ) ∈ R m (cid:93)(cid:96) +1 (1 ≤ (cid:96) ≤ L ) } . Hence, from Theorem 1 and the argument in Appendix B, we can see that under Assumption 3, it holds that f (cid:93) ∈ ˆ F m (cid:93) . for both of the backward-procedure and the simultaneous-procedure. Therefore, the compressed network f (cid:93) of both procedureswith the constraint has (cid:96) ∞ -bound such as (cid:107) [[ f (cid:93) ]] (cid:107) ∞ ≤ ˆ R ∞ . C.2 Proof of Theorem 2

Remember that the (cid:15) -internal covering number of a (semi)-metric space ( T, d ) is the minimum cardinality of a ﬁnite set suchthat every element in T is in distance (cid:15) from the ﬁnite set with respect to the metric d . We denote by N ( (cid:15), T, d ) the (cid:15) -internalcovering number of ( T, d ) . The covering number of the neural network model ˆ F m (cid:48) can be evaluated as follows (see for example[Suzuki, 2018]): Proposition 3.

The covering number of ˆ F m (cid:48) is bounded by log N ( (cid:15), ˆ F m (cid:48) , (cid:107) · (cid:107) ∞ ) ≤ C (cid:80) L(cid:96) =1 m (cid:48) (cid:96) m (cid:48) (cid:96) +1 n log + (cid:32) G max { ¯ R, ¯ R b } δ (cid:33) for a universal constant C > . e deﬁne G m (cid:48) = { g ( x i , y i ) = ψ ( y, f ( x )) | f ∈ ˆ F m (cid:48) } , for m (cid:48) = ( m , . . . , m L ) . Then, its Rademacher complexity can be bounded as follows. Lemma 2.

Let ( (cid:15) i ) ni =1 be i.i.d. Rademacher sequence, that is, P ( (cid:15) i = 1) = P ( (cid:15) i = −

1) = . There exists a universalconstant C > such that, for all δ > , E (cid:34) sup f ∈G m (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i g ( x i , y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ Cρ (cid:34) ˆ R ∞ (cid:118)(cid:117)(cid:117)(cid:116) (cid:80) L(cid:96) =1 m (cid:48) (cid:96) m (cid:48) (cid:96) +1 n log + (cid:32) G max { ¯ R, ¯ R b } ˆ R ∞ (cid:33) ∨ ˆ R ∞ (cid:80) L(cid:96) =1 m (cid:48) (cid:96) m (cid:48) (cid:96) +1 n log + (cid:32) G max { ¯ R, ¯ R b } δ (cid:33) (cid:35) , where the expectation is taken with respect to (cid:15) i , x i , y i .Proof. Since ψ is ρ -Lipschitz continuous, the contraction inequality Theorem 4.12 of [Ledoux and Talagrand, 1991] gives anupper bound of the RHS as E (cid:34) sup g ∈G m (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i g ( x i , y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ ρ E (cid:34) sup g ∈ ˆ F m (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) . We further bound the RHS. By Theorem 3.1 in [Gin´e and Koltchinskii, 2006] or Lemma 2.3 of [Mendelson, 2002] with thecovering number bound (Proposition 3), there exists a universal constant C (cid:48) such that E (cid:34) sup f ∈ ˆ F m (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ C (cid:48) (cid:34) ˆ R ∞ (cid:118)(cid:117)(cid:117)(cid:116) (cid:80) L(cid:96) =1 m (cid:48) (cid:96) m (cid:48) (cid:96) +1 n log + (cid:32) G max { ¯ R, ¯ R b } ˆ R ∞ (cid:33) ∨ ˆ R ∞ (cid:80) L(cid:96) =1 m (cid:48) (cid:96) m (cid:48) (cid:96) +1 n log + (cid:32) G max { ¯ R, ¯ R b } ˆ R ∞ (cid:33) (cid:35) . This concludes the proof.Now we are ready to probe the theorem.

Proof of Theorem 2.

Since G m (cid:48) is separable with respect to (cid:107) · (cid:107) ∞ -norm, by the standard symmetrization argument, we havethat P (cid:40) sup g ∈G m (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 g ( x i , y i ) − E X,Y [ g ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:34) sup f ∈G m (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i g ( x i , y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) + 3 ˆ R ∞ (cid:114) tn (cid:41) ≤ e − t for all t > (see, for example, Theorem 3.4.5 of [Gin´e and Nickl, 2015]). Taking uniform bound with respect to the choice of m (cid:48) ∈ [ m ] × [ m ] × · · · × [ m L ] , we have that P (cid:40) sup g ∈G m (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 g ( x i , y i ) − E X,Y [ g ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:34) sup f ∈G m (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:15) i g ( x i , y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) + 3 ˆ R ∞ (cid:115) t + (cid:80) L(cid:96) =2 log( m (cid:96) )) n for all m (cid:48) ∈ [ m ] × [ m ] × · · · × [ m L ] uniformly (cid:41) ≤ e − t . (15)Now, the generalization error of f (cid:93) can decomposed into Ψ([[ f (cid:93) ]]) = Ψ([[ f (cid:93) ]]) − ˆΨ([[ f (cid:93) ]]) (cid:124) (cid:123)(cid:122) (cid:125) ♣ + ˆΨ([[ f (cid:93) ]]) − ˆΨ([[ (cid:98) f ]]) (cid:124) (cid:123)(cid:122) (cid:125) ♦ + ˆΨ([[ (cid:98) f ]]) . Since the truncation operation [[ · ]] does not increase the (cid:107) · (cid:107) ∞ -norm of two functions, we can apply the inequality (15) andLemma 2 also for [[ f (cid:93) ]] to bound the term ♣ . The term ♦ can be bounded as ˆΨ([[ f (cid:93) ]]) − ˆΨ([[ (cid:98) f ]]) ≤ n n (cid:88) i =1 | ψ ( y i , [[ f (cid:93) ( x i )]]) − ψ ( y i , [[ (cid:98) f ( x i )]]) | ≤ n n (cid:88) i =1 ρ | [[ f (cid:93) ( x i )]] − [[ (cid:98) f ( x i )]] |≤ ρ (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ([[ f (cid:93) ( x i )]] − [[ (cid:98) f ( x i )]]) = ρ (cid:107) [[ f (cid:93) ]] − [[ (cid:98) f ]] (cid:107) n ≤ ρ (cid:107) f (cid:93) − (cid:98) f (cid:107) n ≤ ρδ . Combining these inequalities, we obtain the assertion.

Additional numerical experiments

This section gives additional numerical experiments for compressing the network.

D.1 Compressing VGG-16 on ImageNet

Here, we also applied our method to compress a publicly available VGG-16 network [Simonyan and Zisserman, 2014] on theImageNet dataset. We apply our method to the ImageNet dataset [Deng et al. , 2009]. We used the ILSVRC2012 dataset of theImageNet dataset, which consists of 1.3M training data and 50,000 validation data. Each image is annotated into one of 1,000categories. We applied our method to this network and compared it with existing methods, namely APoZ [Hu et al. , 2016],SqueezeNet [Iandola et al. , 2016], and ThiNet [Luo et al. , 2017]. All of them are applied to the same VGG-16 network. Forfair comparison, we followed the same experimental settings as [Luo et al. , 2017]; the way of training data generation, dataaugmentation, performance evaluation schemes and so on.The results are summarized in Table 3. It summarizes the Top-1/Top-5 classiﬁcation accuracies, the number of parameters( f (cid:93) was set to be the same as that ofThiNet-Conv. Spec-GAP is a method that replaces the FC layers of Spec-Conv with a global average pooling (GAP) layer [Lin et al. , 2013; Zhou et al. , 2016]. Here, we again set the number of channels in each layer of Spec-GAP to be same as that ofThiNet-GAP. We employed λ (cid:96) = 10 − × Tr[ (cid:98) Σ ( (cid:96) ) ] and θ = 0 . for our method.We see that in both situations, out method outperforms ThiNet in terms of accuracy. This shows effectiveness of our methodwhile our method is supported by theories. Table 3: Performance comparison on ImageNet dataset. Our proposed method is compared with APoZ-2, and ThiNet. Our method isindicated as “Spec-(type).”

Model Top-1 Top-5 et al. , 2016] 70.15% 89.69% 51.24M 30.94BThiNet-Conv [Luo et al. , 2017] 69.80% 89.53% 131.44M 9.58BThiNet-GAP [Luo et al. , 2017] 67.34% 87.92% 8.32M 9.34B

Spec-Conv 70.418% 90.094%