[PDF] A Deeper Look into Convolutions via Pruning

Abstract

Convolutional neural networks (CNNs) are able to attain better visual recognition performance than fully connected neural networks despite having much less parameters due to their parameter sharing principle. Hence, modern architectures are designed to contain a very small number of fully-connected layers, often at the end, after multiple layers of convolutions. It is interesting to observe that we can replace large fully-connected layers with relatively small groups of tiny matrices applied on the entire image. Moreover, although this strategy already reduces the number of parameters, most of the convolutions can be eliminated as well, without suffering any loss in recognition performance. However, there is no solid recipe to detect this hidden subset of convolutional neurons that is responsible for the majority of the recognition work. Hence, in this work, we use the matrix characteristics based on eigenvalues in addition to the classical weight-based importance assignment approach for pruning to shed light on the internal mechanisms of a widely used family of CNNs, namely residual neural networks (ResNets), for the image classification problem using CIFAR-10, CIFAR-100 and Tiny ImageNet datasets.

Full PDF

AA Deeper Look into Convolutions via Pruning

Ilke Cugu a , ∗ , Emre Akbas a a Department of Computer Engineering, Middle East Technical University, 06800, Ankara, Turkey

A R T I C L E I N F O

Keywords :pruningmodel compressionconvolutional neural networksdeep learningexpert kernelseigenvalues

A B S T R A C T

Convolutional neural networks (CNNs) are able to attain better visual recognition performance thanfully connected neural networks despite having much less parameters due to their parameter sharingprinciple. Hence, modern architectures are designed to contain a very small number of fully-connectedlayers, often at the end, after multiple layers of convolutions. It is interesting to observe that we canreplace large fully-connected layers with relatively small groups of tiny matrices applied on the entireimage. Moreover, although this strategy already reduces the number of parameters, most of the con-volutions can be eliminated as well, without suﬀering any loss in recognition performance. However,there is no solid recipe to detect this hidden subset of convolutional neurons that is responsible forthe majority of the recognition work. Hence, in this work, we use the matrix characteristics basedon eigenvalues in addition to the classical weight-based importance assignment approach for pruningto shed light on the internal mechanisms of a widely used family of CNNs, namely residual neu-ral networks (ResNets), for the image classiﬁcation problem using CIFAR-10, CIFAR-100 and TinyImageNet datasets. The codes are available at https://github.com/cuguilke/psykedelic .

1. Introduction

Modern deep neural networks are able to attain impres-sive performance on computer vision problems by using mil-lions of parameters scattered across multiple layers in theform of convolutions. However, as we show in this study,one can often eliminate most of these parameters without anynegative eﬀect on generalization error. A classical methodof doing so is to prune the parameters that are eﬀectivelyzero. However, in the case of pruning groups of parameters such as the channels in a convolutional neuron, it is not clearhow to assess their importance.In the literature, the de facto standard of assessing neu-ron importance is to check the average absolute value of itsweights which is an acceptable proposition for fully con-nected neural nets [23, 22, 66, 19, 41, 64, 67, 63, 69, 7].However, for visual recognition, convolutional neural nets(CNNs) [13, 37] have already surpassed the fully connectednets in terms of both accuracy and eﬃciency. CNNs reducedthe number of parameters dramatically due to their param-eter sharing principle and local connectivity. Although thegeneral tendency is still to keep several fully connected (FC)layers right before the output layers [26, 64, 29], it has beenshown that one can get rid oﬀ the FC layers entirely whilepreserving competitive performance [46, 31, 59]. Today,the modern visual recognition architectures mostly consistof convolutions. Hence, we can think of convolutions as avery eﬀective model compression method in and of itself.However, the story does not end there since not all kernelsare equally important for recognition. In the following, werefer to each 𝑛 × 𝑛 channel of a convolutional neuron as akernel In fact, as we also show here, depending on the diﬃculty of a problem, it is possible that only a small portion of the ∗ Corresponding author [email protected] (I. Cugu)

ORCID (s): (I. Cugu) kernels are actually vital for visual recognition [39]. Whenthe absence of a kernel results in a signiﬁcant rise in the gen-eralization error, we call that kernel an expert kernel , andemergence of the expert kernels are often enforced via reg-ularization which is also used to limit the capacity for mem-orization [71]. However, several studies on the generaliza-tion properties of deep learning state that the convolutionalcase is more diﬃcult to analyze [39, 3], so the theoreticaladvances mostly stay in the fully connected realm.Therefore, in this work, we study convolutions and askthe question: "How can we characterize or identify expertkernels?". To this end, we focus on the fact that kernels arebasically tiny matrices applied on the entire image. Thus,we employ matrix characteristics as heuristics to assessthe signiﬁcance of a given kernel. In taxonomy, this mo-tivation puts our work under the category of unstructuredheuristic-based kernel-level pruning of the model com-pression literature. Our approach is to eliminate as manykernels as possible using a set of shared hyperparametersthat enable the best heuristics to maintain the recognitionperformance of the unpruned version of a given model with-out the need for a ﬁne-tuning phase after the pruning. Then,candidate heuristics are used to provide more insight aboutthe expert kernels in convolutional neural nets.Discovering the correct recipe to recognize the expertkernels may inﬂuence both the theory and practice, or, atleast, it may validate our current assumptions. Regardingthe theory side, there are exciting empirical observations on overparameterization such as the lottery ticket hypothesis[12] and the double-descent in generalization error [6, 53].These studies try to explain deep neural nets’ ability to avoidoverﬁtting despite having much more parameters than thenumber of available samples, and we argue that the deﬁni-tion of expert kernels can can shed new light on the discus-sion on overparameterization in deep learning. In practice,elimination of non-expert kernels may produce very sparsemodels. As we show in this study, the sparsity structure of

Cugu and Akbas:

A Deeper Look into Convolutions

Page 1 of 19 a r X i v : . [ c s . C V ] F e b Deeper Look into Convolutions via Pruning the pruned model heavily depends on the problem deﬁnition(i.e. the dataset), hence, in some cases, it is possible to re-move large groups of neurons resulting in a direct drop in thecomputational cost. Nevertheless, such models when com-bined with the advances in high performance sparse matrixcomputations can yield families of energy eﬃcient neuralnets [60].In this paper, we present, to the best of our knowledge, the ﬁrst extensive analysis on eigenvalue based heuristicsto mark the expert kernels, the distribution of expert ker-nels through layers, and the change in their pruning be-havior during the diﬀerent phases of training . Our ﬁnd-ings include: (1) identiﬁcation of spectral norm as a more ro-bust heuristic (in the sense that it better preserves generaliza-tion accuracy) than the widely-used average absolute weightof a kernel, (2) feasibility of ignoring the complex parts ofthe eigenvalues to push the pruning ratio further without asigniﬁcant loss in recognition performance, (3) dataset foot-print on the active parameter distribution across layers afterpruning, (4) observation of a shift in eigenvalues after theﬁrst drop in learning rate causing some heuristics to fail.This paper is organized as follows: we start by review-ing the relevant work in the literature in Section 2. Next, weintroduce our candidate heuristics to assess kernel quality inSection 3, the relevant regularization in Section 4, and ourexperimental setup in Section 5. Then, our analysis starts:(1) the standard performance comparison in Section 6, (2)since most of our heuristics are based on eigenvalues, wediscuss real vs. complex eigenvalues in Section 7, (3) whichheuristic prunes which kernels analysis in Section 8, (4) layer-wise pruning analysis in Section 9, and (5) epoch-wise prun-ing analysis in Section 10.

2. Related Works

There are several ways of model compression in deeplearning such as quantization, binarization, low-rank approx-imation, knowledge distillation, architectural design and prun-ing. In this work, we are only interested in pruning since itis the only category that directly targets the inessential partsof deep neural nets whereas the closest family of compres-sion techniques in this regard focuses on approximating the overall computation (low-rank approximation).In the literature, some of the early examples of pruningresearch are done on saliency based methods [50, 38, 51, 24],and the interest continues as new deeper architectures arebeing introduced [49, 10, 70]. In this line of work, the im-portance of a unit (weight/neuron/group) is determined byits direct eﬀect on the training error. Although the requiredcomputation is immense, and hence the emphasis is on thedevelopment of well approximators, the reasoning behindthis approach is solid and straightforward.Another line of work does pruning based on unit outputs.To our knowledge, the ﬁrst study [62] of this kind look forinvariants. These invariants are deﬁned as units whose out-puts do not change across diﬀerent input patterns. In otherwords, the units that act like bias terms are removed to save space. This is not a widely used technique anymore, per-haps it is because the large datasets of today make it hard forunits to be invariant. Today, (1) the focus is on minimizingthe reconstruction error of output feature maps per layer [28,47], (2) clustering units w.r.t to their activations [11, 4, 52].We can also count the latter studies under another category,namely clustering-based pruning [55, 65, 63, 42]. A rela-tively similar study [27] uses geometric median to set repre-sentative/centroid ﬁlters and prune the ones that are close tothe geometric median.Next set of studies [44, 68, 18, 72] leverage scaling fac-tors such as the 𝛾 of batch norm [32] or introduce their ownfor diﬀerent levels of analysis (neuron/group/layer) [30] tocut oﬀ the outputs of some units. The advantage of thismethod is that it releases the input dependency of activationbased importance assignment. In other words, if the scal-ing factor is , it is guaranteed that the corresponding unithas no contribution to recognition. We can include maskbased pruning methods [33, 57] under this category as wellsince the basic principle is the same. Although quite use-ful in practice, this approach does not tell us anything aboutthe nature of a "good unit" for recognition. The decision isleft entirely to the optimization by simply adding a regular-ization term for the scaling factors. Similarly, a recent work[43] also leaves the importance assignment to the learningitself by employing a minimax game as in generative adver-sarial networks [17]. One way to study this black box is tointroduce heuristics and test their validity. A well-knownheuristic is the average of the absolute value of the weightsof a unit [23, 22, 66, 19]. Hence, we take it as a baseline forour matrix characteristics based heuristics (see Section 3).There are also analysis papers that can be found relatedto ours since the motivation is the same: to understand deepconvolutional neural nets via pruning. Liu et al.’s work [45]present an extensive empirical study on pruning algorithmsto see if there is indeed an advantage of pruning over train-ing the target architecture from scratch. Their methodologyinvolves a ﬁne-tuning phase after pruning to recover the lossin performance. In that setting, it is shown that structuredpruning can be omitted, and, instead, one can directly trainthe target architecture from scratch and achieve competitiveperformance. However, it seems this is not the case for un-structured pruning . In our study, we also employ an un-structured pruning scheme, but we do not have a ﬁne-tuningphase since our aim is to prune as much as possible whilepreserving the vanilla performance. On another note, prun-ing is argued to be important since one cannot achieve thesame adversarial robustness or performance by training fromscratch [69]. In addition, a recent study [7] conﬁrms the ob-servations of Liu et al. [45] and studies the diﬀerences be-tween the representations of the pruned nets and vanilla nets.

3. Expert Kernel Heuristics

Convolutional layers are quite interesting since they canexpress more representational power than the fully-connectedcounterparts despite having much fewer parameters. How-

Cugu and Akbas:

A Deeper Look into Convolutions

Page 2 of 19 Deeper Look into Convolutions via Pruning ever, in the pruning literature, convolutional layers are oftentreated as if they are fully connected layers. A common ap-proach is to compute the average absolute weight of a givenkernel (in this work, a kernel is referred to as a single channelof a neuron in a convolutional layer). However, although thistechnique works well in practice, it is not clear if the averageweight is the real indicator of a kernel’s redundancy. Hence,we propose several heuristics that we refer to as compres-sion modes throughout the paper. Our compression modesare for a given 𝑛 × 𝑛 real kernel 𝐾 :• det : abs determinant = || det( 𝐾 ) || If we assume values smaller than −12 as eﬀectively0, then this mode prunes kernels that are not full-rank.Therefore, its corresponding hypothesis is that an ex-pert kernel is a full-rank matrix.• det_gram : abs determinant of the Gramian matrix = || det( 𝐺 ) || where 𝐺 = 𝐾 𝑇 𝐾 . Since we employ weight regular-ization (see Section 4), kernels tend to have very smallweights even if they are not pruned. Combined with athreshold value of −24 , this mode checks the possi-ble numerical instability caused by weight regulariza-tion. Formally, det( 𝐺 ) = det( 𝐾 𝑇 𝐾 ) = det( 𝐾 𝑇 ) det( 𝐾 ) = det( 𝐾 ) which is basically the same criteria as det. However,in Section 8, we will see that the ﬂoating point con-straints prevent the two compression modes from be-ing identical.• min_eig : min abs eigenvalue = min { || 𝜆 𝑖 ( 𝐾 ) || | 𝑖 ∈ {1 , ..., 𝑛 }} This mode basically tests the hypothesis of det modesince the smallest absolute eigenvalue is enough tocheck whether a matrix is full-rank. However, det isnot enough for that due to the fact that it is the productof all eigenvalues in which case the largest absoluteeigenvalue may prevent an overkill.• min_eig_real : min abs eigenvalue (real parts only) = min { || 𝑎 𝑖 ( 𝐾 ) || | 𝑖 ∈ {1 , ..., 𝑛 }} where 𝜆 = 𝑎 + 𝑏𝑖 . As explained in Section 5, we aregenerally dealing with kernels, and these matri-ces can have complex conjugate eigenvalues. Hence,this mode enables us to measure the relevance of theimaginary part of such eigenvalues.• spectral_radius : max abs eigenvalue = max { || 𝜆 𝑖 ( 𝐾 ) || | 𝑖 ∈ {1 , ..., 𝑛 }} This mode measures the largest scaling of the eigen-vectors of 𝐾 while being a safe comparison factor fora possible overkill due to min_eig. • spectral_radius_real : max abs eigenvalue (real partsonly) = max { || 𝑎 𝑖 ( 𝐾 ) || | 𝑖 ∈ {1 , ..., 𝑛 }} where 𝜆 = 𝑎 + 𝑏𝑖 . It has the same reasoning withmin_eig_real.• spectral_norm : max singular value = √ 𝜆 𝑚𝑎𝑥 ( 𝐺 ) where 𝐺 = 𝐾 𝐻 𝐾 = 𝐾 𝑇 𝐾 since the kernel 𝐾 is areal matrix. This mode is quite interesting since itis the key component of residual network invertibil-ity [5] and stable discriminator training in GANs [48].Hence, naturally, we also want to test the applicabilityof spectral norm as an importance indicator for prun-ing convolutional kernels. Notice that 𝐺 is a 𝑛 × 𝑛 real symmetric matrix, so 𝜆 𝑖 ( 𝐺 ) ∈ ℝ and 𝜆 𝑖 ( 𝐺 ) ≥ 𝑖 ∈ {1 , .., 𝑛 } . Thus, there is no spectral_norm_realmode.• weight : avg of abs weights = 1 𝑛 𝑛 ∑ 𝑖 𝑛 ∑ 𝑗 ||| 𝑤 𝑖𝑗 ||| where 𝑤 𝑖𝑗 ∈ 𝐾 . This is the control group since it isthe de-facto standard of pruning in practice.Note that since the exact calculation of the eigenvaluesof a matrix larger than is, in general, not possible, ac-cording to the insolvability theory of Abel [1] and Galois[14], some of the compression modes are only applicable toa restricted set of convolutional architectures. On the otherhand, most well-known models in the literature excessivelyuse kernels, and we employed a model family satisfyingthis constraint which is reported in Section 5.

4. Regularization for Pruning

Regularization in deep learning can be used for diﬀerentpurposes. For example, it can be used in combination with aspeciﬁc activation function to limit the representational ca-pacity of a model or it can be used just to prevent the weightsfrom getting too large [16]. In this work, we use 𝐿 regular-ization to zero out most parameters so that only a small frac-tion of parameters would be doing most of the recognitionwork. One can think of this as creating a very competitiveenvironment where recognition resources are accumulatedin a minority of kernels, and we evaluate the proposed com-pression modes in terms of their indication quality for sucha minority.As explained in Section 3, all compression modes apartfrom weight are controlled by eigenvalues. 𝐿 regulariza-tion is a known technique to make the weights sparse [16]and, naturally, make the average absolute weight small. Inthis section, we will prove that it can also be used to makeeigenvalues small. Cugu and Akbas:

A Deeper Look into Convolutions

Page 3 of 19 Deeper Look into Convolutions via Pruning

Let kernel 𝐾 be a 𝑛 × 𝑛 real matrix, 𝐾 = ⎛⎜⎜⎜⎜⎝ 𝑤 𝑤 … 𝑤 𝑛 𝑤 𝑤 … ⋮⋮ ⋮ ⋱ ⋮ 𝑤 𝑛 … … 𝑤 𝑛𝑛 ⎞⎟⎟⎟⎟⎠ deﬁne radius 𝑟 𝑖 as the sum of the absolute values of the oﬀ-diagonal elements per row, 𝑟 𝑖 = 𝑛 ∑ 𝑗 ≠ 𝑖 ||| 𝑤 𝑖𝑗 ||| and let Gershgorin disk 𝐷 𝑖 be the closed disk in the complexplane centered at 𝑤 𝑖𝑖 with radius 𝑟 𝑖 , 𝐷 𝑖 = { 𝑥 ∈ ℂ ∶ || 𝑥 − 𝑤 𝑖𝑖 || ≤ 𝑟 𝑖 } Then, Gershgorin Circle Theorem [15] states that every eigen-value of the kernel 𝐾 lies within at least one of the Gersh-gorin disks 𝐷 𝑖 .As the weights get smaller, the origins of disks 𝐷 𝑖 movetowards zero. In addition, since the oﬀ-diagonal row sumsget smaller, the disks shrink. In other words, excessive 𝐿 regularization collapses all Gershgorin disks 𝐷 𝑖 into pointsnear zero.In conclusion, 𝐿 regularization minimizes (1) the aver-age absolute weights, and (2) the probability of having largeeigenvalues. We will microscope the subtle diﬀerence be-tween the two in our analysis sections.

5. Experimental Setup

Dataset . We used 3 generic object classiﬁcation datasets:{CIFAR-10 [36], CIFAR-100 [36], tiny-imagenet [40]}. CIFAR-10 and CIFAR-100 datasets are well known benchmarks forgeneric image classiﬁcation. tiny-imagenet is a subset of Im-ageNet [58] originally formed as a class project benchmark,and a convenient next step after CIFAR-10 and CIFAR-100since it provides a harder classiﬁcation task (200 object classes)and we do not have enough resources to run our experimentson ImageNet. Image resolutions are

32 × 32 in CIFAR, and

64 × 64 in tiny-imagenet. In the analysis sections, we willshow that these 3 datasets are quite useful to test the robust-ness of a hypothesis. We set hyperparameters using valida-tion sets we form by randomly separating of the train-ing set where each class is represented equally. Then, wereport test results in the analysis part for which we use thepre-determined test sets for CIFAR-10 and CIFAR-100 andthe pre-determined validation set of tiny-imagenet (since wedo not have access to the labels of its test set).

Models . We used a well known family of convolutional neu-ral networks: ResNets [26]. In the original paper, ResNetscome with 2 diﬀerent architectures: a thin ResNet variantfor CIFAR-10, and a wide ResNet variant for ImageNet. Al-though the latter is not speciﬁcally designed for the datasetswe have, it is essential to test generalization of pruning hy-potheses. Moreover, wide ResNets enables using ImageNet pre-trained weights to test the eﬀect of model initializationon pruning. In this study, we employ ResNets {32, 56, 110}( thin variants ) and ResNet50 ( wide variant ). 𝐿 penalty . This is the hyperparameter 𝛼 in Eq. 1. Wetried 4 values: { −3 , −4 , −4 , −5 }. Naturally,among these, −3 results in the highest pruning rate andthe lowest validation accuracy. −5 produced slightly bet-ter classiﬁcation performance than −4 , but with a signif-icantly lower pruning rate. −4 was able to reach ashigh as ≈ 96% pruning rate (for example using ResNet110for CIFAR-10 with weight or spectral radius as the compres-sion mode while preserving a reasonable validation accu-racy of ≈ 81% ) resulting in very sparse models. However,when we set performance of training without weight regu-larization as baseline, the gap between the validation accu-racy of −4 and the baseline grows as the dataset getsmore complex (CIFAR-10 → CIFAR-100 → tiny-imagenet).Therefore, in our analyses 𝐿 penalty 𝛼 = 10 −4 . 𝐿 loss term = 𝛼 𝑛 ∑ 𝑖 =0 | 𝑤 𝑖 | (1) Signiﬁcance threshold . In Section 3, we explain how eachcompression mode computes the signiﬁcance score of a ker-nel. Instead of targeting kernels which have signiﬁcancescore of exactly 0, we deﬁne a threshold and deem the ker-nels that have scores under this threshold as insigniﬁcant.Hence we name this hyperparameter as signiﬁcance thresh-old. In Fig 1 and 2, we show the pruning ratio of each com-pression mode for diﬀerent signiﬁcance thresholds to makean informed decision on how to deﬁne what values we willtake as eﬀectively zero. In Fig 1, we see similar curves forthin ResNets, and, as expected, the pruning ratio increasesas we go from ResNet32 to ResNet110. Moreover, since detand det_gram modes represent multiplication of eigenval-ues/matrices, their curves go signiﬁcantly above others, so itis important to scale their selected threshold accordingly.Another interesting ﬁnding is that CIFAR-100 task re-sults in the least pruning ratio compared to CIFAR-10 andtiny-imagenet. We will analyze the eﬀect of this phenomenonon recognition performance in Section 6. In Fig 2, the modelat hand is ResNet50, which is wider than the ones mentionedbefore, so it has a diﬀerent curve. With ResNet50, our goal isto examine the eﬀect of initial weights on pruning. Surpris-ingly, the pruning ratio curve is similar under diﬀerent ini-tializations, and it is important to notice that imagenet_initis substantially diﬀerent than others since it can actually beused to correctly classify most objects in ImageNet. In addi-tion, in both Figs 1 and 2, we see compression modes exceptdet and det_gram draw quite similar curves where green-ishones (min_eig based) go above red-ish ones. This is the ex-pected behavior of extremums and it is discussed in detail inSections 6 and 8. As we want to prune the models as muchas possible, we did preliminary experiments on a small tar-geted interval of signiﬁcance thresholds ( −5 → −3 ). Wefound −4 to have a good balance between accuracy andpruning ratio. Cugu and Akbas:

A Deeper Look into Convolutions

Page 4 of 19 Deeper Look into Convolutions via Pruning R e s N e t - p r un i n g r a t i o ( % ) CIFAR-10

CIFAR-100 tiny-imagenet R e s N e t - p r un i n g r a t i o ( % ) R e s N e t - p r un i n g r a t i o ( % ) Figure 1:

Pruning ratio vs. threshold chart for thin ResNets under multiple random initializations.

In conclusion, in our analysis, we set the signiﬁcancethreshold as:• det: −12 for 3x3 kernels, −4 for 1x1 kernels• det_gram: −24 for 3x3 kernels, −8 for 1x1 kernels• others: −4 Statistical signiﬁcance . We repeated each run for 15+ times,and for performance comparison this number goes up to 40+times simply due to the chronological order of the implemen-tations. Note that each run implies a particular combinationof the settings: 𝑓 settings (model, dataset, initialization mode).For diﬀerent comparison modes, the training part is shared.Once the training stops, all compression modes branch oﬀ.Hence, the experimental results do not have a random pa-rameter space noise for diﬀerent compression modes. Other hyperparameters . We wanted to keep all hyperpa-rameters constant as much as possible to rule out unwantednoise due to variance of hyperparameters for controlled ex-periments, so unless speciﬁcally indicated, a given value fora hyperparameter holds for all settings (models, datasets,etc.):• Optimization algorithm. In this study, we used Adamoptimizer [35] due to its fast convergence on the datasetswe use. • Learning rate. For ResNets {32, 56, 110}, the baselearning rate is −3 whereas it is −4 for ResNet50,and for both families it is divided by 10 at epochs 80,120 and 160.• Batch size. We tried 32, 64, and 128, but we did notfound a signiﬁcant or constant advantage for a partic-ular selection. Therefore, in order to speed-up the ex-periments, we selected 128 as the mini batch size.• Epochs. We monitored the validation accuracy of allmodels for all datasets, and decided to stop the train-ing at a safe point (validation accuracy curve becomesﬂat). Hence, the number of epochs is 200, which isan extremely safe point for some cases where learningstabilizes very early on. In Section 10, we examinethe pruning through epochs and try to shed some lightonto the internal structure of CNNs during training.• Initialization. We have 3 initialization modes: – random_init: Models are initialized using He nor-mal initialization [25] option in Keras [8]. – static_init: We still use He initialization, but weimport the same frozen initial weight for eachrun (we run the experiments multiple times tocheck statistical signiﬁcance of the results). This Cugu and Akbas:

A Deeper Look into Convolutions

Page 5 of 19 Deeper Look into Convolutions via Pruning s t a t i c _ i n i t - p r un i n g r a t i o ( % ) CIFAR-10

CIFAR-100 tiny-imagenet r a n d o m _ i n i t - p r un i n g r a t i o ( % ) i m a g e n e t _ i n i t - p r un i n g r a t i o ( % ) Figure 2:

Pruning ratio vs. threshold chart for ResNet50 under diﬀerent initializations. mode provides a control group for random_initexperiments since we do not have a variance ininitial conditions here. – imagenet_init: This mode is only available forResNet50 experiments. Nevertheless, since ourcompression modes are tightly coupled with theproperties of the convolution matrices in models,it is important to test the robustness of our propo-sitions in an unfamiliar parameter space whereinitial conditions are not random.

6. Performance Comparison

Summary of ﬁndings:• Any heuristic that uses the min abs eigenvalue is bad.• Expert kernel ≠ full-rank matrix.• Spectral norm is the strongest candidate since it causesalmost no drop in classiﬁcation accuracy.• Spectral radius and the avg abs weight are the leadingheuristics in terms of compression score (Table 4).In this section, we compare the compression modes interms of classiﬁcation accuracy and pruning ratio which isthe number of pruned kernels divided by the total number of kernels. These two majors are inversely correlated. Hence,to provide a relatively uniﬁed view of compression perfor-mance we include an additional major called compressionscore 𝑐 deﬁned as, 𝑐 = 𝑎𝑐𝑐 𝑝𝑟𝑢𝑛𝑒𝑑 𝑎𝑐𝑐 𝑣𝑎𝑛𝑖𝑙𝑙𝑎 ∗ 𝑐 may be larger than if pruning removesdisruptive kernels that are causing misclassiﬁcations. How-ever, in our experiments, we did not encounter such a sce-nario. Nevertheless, higher the 𝑐 the better.At the end of a training, we ﬁrst evaluate the classiﬁca-tion accuracy without pruning ( 𝑎𝑐𝑐 𝑣𝑎𝑛𝑖𝑙𝑙𝑎 ), then each com-pression mode branches-oﬀ from that same parameter stateto evaluate their performance. In other words, trained weightsare the same for all modes, but the set of pruned kernelschanges.If you trim a neural net and do not receive a severe penaltyfor it, it is a good indicator that the thrown-out parts were notnecessary in the ﬁrst place. However, we do not search for aset of desirable hyperparameters for each compression modeto get 𝑎𝑐𝑐 𝑝𝑟𝑢𝑛𝑒𝑑 𝑎𝑐𝑐 𝑣𝑎𝑛𝑖𝑙𝑙𝑎 ≈ 1 . Instead, we set the environment whereat least one compression mode is able to prune the networkwhile achieving competitive performance to the vanilla net-work (see "None" category in Table 1). Cugu and Akbas:

A Deeper Look into Convolutions

Page 6 of 19 Deeper Look into Convolutions via Pruning

Table 1

Classiﬁcation accuracy comparison of diﬀerent compression modes compression mode ResNet32 ResNet56 ResNet110 ResNet50static random static random static random static random imagenet C I F A R - None . .

002 0 . .

003 0 . .

002 0 . .

003 0 . .

005 0 . .

003 0 . .

004 0 . . min_eig . .

075 0 . .

095 0 . .

088 0 . .

109 0 . .

074 0 . .

105 0 . .

052 0 . .

049 0 . . min_eig_real . .

081 0 . .

106 0 . .

090 0 . .

101 0 . .

085 0 . .

112 0 . .

051 0 . .

051 0 . . det . .

002 0 . .

003 0 . .

002 0 . .

003 0 . .

005 0 . .

059 0 . .

022 0 . . det_gram . .

002 0 . .

002 0 . . ± 0 .

003 0 . .

005 0 . .

059 0 . .

022 0 . . spectral_radius ± 0 . ± 0 . ± 0 .

002 0 . .

002 0 . . ± 0 . ± 0 . ± 0 .

007 0 . . spectral_radius_real . .

002 0 . .

003 0 . .

002 0 . .

003 0 . .

003 0 . . ± 0 .

005 0 . .

006 0 . . spectral_norm ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 .

005 0 . .

007 0 . . ± 0 . weight ± 0 . ± 0 . ± 0 .

002 0 . .

002 0 . . ± 0 .

005 0 . .

007 0 . . ± 0 . C I F A R - None . .

004 0 . .

005 0 . .

004 0 . .

010 0 . .

009 0 . .

005 0 . .

005 0 . . min_eig . .

039 0 . .

056 0 . .

044 0 . .

041 0 . .

052 0 . .

035 0 . .

003 0 . .

004 0 . . min_eig_real . .

036 0 . .

059 0 . .

050 0 . .

037 0 . .

045 0 . .

032 0 . .

003 0 . .

004 0 . . det . .

004 0 . .

005 0 . .

006 0 . .

004 0 . .

010 0 . .

009 0 . .

054 0 . .

038 0 . . det_gram . .

004 0 . .

005 0 . .

004 0 . .

010 0 . .

009 0 . .

054 0 . .

038 0 . . spectral_radius ± 0 . ± 0 . ± 0 .

005 0 . .

004 0 . . ± 0 .

009 0 . .

006 0 . .

004 0 . . spectral_radius_real . .

004 0 . .

005 0 . .

006 0 . .

004 0 . .

010 0 . .

009 0 . .

005 0 . .

004 0 . . spectral_norm ± 0 . ± 0 . ± 0 . ± 0 . ± 0 .

010 0 . . ± 0 . ± 0 . ± 0 . weight ± 0 .

004 0 . . ± 0 .

005 0 . .

004 0 . .

010 0 . .

009 0 . .

024 0 . .

041 0 . . t i n y - i m a g e n e t None . .

004 0 . .

006 0 . .

005 0 . .

009 0 . .

011 0 . .

013 0 . . min_eig . .

036 0 . .

040 0 . .

032 0 . .

046 0 . .

044 0 . .

043 0 . .

019 0 . .

025 0 . . min_eig_real . .

037 0 . .

041 0 . .

035 0 . .

042 0 . .

044 0 . .

047 0 . .

021 0 . .

027 0 . . det . .

004 0 . .

005 0 . .

006 0 . .

005 0 . .

009 0 . .

011 0 . .

013 0 . .

013 0 . . det_gram . .

004 0 . . ± 0 .

006 0 . .

005 0 . .

009 0 . .

011 0 . .

013 0 . .

012 0 . . spectral_radius ± 0 .

004 0 . . ± 0 . ± 0 . ± 0 .

009 0 . .

011 0 . . ± 0 .

013 0 . . spectral_radius_real . .

005 0 . .

008 0 . .

009 0 . .

011 0 . .

012 0 . .

013 0 . . spectral_norm . . ± 0 .

006 0 . . ± 0 .

005 0 . . ± 0 . ± 0 .

011 0 . . ± 0 . weight ± 0 .

004 0 . . ± 0 . ± 0 .

005 0 . . ± 0 .

011 0 . . ± 0 .

013 0 . . In Table 1, we see very similar performance measure-ments for modes based on spectral radius, spectral norm andthe average absolute weight. The worst results obtained bythe modes that check the minimum absolute eigenvalue. Sincethe determinant combines both the negative and the posi-tive results through multiplication of eigenvalues, det anddet_gram modes oscillate between the extremums.

This ﬁnd-ing rules out the hypothesis that an expert kernel mustbe a full-rank matrix under the assumption that a valuesmaller than the selected signiﬁcance threshold is eﬀec-tively . Because, if the minimum absolute eigenvalue is ,the determinant is . However, since the values are not ex-actly , spectral radius often protects determinant from anoverkill. We can clearly see the case where this is not enoughin ResNet50 results where the gap between the extremums islarge enough to signiﬁcantly deteriorate the performance ofdeterminant based modes. In Table 2, naturally, we observethe opposite, and in Table 3 we get the combined eﬀect ofboth view points. For ResNets{32, 56, 110}, determinantbased heuristics achieve competitive pruning performance.However, when we examine the ResNet50 case (especiallythe imagenet_init results), it is clear that any compressionmode taking the minimum absolute eigenvalue into accountis quite unstable. Therefore, we have a compelling case toturn our attention to spectral radius, spectral norm andthe average absolute weight .As a side note, according to the experimental results, ig-noring the complex parts of the eigenvalues results in diﬀer-ent behavior for min_eig and spectral_radius modes, whichwill be discussed in detail in Section 7. Moreover, we seediﬀerences between the results of det and det_gram modes.This indicates that there is indeed a relatively small numer-ical instability regarding the matrix operations in the given parameter space, and this will be conﬁrmed in Section 8 indetail.

7. Real vs. Complex Eigenvalues

Summary of ﬁndings:• spectral_radius and spectral_radius_real has no majordiﬀerence in terms of their pruning statistics. Hence,it seems that the real part (scaling factor) of the largestabsolute eigenvalue is the key part of the spectral_radiusas a kernel importance assignment heuristic.• Since min_eig has already started to prune essentialkernels, min_eig_real makes the performance even worseby pruning even more essential kernels.Here, we are interested in using eigenvalues to under-stand inner mechanisms of deep convolutional neural nets.One of the reasons is the geometric meaning of eigenval-ues: real parts indicate scaling whereas complex parts in-dicate rotation . Hence, one might conjecture : the amountof scaling is the key indicator of an expert kernel. It is noteasy to answer this question empirically since the absenceof a possible drop in performance, when we ignore the com-plex parts, does not guarantee a general rule. On the otherhand, if we actually observe a performance loss, we can ruleout this hypothesis. Therefore, we have min_eig_real andspectral_radius_real modes in addition to the vanilla ones:min_eig and spectral_radius.The real-part-only modes are expected to prune morekernels than their vanilla counterparts since: √ 𝑎 + 𝑏 ≥ 𝑎 Cugu and Akbas:

A Deeper Look into Convolutions

Page 7 of 19 Deeper Look into Convolutions via Pruning

Table 2

Compression ratio comparison of diﬀerent compression modes compression mode ResNet32 ResNet56 ResNet110 ResNet50static random static random static random static random imagenet C I F A R - min_eig . .

004 0 . .

008 0 . .

003 0 . .

004 0 . .

002 0 . .

003 0 . .

001 0 . .

002 0 . . min_eig_real ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . det . .

005 0 . .

010 0 . .

004 0 . .

006 0 . .

003 0 . .

004 0 . .

001 0 . .

002 0 . . det_gram . .

005 0 . .

010 0 . .

004 0 . .

006 0 . .

003 0 . .

004 0 . .

001 0 . .

002 0 . . spectral_radius . .

006 0 . .

012 0 . .

005 0 . .

006 0 . .

003 0 . .

005 0 . .

004 0 . .

004 0 . . spectral_radius_real . .

006 0 . .

011 0 . .

005 0 . .

006 0 . .

003 0 . .

005 0 . .

004 0 . .

004 0 . . spectral_norm . .

006 0 . .

012 0 . .

005 0 . .

007 0 . .

003 0 . .

005 0 . .

004 0 . .

004 0 . . weight . .

006 0 . .

011 0 . .

004 0 . .

006 0 . .

003 0 . .

005 0 . .

003 0 . .

003 0 . . C I F A R - min_eig . .

011 0 . .

021 0 . .

007 0 . .

014 0 . .

010 0 . .

007 0 . .

009 0 . .

013 0 . . min_eig_real ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . det . .

014 0 . .

028 0 . .

009 0 . .

018 0 . .

012 0 . .

009 0 . .

013 0 . . det_gram . .

014 0 . .

028 0 . .

009 0 . .

018 0 . .

013 0 . .

009 0 . .

013 0 . . spectral_radius . .

016 0 . .

030 0 . .

010 0 . .

019 0 . .

013 0 . .

009 0 . .

022 0 . .

026 0 . . spectral_radius_real . .

015 0 . .

030 0 . .

010 0 . .

019 0 . .

013 0 . .

009 0 . .

022 0 . .

026 0 . . spectral_norm . .

016 0 . .

032 0 . .

010 0 . .

019 0 . .

013 0 . .

009 0 . .

022 0 . .

026 0 . . weight . .

015 0 . .

030 0 . .

010 0 . .

019 0 . .

013 0 . .

009 0 . .

012 0 . .

017 0 . . t i n y - i m a g e n e t min_eig . .

005 0 . .

010 0 . .

006 0 . .

005 0 . .

004 0 . .

007 0 . .

013 0 . .

011 0 . . min_eig_real ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . det . .

008 0 . .

015 0 . .

008 0 . .

007 0 . .

005 0 . .

009 0 . .

017 0 . .

016 0 . . det_gram . .

008 0 . .

015 0 . .

008 0 . .

007 0 . .

005 0 . .

009 0 . .

017 0 . .

016 0 . . spectral_radius . .

008 0 . .

016 0 . .

008 0 . .

007 0 . .

005 0 . .

010 0 . .

021 0 . .

021 0 . . spectral_radius_real . .

008 0 . .

016 0 . .

008 0 . .

007 0 . .

005 0 . .

009 0 . .

020 0 . .

020 0 . . spectral_norm . .

008 0 . .

016 0 . .

008 0 . .

005 0 . .

010 0 . .

022 0 . .

023 0 . . weight . .

008 0 . .

016 0 . .

008 0 . .

007 0 . .

005 0 . .

010 0 . .

015 0 . .

015 0 . . Table 3

Compression score 𝑐 comparison of diﬀerent compression modes compression mode ResNet32 ResNet56 ResNet110 ResNet50static random static random static random static random imagenet C I F A R - min_eig . .

062 0 . .

081 0 . .

084 0 . .

105 0 . .

078 0 . .

110 0 . .

063 0 . .

060 0 . . min_eig_real . .

068 0 . .

091 0 . .

087 0 . .

098 0 . .

089 0 . .

118 0 . .

062 0 . .

062 0 . . det ± 0 .

005 0 . .

010 0 . .

004 0 . .

005 0 . . ± 0 .

004 0 . .

072 0 . .

028 0 . . det_gram . . ± 0 . ± 0 . ± 0 . ± 0 .

003 0 . .

004 0 . .

072 0 . .

028 0 . . spectral_radius . .

006 0 . .

012 0 . .

005 0 . .

006 0 . .

003 0 . .

005 0 . .

006 0 . .

008 0 . . spectral_radius_real . .

006 0 . .

011 0 . .

005 0 . .

006 0 . .

003 0 . .

005 0 . .

006 0 . .

007 0 . . spectral_norm . .

006 0 . .

012 0 . .

005 0 . .

007 0 . .

003 0 . .

005 0 . .

008 0 . .

008 0 . . weight . .

006 0 . .

011 0 . .

004 0 . .

006 0 . .

003 0 . . ± 0 . ± 0 . ± 0 . C I F A R - min_eig . .

030 0 . .

050 0 . .

053 0 . .

045 0 . .

083 0 . .

053 0 . .

005 0 . .

006 0 . . min_eig_real . .

028 0 . .

052 0 . .

061 0 . .

042 0 . .

072 0 . .

050 0 . .

005 0 . .

006 0 . . det ± 0 . ± 0 .

027 0 . .

009 0 . . ± 0 .

013 0 . .

009 0 . .

092 0 . .

068 0 . . det_gram . .

014 0 . . ± 0 . ± 0 .

019 0 . . ± 0 .

009 0 . .

092 0 . .

068 0 . . spectral_radius . .

016 0 . .

030 0 . .

010 0 . .

019 0 . .

013 0 . .

009 0 . .

021 0 . .

023 0 . . spectral_radius_real . .

015 0 . .

030 0 . .

010 0 . .

020 0 . .

013 0 . . ± 0 . ± 0 .

023 0 . . spectral_norm . .

016 0 . .

031 0 . .

010 0 . .

019 0 . .

014 0 . .

009 0 . .

022 0 . .

024 0 . . weight . .

015 0 . .

030 0 . .

010 0 . .

019 0 . .

013 0 . .

009 0 . .

043 0 . . ± 0 . t i n y - i m a g e n e t min_eig . .

052 0 . .

061 0 . .

059 0 . .

087 0 . .

095 0 . .

093 0 . .

037 0 . .

046 0 . . min_eig_real . .

054 0 . .

062 0 . .

065 0 . .

079 0 . .

095 0 . .

101 0 . .

040 0 . .

051 0 . . det . .

007 0 . .

016 0 . .

007 0 . .

008 0 . . ± 0 . ± 0 . ± 0 .

017 0 . . det_gram ± 0 . ± 0 . ± 0 . ± 0 . ± 0 .

006 0 . .

009 0 . .

023 0 . .

019 0 . . spectral_radius . .

008 0 . .

016 0 . .

008 0 . .

007 0 . .

005 0 . .

010 0 . .

024 0 . .

023 0 . . spectral_radius_real . .

008 0 . .

017 0 . .

011 0 . .

015 0 . .

009 0 . .

011 0 . .

023 0 . .

021 0 . . spectral_norm . .

008 0 . .

016 0 . .

008 0 . .

005 0 . .

010 0 . .

024 0 . .

024 0 . . weight . .

008 0 . .

016 0 . .

008 0 . .

007 0 . .

005 0 . .

010 0 . .

018 0 . . ± 0 . Table 4

The average compression score of each heuristic across multiplemodels, datasets and initializations.1. weight .

2. spectral_radius .

3. spectral_radius_real .

4. spectral_norm .

5. det_gram .

6. det .

7. min_eig .

8. min_eig_real . where the complex eigenvalue 𝜆 = 𝑎 + 𝑏𝑖 . Moreover, whenwe only consider the real parts of the eigenvalues, their or- dering may change which means, for instance, min_eig maytarget 𝜆 whereas min_eig_real targets 𝜆 for the same ker-nel, where || 𝜆 || < || 𝜆 || but || 𝑎 || > || 𝑎 || .Naturally, to be able to test the hypothesis mentioned ear-lier, we need to have a signiﬁcant amount of complex eigen-values in our CNNs. In Tables 5 and 6, we show the ratio ofcomplex eigenvalues from three angles: (1) how many com-plex eigenvalues values do we have? ( ), (2)how many complex eigenvalues are targeted by the relevantcompression modes? ( ), and (3) howmany pruned kernels are deemed unimportant by looking atcomplex eigenvalues? ( ).For ResNet50, kernels overweight the picture. InTable 5, we see that the amount of complex eigenvalues in Cugu and Akbas:

A Deeper Look into Convolutions

Page 8 of 19 Deeper Look into Convolutions via Pruning

Table 5

Amount of total/targeted/pruned complex eigenvalues in ResNet50 for diﬀerent inits model total complex ratio min_eig min_eig_real spectral_radius spectral_radius_realtarget pruned target pruned target pruned target pruned C - random_init .

09 ± 0 .

001 0 .

03 ± 0 .

000 0 .

03 ± 0 .

000 0 .

04 ± 0 .

000 0 .

04 ± 0 .

000 0 .

03 ± 0 .

000 0 .

03 ± 0 .

000 0 .

02 ± 0 .

000 0 .

02 ± 0 . imagenet_init .

08 ± 0 .

002 0 .

02 ± 0 .

000 0 .

03 ± 0 .

000 0 .

03 ± 0 .

001 0 .

04 ± 0 .

001 0 .

03 ± 0 .

001 0 .

03 ± 0 .

001 0 .

02 ± 0 .

000 0 .

02 ± 0 . C - random_init .

06 ± 0 .

002 0 .

02 ± 0 .

000 0 .

02 ± 0 .

001 0 .

02 ± 0 .

001 0 .

03 ± 0 .

001 0 .

01 ± 0 .

001 0 .

01 ± 0 .

001 0 .

01 ± 0 .

000 0 .

01 ± 0 . imagenet_init .

07 ± 0 .

001 0 .

02 ± 0 .

000 0 .

02 ± 0 .

000 0 .

03 ± 0 .

000 0 .

03 ± 0 .

000 0 .

02 ± 0 .

000 0 .

02 ± 0 .

000 0 .

01 ± 0 .

000 0 .

01 ± 0 . t - i m g random_init .

09 ± 0 .

001 0 .

03 ± 0 .

000 0 .

03 ± 0 .

000 0 .

04 ± 0 .

000 0 .

04 ± 0 .

001 0 .

03 ± 0 .

000 0 .

03 ± 0 .

001 0 .

02 ± 0 .

000 0 .

02 ± 0 . imagenet_init .

09 ± 0 .

000 0 .

03 ± 0 .

000 0 .

03 ± 0 .

000 0 .

04 ± 0 .

000 0 .

04 ± 0 .

000 0 .

03 ± 0 .

000 0 .

03 ± 0 .

000 0 .

02 ± 0 .

000 0 .

02 ± 0 . ResNet50 are quite small compared to thin ResNets since the kernels do not have any complex eigenvalues. There-fore, for complex vs. real analysis, we decided to focus onthin ResNets.For thin ResNets, total amount of complex eigenvaluesrange from to consistently across diﬀerent mod-els, which is enough to aﬀect the overall classiﬁcation per-formance (Table 6). However, the total complex/real ra-tio may not be quite informative since diﬀerent eigenvalue-based compression modes focus on diﬀerent eigenvalues. Forexample, if the smallest absolute eigenvalues of all kernelsare real and all the largest ones are complex, min_eig andspectral_radius may cause signiﬁcantly diﬀerent behavior interms of vanilla modes vs. real value based counterparts.Hence, we include a category called target which indicatesthe amount of complex values evaluated by each compres-sion mode whether they lead to pruning or not. Accordingto Table 6, there is indeed a signiﬁcant diﬀerence in sets oftargeted values by vanilla and real value based modes, andthis observation is true for both min_eig and spectral_radius.Thus, when we check the results in Table 3, we see thatmin_eig_real has a noticeable disadvantage against min_eig.Nevertheless, spectral_radius remains stable whether we chooseto focus only on real values or not. This, of course, holds noimportance if we did not prune any kernel due to its complexeigenvalue in the ﬁrst place. Therefore, we also show theoverall contribution of complex eigenvalues to the pruning.The fact that

20% − 40% of the pruned kernels are selectedby looking at complex values demonstrates that there wereindeed enough room for a signiﬁcant change in pruning per-formance between vanilla and real-part-only modes (Section6). At the end, we see that ignoring the complex parts leadsto worse performance for min_eig mode whereas it has nosigniﬁcant eﬀect on the performance of spectral_radius mode.However, since min_eig has already started to prune essen-tial kernels, it is no surprise that min_eig_real prunes evenmore essential kernels. This is not the case for spectral_radius,so we may suspect the largest scaling factor is more rele-vant for kernel importance assignment. Although the gener-ality of this heuristic is not guaranteed, it may be useful forpractitioners who want to push the pruning ratio as much aspossible without losing too much recognition performance.

8. Set Analysis

In this section, we examine the sets of kernels pruned byeach compression mode and their intersections. In our ex-periments, for all datasets, models, and initializations, thereare only 10 distinct sets of kernels which are pruned by:1. (min_eig, weight, det, spectral_radius, spectral_norm,det_gram)2. (min_eig, weight, det, spectral_radius, det_gram)3. (min_eig, weight, det, det_gram)4. (min_eig, det, spectral_radius, det_gram)5. (min_eig, det, det_gram)6. (min_eig, det_gram)7. (min_eig, weight)8. (min_eig, det)9. min_eig10. weightExcept 5 cases: (1,2) ResNet50 and ResNet56 for CIFAR-100 using static_init, (3) ResNet32 for tiny-imagenet usingrandom_init, and (4) ResNet50 for tiny-imagenet using ima-genet_init. These cases also have kernels pruned by (min_eig,det, spectral_radius). In addition, (5) we may also observekernels pruned only by det_gram due to numerical instabil-ity born out of dealing with very small numbers.We can draw several conclusions from these sets:• The kernels pruned by spectral norm forms the sub-set of all other sets, so it is the safest heuristic amongothers: 𝑆 min_eig ∩ 𝑆 weight ∩ 𝑆 det ∩ 𝑆 spectral_radius ∩ 𝑆 det_gram = 𝑆 spectral_norm • The extremums created by the modes based on thelargest and smallest absolute eigenvalues, reﬂect them-selves as: 𝑆 min_eig ⊃ 𝑆 det ⊃ 𝑆 spectral_radius • The numerical instability mentioned in Section 6 pre-vents 𝑆 det and 𝑆 det_gram to be identical. Moreover,due to the existence of (min_eig, det) and (min_eig,det_gram), one is not a subset of the other either: 𝑆 det ≠ 𝑆 det_gram 𝑆 det ⊅ 𝑆 det_gram 𝑆 det_gram ⊅ 𝑆 det Cugu and Akbas:

A Deeper Look into Convolutions

Page 9 of 19 Deeper Look into Convolutions via Pruning

Table 6

Amount of total/targeted/pruned complex eigenvalues in thin ResNets model total complex ratio min_eig min_eig_real spectral_radius spectral_radius_realtarget pruned target pruned target pruned target pruned C - ResNet32 .

36 ± 0 .

002 0 .

25 ± 0 .

002 0 .

27 ± 0 .

003 0 .

36 ± 0 .

003 0 .

39 ± 0 .

003 0 .

28 ± 0 .

003 0 .

33 ± 0 .

003 0 .

17 ± 0 .

002 0 .

20 ± 0 . ResNet56 .

39 ± 0 .

001 0 .

27 ± 0 .

001 0 .

28 ± 0 .

002 0 .

39 ± 0 .

002 0 .

41 ± 0 .

002 0 .

31 ± 0 .

001 0 .

34 ± 0 .

001 0 .

19 ± 0 .

001 0 .

21 ± 0 . ResNet110 .

41 ± 0 .

001 0 .

28 ± 0 .

001 0 .

29 ± 0 .

001 0 .

41 ± 0 .

002 0 .

42 ± 0 .

002 0 .

33 ± 0 .

001 0 .

34 ± 0 .

001 0 .

20 ± 0 .

001 0 .

21 ± 0 . C - ResNet32 .

31 ± 0 .

002 0 .

22 ± 0 .

001 0 .

24 ± 0 .

002 0 .

31 ± 0 .

001 0 .

37 ± 0 .

003 0 .

24 ± 0 .

004 0 .

34 ± 0 .

005 0 .

14 ± 0 .

002 0 .

21 ± 0 . ResNet56 .

37 ± 0 .

002 0 .

25 ± 0 .

001 0 .

28 ± 0 .

001 0 .

37 ± 0 .

002 0 .

41 ± 0 .

002 0 .

29 ± 0 .

002 0 .

34 ± 0 .

001 0 .

17 ± 0 .

001 0 .

21 ± 0 . ResNet110 .

40 ± 0 .

001 0 .

28 ± 0 .

001 0 .

29 ± 0 .

001 0 .

41 ± 0 .

002 0 .

43 ± 0 .

001 0 .

32 ± 0 .

001 0 .

34 ± 0 .

001 0 .

19 ± 0 .

001 0 .

21 ± 0 . t - i m g ResNet32 .

35 ± 0 .

003 0 .

24 ± 0 .

002 0 .

26 ± 0 .

004 0 .

35 ± 0 .

003 0 .

39 ± 0 .

004 0 .

27 ± 0 .

003 0 .

32 ± 0 .

002 0 .

16 ± 0 .

001 0 .

19 ± 0 . ResNet56 .

39 ± 0 .

002 0 .

27 ± 0 .

004 0 .

28 ± 0 .

003 0 .

39 ± 0 .

003 0 .

41 ± 0 .

002 0 .

31 ± 0 .

001 0 .

33 ± 0 .

003 0 .

18 ± 0 .

001 0 .

20 ± 0 . ResNet110 .

41 ± 0 .

001 0 .

28 ± 0 .

001 0 .

28 ± 0 .

001 0 .

41 ± 0 .

001 0 .

42 ± 0 .

001 0 .

32 ± 0 .

001 0 .

33 ± 0 .

001 0 .

19 ± 0 .

001 0 .

20 ± 0 . Another outcome of this numerical problem: 𝑆 det_gram ⊅ 𝑆 spectral_radius • We can see the distinction between weight-based prun-ing and eigenvalue-based pruning: 𝑆 min_eig ⊅ 𝑆 weight 𝑆 weight ⊅ { 𝑆 min_eig , 𝑆 det , 𝑆 spectral_radius , 𝑆 det_gram } Additional information, such as the ratio of each set withrespect to the union of all sets, and the variance of these ra-tios across multiple runs can be found in Appendix A.

9. Pruning per Layer

Summary of ﬁndings:• Dataset has a footprint over the activity map ( )of a CNN.• Thin ResNet experiments show that, within the samemodel family, activity maps of diﬀerent models re-semble each other.• ResNet50 experiments show that signiﬁcantly diﬀer-ent initializations aﬀect the activity map.• For randomly initialized models, ﬁrst layers have rel-atively smaller pruning ratio than the deeper layers.In this section, we study how pruning aﬀects diﬀerentlayers. In Fig 3, we show the amount of activity ( )in diﬀerent layers of diﬀerent models of thin ResNet family.Considering the similarity of overall activity pattern, we seethat the dataset at hand has a footprint over the prunedarchitecture . In detail, row-wise similarity is relatively low(same model - diﬀerent dataset) whereas column-wise sim-ilarity indicates a correlation of activity patterns of diﬀer-ent models from the same family for the same dataset. Ofcourse, as the model gets smaller (ResNet32) activity of theﬁrst layers gets higher than that of larger models (ResNet56and ResNet110). Nevertheless, it is interesting to see alignedspikes in activity , for example the spike at layers: ResNet32:25- ResNet56:40 - ResNet110:80 on CIFAR-100, since these alignments require the stretch of activies over layers for dif-ferent models where the number of layers are signiﬁcantlydiﬀerent. However, generalization of this behavior is re-stricted within a somewhat vague model family deﬁnition(thin ResNets) as ResNet50 does not exhibit the same overallactivity pattern. It seems using fundamental building blocks(thin ResNet block) to generate shallow/deep models leads tointeresting activity distributions which may be used to studythe importance/eﬀect of depth in neural nets from a diﬀerentperspective. Another factor that aﬀects the activity patternis initialization. We show this eﬀect in Fig 4. CNN train-ing contains diﬀerent sources of randomness that can aﬀectthe results. Hence, we include static_init results to comparewith random_init results to show how little these perturba-tions are in our experiments. Nevertheless, the key observa-tion here is that signiﬁcantly diﬀerent initializations (im-agenet_init vs. random_init) results in diﬀerent activitypatterns . In addition, imagenet_init is the only case wherewe observe a drop in ﬁrst layers’ activities. On the otherhand, with the exception of sudden increase in last layer ac-tivities on tiny_imagenet, we see a general proclivity towardslow activity in last layers, but the exact reason for this phe-nomenon and its generality to diﬀerent datasets/models re-mains an open question.

10. Pruning through Epochs

Summary of ﬁndings:• The ﬁrst drop in learning rate is found to be a veryinteresting turning point for pruning: – The diﬀerence between the min and the max abseigenvalues of kernels starts to increase. Asa result, the heuristics that use the former startsto prune more and suﬀer in classiﬁcation perfor-mance. – The pruning ratios of all heuristics starts to in-crease until they reach approximately the sameamount of pruning. Nevertheless, as the trainingcontinues the min abs eigenvalue based heuris-tics eventually reach higher pruning ratios.Here, we decided on the number of epochs (200) by look-ing at the learning curves of multiple experiments. Our pol-

Cugu and Akbas:

A Deeper Look into Convolutions

Page 10 of 19 Deeper Look into Convolutions via Pruning A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet32

CIFAR-10 A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet32

CIFAR-100 A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet32 tiny-imagenet A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet56 A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet56 A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet56 L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet110 L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet110 L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet110

Figure 3:

Active parameter ratio of diﬀerent compression modes through layers (input → output) for thin ResNets under multiplerandom initializations. A c t i v e / T o t a l p a r a m r a t i o ( % ) static_init tiny-imagenet A c t i v e / T o t a l p a r a m r a t i o ( % ) static_init CIFAR-10 A c t i v e / T o t a l p a r a m r a t i o ( % ) static_init CIFAR-100 A c t i v e / T o t a l p a r a m r a t i o ( % ) random_init A c t i v e / T o t a l p a r a m r a t i o ( % ) random_init A c t i v e / T o t a l p a r a m r a t i o ( % ) random_init L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) imagenet_init L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) imagenet_init L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) imagenet_init Figure 4:

Active parameter ratio of diﬀerent compression modes through layers (input → output) for ResNet50 under diﬀerentinitializations. icy was to keep training for a while after the learning curvebecame nearly ﬂat. However, recently, training history gainedmore importance due to interesting observations such as epoch-wise double descent [53]. Hence, after every epochs, we simulated pruning for each compression mode as if it wasthe end of the training to construct a pruning history . Us-ing this history, we ask two questions: (1) What would wesee if we stopped the training at an earlier stage? (2) Canwe use compression score to estimate the utilization of agiven model’s capacity, then for example to use it to decidewhether to stop the training or not? In Fig 5 and Fig 7, we show pruning ratio of each com-pression mode through epochs. As expected (see Sections6, 8, and 7), min_eig_real always prunes the most (Fig 6and Fig 8). The interesting observation here is that, for thinResNets, other compression modes almost catches up min_eigin terms of pruning ratio after 90 th epoch which is right afterwe divide the learning rate by (80 th epoch). This is notthe case for ResNet50 where the curves are quite similar toeach other and the gap between them is never as big as in thethin ResNet case. However, ResNet50 contains a signiﬁcantamount of kernels (no diﬀerence between compression Cugu and Akbas:

A Deeper Look into Convolutions

Page 11 of 19 Deeper Look into Convolutions via Pruning R e s N e t p a r a m _ c o un t CIFAR-10

CIFAR-100 tiny-imagenet R e s N e t p a r a m _ c o un t R e s N e t p a r a m _ c o un t Figure 5:

Pruning ratios of diﬀerent thin ResNets through epochs for diﬀerent datasets. R e s N e t p a r a m _ c o un t CIFAR-10

CIFAR-100 tiny-imagenet R e s N e t p a r a m _ c o un t R e s N e t p a r a m _ c o un t Figure 6: min_eig_real always achieves the best pruning ratio for diﬀerent thin ResNets through epochs using diﬀerent datasets. modes), so it is expected for diﬀerent modes to draw similarcurves.In Fig 9, we show classiﬁcation accuracy of each com-pression mode through epochs for diﬀerent thin ResNets.In comparison to pruning ratio charts, classiﬁcation perfor-mances vary signiﬁcantly which results in indistinguishableresults for the ﬁrst 90 epochs. However, after that point,there is a strong divergence in performance for min_eig andmin_eig_real whereas other modes continue to perform quitesimilarly. This similarity is shown in Fig 10 where the colorof the best mode in the previous simulation point is reﬂectedby painting the next epoch interval. Note that det anddet_gram can also diverge from the red-ish curve dependingon the eigenvalue distribution, and we observe this prob-ability in Fig 11 as the blue-ish curves also diverge from the red-ish ones in later stages of training. As in the thinResNet case, we see frequent changes in the mode with thebest recognition performance in Fig 12.Compression score (see Section 6) combines the oppos-ing forces in model compression: (1) pruning ratio, (2) clas-siﬁcation accuracy. For thin ResNets, this combination man-ifests itself very clearly: in Fig 13, (1) for the ﬁrst 80 epochscurves are aligned and clearly distinguishable as in Fig 5,then they merge between epochs - , (2) after 90 th epochgreen curves (min_eig and min_eig_real) diverges from therest as in Fig 9. In addition, ResNet50 also follows this pat-tern though with greater perturbation (Fig 15). We again seethree main clusters of lines: red, blue, green. They go rela-tively aligned at ﬁrst as in Fig 7, then diverge as in Fig 11.Which compression mode is at the top? The ones with the Cugu and Akbas:

A Deeper Look into Convolutions

Page 12 of 19 Deeper Look into Convolutions via Pruning r a n d o m _ i n i t p a r a m _ c o un t CIFAR-10

CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t p a r a m _ c o un t Figure 7:

Pruning ratios of ResNet50 through epochs for diﬀerent datasets under diﬀerent initializations. r a n d o m _ i n i t p a r a m _ c o un t CIFAR-10

CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t p a r a m _ c o un t Figure 8: min_eig_real always achieves the best pruning ratio for ResNet50 through epochs under diﬀerent initializations. highest compression score at the diﬀerent stages of trainingare shown in Fig 14 and Fig 16. According to these results,winners change frequently and do not follow a general pat-tern. Hence, we evaluate winner based comparison not as informative as the clusters-of-modes analysis.In overall, we found an interesting point of behavioralchange in pruning, epoch . After the ﬁrst drop in learn-ing rate, we identiﬁed an interesting shift in the eigen- R e s N e t a cc u r a c y CIFAR-10

CIFAR-100 tiny-imagenet R e s N e t a cc u r a c y R e s N e t a cc u r a c y Figure 9:

Classiﬁcation accuracy of diﬀerent thin ResNets through epochs for diﬀerent datasets.Cugu and Akbas:

A Deeper Look into Convolutions

Page 13 of 19 Deeper Look into Convolutions via Pruning R e s N e t a cc u r a c y CIFAR-10

CIFAR-100 tiny-imagenet R e s N e t a cc u r a c y R e s N e t a cc u r a c y Figure 10:

Compression modes with the best classiﬁcation accuracy through epochs for thin ResNets. r a n d o m _ i n i t a cc u r a c y CIFAR-10

CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t a cc u r a c y Figure 11:

Classiﬁcation accuracy of ResNet50 through epochs for diﬀerent datasets under diﬀerent initializations. value distribution resulting in (1) convergence of diﬀer-ent modes’ compression scores, then (2) divergence ofminimum absolute eigenvalue based modes . Why is thisimportant? Nakkiran et al. [53] note that if one has a large enough model they can observe an epoch-wise double de-scent where the training is split into two halves with diﬀer-ent behaviors. This two stage learning model also resem-bles the information bottleneck principle [61] and the critical r a n d o m _ i n i t a cc u r a c y CIFAR-10

CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t a cc u r a c y Figure 12:

Compression modes with the best classiﬁcation accuracy through epochs for ResNet50 under diﬀerent initializations.Cugu and Akbas:

A Deeper Look into Convolutions

Page 14 of 19 Deeper Look into Convolutions via Pruning R e s N e t s c o r e CIFAR-10

CIFAR-100 tiny-imagenet R e s N e t s c o r e R e s N e t s c o r e Figure 13:

Compression score of diﬀerent thin ResNets through epochs for diﬀerent datasets. R e s N e t s c o r e CIFAR-10

CIFAR-100 tiny-imagenet R e s N e t s c o r e R e s N e t s c o r e Figure 14:

Compression modes with the best compression score through epochs for thin ResNets. r a n d o m _ i n i t s c o r e CIFAR-10

CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t s c o r e Figure 15:

Compression score of ResNet50 through epochs for diﬀerent datasets under diﬀerent initializations.Cugu and Akbas:

A Deeper Look into Convolutions

Page 15 of 19 Deeper Look into Convolutions via Pruning r a n d o m _ i n i t s c o r e CIFAR-10

CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t s c o r e Figure 16:

Compression modes with the best compression score through epochs for ResNet50 under diﬀerent initializations. learning period [2]. It seems that, in deep learning litera-ture, the two phased learning is observed again and again indiﬀerent concepts. Although, we cannot explain the behav-ior of the eigenvalues in our experiments, it is important toreport this incident since each new observation of the twostage learning model decreases the odds of experiencing amere coincidence. Hence, we leave extensive study on therelationship between eigenvalues and the learning rate thatresulted in the two phases shown in Figs 13 and 15 as a futurework.

11. Discussion

We can also ﬁnd the expert kernels in human brain. Itis well known that adult brain contains special modules thatactivate for speciﬁc functions such as face recognition [34].What is more interesting for us is that these experts emergefrom a natural pruning phase [9]: "In normal development,the pathway for each eye is sculpted (or “pruned”) down tothe right number of connections, and those connections aresculpted in other ways, for example, to allow one to see pat-terns. By overproducing synapses then selecting the rightconnections, the brain develops an organized wiring diagramthat functions optimally.". A clear example is given by Haieret al. [21, 20] where it is shown that high brain activity whilelearning to play Tetris is replaced by a low brain activity oncethe subjects reach a certain level of expertise in the game.The authors of these papers state that the correlation betweenimprovement on the task and decreasing brain glucose usesuggests that those who honed their cognitive strategy to thefewest circuits improved the most. In fact, these studies arepart of a wider picture known as the neural eﬃciency hy-pothesis of intelligence [54, 56]. Hence, we think pruningis imperative to understand the learning dynamics of convo-lutional neural nets.In addition, while trying to set the optimal 𝐿 regular-ization penalty, 𝛼 = 4 ∗ 10 −4 and 𝛼 = 10 −5 experiments re-vealed an interesting phenomenon. In general, model com-pression consists of 3 phases: (1) training with regulariza-tion, (2) pruning, (3) ﬁne-tuning without (excessive) regu-larization. Although we do not use the ﬁne-tuning phase inour analysis, we, nonetheless, also run ﬁne-tuning trials in our preliminary experiments, and observed opposite eﬀectsfor −4 and −5 . In our experiments, excessive prun-ing with −4 damages the model capacity in such away that ﬁne-tuning actually decreases the performance ofthe model whereas −5 selection can still beneﬁt from ﬁne-tuning. Nevertheless, this observation requires a detailedstudy focused entirely on pruning vs. model capacity, so weleave this as a future work.

12. Conclusions and Future Work

In this work, we study matrix characteristics for pruningin an attempt to answer the question: what is an expert ker-nel? In this pursuit, we found that (1) spectral norm is a morerobust heuristic (in the sense that it better preserves general-ization accuracy) than the average absolute weight for prun-ing, (2) one can use spectral radius of a kernel as a usefulheuristic for pruning and push the compression ratio by dis-carding the complex parts of the eigenvalues while preserv-ing the recognition performance, (3) a kernel can be essentialwhether it is full-rank or not, (4) theoretical pruning heuris-tics may not hold in practice due to numerical instabilitiesborn out of dealing with very small numbers, (5) datasetshave a signiﬁcant footprint in the activity map (active pa-rameter distribution across layers after pruning) of a modelfamily in a homogeneous 𝐿 regularization (each layer hasthe same regularization penalty), (6) the gap between thesmallest and the largest eigenvalue of a kernel starts to in-crease after the ﬁrst drop in the learning rate which resultedin a divergence of compression scores achieved by minimumand maximum absolute eigenvalue based pruning heuristics.With these in mind, we leave extensive study on neural ac-tivity through layers, two stages of eigenvalue distributionsand their connection to the learning rate for future research. Cugu and Akbas:

A Deeper Look into Convolutions

Page 16 of 19 Deeper Look into Convolutions via Pruning

References [1] Abel, N.H., 1826. Beweis der unmöglichkeit, algebraische gleichun-gen von höheren graden als dem vierten allgemein aufzulösen. Journalfür die reine und angewandte Mathematik 1, 65–84.[2] Achille, A., Rovere, M., Soatto, S., 2018. Critical learning periods indeep networks, in: International Conference on Learning Represen-tations.[3] Arora, S., Ge, R., Neyshabur, B., Zhang, Y., 2018. Stronger general-ization bounds for deep nets via a compression approach, in: Interna-tional Conference on Machine Learning, pp. 254–263.[4] Baykal, C., Liebenwein, L., Gilitschenski, I., Feldman, D., Rus, D.,2018. Data-dependent coresets for compressing neural networks withapplications to generalization bounds, in: International Conferenceon Learning Representations.[5] Behrmann, J., Grathwohl, W., Chen, R.T., Duvenaud, D., Jacobsen,J.H., 2019. Invertible residual networks, in: International Conferenceon Machine Learning, pp. 573–582.[6] Belkin, M., Hsu, D., Ma, S., Mandal, S., 2019. Reconciling modernmachine-learning practice and the classical bias–variance trade-oﬀ.Proceedings of the National Academy of Sciences 116, 15849–15854.[7] Blakeney, C., Yan, Y., Zong, Z., 2020. Is pruning compression?: In-vestigating pruning via network layer similarity, in: The IEEE WinterConference on Applications of Computer Vision, pp. 914–922.[8] Chollet, F., 2015. keras.[9] Council, N.R., et al., 2000. How people learn: Brain, mind, experi-ence, and school: Expanded edition. National Academies Press.[10] Dong, X., Chen, S., Pan, S., 2017. Learning to prune deep neural net-works via layer-wise optimal brain surgeon, in: Advances in NeuralInformation Processing Systems, pp. 4857–4867.[11] Dubey, A., Chatterjee, M., Ahuja, N., 2018. Coreset-based neuralnetwork compression, in: Proceedings of the European Conferenceon Computer Vision (ECCV), pp. 454–470.[12] Frankle, J., Carbin, M., 2018. The lottery ticket hypothesis: Find-ing sparse, trainable neural networks, in: International Conference onLearning Representations.[13] Fukushima, K., Miyake, S., 1982. Neocognitron: A self-organizingneural network model for a mechanism of visual pattern recognition,in: Competition and cooperation in neural nets. Springer, pp. 267–285.[14] Galois, É., Neumann, P.M., 2011. The mathematical writings ofÉvariste Galois. volume 6. European Mathematical Society.[15] Gerschgorin, S., 1931. Uber die abgrenzung der eigenwerte einermatrix. Izvestija Akademii Nauk SSSR, Serija Matematika 7, 749–754.[16] Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep learning.[17] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversar-ial nets, in: Advances in neural information processing systems, pp.2672–2680.[18] Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.J.,Choi, E., 2018. Morphnet: Fast & simple resource-constrained struc-ture learning of deep networks, in: Proceedings of the IEEE confer-ence on computer vision and pattern recognition, pp. 1586–1595.[19] Guo, Y., Yao, A., Chen, Y., 2016. Dynamic network surgery for ef-ﬁcient dnns, in: Advances in neural information processing systems,pp. 1379–1387.[20] Haier, R.J., Siegel, B., Tang, C., Abel, L., Buchsbaum, M.S., 1992a.Intelligence and changes in regional cerebral glucose metabolic ratefollowing learning. Intelligence 16, 415–426.[21] Haier, R.J., Siegel Jr, B.V., MacLachlan, A., Soderling, E., Lotten-berg, S., Buchsbaum, M.S., 1992b. Regional glucose metabolicchanges after learning a complex visuospatial/motor task: a positronemission tomographic study. Brain research 570, 134–143.[22] Han, S., Mao, H., Dally, W.J., 2016. Deep compression: Compressingdeep neural networks with pruning, trained quantization and huﬀmancoding .[23] Han, S., Pool, J., Tran, J., Dally, W., 2015. Learning both weightsand connections for eﬃcient neural network, in: Advances in neural information processing systems, pp. 1135–1143.[24] Hassibi, B., Stork, D.G., 1993. Second order derivatives for networkpruning: Optimal brain surgeon, in: Advances in neural informationprocessing systems, pp. 164–171.[25] He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation, in:Proceedings of the IEEE international conference on computer vision,pp. 1026–1034.[26] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learningfor image recognition, in: Proceedings of the IEEE conference oncomputer vision and pattern recognition, pp. 770–778.[27] He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y., 2019. Filter pruning viageometric median for deep convolutional neural networks accelera-tion, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 4340–4349.[28] He, Y., Zhang, X., Sun, J., 2017. Channel pruning for acceleratingvery deep neural networks, in: Proceedings of the IEEE InternationalConference on Computer Vision, pp. 1389–1397.[29] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,Weyand, T., Andreetto, M., Adam, H., 2017. Mobilenets: Eﬃcientconvolutional neural networks for mobile vision applications. arXivpreprint arXiv:1704.04861 .[30] Huang, Z., Wang, N., 2018. Data-driven sparse structure selection fordeep neural networks, in: Proceedings of the European conference oncomputer vision (ECCV), pp. 304–320.[31] Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J.,Keutzer, K., 2016. Squeezenet: Alexnet-level accuracy with50x fewer parameters and< 0.5 mb model size. arXiv preprintarXiv:1602.07360 .[32] Ioﬀe, S., Szegedy, C., 2015. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift, in: InternationalConference on Machine Learning, pp. 448–456.[33] Junjie, L., Zhe, X., Runbin, S., Cheung, R.C., So, H.K., 2020. Dy-namic sparse training: Find eﬃcient sparse network from scratch withtrainable masked layers, in: International Conference on LearningRepresentations.[34] Kanwisher, N., McDermott, J., Chun, M.M., 1997. The fusiform facearea: a module in human extrastriate cortex specialized for face per-ception. Journal of neuroscience 17, 4302–4311.[35] Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 .[36] Krizhevsky, A., et al., 2009. Learning multiple layers of features fromtiny images .[37] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E.,Hubbard, W., Jackel, L.D., 1989. Backpropagation applied to hand-written zip code recognition. Neural computation 1, 541–551.[38] LeCun, Y., Denker, J.S., Solla, S.A., 1990. Optimal brain damage, in:Advances in neural information processing systems, pp. 598–605.[39] Li, C., Farkhoor, H., Liu, R., Yosinski, J., 2018. Measuring the intrin-sic dimension of objective landscapes, in: International Conferenceon Learning Representations.[40] Li, F.F., Karpathy, A., Johnson, J., 2017a. Tiny imagenet visual recog-nition challenge .[41] Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P., 2017b. Prun-ing ﬁlters for eﬃcient convnets .[42] Li, Y., Gu, S., Gool, L.V., Timofte, R., 2019. Learning ﬁlter basisfor convolutional neural network compression, in: Proceedings of theIEEE International Conference on Computer Vision, pp. 5623–5632.[43] Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., Doer-mann, D., 2019. Towards optimal structured cnn pruning via genera-tive adversarial learning, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 2790–2799.[44] Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C., 2017. Learningeﬃcient convolutional networks through network slimming, in: Pro-ceedings of the IEEE International Conference on Computer Vision,pp. 2736–2744.[45] Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T., 2018. Rethinkingthe value of network pruning, in: International Conference on Learn-

Cugu and Akbas:

A Deeper Look into Convolutions

Page 17 of 19 Deeper Look into Convolutions via Pruning ing Representations.[46] Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional net-works for semantic segmentation, in: Proceedings of the IEEE con-ference on computer vision and pattern recognition, pp. 3431–3440.[47] Luo, J.H., Wu, J., Lin, W., 2017. Thinet: A ﬁlter level pruning methodfor deep neural network compression, in: Proceedings of the IEEEinternational conference on computer vision, pp. 5058–5066.[48] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y., 2018. Spectralnormalization for generative adversarial networks. arXiv preprintarXiv:1802.05957 .[49] Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J., 2017. Pruningconvolutional neural networks for resource eﬃcient inference.[50] Mozer, M.C., Smolensky, P., 1989. Skeletonization: A technique fortrimming the fat from a network via relevance assessment, in: Ad-vances in neural information processing systems, pp. 107–115.[51] Mpitsos, G.J., Burton Jr, R.M., 1992. Convergence and divergence inneural networks: Processing of chaos and biological analogy. Neuralnetworks 5, 605–625.[52] Mussay, B., Osadchy, M., Braverman, V., Zhou, S., Feldman, D.,2019. Data-independent neural pruning via coresets, in: InternationalConference on Learning Representations.[53] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever,I., 2019. Deep double descent: Where bigger models and more datahurt. arXiv preprint arXiv:1912.02292 .[54] Neubauer, A.C., Fink, A., 2009. Intelligence and neural eﬃciency.Neuroscience & Biobehavioral Reviews 33, 1004–1023.[55] Nowlan, S.J., Hinton, G.E., 1992. Simplifying neural networks bysoft weight-sharing. Neural computation 4, 473–493.[56] Nussbaumer, D., Grabner, R.H., Stern, E., 2015. Neural eﬃciency inworking memory tasks: The impact of task demand. Intelligence 50,196–208.[57] Ramakrishnan, R.K., Sari, E., Nia, V.P., 2020. Diﬀerentiable mask forpruning convolutional and recurrent networks, in: 2020 17th Confer-ence on Computer and Robot Vision (CRV), IEEE. pp. 222–229.[58] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Im-agenet large scale visual recognition challenge. International journalof computer vision 115, 211–252.[59] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C., 2018.Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceed-ings of the IEEE conference on computer vision and pattern recogni-tion, pp. 4510–4520.[60] Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O., 2020. Green ai.Communications of the ACM 63, 54–63.[61] Shwartz-Ziv, R., Tishby, N., 2017. Opening the black box of deepneural networks via information. arXiv preprint arXiv:1703.00810 .[62] Sietsma, J., 1988. Neural net pruning-why and how, in: Proceed-ings of International Conference on Neural Networks, San Diego, CA,1988, pp. 325–333.[63] Son, S., Nah, S., Mu Lee, K., 2018. Clustering convolutional kernelsto compress deep neural networks, in: Proceedings of the EuropeanConference on Computer Vision (ECCV), pp. 216–232.[64] Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z., 2016. Re-thinking the inception architecture for computer vision, in: Proceed-ings of the IEEE conference on computer vision and pattern recogni-tion, pp. 2818–2826.[65] Ullrich, K., Meeds, E., Welling, M., 2017. Soft weight-sharing forneural network compression .[66] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H., 2016. Learning struc-tured sparsity in deep neural networks, in: Advances in neural infor-mation processing systems, pp. 2074–2082.[67] Yang, T.J., Chen, Y.H., Sze, V., 2017. Designing energy-eﬃcientconvolutional neural networks using energy-aware pruning, in: Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 5687–5695.[68] Ye, J., Lu, X., Lin, Z., Wang, J.Z., 2018. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolutionlayers, in: International Conference on Learning Representations. [69] Ye, S., Xu, K., Liu, S., Cheng, H., Lambrechts, J.H., Zhang, H., Zhou,A., Ma, K., Wang, Y., Lin, X., 2019. Adversarial robustness vs. modelcompression, or both?, in: Proceedings of the IEEE/CVF Interna-tional Conference on Computer Vision, pp. 111–120.[70] Yu, R., Li, A., Chen, C.F., Lai, J.H., Morariu, V.I., Han, X., Gao, M.,Lin, C.Y., Davis, L.S., 2018. Nisp: Pruning networks using neuronimportance score propagation, in: Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, pp. 9194–9203.[71] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2016. Un-derstanding deep learning requires rethinking generalization .[72] Zhao, C., Ni, B., Zhang, J., Zhao, Q., Zhang, W., Tian, Q., 2019.Variational convolutional neural network pruning, in: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,pp. 2780–2789.

Cugu and Akbas:

A Deeper Look into Convolutions

Page 18 of 19 Deeper Look into Convolutions via Pruning

A. Appendix

Here we present the detailed results of our set analysis (see Section 8). { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 17:

Set analysis bar chart for ResNet32 on CIFAR-10 with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 18:

Set analysis bar chart for ResNet32 on CIFAR-100 with static_initCugu and Akbas:

A Deeper Look into Convolutions

Page 19 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 19:

Set analysis bar chart for ResNet32 on tiny-imagenet with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 20:

Set analysis bar chart for ResNet56 on CIFAR-10 with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , d } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 21:

Set analysis bar chart for ResNet56 on CIFAR-100 with static_initCugu and Akbas:

A Deeper Look into Convolutions

Page 20 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 22:

Set analysis bar chart for ResNet56 on tiny-imagenet with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 23:

Set analysis bar chart for ResNet110 on CIFAR-10 with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 24:

Set analysis bar chart for ResNet110 on CIFAR-100 with static_initCugu and Akbas:

A Deeper Look into Convolutions

Page 21 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 25:

Set analysis bar chart for ResNet110 on tiny-imagenet with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 26:

Set analysis bar chart for ResNet32 on CIFAR-10 with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 27:

Set analysis bar chart for ResNet32 on CIFAR-100 with random_initCugu and Akbas:

A Deeper Look into Convolutions

Page 22 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , d } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 28:

Set analysis bar chart for ResNet32 on tiny-imagenet with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 29:

Set analysis bar chart for ResNet56 on CIFAR-10 with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 30:

Set analysis bar chart for ResNet56 on CIFAR-100 with random_initCugu and Akbas:

A Deeper Look into Convolutions

Page 23 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 31:

Set analysis bar chart for ResNet56 on tiny-imagenet with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 32:

Set analysis bar chart for ResNet110 on CIFAR-10 with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 33:

Set analysis bar chart for ResNet110 on CIFAR-100 with random_initCugu and Akbas:

A Deeper Look into Convolutions

Page 24 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 34:

Set analysis bar chart for ResNet110 on tiny-imagenet with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 35:

Set analysis bar chart for ResNet50 on CIFAR-10 with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , d } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 36:

Set analysis bar chart for ResNet50 on CIFAR-100 with static_initCugu and Akbas:

A Deeper Look into Convolutions

Page 25 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 37:

Set analysis bar chart for ResNet50 on tiny-imagenet with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 38:

Set analysis bar chart for ResNet50 on CIFAR-10 with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 39:

Set analysis bar chart for ResNet50 on CIFAR-100 with random_initCugu and Akbas:

A Deeper Look into Convolutions

Page 26 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 40:

Set analysis bar chart for ResNet50 on tiny-imagenet with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 41:

Set analysis bar chart for ResNet50 on CIFAR-10 with imagenet_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { f } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 42:

Set analysis bar chart for ResNet50 on CIFAR-100 with imagenet_initCugu and Akbas:

A Deeper Look into Convolutions

Page 27 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , d } { a , c , f } { a , c } { a , f } { a } { b } { f } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 43:

Set analysis bar chart for ResNet50 on tiny-imagenet with imagenet_initCugu and Akbas: