A Deeper Look into Convolutions via Pruning
AA Deeper Look into Convolutions via Pruning
Ilke Cugu a , ∗ , Emre Akbas a a Department of Computer Engineering, Middle East Technical University, 06800, Ankara, Turkey
A R T I C L E I N F O
Keywords :pruningmodel compressionconvolutional neural networksdeep learningexpert kernelseigenvalues
A B S T R A C T
Convolutional neural networks (CNNs) are able to attain better visual recognition performance thanfully connected neural networks despite having much less parameters due to their parameter sharingprinciple. Hence, modern architectures are designed to contain a very small number of fully-connectedlayers, often at the end, after multiple layers of convolutions. It is interesting to observe that we canreplace large fully-connected layers with relatively small groups of tiny matrices applied on the entireimage. Moreover, although this strategy already reduces the number of parameters, most of the con-volutions can be eliminated as well, without suffering any loss in recognition performance. However,there is no solid recipe to detect this hidden subset of convolutional neurons that is responsible forthe majority of the recognition work. Hence, in this work, we use the matrix characteristics basedon eigenvalues in addition to the classical weight-based importance assignment approach for pruningto shed light on the internal mechanisms of a widely used family of CNNs, namely residual neu-ral networks (ResNets), for the image classification problem using CIFAR-10, CIFAR-100 and TinyImageNet datasets. The codes are available at https://github.com/cuguilke/psykedelic .
1. Introduction
Modern deep neural networks are able to attain impres-sive performance on computer vision problems by using mil-lions of parameters scattered across multiple layers in theform of convolutions. However, as we show in this study,one can often eliminate most of these parameters without anynegative effect on generalization error. A classical methodof doing so is to prune the parameters that are effectivelyzero. However, in the case of pruning groups of parameters such as the channels in a convolutional neuron, it is not clearhow to assess their importance.In the literature, the de facto standard of assessing neu-ron importance is to check the average absolute value of itsweights which is an acceptable proposition for fully con-nected neural nets [23, 22, 66, 19, 41, 64, 67, 63, 69, 7].However, for visual recognition, convolutional neural nets(CNNs) [13, 37] have already surpassed the fully connectednets in terms of both accuracy and efficiency. CNNs reducedthe number of parameters dramatically due to their param-eter sharing principle and local connectivity. Although thegeneral tendency is still to keep several fully connected (FC)layers right before the output layers [26, 64, 29], it has beenshown that one can get rid off the FC layers entirely whilepreserving competitive performance [46, 31, 59]. Today,the modern visual recognition architectures mostly consistof convolutions. Hence, we can think of convolutions as avery effective model compression method in and of itself.However, the story does not end there since not all kernelsare equally important for recognition. In the following, werefer to each 𝑛 × 𝑛 channel of a convolutional neuron as akernel In fact, as we also show here, depending on the difficulty of a problem, it is possible that only a small portion of the ∗ Corresponding author [email protected] (I. Cugu)
ORCID (s): (I. Cugu) kernels are actually vital for visual recognition [39]. Whenthe absence of a kernel results in a significant rise in the gen-eralization error, we call that kernel an expert kernel , andemergence of the expert kernels are often enforced via reg-ularization which is also used to limit the capacity for mem-orization [71]. However, several studies on the generaliza-tion properties of deep learning state that the convolutionalcase is more difficult to analyze [39, 3], so the theoreticaladvances mostly stay in the fully connected realm.Therefore, in this work, we study convolutions and askthe question: "How can we characterize or identify expertkernels?". To this end, we focus on the fact that kernels arebasically tiny matrices applied on the entire image. Thus,we employ matrix characteristics as heuristics to assessthe significance of a given kernel. In taxonomy, this mo-tivation puts our work under the category of unstructuredheuristic-based kernel-level pruning of the model com-pression literature. Our approach is to eliminate as manykernels as possible using a set of shared hyperparametersthat enable the best heuristics to maintain the recognitionperformance of the unpruned version of a given model with-out the need for a fine-tuning phase after the pruning. Then,candidate heuristics are used to provide more insight aboutthe expert kernels in convolutional neural nets.Discovering the correct recipe to recognize the expertkernels may influence both the theory and practice, or, atleast, it may validate our current assumptions. Regardingthe theory side, there are exciting empirical observations on overparameterization such as the lottery ticket hypothesis[12] and the double-descent in generalization error [6, 53].These studies try to explain deep neural nets’ ability to avoidoverfitting despite having much more parameters than thenumber of available samples, and we argue that the defini-tion of expert kernels can can shed new light on the discus-sion on overparameterization in deep learning. In practice,elimination of non-expert kernels may produce very sparsemodels. As we show in this study, the sparsity structure of
Cugu and Akbas:
A Deeper Look into Convolutions
Page 1 of 19 a r X i v : . [ c s . C V ] F e b Deeper Look into Convolutions via Pruning the pruned model heavily depends on the problem definition(i.e. the dataset), hence, in some cases, it is possible to re-move large groups of neurons resulting in a direct drop in thecomputational cost. Nevertheless, such models when com-bined with the advances in high performance sparse matrixcomputations can yield families of energy efficient neuralnets [60].In this paper, we present, to the best of our knowledge, the first extensive analysis on eigenvalue based heuristicsto mark the expert kernels, the distribution of expert ker-nels through layers, and the change in their pruning be-havior during the different phases of training . Our find-ings include: (1) identification of spectral norm as a more ro-bust heuristic (in the sense that it better preserves generaliza-tion accuracy) than the widely-used average absolute weightof a kernel, (2) feasibility of ignoring the complex parts ofthe eigenvalues to push the pruning ratio further without asignificant loss in recognition performance, (3) dataset foot-print on the active parameter distribution across layers afterpruning, (4) observation of a shift in eigenvalues after thefirst drop in learning rate causing some heuristics to fail.This paper is organized as follows: we start by review-ing the relevant work in the literature in Section 2. Next, weintroduce our candidate heuristics to assess kernel quality inSection 3, the relevant regularization in Section 4, and ourexperimental setup in Section 5. Then, our analysis starts:(1) the standard performance comparison in Section 6, (2)since most of our heuristics are based on eigenvalues, wediscuss real vs. complex eigenvalues in Section 7, (3) whichheuristic prunes which kernels analysis in Section 8, (4) layer-wise pruning analysis in Section 9, and (5) epoch-wise prun-ing analysis in Section 10.
2. Related Works
There are several ways of model compression in deeplearning such as quantization, binarization, low-rank approx-imation, knowledge distillation, architectural design and prun-ing. In this work, we are only interested in pruning since itis the only category that directly targets the inessential partsof deep neural nets whereas the closest family of compres-sion techniques in this regard focuses on approximating the overall computation (low-rank approximation).In the literature, some of the early examples of pruningresearch are done on saliency based methods [50, 38, 51, 24],and the interest continues as new deeper architectures arebeing introduced [49, 10, 70]. In this line of work, the im-portance of a unit (weight/neuron/group) is determined byits direct effect on the training error. Although the requiredcomputation is immense, and hence the emphasis is on thedevelopment of well approximators, the reasoning behindthis approach is solid and straightforward.Another line of work does pruning based on unit outputs.To our knowledge, the first study [62] of this kind look forinvariants. These invariants are defined as units whose out-puts do not change across different input patterns. In otherwords, the units that act like bias terms are removed to save space. This is not a widely used technique anymore, per-haps it is because the large datasets of today make it hard forunits to be invariant. Today, (1) the focus is on minimizingthe reconstruction error of output feature maps per layer [28,47], (2) clustering units w.r.t to their activations [11, 4, 52].We can also count the latter studies under another category,namely clustering-based pruning [55, 65, 63, 42]. A rela-tively similar study [27] uses geometric median to set repre-sentative/centroid filters and prune the ones that are close tothe geometric median.Next set of studies [44, 68, 18, 72] leverage scaling fac-tors such as the 𝛾 of batch norm [32] or introduce their ownfor different levels of analysis (neuron/group/layer) [30] tocut off the outputs of some units. The advantage of thismethod is that it releases the input dependency of activationbased importance assignment. In other words, if the scal-ing factor is , it is guaranteed that the corresponding unithas no contribution to recognition. We can include maskbased pruning methods [33, 57] under this category as wellsince the basic principle is the same. Although quite use-ful in practice, this approach does not tell us anything aboutthe nature of a "good unit" for recognition. The decision isleft entirely to the optimization by simply adding a regular-ization term for the scaling factors. Similarly, a recent work[43] also leaves the importance assignment to the learningitself by employing a minimax game as in generative adver-sarial networks [17]. One way to study this black box is tointroduce heuristics and test their validity. A well-knownheuristic is the average of the absolute value of the weightsof a unit [23, 22, 66, 19]. Hence, we take it as a baseline forour matrix characteristics based heuristics (see Section 3).There are also analysis papers that can be found relatedto ours since the motivation is the same: to understand deepconvolutional neural nets via pruning. Liu et al.’s work [45]present an extensive empirical study on pruning algorithmsto see if there is indeed an advantage of pruning over train-ing the target architecture from scratch. Their methodologyinvolves a fine-tuning phase after pruning to recover the lossin performance. In that setting, it is shown that structuredpruning can be omitted, and, instead, one can directly trainthe target architecture from scratch and achieve competitiveperformance. However, it seems this is not the case for un-structured pruning . In our study, we also employ an un-structured pruning scheme, but we do not have a fine-tuningphase since our aim is to prune as much as possible whilepreserving the vanilla performance. On another note, prun-ing is argued to be important since one cannot achieve thesame adversarial robustness or performance by training fromscratch [69]. In addition, a recent study [7] confirms the ob-servations of Liu et al. [45] and studies the differences be-tween the representations of the pruned nets and vanilla nets.
3. Expert Kernel Heuristics
Convolutional layers are quite interesting since they canexpress more representational power than the fully-connectedcounterparts despite having much fewer parameters. How-
Cugu and Akbas:
A Deeper Look into Convolutions
Page 2 of 19 Deeper Look into Convolutions via Pruning ever, in the pruning literature, convolutional layers are oftentreated as if they are fully connected layers. A common ap-proach is to compute the average absolute weight of a givenkernel (in this work, a kernel is referred to as a single channelof a neuron in a convolutional layer). However, although thistechnique works well in practice, it is not clear if the averageweight is the real indicator of a kernel’s redundancy. Hence,we propose several heuristics that we refer to as compres-sion modes throughout the paper. Our compression modesare for a given 𝑛 × 𝑛 real kernel 𝐾 :• det : abs determinant = || det( 𝐾 ) || If we assume values smaller than −12 as effectively0, then this mode prunes kernels that are not full-rank.Therefore, its corresponding hypothesis is that an ex-pert kernel is a full-rank matrix.• det_gram : abs determinant of the Gramian matrix = || det( 𝐺 ) || where 𝐺 = 𝐾 𝑇 𝐾 . Since we employ weight regular-ization (see Section 4), kernels tend to have very smallweights even if they are not pruned. Combined with athreshold value of −24 , this mode checks the possi-ble numerical instability caused by weight regulariza-tion. Formally, det( 𝐺 ) = det( 𝐾 𝑇 𝐾 ) = det( 𝐾 𝑇 ) det( 𝐾 ) = det( 𝐾 ) which is basically the same criteria as det. However,in Section 8, we will see that the floating point con-straints prevent the two compression modes from be-ing identical.• min_eig : min abs eigenvalue = min { || 𝜆 𝑖 ( 𝐾 ) || | 𝑖 ∈ {1 , ..., 𝑛 }} This mode basically tests the hypothesis of det modesince the smallest absolute eigenvalue is enough tocheck whether a matrix is full-rank. However, det isnot enough for that due to the fact that it is the productof all eigenvalues in which case the largest absoluteeigenvalue may prevent an overkill.• min_eig_real : min abs eigenvalue (real parts only) = min { || 𝑎 𝑖 ( 𝐾 ) || | 𝑖 ∈ {1 , ..., 𝑛 }} where 𝜆 = 𝑎 + 𝑏𝑖 . As explained in Section 5, we aregenerally dealing with kernels, and these matri-ces can have complex conjugate eigenvalues. Hence,this mode enables us to measure the relevance of theimaginary part of such eigenvalues.• spectral_radius : max abs eigenvalue = max { || 𝜆 𝑖 ( 𝐾 ) || | 𝑖 ∈ {1 , ..., 𝑛 }} This mode measures the largest scaling of the eigen-vectors of 𝐾 while being a safe comparison factor fora possible overkill due to min_eig. • spectral_radius_real : max abs eigenvalue (real partsonly) = max { || 𝑎 𝑖 ( 𝐾 ) || | 𝑖 ∈ {1 , ..., 𝑛 }} where 𝜆 = 𝑎 + 𝑏𝑖 . It has the same reasoning withmin_eig_real.• spectral_norm : max singular value = √ 𝜆 𝑚𝑎𝑥 ( 𝐺 ) where 𝐺 = 𝐾 𝐻 𝐾 = 𝐾 𝑇 𝐾 since the kernel 𝐾 is areal matrix. This mode is quite interesting since itis the key component of residual network invertibil-ity [5] and stable discriminator training in GANs [48].Hence, naturally, we also want to test the applicabilityof spectral norm as an importance indicator for prun-ing convolutional kernels. Notice that 𝐺 is a 𝑛 × 𝑛 real symmetric matrix, so 𝜆 𝑖 ( 𝐺 ) ∈ ℝ and 𝜆 𝑖 ( 𝐺 ) ≥ 𝑖 ∈ {1 , .., 𝑛 } . Thus, there is no spectral_norm_realmode.• weight : avg of abs weights = 1 𝑛 𝑛 ∑ 𝑖 𝑛 ∑ 𝑗 ||| 𝑤 𝑖𝑗 ||| where 𝑤 𝑖𝑗 ∈ 𝐾 . This is the control group since it isthe de-facto standard of pruning in practice.Note that since the exact calculation of the eigenvaluesof a matrix larger than is, in general, not possible, ac-cording to the insolvability theory of Abel [1] and Galois[14], some of the compression modes are only applicable toa restricted set of convolutional architectures. On the otherhand, most well-known models in the literature excessivelyuse kernels, and we employed a model family satisfyingthis constraint which is reported in Section 5.
4. Regularization for Pruning
Regularization in deep learning can be used for differentpurposes. For example, it can be used in combination with aspecific activation function to limit the representational ca-pacity of a model or it can be used just to prevent the weightsfrom getting too large [16]. In this work, we use 𝐿 regular-ization to zero out most parameters so that only a small frac-tion of parameters would be doing most of the recognitionwork. One can think of this as creating a very competitiveenvironment where recognition resources are accumulatedin a minority of kernels, and we evaluate the proposed com-pression modes in terms of their indication quality for sucha minority.As explained in Section 3, all compression modes apartfrom weight are controlled by eigenvalues. 𝐿 regulariza-tion is a known technique to make the weights sparse [16]and, naturally, make the average absolute weight small. Inthis section, we will prove that it can also be used to makeeigenvalues small. Cugu and Akbas:
A Deeper Look into Convolutions
Page 3 of 19 Deeper Look into Convolutions via Pruning
Let kernel 𝐾 be a 𝑛 × 𝑛 real matrix, 𝐾 = ⎛⎜⎜⎜⎜⎝ 𝑤 𝑤 … 𝑤 𝑛 𝑤 𝑤 … ⋮⋮ ⋮ ⋱ ⋮ 𝑤 𝑛 … … 𝑤 𝑛𝑛 ⎞⎟⎟⎟⎟⎠ define radius 𝑟 𝑖 as the sum of the absolute values of the off-diagonal elements per row, 𝑟 𝑖 = 𝑛 ∑ 𝑗 ≠ 𝑖 ||| 𝑤 𝑖𝑗 ||| and let Gershgorin disk 𝐷 𝑖 be the closed disk in the complexplane centered at 𝑤 𝑖𝑖 with radius 𝑟 𝑖 , 𝐷 𝑖 = { 𝑥 ∈ ℂ ∶ || 𝑥 − 𝑤 𝑖𝑖 || ≤ 𝑟 𝑖 } Then, Gershgorin Circle Theorem [15] states that every eigen-value of the kernel 𝐾 lies within at least one of the Gersh-gorin disks 𝐷 𝑖 .As the weights get smaller, the origins of disks 𝐷 𝑖 movetowards zero. In addition, since the off-diagonal row sumsget smaller, the disks shrink. In other words, excessive 𝐿 regularization collapses all Gershgorin disks 𝐷 𝑖 into pointsnear zero.In conclusion, 𝐿 regularization minimizes (1) the aver-age absolute weights, and (2) the probability of having largeeigenvalues. We will microscope the subtle difference be-tween the two in our analysis sections.
5. Experimental Setup
Dataset . We used 3 generic object classification datasets:{CIFAR-10 [36], CIFAR-100 [36], tiny-imagenet [40]}. CIFAR-10 and CIFAR-100 datasets are well known benchmarks forgeneric image classification. tiny-imagenet is a subset of Im-ageNet [58] originally formed as a class project benchmark,and a convenient next step after CIFAR-10 and CIFAR-100since it provides a harder classification task (200 object classes)and we do not have enough resources to run our experimentson ImageNet. Image resolutions are
32 × 32 in CIFAR, and
64 × 64 in tiny-imagenet. In the analysis sections, we willshow that these 3 datasets are quite useful to test the robust-ness of a hypothesis. We set hyperparameters using valida-tion sets we form by randomly separating of the train-ing set where each class is represented equally. Then, wereport test results in the analysis part for which we use thepre-determined test sets for CIFAR-10 and CIFAR-100 andthe pre-determined validation set of tiny-imagenet (since wedo not have access to the labels of its test set).
Models . We used a well known family of convolutional neu-ral networks: ResNets [26]. In the original paper, ResNetscome with 2 different architectures: a thin ResNet variantfor CIFAR-10, and a wide ResNet variant for ImageNet. Al-though the latter is not specifically designed for the datasetswe have, it is essential to test generalization of pruning hy-potheses. Moreover, wide ResNets enables using ImageNet pre-trained weights to test the effect of model initializationon pruning. In this study, we employ ResNets {32, 56, 110}( thin variants ) and ResNet50 ( wide variant ). 𝐿 penalty . This is the hyperparameter 𝛼 in Eq. 1. Wetried 4 values: { −3 , −4 , −4 , −5 }. Naturally,among these, −3 results in the highest pruning rate andthe lowest validation accuracy. −5 produced slightly bet-ter classification performance than −4 , but with a signif-icantly lower pruning rate. −4 was able to reach ashigh as ≈ 96% pruning rate (for example using ResNet110for CIFAR-10 with weight or spectral radius as the compres-sion mode while preserving a reasonable validation accu-racy of ≈ 81% ) resulting in very sparse models. However,when we set performance of training without weight regu-larization as baseline, the gap between the validation accu-racy of −4 and the baseline grows as the dataset getsmore complex (CIFAR-10 → CIFAR-100 → tiny-imagenet).Therefore, in our analyses 𝐿 penalty 𝛼 = 10 −4 . 𝐿 loss term = 𝛼 𝑛 ∑ 𝑖 =0 | 𝑤 𝑖 | (1) Significance threshold . In Section 3, we explain how eachcompression mode computes the significance score of a ker-nel. Instead of targeting kernels which have significancescore of exactly 0, we define a threshold and deem the ker-nels that have scores under this threshold as insignificant.Hence we name this hyperparameter as significance thresh-old. In Fig 1 and 2, we show the pruning ratio of each com-pression mode for different significance thresholds to makean informed decision on how to define what values we willtake as effectively zero. In Fig 1, we see similar curves forthin ResNets, and, as expected, the pruning ratio increasesas we go from ResNet32 to ResNet110. Moreover, since detand det_gram modes represent multiplication of eigenval-ues/matrices, their curves go significantly above others, so itis important to scale their selected threshold accordingly.Another interesting finding is that CIFAR-100 task re-sults in the least pruning ratio compared to CIFAR-10 andtiny-imagenet. We will analyze the effect of this phenomenonon recognition performance in Section 6. In Fig 2, the modelat hand is ResNet50, which is wider than the ones mentionedbefore, so it has a different curve. With ResNet50, our goal isto examine the effect of initial weights on pruning. Surpris-ingly, the pruning ratio curve is similar under different ini-tializations, and it is important to notice that imagenet_initis substantially different than others since it can actually beused to correctly classify most objects in ImageNet. In addi-tion, in both Figs 1 and 2, we see compression modes exceptdet and det_gram draw quite similar curves where green-ishones (min_eig based) go above red-ish ones. This is the ex-pected behavior of extremums and it is discussed in detail inSections 6 and 8. As we want to prune the models as muchas possible, we did preliminary experiments on a small tar-geted interval of significance thresholds ( −5 → −3 ). Wefound −4 to have a good balance between accuracy andpruning ratio. Cugu and Akbas:
A Deeper Look into Convolutions
Page 4 of 19 Deeper Look into Convolutions via Pruning R e s N e t - p r un i n g r a t i o ( % ) CIFAR-10
CIFAR-100 tiny-imagenet R e s N e t - p r un i n g r a t i o ( % ) R e s N e t - p r un i n g r a t i o ( % ) Figure 1:
Pruning ratio vs. threshold chart for thin ResNets under multiple random initializations.
In conclusion, in our analysis, we set the significancethreshold as:• det: −12 for 3x3 kernels, −4 for 1x1 kernels• det_gram: −24 for 3x3 kernels, −8 for 1x1 kernels• others: −4 Statistical significance . We repeated each run for 15+ times,and for performance comparison this number goes up to 40+times simply due to the chronological order of the implemen-tations. Note that each run implies a particular combinationof the settings: 𝑓 settings (model, dataset, initialization mode).For different comparison modes, the training part is shared.Once the training stops, all compression modes branch off.Hence, the experimental results do not have a random pa-rameter space noise for different compression modes. Other hyperparameters . We wanted to keep all hyperpa-rameters constant as much as possible to rule out unwantednoise due to variance of hyperparameters for controlled ex-periments, so unless specifically indicated, a given value fora hyperparameter holds for all settings (models, datasets,etc.):• Optimization algorithm. In this study, we used Adamoptimizer [35] due to its fast convergence on the datasetswe use. • Learning rate. For ResNets {32, 56, 110}, the baselearning rate is −3 whereas it is −4 for ResNet50,and for both families it is divided by 10 at epochs 80,120 and 160.• Batch size. We tried 32, 64, and 128, but we did notfound a significant or constant advantage for a partic-ular selection. Therefore, in order to speed-up the ex-periments, we selected 128 as the mini batch size.• Epochs. We monitored the validation accuracy of allmodels for all datasets, and decided to stop the train-ing at a safe point (validation accuracy curve becomesflat). Hence, the number of epochs is 200, which isan extremely safe point for some cases where learningstabilizes very early on. In Section 10, we examinethe pruning through epochs and try to shed some lightonto the internal structure of CNNs during training.• Initialization. We have 3 initialization modes: – random_init: Models are initialized using He nor-mal initialization [25] option in Keras [8]. – static_init: We still use He initialization, but weimport the same frozen initial weight for eachrun (we run the experiments multiple times tocheck statistical significance of the results). This Cugu and Akbas:
A Deeper Look into Convolutions
Page 5 of 19 Deeper Look into Convolutions via Pruning s t a t i c _ i n i t - p r un i n g r a t i o ( % ) CIFAR-10
CIFAR-100 tiny-imagenet r a n d o m _ i n i t - p r un i n g r a t i o ( % ) i m a g e n e t _ i n i t - p r un i n g r a t i o ( % ) Figure 2:
Pruning ratio vs. threshold chart for ResNet50 under different initializations. mode provides a control group for random_initexperiments since we do not have a variance ininitial conditions here. – imagenet_init: This mode is only available forResNet50 experiments. Nevertheless, since ourcompression modes are tightly coupled with theproperties of the convolution matrices in models,it is important to test the robustness of our propo-sitions in an unfamiliar parameter space whereinitial conditions are not random.
6. Performance Comparison
Summary of findings:• Any heuristic that uses the min abs eigenvalue is bad.• Expert kernel ≠ full-rank matrix.• Spectral norm is the strongest candidate since it causesalmost no drop in classification accuracy.• Spectral radius and the avg abs weight are the leadingheuristics in terms of compression score (Table 4).In this section, we compare the compression modes interms of classification accuracy and pruning ratio which isthe number of pruned kernels divided by the total number of kernels. These two majors are inversely correlated. Hence,to provide a relatively unified view of compression perfor-mance we include an additional major called compressionscore 𝑐 defined as, 𝑐 = 𝑎𝑐𝑐 𝑝𝑟𝑢𝑛𝑒𝑑 𝑎𝑐𝑐 𝑣𝑎𝑛𝑖𝑙𝑙𝑎 ∗ 𝑐 may be larger than if pruning removesdisruptive kernels that are causing misclassifications. How-ever, in our experiments, we did not encounter such a sce-nario. Nevertheless, higher the 𝑐 the better.At the end of a training, we first evaluate the classifica-tion accuracy without pruning ( 𝑎𝑐𝑐 𝑣𝑎𝑛𝑖𝑙𝑙𝑎 ), then each com-pression mode branches-off from that same parameter stateto evaluate their performance. In other words, trained weightsare the same for all modes, but the set of pruned kernelschanges.If you trim a neural net and do not receive a severe penaltyfor it, it is a good indicator that the thrown-out parts were notnecessary in the first place. However, we do not search for aset of desirable hyperparameters for each compression modeto get 𝑎𝑐𝑐 𝑝𝑟𝑢𝑛𝑒𝑑 𝑎𝑐𝑐 𝑣𝑎𝑛𝑖𝑙𝑙𝑎 ≈ 1 . Instead, we set the environment whereat least one compression mode is able to prune the networkwhile achieving competitive performance to the vanilla net-work (see "None" category in Table 1). Cugu and Akbas:
A Deeper Look into Convolutions
Page 6 of 19 Deeper Look into Convolutions via Pruning
Table 1
Classification accuracy comparison of different compression modes compression mode ResNet32 ResNet56 ResNet110 ResNet50static random static random static random static random imagenet C I F A R - None . .
002 0 . .
003 0 . .
002 0 . .
002 0 . .
003 0 . .
005 0 . .
003 0 . .
004 0 . . min_eig . .
075 0 . .
095 0 . .
088 0 . .
109 0 . .
074 0 . .
105 0 . .
052 0 . .
049 0 . . min_eig_real . .
081 0 . .
106 0 . .
090 0 . .
101 0 . .
085 0 . .
112 0 . .
051 0 . .
051 0 . . det . .
002 0 . .
003 0 . .
002 0 . .
002 0 . .
003 0 . .
005 0 . .
059 0 . .
022 0 . . det_gram . .
002 0 . .
002 0 . .
002 0 . . ± 0 .
003 0 . .
005 0 . .
059 0 . .
022 0 . . spectral_radius ± 0 . ± 0 . ± 0 .
002 0 . .
002 0 . . ± 0 . ± 0 . ± 0 .
007 0 . . spectral_radius_real . .
002 0 . .
003 0 . .
002 0 . .
003 0 . .
003 0 . . ± 0 .
005 0 . .
006 0 . . spectral_norm ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 .
005 0 . .
007 0 . . ± 0 . weight ± 0 . ± 0 . ± 0 .
002 0 . .
002 0 . . ± 0 .
005 0 . .
007 0 . . ± 0 . C I F A R - None . .
004 0 . .
005 0 . .
005 0 . .
004 0 . .
010 0 . .
009 0 . .
005 0 . .
005 0 . . min_eig . .
039 0 . .
056 0 . .
044 0 . .
041 0 . .
052 0 . .
035 0 . .
003 0 . .
004 0 . . min_eig_real . .
036 0 . .
059 0 . .
050 0 . .
037 0 . .
045 0 . .
032 0 . .
003 0 . .
004 0 . . det . .
004 0 . .
005 0 . .
006 0 . .
004 0 . .
010 0 . .
009 0 . .
054 0 . .
038 0 . . det_gram . .
004 0 . .
005 0 . .
005 0 . .
004 0 . .
010 0 . .
009 0 . .
054 0 . .
038 0 . . spectral_radius ± 0 . ± 0 . ± 0 .
005 0 . .
004 0 . . ± 0 .
009 0 . .
006 0 . .
004 0 . . spectral_radius_real . .
004 0 . .
005 0 . .
006 0 . .
004 0 . .
010 0 . .
009 0 . .
005 0 . .
004 0 . . spectral_norm ± 0 . ± 0 . ± 0 . ± 0 . ± 0 .
010 0 . . ± 0 . ± 0 . ± 0 . weight ± 0 .
004 0 . . ± 0 .
005 0 . .
004 0 . .
010 0 . .
009 0 . .
024 0 . .
041 0 . . t i n y - i m a g e n e t None . .
004 0 . .
006 0 . .
006 0 . .
005 0 . .
009 0 . .
011 0 . .
011 0 . .
013 0 . . min_eig . .
036 0 . .
040 0 . .
032 0 . .
046 0 . .
044 0 . .
043 0 . .
019 0 . .
025 0 . . min_eig_real . .
037 0 . .
041 0 . .
035 0 . .
042 0 . .
044 0 . .
047 0 . .
021 0 . .
027 0 . . det . .
004 0 . .
005 0 . .
006 0 . .
005 0 . .
009 0 . .
011 0 . .
013 0 . .
013 0 . . det_gram . .
004 0 . . ± 0 .
006 0 . .
005 0 . .
009 0 . .
011 0 . .
013 0 . .
012 0 . . spectral_radius ± 0 .
004 0 . . ± 0 . ± 0 . ± 0 .
009 0 . .
011 0 . . ± 0 .
013 0 . . spectral_radius_real . .
005 0 . .
005 0 . .
008 0 . .
009 0 . .
009 0 . .
011 0 . .
012 0 . .
013 0 . . spectral_norm . . ± 0 .
006 0 . . ± 0 .
005 0 . . ± 0 . ± 0 .
011 0 . . ± 0 . weight ± 0 .
004 0 . . ± 0 . ± 0 .
005 0 . . ± 0 .
011 0 . . ± 0 .
013 0 . . In Table 1, we see very similar performance measure-ments for modes based on spectral radius, spectral norm andthe average absolute weight. The worst results obtained bythe modes that check the minimum absolute eigenvalue. Sincethe determinant combines both the negative and the posi-tive results through multiplication of eigenvalues, det anddet_gram modes oscillate between the extremums.
This find-ing rules out the hypothesis that an expert kernel mustbe a full-rank matrix under the assumption that a valuesmaller than the selected significance threshold is effec-tively . Because, if the minimum absolute eigenvalue is ,the determinant is . However, since the values are not ex-actly , spectral radius often protects determinant from anoverkill. We can clearly see the case where this is not enoughin ResNet50 results where the gap between the extremums islarge enough to significantly deteriorate the performance ofdeterminant based modes. In Table 2, naturally, we observethe opposite, and in Table 3 we get the combined effect ofboth view points. For ResNets{32, 56, 110}, determinantbased heuristics achieve competitive pruning performance.However, when we examine the ResNet50 case (especiallythe imagenet_init results), it is clear that any compressionmode taking the minimum absolute eigenvalue into accountis quite unstable. Therefore, we have a compelling case toturn our attention to spectral radius, spectral norm andthe average absolute weight .As a side note, according to the experimental results, ig-noring the complex parts of the eigenvalues results in differ-ent behavior for min_eig and spectral_radius modes, whichwill be discussed in detail in Section 7. Moreover, we seedifferences between the results of det and det_gram modes.This indicates that there is indeed a relatively small numer-ical instability regarding the matrix operations in the given parameter space, and this will be confirmed in Section 8 indetail.
7. Real vs. Complex Eigenvalues
Summary of findings:• spectral_radius and spectral_radius_real has no majordifference in terms of their pruning statistics. Hence,it seems that the real part (scaling factor) of the largestabsolute eigenvalue is the key part of the spectral_radiusas a kernel importance assignment heuristic.• Since min_eig has already started to prune essentialkernels, min_eig_real makes the performance even worseby pruning even more essential kernels.Here, we are interested in using eigenvalues to under-stand inner mechanisms of deep convolutional neural nets.One of the reasons is the geometric meaning of eigenval-ues: real parts indicate scaling whereas complex parts in-dicate rotation . Hence, one might conjecture : the amountof scaling is the key indicator of an expert kernel. It is noteasy to answer this question empirically since the absenceof a possible drop in performance, when we ignore the com-plex parts, does not guarantee a general rule. On the otherhand, if we actually observe a performance loss, we can ruleout this hypothesis. Therefore, we have min_eig_real andspectral_radius_real modes in addition to the vanilla ones:min_eig and spectral_radius.The real-part-only modes are expected to prune morekernels than their vanilla counterparts since: √ 𝑎 + 𝑏 ≥ 𝑎 Cugu and Akbas:
A Deeper Look into Convolutions
Page 7 of 19 Deeper Look into Convolutions via Pruning
Table 2
Compression ratio comparison of different compression modes compression mode ResNet32 ResNet56 ResNet110 ResNet50static random static random static random static random imagenet C I F A R - min_eig . .
004 0 . .
008 0 . .
003 0 . .
004 0 . .
002 0 . .
003 0 . .
001 0 . .
002 0 . . min_eig_real ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . det . .
005 0 . .
010 0 . .
004 0 . .
006 0 . .
003 0 . .
004 0 . .
001 0 . .
002 0 . . det_gram . .
005 0 . .
010 0 . .
004 0 . .
006 0 . .
003 0 . .
004 0 . .
001 0 . .
002 0 . . spectral_radius . .
006 0 . .
012 0 . .
005 0 . .
006 0 . .
003 0 . .
005 0 . .
004 0 . .
004 0 . . spectral_radius_real . .
006 0 . .
011 0 . .
005 0 . .
006 0 . .
003 0 . .
005 0 . .
004 0 . .
004 0 . . spectral_norm . .
006 0 . .
012 0 . .
005 0 . .
007 0 . .
003 0 . .
005 0 . .
004 0 . .
004 0 . . weight . .
006 0 . .
011 0 . .
004 0 . .
006 0 . .
003 0 . .
005 0 . .
003 0 . .
003 0 . . C I F A R - min_eig . .
011 0 . .
021 0 . .
007 0 . .
014 0 . .
010 0 . .
007 0 . .
009 0 . .
013 0 . . min_eig_real ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . det . .
014 0 . .
028 0 . .
009 0 . .
018 0 . .
012 0 . .
009 0 . .
009 0 . .
013 0 . . det_gram . .
014 0 . .
028 0 . .
009 0 . .
018 0 . .
013 0 . .
009 0 . .
009 0 . .
013 0 . . spectral_radius . .
016 0 . .
030 0 . .
010 0 . .
019 0 . .
013 0 . .
009 0 . .
022 0 . .
026 0 . . spectral_radius_real . .
015 0 . .
030 0 . .
010 0 . .
019 0 . .
013 0 . .
009 0 . .
022 0 . .
026 0 . . spectral_norm . .
016 0 . .
032 0 . .
010 0 . .
019 0 . .
013 0 . .
009 0 . .
022 0 . .
026 0 . . weight . .
015 0 . .
030 0 . .
010 0 . .
019 0 . .
013 0 . .
009 0 . .
012 0 . .
017 0 . . t i n y - i m a g e n e t min_eig . .
005 0 . .
010 0 . .
006 0 . .
005 0 . .
004 0 . .
007 0 . .
013 0 . .
011 0 . . min_eig_real ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . ± 0 . det . .
008 0 . .
015 0 . .
008 0 . .
007 0 . .
005 0 . .
009 0 . .
017 0 . .
016 0 . . det_gram . .
008 0 . .
015 0 . .
008 0 . .
007 0 . .
005 0 . .
009 0 . .
017 0 . .
016 0 . . spectral_radius . .
008 0 . .
016 0 . .
008 0 . .
007 0 . .
005 0 . .
010 0 . .
021 0 . .
021 0 . . spectral_radius_real . .
008 0 . .
016 0 . .
008 0 . .
007 0 . .
005 0 . .
009 0 . .
020 0 . .
020 0 . . spectral_norm . .
008 0 . .
016 0 . .
008 0 . .
008 0 . .
005 0 . .
010 0 . .
022 0 . .
023 0 . . weight . .
008 0 . .
016 0 . .
008 0 . .
007 0 . .
005 0 . .
010 0 . .
015 0 . .
015 0 . . Table 3
Compression score 𝑐 comparison of different compression modes compression mode ResNet32 ResNet56 ResNet110 ResNet50static random static random static random static random imagenet C I F A R - min_eig . .
062 0 . .
081 0 . .
084 0 . .
105 0 . .
078 0 . .
110 0 . .
063 0 . .
060 0 . . min_eig_real . .
068 0 . .
091 0 . .
087 0 . .
098 0 . .
089 0 . .
118 0 . .
062 0 . .
062 0 . . det ± 0 .
005 0 . .
010 0 . .
004 0 . .
005 0 . . ± 0 .
004 0 . .
072 0 . .
028 0 . . det_gram . . ± 0 . ± 0 . ± 0 . ± 0 .
003 0 . .
004 0 . .
072 0 . .
028 0 . . spectral_radius . .
006 0 . .
012 0 . .
005 0 . .
006 0 . .
003 0 . .
005 0 . .
006 0 . .
008 0 . . spectral_radius_real . .
006 0 . .
011 0 . .
005 0 . .
006 0 . .
003 0 . .
005 0 . .
006 0 . .
007 0 . . spectral_norm . .
006 0 . .
012 0 . .
005 0 . .
007 0 . .
003 0 . .
005 0 . .
008 0 . .
008 0 . . weight . .
006 0 . .
011 0 . .
004 0 . .
006 0 . .
003 0 . . ± 0 . ± 0 . ± 0 . C I F A R - min_eig . .
030 0 . .
050 0 . .
053 0 . .
045 0 . .
083 0 . .
053 0 . .
005 0 . .
006 0 . . min_eig_real . .
028 0 . .
052 0 . .
061 0 . .
042 0 . .
072 0 . .
050 0 . .
005 0 . .
006 0 . . det ± 0 . ± 0 .
027 0 . .
009 0 . . ± 0 .
013 0 . .
009 0 . .
092 0 . .
068 0 . . det_gram . .
014 0 . . ± 0 . ± 0 .
019 0 . . ± 0 .
009 0 . .
092 0 . .
068 0 . . spectral_radius . .
016 0 . .
030 0 . .
010 0 . .
019 0 . .
013 0 . .
009 0 . .
021 0 . .
023 0 . . spectral_radius_real . .
015 0 . .
030 0 . .
010 0 . .
020 0 . .
013 0 . . ± 0 . ± 0 .
023 0 . . spectral_norm . .
016 0 . .
031 0 . .
010 0 . .
019 0 . .
014 0 . .
009 0 . .
022 0 . .
024 0 . . weight . .
015 0 . .
030 0 . .
010 0 . .
019 0 . .
013 0 . .
009 0 . .
043 0 . . ± 0 . t i n y - i m a g e n e t min_eig . .
052 0 . .
061 0 . .
059 0 . .
087 0 . .
095 0 . .
093 0 . .
037 0 . .
046 0 . . min_eig_real . .
054 0 . .
062 0 . .
065 0 . .
079 0 . .
095 0 . .
101 0 . .
040 0 . .
051 0 . . det . .
007 0 . .
016 0 . .
007 0 . .
008 0 . . ± 0 . ± 0 . ± 0 .
017 0 . . det_gram ± 0 . ± 0 . ± 0 . ± 0 . ± 0 .
006 0 . .
009 0 . .
023 0 . .
019 0 . . spectral_radius . .
008 0 . .
016 0 . .
008 0 . .
007 0 . .
005 0 . .
010 0 . .
024 0 . .
023 0 . . spectral_radius_real . .
008 0 . .
017 0 . .
011 0 . .
015 0 . .
009 0 . .
011 0 . .
023 0 . .
021 0 . . spectral_norm . .
008 0 . .
016 0 . .
008 0 . .
008 0 . .
005 0 . .
010 0 . .
024 0 . .
024 0 . . weight . .
008 0 . .
016 0 . .
008 0 . .
007 0 . .
005 0 . .
010 0 . .
018 0 . . ± 0 . Table 4
The average compression score of each heuristic across multiplemodels, datasets and initializations.1. weight .
2. spectral_radius .
3. spectral_radius_real .
4. spectral_norm .
5. det_gram .
6. det .
7. min_eig .
8. min_eig_real . where the complex eigenvalue 𝜆 = 𝑎 + 𝑏𝑖 . Moreover, whenwe only consider the real parts of the eigenvalues, their or- dering may change which means, for instance, min_eig maytarget 𝜆 whereas min_eig_real targets 𝜆 for the same ker-nel, where || 𝜆 || < || 𝜆 || but || 𝑎 || > || 𝑎 || .Naturally, to be able to test the hypothesis mentioned ear-lier, we need to have a significant amount of complex eigen-values in our CNNs. In Tables 5 and 6, we show the ratio ofcomplex eigenvalues from three angles: (1) how many com-plex eigenvalues values do we have? ( ), (2)how many complex eigenvalues are targeted by the relevantcompression modes? ( ), and (3) howmany pruned kernels are deemed unimportant by looking atcomplex eigenvalues? ( ).For ResNet50, kernels overweight the picture. InTable 5, we see that the amount of complex eigenvalues in Cugu and Akbas:
A Deeper Look into Convolutions
Page 8 of 19 Deeper Look into Convolutions via Pruning
Table 5
Amount of total/targeted/pruned complex eigenvalues in ResNet50 for different inits model total complex ratio min_eig min_eig_real spectral_radius spectral_radius_realtarget pruned target pruned target pruned target pruned C - random_init .
09 ± 0 .
001 0 .
03 ± 0 .
000 0 .
03 ± 0 .
000 0 .
04 ± 0 .
000 0 .
04 ± 0 .
000 0 .
03 ± 0 .
000 0 .
03 ± 0 .
000 0 .
02 ± 0 .
000 0 .
02 ± 0 . imagenet_init .
08 ± 0 .
002 0 .
02 ± 0 .
000 0 .
03 ± 0 .
000 0 .
03 ± 0 .
001 0 .
04 ± 0 .
001 0 .
03 ± 0 .
001 0 .
03 ± 0 .
001 0 .
02 ± 0 .
000 0 .
02 ± 0 . C - random_init .
06 ± 0 .
002 0 .
02 ± 0 .
000 0 .
02 ± 0 .
001 0 .
02 ± 0 .
001 0 .
03 ± 0 .
001 0 .
01 ± 0 .
001 0 .
01 ± 0 .
001 0 .
01 ± 0 .
000 0 .
01 ± 0 . imagenet_init .
07 ± 0 .
001 0 .
02 ± 0 .
000 0 .
02 ± 0 .
000 0 .
03 ± 0 .
000 0 .
03 ± 0 .
000 0 .
02 ± 0 .
000 0 .
02 ± 0 .
000 0 .
01 ± 0 .
000 0 .
01 ± 0 . t - i m g random_init .
09 ± 0 .
001 0 .
03 ± 0 .
000 0 .
03 ± 0 .
000 0 .
04 ± 0 .
000 0 .
04 ± 0 .
001 0 .
03 ± 0 .
000 0 .
03 ± 0 .
001 0 .
02 ± 0 .
000 0 .
02 ± 0 . imagenet_init .
09 ± 0 .
000 0 .
03 ± 0 .
000 0 .
03 ± 0 .
000 0 .
04 ± 0 .
000 0 .
04 ± 0 .
000 0 .
03 ± 0 .
000 0 .
03 ± 0 .
000 0 .
02 ± 0 .
000 0 .
02 ± 0 . ResNet50 are quite small compared to thin ResNets since the kernels do not have any complex eigenvalues. There-fore, for complex vs. real analysis, we decided to focus onthin ResNets.For thin ResNets, total amount of complex eigenvaluesrange from to consistently across different mod-els, which is enough to affect the overall classification per-formance (Table 6). However, the total complex/real ra-tio may not be quite informative since different eigenvalue-based compression modes focus on different eigenvalues. Forexample, if the smallest absolute eigenvalues of all kernelsare real and all the largest ones are complex, min_eig andspectral_radius may cause significantly different behavior interms of vanilla modes vs. real value based counterparts.Hence, we include a category called target which indicatesthe amount of complex values evaluated by each compres-sion mode whether they lead to pruning or not. Accordingto Table 6, there is indeed a significant difference in sets oftargeted values by vanilla and real value based modes, andthis observation is true for both min_eig and spectral_radius.Thus, when we check the results in Table 3, we see thatmin_eig_real has a noticeable disadvantage against min_eig.Nevertheless, spectral_radius remains stable whether we chooseto focus only on real values or not. This, of course, holds noimportance if we did not prune any kernel due to its complexeigenvalue in the first place. Therefore, we also show theoverall contribution of complex eigenvalues to the pruning.The fact that
20% − 40% of the pruned kernels are selectedby looking at complex values demonstrates that there wereindeed enough room for a significant change in pruning per-formance between vanilla and real-part-only modes (Section6). At the end, we see that ignoring the complex parts leadsto worse performance for min_eig mode whereas it has nosignificant effect on the performance of spectral_radius mode.However, since min_eig has already started to prune essen-tial kernels, it is no surprise that min_eig_real prunes evenmore essential kernels. This is not the case for spectral_radius,so we may suspect the largest scaling factor is more rele-vant for kernel importance assignment. Although the gener-ality of this heuristic is not guaranteed, it may be useful forpractitioners who want to push the pruning ratio as much aspossible without losing too much recognition performance.
8. Set Analysis
In this section, we examine the sets of kernels pruned byeach compression mode and their intersections. In our ex-periments, for all datasets, models, and initializations, thereare only 10 distinct sets of kernels which are pruned by:1. (min_eig, weight, det, spectral_radius, spectral_norm,det_gram)2. (min_eig, weight, det, spectral_radius, det_gram)3. (min_eig, weight, det, det_gram)4. (min_eig, det, spectral_radius, det_gram)5. (min_eig, det, det_gram)6. (min_eig, det_gram)7. (min_eig, weight)8. (min_eig, det)9. min_eig10. weightExcept 5 cases: (1,2) ResNet50 and ResNet56 for CIFAR-100 using static_init, (3) ResNet32 for tiny-imagenet usingrandom_init, and (4) ResNet50 for tiny-imagenet using ima-genet_init. These cases also have kernels pruned by (min_eig,det, spectral_radius). In addition, (5) we may also observekernels pruned only by det_gram due to numerical instabil-ity born out of dealing with very small numbers.We can draw several conclusions from these sets:• The kernels pruned by spectral norm forms the sub-set of all other sets, so it is the safest heuristic amongothers: 𝑆 min_eig ∩ 𝑆 weight ∩ 𝑆 det ∩ 𝑆 spectral_radius ∩ 𝑆 det_gram = 𝑆 spectral_norm • The extremums created by the modes based on thelargest and smallest absolute eigenvalues, reflect them-selves as: 𝑆 min_eig ⊃ 𝑆 det ⊃ 𝑆 spectral_radius • The numerical instability mentioned in Section 6 pre-vents 𝑆 det and 𝑆 det_gram to be identical. Moreover,due to the existence of (min_eig, det) and (min_eig,det_gram), one is not a subset of the other either: 𝑆 det ≠ 𝑆 det_gram 𝑆 det ⊅ 𝑆 det_gram 𝑆 det_gram ⊅ 𝑆 det Cugu and Akbas:
A Deeper Look into Convolutions
Page 9 of 19 Deeper Look into Convolutions via Pruning
Table 6
Amount of total/targeted/pruned complex eigenvalues in thin ResNets model total complex ratio min_eig min_eig_real spectral_radius spectral_radius_realtarget pruned target pruned target pruned target pruned C - ResNet32 .
36 ± 0 .
002 0 .
25 ± 0 .
002 0 .
27 ± 0 .
003 0 .
36 ± 0 .
003 0 .
39 ± 0 .
003 0 .
28 ± 0 .
003 0 .
33 ± 0 .
003 0 .
17 ± 0 .
002 0 .
20 ± 0 . ResNet56 .
39 ± 0 .
001 0 .
27 ± 0 .
001 0 .
28 ± 0 .
002 0 .
39 ± 0 .
002 0 .
41 ± 0 .
002 0 .
31 ± 0 .
001 0 .
34 ± 0 .
001 0 .
19 ± 0 .
001 0 .
21 ± 0 . ResNet110 .
41 ± 0 .
001 0 .
28 ± 0 .
001 0 .
29 ± 0 .
001 0 .
41 ± 0 .
002 0 .
42 ± 0 .
002 0 .
33 ± 0 .
001 0 .
34 ± 0 .
001 0 .
20 ± 0 .
001 0 .
21 ± 0 . C - ResNet32 .
31 ± 0 .
002 0 .
22 ± 0 .
001 0 .
24 ± 0 .
002 0 .
31 ± 0 .
001 0 .
37 ± 0 .
003 0 .
24 ± 0 .
004 0 .
34 ± 0 .
005 0 .
14 ± 0 .
002 0 .
21 ± 0 . ResNet56 .
37 ± 0 .
002 0 .
25 ± 0 .
001 0 .
28 ± 0 .
001 0 .
37 ± 0 .
002 0 .
41 ± 0 .
002 0 .
29 ± 0 .
002 0 .
34 ± 0 .
001 0 .
17 ± 0 .
001 0 .
21 ± 0 . ResNet110 .
40 ± 0 .
001 0 .
28 ± 0 .
001 0 .
29 ± 0 .
001 0 .
41 ± 0 .
002 0 .
43 ± 0 .
001 0 .
32 ± 0 .
001 0 .
34 ± 0 .
001 0 .
19 ± 0 .
001 0 .
21 ± 0 . t - i m g ResNet32 .
35 ± 0 .
003 0 .
24 ± 0 .
002 0 .
26 ± 0 .
004 0 .
35 ± 0 .
003 0 .
39 ± 0 .
004 0 .
27 ± 0 .
003 0 .
32 ± 0 .
002 0 .
16 ± 0 .
001 0 .
19 ± 0 . ResNet56 .
39 ± 0 .
002 0 .
27 ± 0 .
004 0 .
28 ± 0 .
003 0 .
39 ± 0 .
003 0 .
41 ± 0 .
002 0 .
31 ± 0 .
001 0 .
33 ± 0 .
003 0 .
18 ± 0 .
001 0 .
20 ± 0 . ResNet110 .
41 ± 0 .
001 0 .
28 ± 0 .
001 0 .
28 ± 0 .
001 0 .
41 ± 0 .
001 0 .
42 ± 0 .
001 0 .
32 ± 0 .
001 0 .
33 ± 0 .
001 0 .
19 ± 0 .
001 0 .
20 ± 0 . Another outcome of this numerical problem: 𝑆 det_gram ⊅ 𝑆 spectral_radius • We can see the distinction between weight-based prun-ing and eigenvalue-based pruning: 𝑆 min_eig ⊅ 𝑆 weight 𝑆 weight ⊅ { 𝑆 min_eig , 𝑆 det , 𝑆 spectral_radius , 𝑆 det_gram } Additional information, such as the ratio of each set withrespect to the union of all sets, and the variance of these ra-tios across multiple runs can be found in Appendix A.
9. Pruning per Layer
Summary of findings:• Dataset has a footprint over the activity map ( )of a CNN.• Thin ResNet experiments show that, within the samemodel family, activity maps of different models re-semble each other.• ResNet50 experiments show that significantly differ-ent initializations affect the activity map.• For randomly initialized models, first layers have rel-atively smaller pruning ratio than the deeper layers.In this section, we study how pruning affects differentlayers. In Fig 3, we show the amount of activity ( )in different layers of different models of thin ResNet family.Considering the similarity of overall activity pattern, we seethat the dataset at hand has a footprint over the prunedarchitecture . In detail, row-wise similarity is relatively low(same model - different dataset) whereas column-wise sim-ilarity indicates a correlation of activity patterns of differ-ent models from the same family for the same dataset. Ofcourse, as the model gets smaller (ResNet32) activity of thefirst layers gets higher than that of larger models (ResNet56and ResNet110). Nevertheless, it is interesting to see alignedspikes in activity , for example the spike at layers: ResNet32:25- ResNet56:40 - ResNet110:80 on CIFAR-100, since these alignments require the stretch of activies over layers for dif-ferent models where the number of layers are significantlydifferent. However, generalization of this behavior is re-stricted within a somewhat vague model family definition(thin ResNets) as ResNet50 does not exhibit the same overallactivity pattern. It seems using fundamental building blocks(thin ResNet block) to generate shallow/deep models leads tointeresting activity distributions which may be used to studythe importance/effect of depth in neural nets from a differentperspective. Another factor that affects the activity patternis initialization. We show this effect in Fig 4. CNN train-ing contains different sources of randomness that can affectthe results. Hence, we include static_init results to comparewith random_init results to show how little these perturba-tions are in our experiments. Nevertheless, the key observa-tion here is that significantly different initializations (im-agenet_init vs. random_init) results in different activitypatterns . In addition, imagenet_init is the only case wherewe observe a drop in first layers’ activities. On the otherhand, with the exception of sudden increase in last layer ac-tivities on tiny_imagenet, we see a general proclivity towardslow activity in last layers, but the exact reason for this phe-nomenon and its generality to different datasets/models re-mains an open question.
10. Pruning through Epochs
Summary of findings:• The first drop in learning rate is found to be a veryinteresting turning point for pruning: – The difference between the min and the max abseigenvalues of kernels starts to increase. Asa result, the heuristics that use the former startsto prune more and suffer in classification perfor-mance. – The pruning ratios of all heuristics starts to in-crease until they reach approximately the sameamount of pruning. Nevertheless, as the trainingcontinues the min abs eigenvalue based heuris-tics eventually reach higher pruning ratios.Here, we decided on the number of epochs (200) by look-ing at the learning curves of multiple experiments. Our pol-
Cugu and Akbas:
A Deeper Look into Convolutions
Page 10 of 19 Deeper Look into Convolutions via Pruning A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet32
CIFAR-10 A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet32
CIFAR-100 A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet32 tiny-imagenet A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet56 A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet56 A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet56 L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet110 L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet110 L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) ResNet110
Figure 3:
Active parameter ratio of different compression modes through layers (input → output) for thin ResNets under multiplerandom initializations. A c t i v e / T o t a l p a r a m r a t i o ( % ) static_init tiny-imagenet A c t i v e / T o t a l p a r a m r a t i o ( % ) static_init CIFAR-10 A c t i v e / T o t a l p a r a m r a t i o ( % ) static_init CIFAR-100 A c t i v e / T o t a l p a r a m r a t i o ( % ) random_init A c t i v e / T o t a l p a r a m r a t i o ( % ) random_init A c t i v e / T o t a l p a r a m r a t i o ( % ) random_init L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) imagenet_init L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) imagenet_init L aye r A c t i v e / T o t a l p a r a m r a t i o ( % ) imagenet_init Figure 4:
Active parameter ratio of different compression modes through layers (input → output) for ResNet50 under differentinitializations. icy was to keep training for a while after the learning curvebecame nearly flat. However, recently, training history gainedmore importance due to interesting observations such as epoch-wise double descent [53]. Hence, after every epochs, we simulated pruning for each compression mode as if it wasthe end of the training to construct a pruning history . Us-ing this history, we ask two questions: (1) What would wesee if we stopped the training at an earlier stage? (2) Canwe use compression score to estimate the utilization of agiven model’s capacity, then for example to use it to decidewhether to stop the training or not? In Fig 5 and Fig 7, we show pruning ratio of each com-pression mode through epochs. As expected (see Sections6, 8, and 7), min_eig_real always prunes the most (Fig 6and Fig 8). The interesting observation here is that, for thinResNets, other compression modes almost catches up min_eigin terms of pruning ratio after 90 th epoch which is right afterwe divide the learning rate by (80 th epoch). This is notthe case for ResNet50 where the curves are quite similar toeach other and the gap between them is never as big as in thethin ResNet case. However, ResNet50 contains a significantamount of kernels (no difference between compression Cugu and Akbas:
A Deeper Look into Convolutions
Page 11 of 19 Deeper Look into Convolutions via Pruning R e s N e t p a r a m _ c o un t CIFAR-10
CIFAR-100 tiny-imagenet R e s N e t p a r a m _ c o un t R e s N e t p a r a m _ c o un t Figure 5:
Pruning ratios of different thin ResNets through epochs for different datasets. R e s N e t p a r a m _ c o un t CIFAR-10
CIFAR-100 tiny-imagenet R e s N e t p a r a m _ c o un t R e s N e t p a r a m _ c o un t Figure 6: min_eig_real always achieves the best pruning ratio for different thin ResNets through epochs using different datasets. modes), so it is expected for different modes to draw similarcurves.In Fig 9, we show classification accuracy of each com-pression mode through epochs for different thin ResNets.In comparison to pruning ratio charts, classification perfor-mances vary significantly which results in indistinguishableresults for the first 90 epochs. However, after that point,there is a strong divergence in performance for min_eig andmin_eig_real whereas other modes continue to perform quitesimilarly. This similarity is shown in Fig 10 where the colorof the best mode in the previous simulation point is reflectedby painting the next epoch interval. Note that det anddet_gram can also diverge from the red-ish curve dependingon the eigenvalue distribution, and we observe this prob-ability in Fig 11 as the blue-ish curves also diverge from the red-ish ones in later stages of training. As in the thinResNet case, we see frequent changes in the mode with thebest recognition performance in Fig 12.Compression score (see Section 6) combines the oppos-ing forces in model compression: (1) pruning ratio, (2) clas-sification accuracy. For thin ResNets, this combination man-ifests itself very clearly: in Fig 13, (1) for the first 80 epochscurves are aligned and clearly distinguishable as in Fig 5,then they merge between epochs - , (2) after 90 th epochgreen curves (min_eig and min_eig_real) diverges from therest as in Fig 9. In addition, ResNet50 also follows this pat-tern though with greater perturbation (Fig 15). We again seethree main clusters of lines: red, blue, green. They go rela-tively aligned at first as in Fig 7, then diverge as in Fig 11.Which compression mode is at the top? The ones with the Cugu and Akbas:
A Deeper Look into Convolutions
Page 12 of 19 Deeper Look into Convolutions via Pruning r a n d o m _ i n i t p a r a m _ c o un t CIFAR-10
CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t p a r a m _ c o un t Figure 7:
Pruning ratios of ResNet50 through epochs for different datasets under different initializations. r a n d o m _ i n i t p a r a m _ c o un t CIFAR-10
CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t p a r a m _ c o un t Figure 8: min_eig_real always achieves the best pruning ratio for ResNet50 through epochs under different initializations. highest compression score at the different stages of trainingare shown in Fig 14 and Fig 16. According to these results,winners change frequently and do not follow a general pat-tern. Hence, we evaluate winner based comparison not as informative as the clusters-of-modes analysis.In overall, we found an interesting point of behavioralchange in pruning, epoch . After the first drop in learn-ing rate, we identified an interesting shift in the eigen- R e s N e t a cc u r a c y CIFAR-10
CIFAR-100 tiny-imagenet R e s N e t a cc u r a c y R e s N e t a cc u r a c y Figure 9:
Classification accuracy of different thin ResNets through epochs for different datasets.Cugu and Akbas:
A Deeper Look into Convolutions
Page 13 of 19 Deeper Look into Convolutions via Pruning R e s N e t a cc u r a c y CIFAR-10
CIFAR-100 tiny-imagenet R e s N e t a cc u r a c y R e s N e t a cc u r a c y Figure 10:
Compression modes with the best classification accuracy through epochs for thin ResNets. r a n d o m _ i n i t a cc u r a c y CIFAR-10
CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t a cc u r a c y Figure 11:
Classification accuracy of ResNet50 through epochs for different datasets under different initializations. value distribution resulting in (1) convergence of differ-ent modes’ compression scores, then (2) divergence ofminimum absolute eigenvalue based modes . Why is thisimportant? Nakkiran et al. [53] note that if one has a large enough model they can observe an epoch-wise double de-scent where the training is split into two halves with differ-ent behaviors. This two stage learning model also resem-bles the information bottleneck principle [61] and the critical r a n d o m _ i n i t a cc u r a c y CIFAR-10
CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t a cc u r a c y Figure 12:
Compression modes with the best classification accuracy through epochs for ResNet50 under different initializations.Cugu and Akbas:
A Deeper Look into Convolutions
Page 14 of 19 Deeper Look into Convolutions via Pruning R e s N e t s c o r e CIFAR-10
CIFAR-100 tiny-imagenet R e s N e t s c o r e R e s N e t s c o r e Figure 13:
Compression score of different thin ResNets through epochs for different datasets. R e s N e t s c o r e CIFAR-10
CIFAR-100 tiny-imagenet R e s N e t s c o r e R e s N e t s c o r e Figure 14:
Compression modes with the best compression score through epochs for thin ResNets. r a n d o m _ i n i t s c o r e CIFAR-10
CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t s c o r e Figure 15:
Compression score of ResNet50 through epochs for different datasets under different initializations.Cugu and Akbas:
A Deeper Look into Convolutions
Page 15 of 19 Deeper Look into Convolutions via Pruning r a n d o m _ i n i t s c o r e CIFAR-10
CIFAR-100 tiny-imagenet i m a g e n e t _ i n i t s c o r e Figure 16:
Compression modes with the best compression score through epochs for ResNet50 under different initializations. learning period [2]. It seems that, in deep learning litera-ture, the two phased learning is observed again and again indifferent concepts. Although, we cannot explain the behav-ior of the eigenvalues in our experiments, it is important toreport this incident since each new observation of the twostage learning model decreases the odds of experiencing amere coincidence. Hence, we leave extensive study on therelationship between eigenvalues and the learning rate thatresulted in the two phases shown in Figs 13 and 15 as a futurework.
11. Discussion
We can also find the expert kernels in human brain. Itis well known that adult brain contains special modules thatactivate for specific functions such as face recognition [34].What is more interesting for us is that these experts emergefrom a natural pruning phase [9]: "In normal development,the pathway for each eye is sculpted (or “pruned”) down tothe right number of connections, and those connections aresculpted in other ways, for example, to allow one to see pat-terns. By overproducing synapses then selecting the rightconnections, the brain develops an organized wiring diagramthat functions optimally.". A clear example is given by Haieret al. [21, 20] where it is shown that high brain activity whilelearning to play Tetris is replaced by a low brain activity oncethe subjects reach a certain level of expertise in the game.The authors of these papers state that the correlation betweenimprovement on the task and decreasing brain glucose usesuggests that those who honed their cognitive strategy to thefewest circuits improved the most. In fact, these studies arepart of a wider picture known as the neural efficiency hy-pothesis of intelligence [54, 56]. Hence, we think pruningis imperative to understand the learning dynamics of convo-lutional neural nets.In addition, while trying to set the optimal 𝐿 regular-ization penalty, 𝛼 = 4 ∗ 10 −4 and 𝛼 = 10 −5 experiments re-vealed an interesting phenomenon. In general, model com-pression consists of 3 phases: (1) training with regulariza-tion, (2) pruning, (3) fine-tuning without (excessive) regu-larization. Although we do not use the fine-tuning phase inour analysis, we, nonetheless, also run fine-tuning trials in our preliminary experiments, and observed opposite effectsfor −4 and −5 . In our experiments, excessive prun-ing with −4 damages the model capacity in such away that fine-tuning actually decreases the performance ofthe model whereas −5 selection can still benefit from fine-tuning. Nevertheless, this observation requires a detailedstudy focused entirely on pruning vs. model capacity, so weleave this as a future work.
12. Conclusions and Future Work
In this work, we study matrix characteristics for pruningin an attempt to answer the question: what is an expert ker-nel? In this pursuit, we found that (1) spectral norm is a morerobust heuristic (in the sense that it better preserves general-ization accuracy) than the average absolute weight for prun-ing, (2) one can use spectral radius of a kernel as a usefulheuristic for pruning and push the compression ratio by dis-carding the complex parts of the eigenvalues while preserv-ing the recognition performance, (3) a kernel can be essentialwhether it is full-rank or not, (4) theoretical pruning heuris-tics may not hold in practice due to numerical instabilitiesborn out of dealing with very small numbers, (5) datasetshave a significant footprint in the activity map (active pa-rameter distribution across layers after pruning) of a modelfamily in a homogeneous 𝐿 regularization (each layer hasthe same regularization penalty), (6) the gap between thesmallest and the largest eigenvalue of a kernel starts to in-crease after the first drop in the learning rate which resultedin a divergence of compression scores achieved by minimumand maximum absolute eigenvalue based pruning heuristics.With these in mind, we leave extensive study on neural ac-tivity through layers, two stages of eigenvalue distributionsand their connection to the learning rate for future research. Cugu and Akbas:
A Deeper Look into Convolutions
Page 16 of 19 Deeper Look into Convolutions via Pruning
References [1] Abel, N.H., 1826. Beweis der unmöglichkeit, algebraische gleichun-gen von höheren graden als dem vierten allgemein aufzulösen. Journalfür die reine und angewandte Mathematik 1, 65–84.[2] Achille, A., Rovere, M., Soatto, S., 2018. Critical learning periods indeep networks, in: International Conference on Learning Represen-tations.[3] Arora, S., Ge, R., Neyshabur, B., Zhang, Y., 2018. Stronger general-ization bounds for deep nets via a compression approach, in: Interna-tional Conference on Machine Learning, pp. 254–263.[4] Baykal, C., Liebenwein, L., Gilitschenski, I., Feldman, D., Rus, D.,2018. Data-dependent coresets for compressing neural networks withapplications to generalization bounds, in: International Conferenceon Learning Representations.[5] Behrmann, J., Grathwohl, W., Chen, R.T., Duvenaud, D., Jacobsen,J.H., 2019. Invertible residual networks, in: International Conferenceon Machine Learning, pp. 573–582.[6] Belkin, M., Hsu, D., Ma, S., Mandal, S., 2019. Reconciling modernmachine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences 116, 15849–15854.[7] Blakeney, C., Yan, Y., Zong, Z., 2020. Is pruning compression?: In-vestigating pruning via network layer similarity, in: The IEEE WinterConference on Applications of Computer Vision, pp. 914–922.[8] Chollet, F., 2015. keras.[9] Council, N.R., et al., 2000. How people learn: Brain, mind, experi-ence, and school: Expanded edition. National Academies Press.[10] Dong, X., Chen, S., Pan, S., 2017. Learning to prune deep neural net-works via layer-wise optimal brain surgeon, in: Advances in NeuralInformation Processing Systems, pp. 4857–4867.[11] Dubey, A., Chatterjee, M., Ahuja, N., 2018. Coreset-based neuralnetwork compression, in: Proceedings of the European Conferenceon Computer Vision (ECCV), pp. 454–470.[12] Frankle, J., Carbin, M., 2018. The lottery ticket hypothesis: Find-ing sparse, trainable neural networks, in: International Conference onLearning Representations.[13] Fukushima, K., Miyake, S., 1982. Neocognitron: A self-organizingneural network model for a mechanism of visual pattern recognition,in: Competition and cooperation in neural nets. Springer, pp. 267–285.[14] Galois, É., Neumann, P.M., 2011. The mathematical writings ofÉvariste Galois. volume 6. European Mathematical Society.[15] Gerschgorin, S., 1931. Uber die abgrenzung der eigenwerte einermatrix. Izvestija Akademii Nauk SSSR, Serija Matematika 7, 749–754.[16] Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep learning.[17] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversar-ial nets, in: Advances in neural information processing systems, pp.2672–2680.[18] Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.J.,Choi, E., 2018. Morphnet: Fast & simple resource-constrained struc-ture learning of deep networks, in: Proceedings of the IEEE confer-ence on computer vision and pattern recognition, pp. 1586–1595.[19] Guo, Y., Yao, A., Chen, Y., 2016. Dynamic network surgery for ef-ficient dnns, in: Advances in neural information processing systems,pp. 1379–1387.[20] Haier, R.J., Siegel, B., Tang, C., Abel, L., Buchsbaum, M.S., 1992a.Intelligence and changes in regional cerebral glucose metabolic ratefollowing learning. Intelligence 16, 415–426.[21] Haier, R.J., Siegel Jr, B.V., MacLachlan, A., Soderling, E., Lotten-berg, S., Buchsbaum, M.S., 1992b. Regional glucose metabolicchanges after learning a complex visuospatial/motor task: a positronemission tomographic study. Brain research 570, 134–143.[22] Han, S., Mao, H., Dally, W.J., 2016. Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding .[23] Han, S., Pool, J., Tran, J., Dally, W., 2015. Learning both weightsand connections for efficient neural network, in: Advances in neural information processing systems, pp. 1135–1143.[24] Hassibi, B., Stork, D.G., 1993. Second order derivatives for networkpruning: Optimal brain surgeon, in: Advances in neural informationprocessing systems, pp. 164–171.[25] He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification, in:Proceedings of the IEEE international conference on computer vision,pp. 1026–1034.[26] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learningfor image recognition, in: Proceedings of the IEEE conference oncomputer vision and pattern recognition, pp. 770–778.[27] He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y., 2019. Filter pruning viageometric median for deep convolutional neural networks accelera-tion, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 4340–4349.[28] He, Y., Zhang, X., Sun, J., 2017. Channel pruning for acceleratingvery deep neural networks, in: Proceedings of the IEEE InternationalConference on Computer Vision, pp. 1389–1397.[29] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,Weyand, T., Andreetto, M., Adam, H., 2017. Mobilenets: Efficientconvolutional neural networks for mobile vision applications. arXivpreprint arXiv:1704.04861 .[30] Huang, Z., Wang, N., 2018. Data-driven sparse structure selection fordeep neural networks, in: Proceedings of the European conference oncomputer vision (ECCV), pp. 304–320.[31] Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J.,Keutzer, K., 2016. Squeezenet: Alexnet-level accuracy with50x fewer parameters and< 0.5 mb model size. arXiv preprintarXiv:1602.07360 .[32] Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift, in: InternationalConference on Machine Learning, pp. 448–456.[33] Junjie, L., Zhe, X., Runbin, S., Cheung, R.C., So, H.K., 2020. Dy-namic sparse training: Find efficient sparse network from scratch withtrainable masked layers, in: International Conference on LearningRepresentations.[34] Kanwisher, N., McDermott, J., Chun, M.M., 1997. The fusiform facearea: a module in human extrastriate cortex specialized for face per-ception. Journal of neuroscience 17, 4302–4311.[35] Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 .[36] Krizhevsky, A., et al., 2009. Learning multiple layers of features fromtiny images .[37] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E.,Hubbard, W., Jackel, L.D., 1989. Backpropagation applied to hand-written zip code recognition. Neural computation 1, 541–551.[38] LeCun, Y., Denker, J.S., Solla, S.A., 1990. Optimal brain damage, in:Advances in neural information processing systems, pp. 598–605.[39] Li, C., Farkhoor, H., Liu, R., Yosinski, J., 2018. Measuring the intrin-sic dimension of objective landscapes, in: International Conferenceon Learning Representations.[40] Li, F.F., Karpathy, A., Johnson, J., 2017a. Tiny imagenet visual recog-nition challenge .[41] Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P., 2017b. Prun-ing filters for efficient convnets .[42] Li, Y., Gu, S., Gool, L.V., Timofte, R., 2019. Learning filter basisfor convolutional neural network compression, in: Proceedings of theIEEE International Conference on Computer Vision, pp. 5623–5632.[43] Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., Doer-mann, D., 2019. Towards optimal structured cnn pruning via genera-tive adversarial learning, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 2790–2799.[44] Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C., 2017. Learningefficient convolutional networks through network slimming, in: Pro-ceedings of the IEEE International Conference on Computer Vision,pp. 2736–2744.[45] Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T., 2018. Rethinkingthe value of network pruning, in: International Conference on Learn-
Cugu and Akbas:
A Deeper Look into Convolutions
Page 17 of 19 Deeper Look into Convolutions via Pruning ing Representations.[46] Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional net-works for semantic segmentation, in: Proceedings of the IEEE con-ference on computer vision and pattern recognition, pp. 3431–3440.[47] Luo, J.H., Wu, J., Lin, W., 2017. Thinet: A filter level pruning methodfor deep neural network compression, in: Proceedings of the IEEEinternational conference on computer vision, pp. 5058–5066.[48] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y., 2018. Spectralnormalization for generative adversarial networks. arXiv preprintarXiv:1802.05957 .[49] Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J., 2017. Pruningconvolutional neural networks for resource efficient inference.[50] Mozer, M.C., Smolensky, P., 1989. Skeletonization: A technique fortrimming the fat from a network via relevance assessment, in: Ad-vances in neural information processing systems, pp. 107–115.[51] Mpitsos, G.J., Burton Jr, R.M., 1992. Convergence and divergence inneural networks: Processing of chaos and biological analogy. Neuralnetworks 5, 605–625.[52] Mussay, B., Osadchy, M., Braverman, V., Zhou, S., Feldman, D.,2019. Data-independent neural pruning via coresets, in: InternationalConference on Learning Representations.[53] Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever,I., 2019. Deep double descent: Where bigger models and more datahurt. arXiv preprint arXiv:1912.02292 .[54] Neubauer, A.C., Fink, A., 2009. Intelligence and neural efficiency.Neuroscience & Biobehavioral Reviews 33, 1004–1023.[55] Nowlan, S.J., Hinton, G.E., 1992. Simplifying neural networks bysoft weight-sharing. Neural computation 4, 473–493.[56] Nussbaumer, D., Grabner, R.H., Stern, E., 2015. Neural efficiency inworking memory tasks: The impact of task demand. Intelligence 50,196–208.[57] Ramakrishnan, R.K., Sari, E., Nia, V.P., 2020. Differentiable mask forpruning convolutional and recurrent networks, in: 2020 17th Confer-ence on Computer and Robot Vision (CRV), IEEE. pp. 222–229.[58] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Im-agenet large scale visual recognition challenge. International journalof computer vision 115, 211–252.[59] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C., 2018.Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceed-ings of the IEEE conference on computer vision and pattern recogni-tion, pp. 4510–4520.[60] Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O., 2020. Green ai.Communications of the ACM 63, 54–63.[61] Shwartz-Ziv, R., Tishby, N., 2017. Opening the black box of deepneural networks via information. arXiv preprint arXiv:1703.00810 .[62] Sietsma, J., 1988. Neural net pruning-why and how, in: Proceed-ings of International Conference on Neural Networks, San Diego, CA,1988, pp. 325–333.[63] Son, S., Nah, S., Mu Lee, K., 2018. Clustering convolutional kernelsto compress deep neural networks, in: Proceedings of the EuropeanConference on Computer Vision (ECCV), pp. 216–232.[64] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Re-thinking the inception architecture for computer vision, in: Proceed-ings of the IEEE conference on computer vision and pattern recogni-tion, pp. 2818–2826.[65] Ullrich, K., Meeds, E., Welling, M., 2017. Soft weight-sharing forneural network compression .[66] Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H., 2016. Learning struc-tured sparsity in deep neural networks, in: Advances in neural infor-mation processing systems, pp. 2074–2082.[67] Yang, T.J., Chen, Y.H., Sze, V., 2017. Designing energy-efficientconvolutional neural networks using energy-aware pruning, in: Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 5687–5695.[68] Ye, J., Lu, X., Lin, Z., Wang, J.Z., 2018. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolutionlayers, in: International Conference on Learning Representations. [69] Ye, S., Xu, K., Liu, S., Cheng, H., Lambrechts, J.H., Zhang, H., Zhou,A., Ma, K., Wang, Y., Lin, X., 2019. Adversarial robustness vs. modelcompression, or both?, in: Proceedings of the IEEE/CVF Interna-tional Conference on Computer Vision, pp. 111–120.[70] Yu, R., Li, A., Chen, C.F., Lai, J.H., Morariu, V.I., Han, X., Gao, M.,Lin, C.Y., Davis, L.S., 2018. Nisp: Pruning networks using neuronimportance score propagation, in: Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, pp. 9194–9203.[71] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2016. Un-derstanding deep learning requires rethinking generalization .[72] Zhao, C., Ni, B., Zhang, J., Zhao, Q., Zhang, W., Tian, Q., 2019.Variational convolutional neural network pruning, in: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,pp. 2780–2789.
Cugu and Akbas:
A Deeper Look into Convolutions
Page 18 of 19 Deeper Look into Convolutions via Pruning
A. Appendix
Here we present the detailed results of our set analysis (see Section 8). { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 17:
Set analysis bar chart for ResNet32 on CIFAR-10 with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 18:
Set analysis bar chart for ResNet32 on CIFAR-100 with static_initCugu and Akbas:
A Deeper Look into Convolutions
Page 19 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 19:
Set analysis bar chart for ResNet32 on tiny-imagenet with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 20:
Set analysis bar chart for ResNet56 on CIFAR-10 with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , d } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 21:
Set analysis bar chart for ResNet56 on CIFAR-100 with static_initCugu and Akbas:
A Deeper Look into Convolutions
Page 20 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 22:
Set analysis bar chart for ResNet56 on tiny-imagenet with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 23:
Set analysis bar chart for ResNet110 on CIFAR-10 with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 24:
Set analysis bar chart for ResNet110 on CIFAR-100 with static_initCugu and Akbas:
A Deeper Look into Convolutions
Page 21 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 25:
Set analysis bar chart for ResNet110 on tiny-imagenet with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 26:
Set analysis bar chart for ResNet32 on CIFAR-10 with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 27:
Set analysis bar chart for ResNet32 on CIFAR-100 with random_initCugu and Akbas:
A Deeper Look into Convolutions
Page 22 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , d } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 28:
Set analysis bar chart for ResNet32 on tiny-imagenet with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 29:
Set analysis bar chart for ResNet56 on CIFAR-10 with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 30:
Set analysis bar chart for ResNet56 on CIFAR-100 with random_initCugu and Akbas:
A Deeper Look into Convolutions
Page 23 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 31:
Set analysis bar chart for ResNet56 on tiny-imagenet with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 32:
Set analysis bar chart for ResNet110 on CIFAR-10 with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 33:
Set analysis bar chart for ResNet110 on CIFAR-100 with random_initCugu and Akbas:
A Deeper Look into Convolutions
Page 24 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 34:
Set analysis bar chart for ResNet110 on tiny-imagenet with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 35:
Set analysis bar chart for ResNet50 on CIFAR-10 with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , d } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 36:
Set analysis bar chart for ResNet50 on CIFAR-100 with static_initCugu and Akbas:
A Deeper Look into Convolutions
Page 25 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 37:
Set analysis bar chart for ResNet50 on tiny-imagenet with static_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 38:
Set analysis bar chart for ResNet50 on CIFAR-10 with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 39:
Set analysis bar chart for ResNet50 on CIFAR-100 with random_initCugu and Akbas:
A Deeper Look into Convolutions
Page 26 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 40:
Set analysis bar chart for ResNet50 on tiny-imagenet with random_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 41:
Set analysis bar chart for ResNet50 on CIFAR-10 with imagenet_init { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , f } { a , c } { a , f } { a } { b } { f } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 42:
Set analysis bar chart for ResNet50 on CIFAR-100 with imagenet_initCugu and Akbas:
A Deeper Look into Convolutions
Page 27 of 19 Deeper Look into Convolutions via Pruning { a , b , c , d , e , f } { a , b , c , d , f } { a , b , c , f } { a , b } { a , c , d , f } { a , c , d } { a , c , f } { a , c } { a , f } { a } { b } { f } { } P r un e d / T o t a l p a r a m r a t i o ( % ) a : min_eigb : weightc : detd : spectral_radiuse : spectral_normf : det_corr Figure 43:
Set analysis bar chart for ResNet50 on tiny-imagenet with imagenet_initCugu and Akbas: