A Gradient Flow Framework For Analyzing Network Pruning
PPreprint
A G
RADIENT F LOW F RAMEWORK F OR A NALYZING N ETWORK P RUNING
Ekdeep Singh Lubana & Robert P. Dick
EECS DepartmentUniversity of MichiganAnn Arbor, MI 48105, USA {eslubana, dickrp}@umich.edu A BSTRACT
Recent network pruning methods focus on pruning models early-on in training.To estimate the impact of removing a parameter, these methods use importancemeasures that were originally designed to prune trained models. Despite lackingjustification for their use early-on in training, such measures result in surprisinglylow accuracy loss. To better explain this behavior, we develop a general gradientflow based framework that unifies state-of-the-art importance measures through thenorm of model parameters. We use this framework to determine the relationshipbetween pruning measures and evolution of model parameters, establishing severalresults related to pruning models early-on in training: (i) magnitude-based pruningremoves parameters that contribute least to reduction in loss, resulting in modelsthat converge faster than magnitude-agnostic methods; (ii) loss-preservation basedpruning preserves first-order model evolution dynamics and is therefore appropriatefor pruning minimally trained models; and (iii) gradient-norm based pruning affectssecond-order model evolution dynamics, such that increasing gradient norm viapruning can produce poorly performing models. We validate our claims on severalVGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10 and CIFAR-100. Code available at https://github.com/EkdeepSLubana/flowandprune.
NTRODUCTION
The use of Deep Neural Networks (DNNs) in intelligent edge systems has been enabled by extensiveresearch on model compression. “Pruning” techniques are commonly used to remove “unimportant”filters to either preserve or promote specific, desirable model properties. Most pruning methodswere originally designed to compress trained models, with the goal of reducing inference costsonly. For example, Li et al. (2017); He et al. (2018) proposed to remove filters with small (cid:96) / (cid:96) norm, thus ensuring minimal change in model output. Molchanov et al. (2017; 2019); Theis et al.(2018) proposed to preserve the loss of a model, generally using Taylor expansions around a filter’sparameters to estimate change in loss as a function of its removal.Recent works focus on pruning models at initialization (Frankle & Carbin (2019); Lee et al. (2019;2020)) or after minimal training (You et al. (2020)), thus enabling reduction in both inference andtraining costs. To estimate the impact of removing a parameter, these methods use the same importancemeasures as designed for pruning trained models. Since such measures focus on preserving modeloutputs or loss, Wang et al. (2020) argue they are not well-motivated for pruning models early-onin training. However, in this paper, we demonstrate that if the relationship between importancemeasures used for pruning trained models and the evolution of model parameters is established, theiruse early-on in training can be better justified.In particular, we employ gradient flow (gradient descent with infinitesimal learning rate) to develop a general framework that relates state-of-the-art importance measures used in network pruning throughthe norm of model parameters . This framework establishes the relationship between regularly usedimportance measures and the evolution of a model’s parameters, thus demonstrating why measuresdesigned to prune trained models also perform well early-on in training . More generally, ourframework enables better understanding of what properties make a parameter dispensable according1 a r X i v : . [ c s . L G ] O c t reprintto a particular importance measure. Our findings follow. (i) Magnitude-based pruning measuresremove parameters that contribute least to reduction in loss. This enables magnitude-based prunedmodels to achieve faster convergence than magnitude-agnostic measures. (ii) Loss-preservationbased measures remove parameters with the least tendency to change, thus preserving first-ordermodel evolution dynamics. This shows that loss-preservation is appropriate for pruning modelsearly-on in training as well. (iii) Gradient-norm based pruning is linearly related to second-ordermodel evolution dynamics. Increasing gradient norm via pruning for even slightly trained models canpermanently damage earlier layers, producing poorly performing architectures. This behavior is aresult of aggressively pruning filters that maximally increase model loss. We validate our claims onseveral VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10 and CIFAR-100. ELATED W ORK
Pruning frameworks generally define importance measures to estimate the impact of removing aparameter. Most popular importance measures are based on parameter magnitude (Li et al. (2017);He et al. (2018); Liu et al. (2017)) or loss preservation (Molchanov et al. (2019; 2017); Theis et al.(2018)). Recent works show that using these measures, models pruned at initialization (Lee et al.(2019); Wang et al. (2020); Hayou et al. (2020); Frankle & Carbin (2019)) or after minimal training(You et al. (2020)) achieve final performance similar to the original networks. Since measures forpruning trained models are motivated by output or loss preservation, Wang et al. (2020) argue theymay not be well suited for pruning models early-on in training. They thus propose GraSP, a measurethat promotes preservation of parameters that increase the gradient norm. Since loss preservation andgradient-norm based measures are designed using first-order Taylor series approximations, they areexactly applicable under gradient flow only. This suggests using gradient flow to analyze evolution ofa model can provide useful insights into regularly used importance measures for network pruning.Despite its success, the foundations of network pruning are not well understood. Recent work hasshown that good “subnetworks” that achieve similar performance to the original network exist withinboth trained (Ye et al. (2020)) and untrained models (Malach et al. (2020)). These works thus provenetworks can be pruned without loss in performance, but do not indicate how a network should bepruned, i.e, which importance measures are preferable.From an implementation standpoint, pruning approaches can be placed in two categories. The first,structured pruning (Li et al. (2017); He et al. (2018); Liu et al. (2017); Ye et al. (2018); Luo et al.(2019); Molchanov et al. (2019); Theis et al. (2018); Molchanov et al. (2017); Gao et al. (2019)),removes entire filters, simplifying efficient implementation in both hardware and software, whereregular structures are preferred. The second, unstructured pruning (Han et al. (2016b); LeCun et al.(1990); Hassibi & Stork (1993)) is more fine-grained, operating at the level of individual parametersinstead of filters. Unstructured pruning has recently been used to reduce computational complexity aswell, but requires specially designed hardware (Han et al. (2016a)) or software (Elsen et al. (2020)).While results in this paper are applicable in both settings, our experimental evaluation focuses onstructured pruning due to its higher relevance to practitioners.
RELIMINARIES : C
LASSES OF S TANDARD I MPORTANCE M EASURES
In this section, we review the most successful classes of importance measures for network pruning.These measures will be our focus in subsequent sections. We use bold symbols to denote vectors anditalicize scalar variables. Consider a model that is parameterized as Θ ( t ) at time t . We denote thegradient of the loss with respect to model parameters at time t as g ( Θ ( t )) , the Hessian as H ( Θ ( t )) ,and the model loss as L ( Θ ( t )) . A general model parameter is denoted as θ ( t ) . The importance of aset of parameters Θ p ( t ) is denoted as I ( Θ p ( t )) . Magnitude-based measures:
Both (cid:96) norm (Li et al. (2017)) and (cid:96) norm (He et al. (2018)) havebeen successfully used as magnitude-focused importance measures and generally perform equallywell. Due to its differentiability, (cid:96) norm can be analyzed using gradient flow and will be our focusin the following sections. I ( Θ p ( t )) = (cid:107) Θ p ( t ) (cid:107) = (cid:88) θ i ∈ Θ p ( θ i ( t )) . (1)2reprint Loss-preservation based measures:
These measures determine the impact removing a set of param-eters has on model loss, generally using a first-order Taylor decomposition. Most recent methods(Molchanov et al. (2019; 2017); Ding et al. (2019); Theis et al. (2018)) for pruning trained modelsare variants of this method, often using additional heuristics to improve their performance. L ( Θ ( t ) − Θ p ( t )) − L ( Θ ( t )) ≈ − Θ Tp ( t ) g ( Θ ( t )) . (2)The equation above implies that the loss of a pruned model is higher (lower) than the original modelif parameters with a negative (positive) value for Θ Tp ( t ) g ( Θ ( t )) are removed. Thus, for preserving model loss, the following importance score should be used. I ( Θ p ( t )) = (cid:12)(cid:12) Θ Tp ( t ) g ( Θ ( t )) (cid:12)(cid:12) . (3) Increase in gradient-norm based measures:
Wang et al. (2020) argue loss-preservation basedmethods are not well-motivated for pruning models early-on in training. They thus propose GraSP,an importance measure that prunes parameters whose removal increases the gradient norm and canenable fast convergence for a pruned model. (cid:107) g ( Θ ( t ) − Θ p ( t )) (cid:107) − (cid:107) g ( Θ ( t )) (cid:107) ≈ − Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) . (4)The above equation implies that the gradient norm of a pruned model is higher than the originalmodel if parameters with a negative value for Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) are removed. This results inthe following importance score. I ( Θ p ( t )) = Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) . (5)As mentioned before, these importance measures were introduced for pruning trained models (exceptfor GraSP), but are also used for pruning models early-on in training. In the following sections, werevisit the original goals for these measures, establish their relationship with evolution of modelparameters over time, and provide clear justifications for their use early-on in training. RADIENT F LOW AND N ETWORK P RUNING
Gradient flow, or gradient descent with infinitesimal learning rate, is a continuous-time version ofgradient descent. The evolution over time of model parameters, gradient, and loss under gradientflow can be described as follows.(Parameters over time) ∂ Θ ( t ) ∂t = − g ( Θ ( t )); (Gradient over time) ∂ g ( Θ ( t )) ∂t = − H ( Θ ( t )) g ( Θ ( t )); (Loss over time) ∂L ( t ) ∂t = − (cid:107) g ( Θ ( t )) (cid:107) . (6)Recall that standard importance measures based on loss-preservation (Equation 3) or increase ingradient-norm (Equation 5) are derived using a first-order Taylor series approximation, making themexactly valid under the continuous scenario of gradient flow. This indicates analyzing the evolutionof model parameters via gradient flow can provide useful insights into the relationships betweendifferent importance measures. To this end, we use gradient flow to develop a general frameworkthat relates different classes of importance measures through the norm of model parameters . As wedevelop this framework, we explain the reasons why importance measures defined for pruning trainedmodels are also highly effective when used for pruning early-on in training.4.1 G RADIENT F LOW AND M AGNITUDE -B ASED P RUNING
We first analyze the evolution of the (cid:96) norm of model parameters, a magnitude-based pruningmeasure, as a function of time under gradient flow. For a model initialized as Θ (0) , with parameters Θ ( T ) at time T , we note the distance from initialization can be related to model loss as follows: (cid:107) Θ ( T ) − Θ (0) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) T ∂ Θ ( t ) ∂t dt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:90) T g ( Θ ( t )) dt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (a) ≤ (cid:90) T (cid:107) g ( Θ ( t )) dt (cid:107) dt (b) = (cid:90) T − ∂L ( t ) ∂t dt = L (0) − L ( T ) (7)3reprintFigure 1: Train/test accuracy curves for pruned ResNet-56 models on CIFAR-10 (left) and CIFAR-100 (right) over 25 rounds. Models are pruned using magnitude-based pruning (Magnitude), theproposed extension to loss preservation (Proposed), and loss-preservation based pruning (Loss-pres.).Magnitude-based pruning converges fastest, followed by the proposed measure. Curves for othermodels and number of rounds are shown in the Appendix. = ⇒ (cid:107) Θ ( T ) − Θ (0) (cid:107) ≤ L (0) − L ( T ) , where (a) follows from the triangle inequality for norms and (b) follows from Equation 6. Theinequality above is a modification of a previously known result by Nagarajan & Kolter (2017), whoshow that change in model parameters, as measured by distance from initialization, is bounded bythe ratio of loss at initialization to the norm of the gradient at time T . Frequently used initializationtechniques are zero-centered with small variance (Glorot & Bengio (2010); He et al. (2015)). Thisreduces the impact of initialization, making (cid:107) Θ ( T ) − Θ (0) (cid:107) ≈ (cid:107) Θ ( T ) (cid:107) . In fact, calculating theimportance of filters in a model using (cid:96) norm versus using distance from initialization, we findthe two measures have an average correlation of 0.994 across training epochs, and generally pruneidentical parameters. We now use Equation 7 to relate magnitude-based pruning with model loss atany given time. Specifically, reorganizing Equation 7, yields L ( T ) ≤ L (0) − (cid:107) Θ ( T ) − Θ (0) (cid:107) ≈ L (0) − (cid:107) Θ ( T ) (cid:107) . (8)Based on this analysis, we make the following observation. Observation 1:
The larger the magnitude of parameters at a particular instant, the smaller themodel loss at that instant will be. If these large-magnitude parameters are preserved while pruning(instead of smaller ones), the pruned model’s loss decreases faster.
Equation 8 shows that the norm of model parameters bounds model loss at any given time. Thus, bypreserving large-magnitude parameters, magnitude-based pruning enables faster reduction in lossthan magnitude-agnostic techniques. For example, we later show that loss-preservation based pruningremoves the most slowly changing parameters, regardless of their magnitude. Thus magnitude-basedpruned models will generally converge faster than loss-preservation based pruned models.To verify this analysis also holds under the SGD training used in practice, we train several VGG-13,MobileNet-V1, and ResNet-56 models on CIFAR-10 and CIFAR-100. Our pruning setup starts withrandomly initialized models and uses a prune-and-train framework, where each round of pruninginvolves an epoch of training followed by pruning. A target amount of pruning (e.g., 75% filters)is divided evenly over a given number of pruning rounds. Throughout training, we use a smalltemperature value ( = 5 ) to ensure smooth changes to model parameters. We provide results for 1, 5,and 25 rounds for all models and datasets. The results after 1 round, where pruning is single-shot,demonstrate that our claims are general and not an artifact of allowing the model to compensate forits lost parameters; the results after 5 and 25 rounds, where pruning is distributed over a number ofrounds, demonstrate that our claims also hold when models compensate for lost parameters.The results are shown in Table 1. Magnitude-based pruning consistently performs better than loss-preservation based pruning. Furthermore, train/test convergence for magnitude-based pruned modelsis faster than that for loss-preservation based pruned models, as shown in Figure 1. These resultsvalidate our claim that magnitude-based pruning results in faster converging, better-performingmodels when pruning early-on in training. 4reprintTable 1: Accuracy of pruned models on CIFAR-10/CIFAR-100 (over 3 seeds; base model accuraciesare reported in parentheses; std. dev. < . for all experiments; best results are in bold, second bestare underlined). Magnitude-based pruning (Mag.) and the proposed extension to loss preservation(Proposed: (cid:80) θ i ∈ Θ p | θ i ( t ) || θ i ( t ) g ( θ i ( t )) | ) consistently outperform plain loss-preservation basedpruning (Loss). With more rounds, the proposed measure outperforms Magnitude-based pruning too. Pruning Rounds 1 round 5 rounds 25 roundsCIFAR-10 % pruned Mag. Loss Proposed Mag. Loss Proposed Mag. Loss ProposedVGG-13 (93.1) 75% 92.05 92.01
MobileNet (92.3) 75% 91.71 91.17
ResNet-56 (93.1) 60% 91.41 91.09
CIFAR-100 % pruned Mag. Loss Proposed Mag. Loss Proposed Mag. Loss ProposedVGG-13 (69.6) 65% 67.89 68.61
MobileNet (69.1) 65% 68.01 67.16
ResNet-56 (71.0) 50% 66.92 66.88
Extending existing importance measures : A fundamental understanding of existing importancemeasures can be exploited to extend and design new measures for pruning models early-on in training.For example, we modify loss-preservation based pruning to remove small-magnitude parameters thatdo not affect model loss using the following importance measure: (cid:80) θ i ∈ Θ p | θ i ( t ) || θ i ( t ) g ( θ i ( t )) | .Using | θ i ( t ) g ( θ i ( t )) | to preserve loss, this measure retains training progress up to the current in-stant; using | θ i ( t ) | , this measure is biased towards removing small-magnitude parameters and shouldproduce a model that converges faster than loss-preservation alone, thus improving accuracy. Theseexpected properties are demonstrated in Table 1 and Figure 1: the proposed measure consistentlyoutperforms loss-preservation based pruning and often outperforms magnitude-based pruning, es-pecially when more rounds are used (see Table 1); train/test convergence rate for models prunedusing this measure are better than those for loss-preservation pruning, and competitive to those ofmagnitude-based pruning (see Figure 1).4.2 G RADIENT F LOW AND L OSS -P RESERVATION B ASED P RUNING
Loss-preservation based pruning methods use first-order Taylor decomposition to determine theimpact of pruning on model loss (see Equation 3). We now show that magnitude-based and loss-preservation based pruning measures are related by an order of time-derivative. ∂ (cid:107) Θ ( t ) (cid:107) ∂t = 2 Θ T ( t ) ∂ Θ ( t ) ∂t (a) = − Θ T ( t ) g ( Θ ( t )) , (9)where (a) follows from Equation 6. This implies that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ (cid:107) Θ ( t ) (cid:107) ∂t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 2 (cid:12)(cid:12) Θ T ( t ) g ( Θ ( t )) (cid:12)(cid:12) . (10)Based on Equation 10, the following observations can be made. Observation 2:
Up to a constant, the magnitude of time-derivative of norm of model parameters (thescore for magnitude-based pruning) is equal to the importance measure used for loss-preservation(Equation 3). Moreover, loss-preservation corresponds to removal of the slowest changing parameters.
Equation 10 implies that loss-preservation based pruning preserves more than model loss. It also pre-serves the first-order dynamics of a model’s evolution.
This result demonstrates that loss-preservationbased importance is in fact a well-motivated measure for pruning models early-on in training, be-cause it preserves a model’s evolution trajectory.
This explains why loss-preservation based pruninghas been successfully used for pruning both trained (Molchanov et al. (2017; 2019)) and untrained(Lee et al. (2019)) models. We also draw another important conclusion pertaining to pruning oftrained models: preserving parameters that continue to change are more important to preserve modelloss, while parameters that have least tendency to change are dispensable.5reprint (a) VGG-13. (b) MobileNet-V1. (c) ResNet-56.
Figure 2: Correlation between | σ ∆ σ | and loss-preservation based importance (see Equation 3) atevery 10th epoch. Also plotted is the distance between pruning masks (target ratio: 20% filters), asused by You et al. (2020) to decide when to prune a model. As the distance between pruning masksover consecutive epochs reduces, | σ ∆ σ | becomes more correlated with loss-preservation importance. Observation 3:
Due to their closely related nature, when used with additional heuristics, magnitude-based importance measures preserve loss.
Pruning methods often use additional heuristics with existing importance measures to improve theirefficacy. E.g., You et al. (2020) recently showed that models become amenable to pruning early-on intraining. To determine how much to train before pruning, they create a binary mask to represent filtersthat have low importance and are thus marked for pruning. A filter’s importance is defined using themagnitude of its corresponding BatchNorm-scale parameter. If change in this binary mask is minimalover a predefined number of consecutive epochs, training is stopped and the model is pruned.Due to the related nature of magnitude-based and loss-preservation based measures, as shown inEquation 10, we expect use of additional heuristics can unknowingly extend an importance measureto preserve model properties that it was not intentionally designed to target. To demonstrate thisconcretely, we analyze the implications of the “train until minimal change” heuristic by You et al.,as explained above. In particular, consider a certain scale parameter σ that has a small magnitudeand is thus marked for pruning. Its magnitude must continue to remain small to ensure the binarymask does not change. This implies the change in σ over an epoch, or ∆ σ , must be small. Thesetwo constraints can be assembled in a single measure: | σ ∆ σ | . The continuous-time, per-iterationversion of this product is (cid:12)(cid:12) σ ∂σ∂t (cid:12)(cid:12) . As shown in Equation 9 and Equation 10, (cid:12)(cid:12) σ ∂σ∂t (cid:12)(cid:12) is the mathematicalequivalent of loss-preservation based importance (Equation 3). Therefore, when applied per iteration,removing parameters with small magnitude, under the additional heuristic of minimal change,converts magnitude-based pruning into a loss-preservation strategy.
If parameters do not changemuch over several iterations, this argument can further extend to change over an epoch. To confirmthis, we train VGG-13, MobileNet-V1, and ResNet-56 models on CIFAR-100, record the distancebetween pruning masks for consecutive epochs, and calculate correlation between loss-preservationbased importance and | σ ∆ σ | . As shown in Figure 2, as the mask distance reduces, the correlationbetween the importance measures increases. This confirms that due to the intimate relationshipbetween magnitude-based pruning and loss-preservation based pruning, when used with additionalheuristics, magnitude-based pruning extends to preserve model loss as well.4.3 G RADIENT F LOW AND G RADIENT -N ORM B ASED P RUNING
Having demonstrated the relationship between the first-order time derivative of the norm of modelparameters and loss-preservation based pruning, we now consider the second-order time derivative ofthe norm of model parameters and demonstrate its connection with GraSP (Wang et al. (2020)), amethod designed for use early-on in training to increase gradient-norm using pruning. ∂ (cid:107) Θ ( t ) (cid:107) ∂t = ∂∂t (cid:32) ∂ (cid:107) Θ ( t ) (cid:107) ∂t (cid:33) (a) = − ∂ (cid:0) Θ T ( t ) g ( Θ ( t )) (cid:1) ∂t = − (cid:18) g T ( Θ ( t )) ∂ Θ ( t ) ∂t + Θ T ( t ) ∂ g ( Θ ( t )) ∂t (cid:19) (b) = 2 (cid:16) (cid:107) g ( Θ ( t )) (cid:107) + Θ T ( t ) H ( Θ ( t )) g ( Θ ( t )) (cid:17) , (11)6reprint (a) Initialization. (b) 40 epochs. (c) 160 epochs. Figure 3: Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) versus Θ Tp ( t ) g ( Θ ( t )) for ResNet-56 models trained on CIFAR-100. Plots are shown for parameters (a) at initialization, (b) after 40 epochs of training, and (c) aftercomplete (160 epochs) training. The correlation is averaged over 3 seeds and plots are for 1 seed.As shown, the measures are highly correlated throughout model training, indicating gradient-normincrease may severely affect model loss if a partially or completely trained model is pruned using Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) . Plots for other models are shown in the Appendix.where (a) follows from Equation 9 and (b) follows from Equation 6. This implies that ∂ (cid:107) Θ ( t ) (cid:107) ∂t = 2 (cid:16) (cid:107) g ( Θ ( t )) (cid:107) + Θ T ( t ) H ( Θ ( t )) g ( Θ ( t )) (cid:17) . (12)Equation 12 shows the second-order time derivative of the norm of model parameters is linearlyrelated to the importance measure for increase in gradient-norm (Equation 5). In particular, parameterswith a negative Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) reduce the “acceleration” at which a model approaches itsfinal solution. Thus, removing them can speedup optimization.We now use the analysis above to demonstrate limitations to increase of gradient-norm using pruning. Observation 4:
Increasing gradient-norm via pruning removes parameters that maximally increasemodel loss.
Extending the “acceleration” analogy, recall that if an object has increasingly negative velocity, itsacceleration will be negative as well. As shown before in Equation 9, the velocity of norm of modelparameters is a constant multiple of the first-order Taylor decomposition for model loss around currentparameter values. Thus, the parameter with the most negative value for Θ Tp ( t ) g ( Θ ( t )) is likely toalso have a large, negative value for Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) . From Equation 2, we observe thatremoving parameters with negative Θ Tp ( t ) g ( Θ ( t )) increases model loss with respect to the originalmodel. This indicates Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) may in fact increase the gradient norm by removingparameters that maximally increase model loss.To test this claim, we plot Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) alongside Θ Tp ( t ) g ( Θ ( t )) for VGG-13,MobileNet-V1, and ResNet-56 models trained on CIFAR-100 at various points in training andconsistently find them to be highly correlated (see Figure 3). This strong correlation confirms that using Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) as an importance measure for network pruning increases gradientnorm by removing parameters that maximally increase model loss. Wang et al. (2020) remark in their work that it is possible that GraSP may increase the gradient-normby increasing model loss. We provide evidence and rationale illuminating this remark: we show whypreserving loss and increasing gradient-norm are antithetical. To mitigate this behavior, Wang et al.propose to use a large temperature value (200) before calculating the Hessian-gradient product. Wenow demonstrate the pitfalls of this solution and propose a more robust approach.
Observation 5:
Preserving gradient-norm maintains second-order model evolution dynamics andresults in better-performing models than increasing gradient-norm.
Equation 12 shows the measure Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) affects a model’s second-order evolution dy-namics. The analysis in Observation 4 shows Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) and Θ Tp ( t ) g ( Θ ( t )) are highlycorrelated. These results together imply that preserving gradient-norm (cid:0)(cid:12)(cid:12) Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) (cid:12)(cid:12)(cid:1) should preserve both second-order model evolution dynamics and model loss. Since the correlation7reprintTable 2: Accuracy of models pruned using different variants of gradient-norm based pruning onCIFAR-10/CIFAR-100 (over 3 seeds; base model accuracies are reported in parentheses; std. dev. < . for all models, except for GraSP (T=1) with 5/25 rounds (std. dev. < . ); best results are inbold, second best are underlined). Gradient-norm preservation ( | GraSP (T=1) | ) outperforms bothGraSP with large temperature (T=200) and without temperature (T=1). Pruning Rounds 1 round 5 rounds 25 roundsCIFAR-10 pruned GraSP GraSP | GraSP | GraSP GraSP | GraSP | GraSP GraSP | GraSP | (%) (T=200) (T=1) (T=1) (T=200) (T=1) (T=1) (T=200) (T=1) (T=1)VGG-13 (93.1) 75% 91.62 91.32 MobileNet (92.3) 75% 89.46 90.03
ResNet-56 (93.1) 60% 91.01 90.81
CIFAR-100 pruned GraSP GraSP | GraSP | GraSP GraSP | GraSP | GraSP GraSP | GraSP | (%) (T=200) (T=1) (T=1) (T=200) (T=1) (T=1) (T=200) (T=1) (T=1)VGG-13 (69.6) 65% 68.51 65.79 MobileNet (69.1) 65% 64.52 64.79
ResNet-56 (71.0) 50% 67.09 67.02 with loss-preservation based importance is not perfect, this strategy is only an approximation ofloss-preservation. However, it is capable of reducing increase in model loss due to pruning.To demonstrate the efficacy of gradient-norm preservation and the effects of increase in model loss onGraSP, we prune VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10 and CIFAR-100. The results are shown in Table 2 and lead to the following conclusions. (i) When a single round ofpruning is performed, the accuracy of GraSP is essentially independent of temperature. This impliesthat using temperature may be unnecessary if the model is close to initialization. (ii) In a few epochsof training, when reduction in loss can be attributed to training of earlier layers (Raghu et al. (2017)),GraSP without large temperature chooses to prune earlier layers aggressively. This permanentlydamages the model and thus the accuracy of low-temperature GraSP decreases substantially withincreasing training epochs. (iii) At high temperatures, the performance of pruned models is morerobust to the number of rounds and pruning of earlier layers is reduced. However, reduction inaccuracy is still significant. (iv) These behaviors are mitigated by using the proposed alternative:gradient-norm preservation (cid:0)(cid:12)(cid:12) Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) (cid:12)(cid:12)(cid:1) . Since this measure preserves the gradient-norm and is correlated to loss-preservation, it focuses on ensuring the model’s second-order trainingdynamics and loss remain the same, consistently resulting in the best performance. ONCLUSION
In this paper, we revisit importance measures designed for pruning trained models to better justify theiruse early-on in training. Developing a general framework that relates these measures through the normof model parameters, we analyze what properties of a parameter make it more dispensable accordingto each measure. This enables us to show that from the lens of model evolution, use of magnitude-and loss-preservation based measures is well-justified early-on in training. More specifically, bypreserving parameters that enable fast convergence, magnitude-based pruning generally outperformsmagnitude-agnostic methods. By removing parameters that have the least tendency to change, loss-preservation based pruning preserves first-order model evolution dynamics and is well-justifiedfor pruning models early-on in training. We also explore implications of the intimate relationshipbetween magnitude-based pruning and loss-preservation based pruning, demonstrating that one canevolve into the other as training proceeds. Finally, we analyze gradient-norm based pruning and showthat it is linearly related to second-order model evolution dynamics. Due to this relationship, wefind that increasing gradient-norm via pruning corresponds to removing parameters that maximallyincrease model loss. Since such parameters are concentrated in the initial layers early-on in training,this method can permanently damage a model’s initial layers and undermine its ability to learn fromlater training. To mitigate this problem, we show the most robust approach is to prune parameters thatpreserve gradient norm, thus preserving a model’s second-order evolution dynamics while pruning. Inconclusion, our work shows the use of an importance measure for pruning models early-on in trainingis difficult to justify unless the measure’s relationship with the evolution of a model’s parameters overtime is established. More generally, we believe new importance measures that specifically focus onpruning early-on should be directly motivated by a model’s evolution dynamics.8reprint
CKNOWLEDGEMENTS
We thank Puja Trivedi for numerous helpful discussions during the course of this project and severaluseful comments in the drafting of this paper. R EFERENCES
X. Ding, G. Ding, X. Zhou, Y. Guo, J. Han, and J. Liu. Global Sparse Momentum SGD for PruningVery Deep Neural Networks. In
Proc. Adv. in Neural Information Processing Systems , 2019.E. Elsen, M. Dukhan, T. Gale, and K. Simonyan. Fast Sparse ConvNets. In
Proc. Int. Conf. onComputer Vision and Pattern Recognition , 2020.J. Frankle and M. Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.In
Proc. Int. Conf. on Learning Representations , 2019.W. Gao, Y. Liu, C. Wang, and S. Oh. Rate Distortion For Model Compression: From Theory ToPractice. In
Proc. Int. Conf. on Machine Learning , 2019.X. Glorot and Y. Bengio. Understanding the Difficulty of Training Deep Feedforward NeuralNetworks. In
Proc. Int. Conf. on Artificial Intelligence and Statistics , 2010.S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. J. Dally. EIE: Efficient InferenceEngine on Compressed Deep Neural Network. In
Proc. Int. Symp. on Computer Architecture ,2016a.S. Han, J. Pool, J. Tran, and W. J. Dally. Learning Both Weights and Connections For EfficientNeural Networks. In
Proc. Adv. in Neural Information Processing Systems , 2016b.B. Hassibi and D. G. Stork. Second Order Derivatives For Network Pruning: Optimal Brain Surgeon.In
Proc. Adv. in Neural Information Processing Systems , 1993.S. Hayou, J. Ton, A. Doucet, and Y. W. Teh. Pruning untrained neural networks: Principles andAnalysis. arXiv preprint , arXiv:2002.08797v3 [cs.LG], 2020.K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-LevelPerformance on ImageNet Classification. In
Proc. Int. Conf. on Computer Vision , 2015.Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang. Soft Filter Pruning For Accelerating Deep ConvolutionalNeural Networks. In
Proc. Int. Joint Conf. on Artificial Intelligence , 2018.Y. LeCun, J. S. Denker, and S. A. Solla. Optimal Brain Damage. In
Proc. Adv. in Neural InformationProcessing Systems , 1990.N. Lee, T. Ajanthan, and P. Torr. SNIP: Single-Shot Network Pruning Based on Connection Sensitivity.In
Proc. Int. Conf. on Learning Representations , 2019.N. Lee, T. Ajanthan, S. Gould, and P. Torr. A Signal Propagation Perspective For Pruning NeuralNetworks at Initialization. In
Proc. Int. Conf. on Learning Representations , 2020.H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning Filters For Efficient ConvNets. In
Proc. Int. Conf. on Learning Representations , 2017.Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning Efficient Convolutional NetworksThrough Network Slimming. In
Proc. Int. Conf. on Computer Vision , 2017.J. Luo, H. Zhang, H. Zhou, C. Xie, J. Wu, and W. Lin. ThiNet: Pruning CNN Filters For a ThinnerNet.
IEEE Trans. on Pattern Analysis and Machine Intelligence , 2019.E. Malach, G. Yehudai, S. Shalev-shwartz, and O. Shamir. Proving the Lottery Ticket Hypothesis:Pruning is All You Need. In
Proc. Int. Conf. on Machine Learning , 2020.P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning Convolutional Neural Networks ForResource Efficient Inference. In
Proc. Int. Conf. on Learning Representations , 2017.9reprintP. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz. Importance Estimation For NeuralNetwork Pruning. In
Proc. Int. Conf. on Computer Vision and Pattern Recognition , 2019.V. Nagarajan and Z. Kolter. Generalization in Deep Networks: The Role of Distance from Initializa-tion. In
Workshop on Deep Learning: Bridging Theory and Practice , 2017.M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. SVCCA: Singular Vector CanonicalCorrelation Analysis For Deep Learning Dynamics and Interpretability. In
Proc. Adv. in NeuralInformation Processing Systems , 2017.L. Theis, I. Korshunova, A. Tejani, and F. Huszár. Faster Gaze Prediction with Dense Networks andFisher Pruning. In
Proc. Euro. Conf. on Computer Vision , 2018.C. Wang, G. Zhang, and R. Grosse. Picking Winning Tickets Before Training by Preserving GradientFlow. In
Proc. Int. Conf. on Learning Representations , 2020.J. Ye, X. Lu, Z. Lin, and J. Z. Wang. Rethinking the Smaller-Norm-Less-Informative Assumption inChannel Pruning of Convolution Layers. In
Proc. Int. Conf. on Learning Representations , 2018.M. Ye, C. Gong, L. Nie, D. Zhou, A. Klivans, and Q. Liu. Good Subnetworks Provably Exist: Pruningvia Greedy Forward Selection. In
Proc. Int. Conf. on Machine Learning , 2020.H. You, C. Li, P. Xu, Y. Fu, Y. Wang, X. Chen, R. G. Baraniuk, Z. Wang, and Y. Lin. DrawingEarly-Bird Tickets: Towards More Efficient Training of Deep Networks. In
Proc. Int. Conf. onLearning Representations , 2020. 10reprint
A O
RGANIZATION
The appendix is organized as follows:• Appendix B: Details the setup for training base models.• Appendix C: Details the setup for training pruned models.• Appendix D: Train/Test curves for magnitude-based and loss-preservation based prunedmodels.• More Results on Gradient-norm Based Pruning:Section E.1: Scatter plots demonstrating correlation between Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) and Θ Tp ( t ) g ( Θ ( t )) .Section E.2: Train/Test curves for gradient-norm based variants used in the main paper. B T
RAINING B ASE M ODELS
We analyze models trained on CIFAR-10 and CIFAR-100 datasets. All models are trained with theexact same setup. In all our experiments, we average results across 3 seeds.Since pruning has a highly practical motivation, we argue analyzing a variety of models that capturedifferent architecture classes is important. To this end, we use VGG-13 (a vanilla CNN), MobileNet-V1 (a low-redundancy CNN designed specifically for efficient computation), and ResNet-56 (aresidual model to analyze effects of pruning under skip connections). The final setup is as follows:• Optimizer: SGD,• Momentum: . ,• Weight decay: . ,• Learning rate schedule: (0 . , . , . ,• Number of epochs for each learning rate: (80 , , ,• Batch Size: . C T
RAINING P RUNED M ODELS
The general setup for training pruned models is the same as that for training base models. However,for a fair comparison, if the prune-and-train framework takes n rounds, we subtract n epochs from thenumber of epochs alloted to the highest learning rate. This ensures same amount of training budgetfor all models and pruning strategies. Note that even if this subtraction is not performed, the resultsdo not vary much, as the model already converges (see Appendix D for train/test curves). The finalsetup is as follows.• Optimizer: SGD,• Momentum: . ,• Weight decay: . ,• Learning rate schedule: (0 . , . , . ,• Number of epochs for each learning rate: (80 − number of pruning rounds , , ,• Batch size: .Since we prune models in a structured manner, our target pruning ratios are chosen in terms of thenumber of filters to be pruned. A rough translation in terms of parameters follows:• VGG-13: 75% filters ( ∼
94% parameters) for CIFAR-10; 65% filters ( ∼
89% parameters)for CIFAR-100,• MobileNet-V1: 75% filters ( ∼
94% parameters) for CIFAR-10; 65% filters ( ∼
89% parame-ters) for CIFAR-100, 11reprint• ResNet-56: 60% filters ( ∼
88% parameters) for CIFAR-10; 50% filters ( ∼
83% parameters)for CIFAR-100.The percentage of pruned parameters is much larger than the percentage of pruned filters becausegenerally filters from deeper layers are pruned more heavily. These filters form the bulk of a model’sparameters, thus resulting in high parameter percentages.
D T
RAIN /T EST P LOTS
This section provides further demonstration that magnitude based pruned models converge fasterthan loss-preservation based pruned models. For magnitude-based pruning, the importance is definedas (cid:96) norm of a filter; for loss-preservation, the importance is defined as a variant of SNIP (Leeet al. (2019)) applied to entire filter. The plots show train/test curves for VGG-13, MobileNet-V1,and ResNet-56 models for 1, 5, and 25 pruning rounds each. Also plotted are train/test curves forthe proposed extension to loss-preservation which intentionally biases loss-preservation towardsremoving small magnitude parameters (see Section 4.1 for more details).D.1 VGG-13Figure 4: VGG-13: 1 round of pruning. CIFAR-10 (left); CIFAR-100 (right).Figure 5: VGG-13: 5 rounds of pruning. CIFAR-10 (left); CIFAR-100 (right).Figure 6: VGG-13: 25 rounds of pruning. CIFAR-10 (left); CIFAR-100 (right).12reprintD.2 M OBILE N ET -V1Figure 7: MobileNet-V1: 1 round of pruning. CIFAR-10 (left); CIFAR-100 (right).Figure 8: MobileNet-V1: 5 rounds of pruning. CIFAR-10 (left); CIFAR-100 (right).Figure 9: MobileNet-V1: 25 rounds of pruning. CIFAR-10 (left); CIFAR-100 (right).D.3 R ES N ET -56Figure 10: ResNet-56: 1 round of pruning. CIFAR-10 (left); CIFAR-100 (right).13reprint Figure 11: ResNet-56: 5 rounds of pruning. CIFAR-10 (left); CIFAR-100 (right).Figure 12: ResNet-56: 25 rounds of pruning. CIFAR-10 (left); CIFAR-100 (right). E M
ORE R ESULTS ON G RADIENT - NORM B ASED P RUNING
This section provides further results on gradient-norm based pruning. The general implementation ofthese measures requires the calculation of Hessian-gradient products. Similar to the implementationby Wang et al. (2020), we define a constant amount of memory that stores randomly selected samplesfor all classes. For the original GraSP variant that increases gradient norm, we follow the originalimplementation and use a temperature of during the calculation of Hessian-gradient product. Forthe GraSP variant without large temperature, this temperature value is brought down to 1. Finally,for the gradient norm preservation measure, the GraSP variant without large temperature is usedalongside an absolute value operator to remove parameters that least change gradient norm (i.e., leastaffect second-order model evolution dynamics).E.1 S
CATTER P LOTS F OR Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) VS . Θ Tp ( t ) g ( Θ ( t )) This subsection provides scatter plots demonstrating the highly correlated nature of Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) and Θ Tp ( t ) g ( Θ ( t )) . The correlation is averaged over 3 seeds and plotsare for 1 seed. As can be seen in the plots, the measures are highly correlated throughout modeltraining, indicating gradient-norm increase may severely affect model loss if a partially trained orcompletely trained model is pruned using Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) .14reprint (a) Initialization. (b) 40 epochs. (c) 160 epochs. Figure 13: Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) versus Θ Tp ( t ) g ( Θ ( t )) for VGG-13 models trained on CIFAR-100. Plots are shown for parameters (a) at initialization, (b) after 40 epochs of training, and (c) aftercomplete (160 epochs) training. (a) Initialization. (b) 40 epochs. (c) 160 epochs. Figure 14: Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) versus Θ Tp ( t ) g ( Θ ( t )) for MobileNet-V1 models trained onCIFAR-100. Plots are shown for parameters (a) at initialization, (b) after 40 epochs of training, and(c) after complete (160 epochs) training. (a) Initialization. (b) 40 epochs. (c) 160 epochs. Figure 15: Θ Tp ( t ) H ( Θ ( t )) g ( Θ ( t )) versus Θ Tp ( t ) g ( Θ ( t )) for ResNet-56 models trained on CIFAR-100. Plots are shown for parameters (a) at initialization, (b) after 40 epochs of training, and (c) aftercomplete (160 epochs) training.E.2 T RAIN /T EST C URVES FOR G RADIENT - NORM B ASED P RUNING V ARIANTS
This subsection provides train/test curves for different variants of gradient-norm based pruningmethods considered in this paper. These curves demonstrate the large temperature GraSP variant doesnot improve convergence rate substantially, if ever, in comparison to gradient-norm preservation. Onthe other hand, it does result in significant performance loss when the model is even slightly trained.15reprintE.2.1 VGG-13Figure 16: VGG-13: 1 round of pruning. CIFAR-10 (left); CIFAR-100 (right).Figure 17: VGG-13: 5 rounds of pruning. CIFAR-10 (left); CIFAR-100 (right).Figure 18: VGG-13: 25 rounds of pruning. CIFAR-10 (left); CIFAR-100 (right).E.2.2 M