Going Beyond Classification Accuracy Metrics in Model Compression
Vinu Joseph, Shoaib Ahmed Siddiqui, Aditya Bhaskara, Ganesh Gopalakrishnan, Saurav Muralidharan, Michael Garland, Sheraz Ahmed, Andreas Dengel
RReliable Model Compression via Label-Preservation-Aware Loss Functions
Vinu Joseph Shoaib Ahmed Siddiqui
Aditya Bhaskara Ganesh Gopalakrishnan Saurav Muralidharan Michael Garland Sheraz Ahmed Andreas Dengel University of Utah TU Kaiserslautern German Research Center for Artificial Intelligence (DFKI) NVIDIA
Abstract
Model compression is a ubiquitous tool that brings thepower of modern deep learning to edge devices with powerand latency constraints. The goal of model compressionis to take a large reference neural network and output asmaller and less expensive compressed network that is func-tionally equivalent to the reference. Compression typi-cally involves pruning and/or quantization, followed by re-training to maintain the reference accuracy. However, it hasbeen observed that compression can lead to a considerablemismatch in the labels produced by the reference and thecompressed models, resulting in bias and unreliability. Tocombat this, we present a framework that uses a teacher-student learning paradigm to better preserve labels. We in-vestigate the role of additional terms to the loss functionand show how to automatically tune the associated param-eters. We demonstrate the effectiveness of our approachboth quantitatively and qualitatively on multiple compres-sion schemes and accuracy recovery algorithms using a setof 8 different real-world network architectures. We obtaina significant reduction of up to . × in the number of mis-matches between the compressed and reference models, andup to . × in cases where the reference model makes thecorrect prediction.
1. Introduction
Modern deep learning owes much of its success to theability to train large models by leveraging data sets of everincreasing size [36]. The best performing models for com-puter vision [65] and natural language processing applica-tions [16] tend to have tens to hundreds of layers and hun-dreds of millions of parameters. However, an increasing * Joint first authors. Supported in part by NSF Awards 1704715 and1817073. This work was in part also supported by the BMBF project De-FuseNN (Grant 01IW17002) and the NVIDIA AI Lab (NVAIL) program.Vinu Joseph is supported in part by an NVIDIA Graduate Research Fellow-ship. The third author acknowledges the support from NSF CCF-2008688.Correspondence to Vinu Joseph: [email protected] (a) Uncompressed (b) Cross Entropy(c) Logit Pairing (d) Combo
Figure 1: Comparing the number of Compression ImpactedPixels (CIPs) for different loss functions on a brain MRIFLAIR segmentation task [7]. CIPs are pixels classified dif-ferently in the original and compressed models. Red con-tour represents the prediction while the green contour rep-resents the ground-truth.number of applications, including autonomous driving [53],surveillance [68], and voice assistance systems [2], demandML models that can be deployed on low-power and low-resource devices, and typically have strong latency require-ments [14, 44]. In such applications, the notion of modelcompression has gained popularity; at a high level, modelcompression involves taking a reference model and produc-ing a compressed model that is lightweight in terms of com-putational requirements, while being functionally equiva-lent to the reference model (i.e., produces the same classifi-cation outputs on all inputs). Model compression has beenstudied extensively in computer vision as well as other do-mains [12], leveraging techniques such as structured weightpruning [61, 51, 18, 25, 50, 22, 17, 23, 56, 32, 3, 52], quan-tization [72, 20] and low-rank factorization [42, 45, 67, 15,1 a r X i v : . [ c s . C V ] D ec can compression schemes en-sure that the compressed and reference networks are closeat a semantic or feature level? This question is challengingbecause networks can have different number of layers, fea-tures per layer and different connectivity structures. More-over, the answer generally depends on the architecture ofthe reference network and the task at hand. The goal of ourpaper is to use ideas from knowledge distillation (such aslogit pairing [4, 30]) to introduce new terms into the objec-tive function of a compression scheme and help answer theabove question. With new loss terms, the challenge now isto understand the relative importance of the terms and mea-sure the impact they have on the overall objective.
Compression Impacted Exemplars (CIEs)
To measurethe impact of compression, we use a metric proposed in aseries of work by Hooker et al. [31]: counting the numberof classification mismatches between the reference and thecompressed model. This number can be more informativethan purely the accuracy of the compressed model. For ex-ample, a lot of research has focused on making models moreunbiased, more fair and more robust. If we have a referencemodel obtained using such methods, we would wish to havecompression methods to preserve these properties. In fact,we consider a subset of CIEs (which we call
CIE-Us , seeSection 3) and report both CIE and
CIE-U numbers. Note that [31] used the slightly different name of Compression
Iden-tified
Exemplars; our modification is done to ensure compatibility with asimilar notation for segmentation that we will define later.
Our key contributions in this work are as follows:• Employing additional loss terms in the compressionobjective based on the teacher-student paradigm so asto align the predictions of the reference and the com-pressed models. We show for the first time that such apairing can be extended to other tasks, by consideringsemantic segmentation. Figure 1 shows an example ofthe effect of the different loss terms on the number ofmodel mismatches.• Analyzing automated strategies for tuning the hyper-parameters associated with our multi-part loss func-tions. We demonstrate that our framework is robustto the choice of tuning strategies, and uniform weight-ing works as well as more intricate strategies acrossdifferent datasets and reference network architectures.• Through extensive experiments, we validate the effec-tiveness of our framework and show that it not onlyimproves metrics such as the number of CIEs, but alsoyields better compression accuracy compared to previ-ous approaches.While the teacher-student paradigm is a coarse way tocapture “semantic similarity” between the reference andcompressed models, our results show that it can nonethe-less be highly effective in reducing the number of CIEs. Wealso remark that our methodology can work with any com-pression scheme that allows us to specify a custom objectivethat can be optimized to produce a compressed model.
2. Background and Related Work
In this section, we provide a brief overview of recentmodel compression approaches, followed by a more de-tailed background on group sparsity-based model compres-sion and CIE reduction.
Deep neural networks are heavy on computation andmemory by design, creating an impediment to operatingthese networks on resource-constrained platforms. To al-leviate this constraint, several branches of work have beenproposed to reduce the size of an existing neural net-work. The most commonly employed approach is to re-duce the number of weights, neurons, or layers in a net-work while maintaining approximately the same perfor-mance [39]. This approach was first explored on DNNs inearly work such as [46, 27]. Studies conducted by [25, 24]showed that simple unstructured pruning can reduce the sizeof the network by pruning unimportant connections withinthe network. However, such unstructured pruning strate-gies produce large sparse weight matrices that are computa-tionally inefficient unless equipped with a specialized hard-ware [35]. To resolve this issue, structured pruning methodsere proposed where entire channels are pruned simultane-ously to ensure that the pruned network can be naturallyaccelerated on commodity hardware [47, 32, 62]. Morerecently, Renda et al. [57] proposed the rewind algorithmwhich is similar to simple fine-tuning of the network to re-gain the loss in accuracy incurred during the pruning step.The sparsity level of the model is updated in small stepswhere each step enhances the sparsity of the model fol-lowed by fine-tuning. The two major schemes for structuredpruning are either based on filter pruning [40] or low-ranktensor factorization [49]. Both these approaches enable di-rect acceleration of the networks in contrast to unstructuredpruning. Li et al. [49] explored the relationship betweentensor factorization and general pruning methods, and pro-posed a unified approach based on sparsity-inducing normwhich can be interpreted as both tensor factorization or di-rect filter pruning. By simply changing the way the sparsityregularization is enforced, filter pruning and low-rank de-composition can be derived accordingly. This is particularlyimportant for the compression of popular network architec-tures with shortcut connections (e.g. ResNet), where filterpruning cannot deal with the last convolutional layer in aResBlock while the low-rank decomposition methods can.
Accuracy Recovery Algorithms
General accuracy re-covery algorithms capable of handling a wide variety ofcompression techniques provide the foundation for moderncompression systems. Prior work in this domain includesthe LC algorithm [9], ADAM-ADMM [71] and DCP [73].More recently, the
Rewind [57] and Group-Sparsity [49]algorithms have been demonstrated to be state-of-the-artcompression algorithms. Due to their compression scheme-agnostic nature, we build upon these two methods in our pa-per to evaluate the proposed label-preservation-aware lossfunctions as described in Section 3.1.
Network Distillation
Another branch of network com-pression initially proposed by [30], attempts to distillknowledge from a large teacher network to a small studentnetwork. With the assumption that the knowledge capturedby a network is reflected in the output probability distribu-tion, this line of work trains the student network to mimicthe probability distribution produced by the teacher net-work. Since the networks are trained to output one-hot dis-tribution, a temperature T is used to diffuse the probabilitymass. Advanced methods of distillation have succeeded inachieving much more effective transfer by not only transfer-ring the output logits but the information of the intermediateactivations as in [70, 58, 37, 1]. Although network distil-lation was presented as a general form of logit pairing, itis quite difficult to obtain improvements during distillationwithout spending considerable effort in manually tuning thetemperature T for the softmax layer. In contrast, using pure logit pairing comes without any additional cost of manualhyperparameter tuning. Therefore, we employ pure logitpairing instead of knowledge distillation in our approach. Group-Sparsity based Model Compression
We nowbriefly describe the key insight of the compression recov-ery algorithm that was used in our evaluation. The mainidea in the Group-Sparsity recovery algorithm [49] is thatthe filter pruning and filter decomposition seek a compactapproximation of the parameter tensors despite their differ-ent operational forms to cope with different application sce-narios. Consider a vectorized image patch x ∈ R m × and agroup of n filters W = { w , · · · , w n } ∈ R m × n . The prun-ing methods remove output channels and approximate theoriginal output x T W as x T C , where C ∈ R m × k only has k output channels. Filter decomposition methods approxi-mate W as two filters A ∈ R m × k and B ∈ R k × n , making AB a rank k approximation of W . Thus, both pruning anddecomposition-based methods seek a compact approxima-tion to the original network parameters, but adopt differ-ent strategies for the approximation. The weight parame-ters W are usually trained with some regularization such asweight decay to constrain the hypothesis class. To get struc-tured pruning of the filter, structured sparsity regularizationis used to constrain the filter: min W L ( y, Φ( x ; W )) + µ D ( W ) + λ R ( W ) (1)where D ( · ) and R ( · ) represents the weight decay and spar-sity regularization term respectively, while µ and λ are theregularization factors. Instead of directly regularizing thematrix W [69, 48], we enforced group sparsity constraintsby incorporating a sparsity-inducing matrix A ∈ R n × n ,which can be converted to the filter of a × convolutionlayer after the original layer. Then the original convolutionof Z = X × W becomes Z = X × ( W × A ) . To obtaina structured sparse matrix, group sparsity regularization isenforced on A . Thus, the loss Eqn. 1 function becomes min W , A L ( y, Φ( x ; W , A )) + µ D ( W ) + λ R ( A ) (2)Solving the problem in Eqn. 2 results in structured groupsparsity in matrix A . By considering matrix W and A to-gether, the actual effect is that the original convolutionalfilter is compressed. Top-1 accuracy is just one among many possible waysof characterizing the quality of a compressed model. Analternative approach involves counting all the inputs forwhich the compressed model disagrees with the original,uncompressed model. Each such input is termed a
Com-pression Impacted Exemplar (CIE) , following the defini-tion by Hooker et al. [31] (see also footnote 1). While CIE ompressed ModelLabel-Preservation Aware Loss Function LearnerCompressed Accuracy Recovery Optimizers (Condensa-LC, Rewind, Group-Sparsity)ImageNet “stretcher” ClassHypothesis Space for The Compressed Model (Condensa Compression Scheme)Reference Model Cross Entropy Cross Entropy Logit Pairing
Baselines Methods
RandomRound RobinUniform Selection
Non-Gradient Based Methods
Using loss functionstatistics
Gradient Based Methods
Learning setup forthe loss weights termsReference Model Predictions Original Labels (From Dataset) Compressed Model Predictions“stretcher” “mountain bike”
Example Input:
CIE-U . In Section 3.1, we explorenovel loss formulations that target both CIE and CIE-U re-duction during compression.CIE reduction has received relatively little attention fromthe research community, with recent work by Hooker etal. [31] being the only one that we are aware of that tries toidentify and reduce CIEs. Their primary approach involvesre-weighting CIEs, where they consider a mitigation strat-egy of fine-tuning the compressed model for a certain num-ber (chosen to be 3000) of iterations while up-weightingthe CIEs relative to the rest of the dataset. Their approach issensitive to hyperparameters such as: (1) choice of numberof fine-tuning iterations, (2) a threshold (90th percentile) toupweight all exemplars above that threshold, and (3) an up-weighting value of λ > for CIE which they choose tobe . We believe that our approach is more principled fora number of reasons. First, we pose CIE mitigation as ageneral label-preservation problem and extensively exploreseveral loss functions to mitigate this without introducingany new hyperparameters than were initially used duringmodel compression or changing any of the values of theoriginal compression hyperparameters. Lastly, we are ag-nostic to the compression scheme and compression algo-rithm. Networks that perform challenging tasks or multipletasks often require a combination of losses to work. Consid-erable effort has been made towards understanding the roleof different loss terms [33, 6, 11], and how best to com-bine them. While most of the prior work combines theselosses either using ad-hoc or equal weights, researchershave recently tried to develop systematic methods to adjustthe weights on the linear combination of loss components.These methods often require defining new loss functions [6]or changing the optimization procedure [11]. As we de-scribe in Section 3.2, we compare three different hyper-parameter tuning strategies to optimize our multi-part lossfunction.
3. Methodology
Figure 2 provides a high-level overview of the proposedsystem. Given a reference model W , we wish to obtain acompressed model W that: (i) obtains the same or betteraccuracy as W , and (ii) minimizes the number of CIEs andCIE-Us. Recall that as defined in Section 2.2, a CIE is anyinput x on which the reference and compressed models dis-agree (i.e., y W (cid:54) = y W ), and a CIE-U is a CIE for whichthe reference model’s output matches the ground truth la-bel while the compressed model’s does not (i.e., y W = y ∧ y W (cid:54) = y ). The middle part of the figure (box labeled L )depicts the proposed learner that uses a label-preservation-aware loss function. L automatically optimizes the cross-ntropy, distillation and logit pairing losses to realize CIEand CIE-U reductions in the compressed model W . Further,as depicted in boxes H and L , our approach is agnostic tothe compression scheme and accuracy recovery algorithmused for compressing the reference model W . We now de-scribe the loss functions and the learning schemes used forselecting the associated hyperparameters. Most of the known model compression approaches opti-mize the cross-entropy loss with respect to the target labelsi.e. they minimize L CE = 1 N N (cid:88) i =1 CE ( σ (Φ( x i ; W )) , y i ) , (3)where Φ( x i ; W ) represents the output of the network (be-fore softmax), parameterized by the compressed weights W , σ represents softmax function applied to the network’soutput, and N denotes the number of examples in a mini-batch. Notice that this loss function does not explicitlyencourage any alignment between the compressed and theuncompressed networks, and only requires the compressednetwork to produce the correct output labels.To explicitly encourage an alignment between the com-pressed and uncompressed models, one approach is to intro-duce a logit pairing objective (as in teacher-student frame-works [4, 30]) which encourages the logits of the two mod-els to be well-aligned with each other: L MSE = 1 N N (cid:88) i =1 (cid:107) Φ( x i ; W )) − Φ( x i ; W )) (cid:107) . (4)where W represents the weights of the uncompressedmodel. The overall loss function can now be taken to bea combination: L = α · L CE + β · L MSE , (5)where α and β are the corresponding weighting factors forthe two losses. Here, the logit pairing term attempts to min-imize the difference in the functional form of the two mod-els, while the cross-entropy term can be viewed as plac-ing an additional weight on terms corresponding to CIE-Us,where the uncompressed model makes the right prediction.The MSE logit pairing objective is one way to ensurethat the models have similar behavior. We also consideredthe following term, which could apply to settings where thereference model is “unsure” of the class, i.e., logits corre-sponding to two different classes are close in magnitude,making them difficult to disambiguate. To help in suchcases, we consider an additional term that effectively maxi-mizes the cross-entropy between the predictions of the twomodels. In other words, L CE Pred = 1 N N (cid:88) i =1 CE ( σ (Φ( x i ; W )) , arg max j Φ( x i ; W ) j ) , (6)where the second term is simply the prediction of the un-compressed model on the input x i . When the logits fortwo classes are close, this term ensures that the compressedmodel better respects the ordering between the logit values.The final loss function that we optimize thus has the form L = α · L CE + β · L MSE + γ · L CE Pred . (7)It is non-trivial to come up with the right weights for thesemulti-part loss functions. In the next section, we describehow we automatically tune the hyper-parameters α , β and γ . We experiment with three different ways of setting thehyper-parameters α , β , and γ : U NIFORM , L
EARNABLE ,and S
OFT A DAPT . To better understand the contributions ofthe individual loss terms to the final model, we also evaluatethe 7 possible subsets obtained using the three loss terms inEq. 7. These subsets correspond to setting some of the lossweights to zero, while optimizing the rest of the weightsusing the three methods we describe.U
NIFORM . Here, we assign a uniform weight to each of theselected loss terms. The weight is equally divided amongthe number of loss terms present. i.e., for subsets containing1, 2, and 3 loss terms, the weights are . , . and . respectively.L EARNABLE . In the learnable variant, we treat the weights α, β, γ as parameters of the model, and then optimize themas standard parameters using gradient descent. Since thegradient points in the steepest direction, using gradient de-scent will naturally choose loss terms which are lowest inmagnitude. Therefore, for the learnable weights to be fair,the relative magnitude of the different loss terms should beabout the same. We combat this by introducing a weight de-cay term on the loss term (similar to standard optimization)and then projecting the weights to sum to 1. This avoids thecollapse for the terms other than the minimum loss to 0. α = e α (cid:48) e α (cid:48) + β (cid:48) + γ (cid:48) where α (cid:48) is a parameter of the model. Once the final lossvalue is computed based on these weights, the parameters ofthe model (including α (cid:48) , β (cid:48) and γ (cid:48) ) are updated using SGD. α (cid:48) = α (cid:48) − ∇ α (cid:48) ( L + η (cid:107) α (cid:48) (cid:107) ) ATASET A RCHITECTURE L OSS FUNCTION U NCOMPRESSED A CC . C OMPRESSED A CC . S S CE (D
ISTILLATION ) .
78% 74 .
95% 1925 717
RESNET-164 UNIFORM (MSE) .
78% 76 . ( . × ) ( . × ) UNIFORM (CE, MSE) .
78% 76 .
62% 1237(1 . × ) 386(1 . × ) CE (D
ISTILLATION ) .
83% 66 .
96% 1903 639
RESNET-20 UNIFORM (MSE) .
83% 68 . ( . × ) ( . × ) CIFAR-100 UNIFORM (CE, MSE) .
83% 69 .
13% 717(2 . × ) 184(3 . × ) CE (D
ISTILLATION ) .
87% 74 .
40% 2092 788
RESNEXT-164 UNIFORM (MSE) .
87% 76 . ( . × ) ( . × ) UNIFORM (CE, MSE) .
87% 75 .
95% 1523(1 . × ) 526(1 . × ) CE (D
ISTILLATION ) .
95% 70 .
98% 1609 535
RESNEXT-20 UNIFORM (MSE) .
95% 72 . ( . × ) ( . × ) UNIFORM (CE, MSE) .
95% 72 .
16% 792(2 × ) 202(2 . × ) CE (D
ISTILLATION ) .
62% 92 .
90% 550 317
DENSENET UNIFORM (MSE) .
62% 93 .
37% 496(1 . × ) 274(1 . × ) UNIFORM (CE, MSE) .
62% 94 . ( . × ) ( . × ) CE (D
ISTILLATION ) .
03% 93 .
71% 466 266
RESNET-164 UNIFORM (MSE) .
03% 94 .
19% 381(1 . × ) 208(1 . × ) UNIFORM (CE, MSE) .
03% 94 . ( . × ) ( . × ) CE (D
ISTILLATION ) .
54% 90 .
54% 657 381
RESNET-20 UNIFORM (MSE) .
54% 92 .
28% 396(1 . × ) 176(2 . × ) CIFAR-10 UNIFORM (CE, MSE) .
54% 92 . ( . × ) ( . × ) CE (D
ISTILLATION ) .
09% 91 .
73% 589 323
RESNET-56 UNIFORM (MSE) .
09% 91 .
73% 572(1 . × ) 311(1 . × ) UNIFORM (CE, MSE) .
09% 92 .
08% 530(1 . × ) ( . × ) CE (D
ISTILLATION ) .
18% 93 .
74% 472 276
RESNEXT-164 UNIFORM (MSE) .
18% 92 .
87% 576(0 . × ) 373(0 . × ) UNIFORM (CE, MSE) .
18% 93 .
83% 462(1 . × ) 269(1 . × ) CE (D
ISTILLATION ) .
54% 90 .
92% 783 422
RESNEXT-20 UNIFORM (MSE) .
54% 91 .
37% 674(1 . × ) 345(1 . × ) UNIFORM (CE, MSE) .
54% 92 . ( . × ) ( . × ) CE (D
ISTILLATION ) .
01% 76 .
35% 4491 1185
IMAGENET RESNET-50 UNIFORM (MSE) .
01% 76 .
13% 1890( . × ) ( . × ) UNIFORM (CE, MSE) .
01% 76 .
48% 2911(1 . × ) 704(1 . × ) UNIFORM (CE) . . MRI UNET UNIFORM (MSE) . . ( . × ) 13468(1 . × ) UNIFORM (CE, MSE) 0.8454 . ( . × ) 13556(1 . × ) Table 1: Summary of our main results. We compare the performance of our two best-performing losses and one baselineon 12 networks across the CIFAR-10/100, ImageNet and Brain MRI datasets. The full set of results for all 17 loss com-binations is available in the supplementary material. The entries in bold correspond to losses that perform best for thatarchitecture+dataset combination.where η represents the weight decay. We use a strongweight-decay of 1.0 in our experiments to ensure that a par-ticular loss strictly does not dominate others.S OFT A DAPT . Proposed by Heyderi et al. [29], S
OFT A DAPT is a method to automatically tune the weights of a multi-partloss function. It can be tuned to assign the maximum weightto either the best-performing or the worst-performing lossbased on the value of a parameter η . Let s α = L CE ( t ) −L CE ( t − be the corresponding change in the loss valuebetween two consecutive steps (define s β and s γ using thecorresponding losses). They use the normalized version ofSoftAdapt which can be written as: s α = s α ( s α + s β + s γ ) + (cid:15) , where (cid:15) is introduced for numerical stability. The finalweight based on these normalized scores is: α = e ηs α e ηs α + e ηs β + e ηs γ , where η selects whether to optimize the worst or the bestloss based on whether the value of η is greater than or lessthan zero respectively. We use η = 1 . in our experimentswhich equates to optimizing the worst performing loss atevery weight update step. Note that these weights are notoptimized at every step of the optimization process, but up-dated after every 10 optimization steps and the correspond-ing average change is taken into account.
4. Evaluation
We evaluate the efficacy of the proposed label-preservation-aware loss functions on a wide range of real-world tasks, network architectures, and datasets. Specif-ically, we report results for image classification on theCIFAR-10/100 [43] and ImageNet [60] datasets, and se-mantic segmentation on the Brain MRI Segmentationdataset [7]. To demonstrate that our approach is not re-stricted to a particular type of architecture, we experiment E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( M S E ) U N I F O R M ( C E , M S E ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 3:
Performance of each loss+optimizer combination on ResNet50 (ImageNet) with a range of different network architectures, includingResNet-20 , ResNet-50 [28], ResNeXt [66], ResNet-56,DenseNet 12-40 [34] and U-Net [59]. As described inSection 3.1, we consider the CE , M SE , and CE P RED losses and use three distinct algorithms to obtain optimalweights for each of these losses: U
NIFORM , L
EARNABLE ,and S
OFT A DAPT . We evaluate each of these combinationsand also include three additional baseline comparisons: (1)CE with distillation, which is used by Li et al. [49], (2)R
OUND -R OBIN -C OMBO , which picks a loss to minimizein round-robin fashion, and (3) R
ANDOM -C OMBO , whichsimply picks one loss at random for optimization.
Hyper-Parameter Settings
For ResNet20 and ResNet56on the CIFAR datasets, the residual block is the basic Res-Block with two × convolutional layers. For ResNet164on CIFAR and ResNet50 on ImageNet, the residual block isa bottleneck block. ResNeXt20 and ResNeXt164 have car-dinality , and bottleneck width . For CIFAR, we trainthe reference network for epochs with SGD using a mo-mentum of . , weight decay of − , and batch size of ;the learning rate starts with . and decays by at epochs and . The ResNet50 model for ImageNet is ob-tained from the PyTorch pretrained model repository [55].We use NVIDIA V100 GPUs to train all models. For com- pression, we follow the settings described in Li et al. [49];in particular, we use (cid:96) regularization with a regularizationfactor of e − , and use different learning rates for W and A . The ratio between η s and η is . . Table 1 summarizes our main results. Here, we compareour best-performing loss+optimizer combination with thethree baselines described above for each task, network ar-chitecture, and dataset. As shown in the Table, we demon-strate significant reductions in the number of CIEs (up to . × ) and CIE-Us (up to . × ) using label-preservation-aware loss functions while largely retaining reference top-1 accuracy. Further, we notice that one of the simplestloss+optimizer combinations, namely CE and MSE withuniform weights works best in practice.
We now dive deeperinto how our proposed loss functions perform for each indi-vidual task and dataset. The full set of results spanning dif-ferent tasks, network architectures, and datasets is includedin the supplementary material.
Image Classification on ImageNet
To identify the indi-vidual contribution of each of the terms in the loss func-tion, as well as their different combinations, we performeddetailed experiments over all our loss combinations on the
NIFORM (CE) UNIFORM (MSE) UNIFORM (CE, MSE)0500010000150002000025000300003500040000 C I P s D i c e C o e ff i c i e n t Uncompressed Dice CoefficientCompressed Dice Coefficient
Figure 4:
CIPs for Brain MRI FLAIR Segmentation networks and datasets shown in Table 1. Due to space re-strictions, we only show results for ResNet50 (ImageNet)in this paper in Figure 3; we include results for the othernetworks and datasets in the supplementary material. Asshown in the figure, both U
NIFORM -(MSE) and L
EARN - ABLE -(CEpred, MSE) achieve CIE and CIE-U reductionsof . × while improving upon baseline top-1 accuracy. Image Classification on CIFAR-10/100
On CIFAR-10and CIFAR-100, we achieve CIE reductions of up to . × and . × , respectively, with a negligible drop in accuracy(less than 0.1%) as shown in Table 1. Compared to Ima-geNet, uniform weights with CE and MSE losses performsbest on CIFAR. Semantic Segmentation on Brain MRI
Semantic seg-mentation is a per-pixel classification task which aims toassign the correct class to every pixel in the input. Thenotion of CIEs can be naturally extended to segmentation;in this case, each input pixel for which the reference andcompressed models disagree constitutes a
Compression Im-pacted Pixel (CIP) . We extend our proposed formulation toattempt to reduce CIPs, and evaluate our approach on thetask of semantic segmentation of brain MRI images. Thedataset comprises of brain MRI images along with man-ual FLAIR abnormality segmentation masks [7]. We usea generic U-Net architecture with two output channels, andtrain the complete model using conventional cross-entropyloss. We prune the model using unstructured pruning at asparsity of 81% and the rewind algorithm [57] (see Sec-tion 2 for a more detailed description of rewinding).Figure 4 shows the results for the segmentation task.Similar to classification, we notice a drastic reduction inthe number of CIPs when using logit pairing (MSE). Wealso see a positive impact of including CE along with MSE
600 800 1000 1200 1400 1600 1800
CIEs
CEUNIFORM (CE)UNIFORM (CEpred)UNIFORM (MSE)UNIFORM (CE, MSE)UNIFORM (CE, CEpred)UNIFORM (CEpred, MSE)UNIFORM (CE, CEpred, MSE)SOFTADAPT (CE, MSE)SOFTADAPT (CE, CEpred)SOFTADAPT (CEpred, MSE)SOFTADAPT (CE, CEpred, MSE)LEARNABLE (CE, MSE)LEARNABLE (CE, CEpred)LEARNABLE (CEpred, MSE)LEARNABLE (CE, CEpred, MSE)
Figure 5:
Box-plot capturing the variation among CIEs over 10random runs of ResNet20 on CIFAR100. on the dice coefficient, as the model only focuses on reduc-ing CIP-Us. This indicates that using the proposed label-preservation-aware loss functions can naturally mitigate theimpact of compression on the functional form of the clas-sifier even for tasks beyond image classification. We alsoinclude some specific qualitative examples of CIP reductionon brain MRI images in Figure 1.
To better understand how CIEs vary with different lossterm weight initializations, we evaluated 10 random runsof our loss+optimizer combinations on ResNet20 (CIFAR-100). Figure 5 shows the results of this experiment in theform of a box-plot. We notice that the number of CIEsacross the 10 random runs are fairly consistent for the ma-jority of losses. We observe similar trends on other net-works and datasets and include these results in the supple-mentary material.
5. Conclusions
This paper has presented a novel method for identifyingand reducing label mismatches during model compression.We introduce a label-preservation-aware loss formulationand corresponding optimization algorithms that systemat-ically reduce compression impacted exemplars (CIEs). Ourformulation carefully balances accuracy recovery with try-ing to match the functional form of the reference model,yielding dramatic reductions in label mismatches, espe-cially in cases when the reference model makes a correctprediction (CIE-Us). We evaluate our approach on a widerange of tasks, network architectures, datasets, accuracy re-covery algorithms and compression schemes to obtain up toa . × reduction in CIEs and . × reduction in CIE-Us. eferences [1] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai.Variational information distillation for knowledge transfer.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 9163–9171, 2019. 3[2] M. Alam, M. D. Samad, L. Vidyaratne, A. Glandon, andK. M. Iftekharuddin. Survey on deep neural networks inspeech and vision systems.
Neurocomputing , 417:302–321,2020. 1[3] S. Anwar and W. Sung. Compact deep convolutionalneural networks with coarse pruning. arXiv preprintarXiv:1610.09639 , 2016. 1[4] J. Ba and R. Caruana. Do deep nets really need to be deep?In
Advances in Neural Information Processing Systems , vol-ume 27, pages 2654–2662. Curran Associates, Inc., 2014. 2,5[5] M. A. Badgeley, J. R. Zech, L. Oakden-Rayner, B. S. Glicks-berg, M. Liu, W. Gale, M. V. McConnell, B. Percha, T. M.Snyder, and J. T. Dudley. Deep learning predicts hip fractureusing confounding patient and healthcare variables.
NPJ dig-ital medicine , 2(1):1–10, 2019. 2[6] J. T. Barron. A general and adaptive robust loss function.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 4331–4339, 2019. 4[7] M. Buda, A. Saha, and M. A. Mazurowski. Association ofgenomic subtypes of lower-grade gliomas with shape fea-tures automatically extracted by a deep learning algorithm.
Computers in biology and medicine , 109:218–225, 2019. 1,6, 8[8] J. Buolamwini and T. Gebru. Gender shades: Intersectionalaccuracy disparities in commercial gender classification. In
Conference on fairness, accountability and transparency ,pages 77–91, 2018. 2[9] M. A. Carreira-Perpin´an. Model compression as constrainedoptimization, with application to neural nets. part I: Generalframework. arXiv preprint arXiv:1707.01209 , 2017. 3[10] K. Chekanov, P. Mamoshina, R. V. Yampolskiy, R. Timofte,M. Scheibye-Knudsen, and A. Zhavoronkov. Evaluating raceand sex diversity in the world’s largest companies using deepneural networks. arXiv preprint arXiv:1707.02353 , 2017. 2[11] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich.Gradnorm: Gradient normalization for adaptive loss balanc-ing in deep multitask networks. In
International Conferenceon Machine Learning , pages 794–803. PMLR, 2018. 4[12] Y. Cheng, D. Wang, P. Zhou, and T. Zhang. A survey ofmodel compression and acceleration for deep neural net-works. arXiv preprint arXiv:1710.09282 , 2017. 1[13] J. Dastin. Amazon scraps secret ai recruiting tool thatshowed bias against women.
Reuters , 2018. 2[14] D. K. Dennis, S. Gopinath, C. Gupta, A. Kumar, A. Kusu-pati, S. Patil, and H. Simhadri. Edgeml machine learningfor resource-constrained edge devices.
URL https://github.com/Microsoft/EdgeML. Retrieved January , 2020. 1[15] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer-gus. Exploiting linear structure within convolutional net-works for efficient evaluation. In
Advances in neural infor-mation processing systems , pages 1269–1277, 2014. 2 [16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert:Pre-training of deep bidirectional transformers for languageunderstanding. arXiv preprint arXiv:1810.04805 , 2018. 1[17] X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: Amore complicated network with less inference complexity.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5840–5848, 2017. 1[18] J. Frankle and M. Carbin. The lottery ticket hypoth-esis: Training pruned neural networks. arXiv preprintarXiv:1803.03635 , 2018. 1[19] R. Girshick. Fast r-cnn. In
Proceedings of the IEEE inter-national conference on computer vision , pages 1440–1448,2015. 2[20] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compress-ing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115 , 2014. 1[21] R. Gruetzemacher, A. Gupta, and D. Paradice. 3d deeplearning for detecting pulmonary nodules in ct scans.
Journal of the American Medical Informatics Association ,25(10):1301–1310, 2018. 2[22] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie,H. Luo, S. Yao, Y. Wang, et al. Ese: Efficient speech recog-nition engine with sparse LSTM on FPGA. In
Proceedingsof the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays , pages 75–84. ACM, 2017. 1[23] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz,and W. J. Dally. Eie: efficient inference engine on com-pressed deep neural network. In , pages 243–254. IEEE, 2016. 1[24] S. Han, H. Mao, and W. J. Dally. Deep compres-sion: Compressing deep neural networks with pruning,trained quantization and huffman coding. arXiv preprintarXiv:1510.00149 , 2015. 2[25] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weightsand connections for efficient neural network. In
Advancesin neural information processing systems , pages 1135–1143,2015. 1, 2[26] D. Harwell. A face-scanning algorithm increasingly decideswhether you deserve the job.
The Washington Post, URLhttps://wapo.st/2X3bupO. , 2019. 2[27] B. Hassibi, D. G. Stork, and G. Wolff. Optimal brain sur-geon: Extensions and performance comparisons. In
Ad-vances in neural information processing systems , pages 263–270, 1994. 2[28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In
Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016. 7[29] A. A. Heydari, C. A. Thompson, and A. Mehmood. Sof-tadapt: Techniques for adaptive loss weighting of neuralnetworks with multi-part loss functions. arXiv preprintarXiv:1912.12355 , 2019. 6[30] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531 , 2015.2, 3, 531] S. Hooker, N. Moorosi, G. Clark, S. Bengio, and E. Denton.Characterising bias in compressed models. arXiv preprintarXiv:2010.03058 , 2020. 2, 3, 4[32] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trim-ming: A data-driven neuron pruning approach towards effi-cient deep architectures. arXiv preprint arXiv:1607.03250 ,2016. 1, 3[33] C. Huang, S. Zhai, W. Talbott, M. A. Bautista, S.-Y. Sun,C. Guestrin, and J. Susskind. Addressing the loss-metricmismatch with adaptive loss alignment. arXiv preprintarXiv:1905.05895 , 2019. 4[34] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-berger. Densely connected convolutional networks. In
Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 4700–4708, 2017. 7[35] N. Inc.
Sparsity Enables 50x Performance Acceleration inDeep Learning Networks , 2020 (accessed Nov 16, 2020). 2[36] W. Inc.
Waymo Releases A Large Dataset For AutonomousDriving , 25 June 2019. 1[37] Y. Jang, H. Lee, S. J. Hwang, and J. Shin. Learning what andwhere to transfer. arXiv preprint arXiv:1905.05901 , 2019. 3[38] N. K. Jha, S. Mittal, and G. Mattela. The ramifications ofmaking deep neural networks compact. In , pages215–220, 2019. 2[39] V. Joseph, G. L. Gopalakrishnan, S. Muralidharan, M. Gar-land, and A. Garg. A programmable approach to neural net-work compression.
IEEE Micro , 40(5):17–25, 2020. 2[40] V. Joseph, S. Muralidharan, and M. Garland. Condensa:Programmable model compression. https://nvlabs.github.io/condensa/ , 2019. [Online; accessed 1-July-2019]. 3[41] A. Kendall and Y. Gal. What uncertainties do we need inbayesian deep learning for computer vision? In
Advancesin neural information processing systems , pages 5574–5584,2017. 2[42] J. Kossaifi, A. Toisoul, A. Bulat, Y. Panagakis,T. Hospedales, and M. Pantic. Factorized higher-ordercnns with an application to spatio-temporal emotionestimation, 2020. 2[43] A. Krizhevsky, V. Nair, and G. Hinton. The cifar-10 dataset. , 55,2014. 6[44] A. Kusupati, M. Singh, K. Bhatia, A. Kumar, P. Jain, andM. Varma. Fastgrnn: A fast, accurate, stable and tiny kilo-byte sized gated recurrent neural network. In
Advances inNeural Information Processing Systems , pages 9017–9028,2018. 1[45] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, andV. Lempitsky. Speeding-up convolutional neural net-works using fine-tuned cp-decomposition. arXiv preprintarXiv:1412.6553 , 2014. 2[46] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain dam-age. In
Advances in neural information processing systems ,pages 598–605, 1990. 2 [47] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P.Graf. Pruning filters for efficient convnets. arXiv preprintarXiv:1608.08710 , 2016. 3[48] J. Li, Q. Qi, J. Wang, C. Ge, Y. Li, Z. Yue, and H. Sun.Oicsr: Out-in-channel sparsity regularization for compactdeep neural networks. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages7046–7055, 2019. 3[49] Y. Li, S. Gu, C. Mayer, L. V. Gool, and R. Timofte.Group sparsity: The hinge between filter pruning and de-composition for network compression. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 8018–8027, 2020. 3, 7[50] J.-H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruningmethod for deep neural network compression. arXiv preprintarXiv:1707.06342 , 2017. 1[51] J. S. McCarley, R. Chakravarti, and A. Sil. Structured prun-ing of a bert-based question answering model, 2020. 1[52] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz.Pruning convolutional neural networks for resource efficientinference. arXiv preprint arXiv:1611.06440 , 2016. 1[53] NHTSA. Technical report, u.s. department of transportation,national highway traffic, tesla crash preliminary evaluationreport safety administration.
PE 16-007 , Jan 2017. 1, 2[54] L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. R´e.Hidden stratification causes clinically meaningful failures inmachine learning for medical imaging. In
Proceedings of theACM Conference on Health, Inference, and Learning , pages151–159, 2020. 2[55] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, andS. Chintala. Pytorch: An imperative style, high-performancedeep learning library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d’ Alch´e-Buc, E. Fox, and R. Garnett,editors,
Advances in Neural Information Processing Systems32 , pages 8024–8035. Curran Associates, Inc., 2019. 7[56] A. Polyak and L. Wolf. Channel-level acceleration of deepface representations.
IEEE Access , 3:2163–2175, 2015. 1[57] A. Renda, J. Frankle, and M. Carbin. Comparing rewindingand fine-tuning in neural network pruning. arXiv preprintarXiv:2003.02389 , 2020. 3, 8[58] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,and Y. Bengio. Fitnets: Hints for thin deep nets. arXivpreprint arXiv:1412.6550 , 2014. 3[59] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo-lutional networks for biomedical image segmentation. In
International Conference on Medical image computing andcomputer-assisted intervention , pages 234–241. Springer,2015. 7[60] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.
International journal of computer vision , 115(3):211–252,2015. 6[61] Z. Wang, J. Wohlwend, and T. Lei. Structured pruning oflarge language models, 2019. 162] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningstructured sparsity in deep neural networks. In
Advancesin neural information processing systems , pages 2074–2082,2016. 3[63] M. Wilmanski, C. Kreucher, and A. Hero. Complex inputconvolutional neural networks for wide angle sar atr. In , pages 1037–1041. IEEE, 2016. 2[64] H. Xie, D. Yang, N. Sun, Z. Chen, and Y. Zhang. Automatedpulmonary nodule detection in ct images using deep convo-lutional neural networks.
Pattern Recognition , 85:109–119,2019. 2[65] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-trainingwith noisy student improves imagenet classification. In
Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition , pages 10687–10698, 2020. 1[66] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggre-gated residual transformations for deep neural networks. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 1492–1500, 2017. 7[67] J. Xue, J. Li, and Y. Gong. Restructuring of deep neuralnetwork acoustic models with singular value decomposition.In
Interspeech , pages 2365–2369, 2013. 2[68] L. W. Yang and C. Y. Su. Low-cost cnn design for intelligentsurveillance system. In , pages 1–4. IEEE,2018. 1[69] J. Yoon and S. J. Hwang. Combined group and exclusivesparsity for deep neural networks. In
International Confer-ence on Machine Learning , pages 3958–3966, 2017. 3[70] S. Zagoruyko and N. Komodakis. Paying more atten-tion to attention: Improving the performance of convolu-tional neural networks via attention transfer. arXiv preprintarXiv:1612.03928 , 2016. 3[71] T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin,M. Fardad, and Y. Wang. Adam-ADMM: A unified, sys-tematic framework of structured weight pruning for DNNs. arXiv preprint arXiv:1807.11091 , 2018. 3[72] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternaryquantization. arXiv preprint arXiv:1612.01064 , 2016. 1[73] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu,J. Huang, and J. Zhu. Discrimination-aware channel pruningfor deep neural networks. In
Advances in Neural InformationProcessing Systems , pages 883–894, 2018. 3 . Additional Experimental Results
A.1. Results on CIFAR-10
A.1.1 DenseNet C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s
550 386 418 549 548 542 286 279 310 517 313 296 495 560 453 504317 201 213 324 316 303 115 118 146 305 144 131 285 333 230 288CIEsCIEs-U 80.082.585.087.590.092.595.097.5100.0 A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 6: DenseNet results on CIFAR-10
A.1.2 ResNet-20 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s
657 383 369 707 673 653 224 205 237 665 232 266 232 679 302 227381 191 192 386 366 363 96 85 110 360 103 134 98 385 139 98CIEsCIEs-U 80.082.585.087.590.092.595.097.5100.0 A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 7: Performance of each loss+optimizer combination on ResNet-20 (CIFAR-10) .1.3 ResNeXt-20 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s
783 592 615 769 721 780 442 425 547 759 543 510 709 712 609 723422 286 306 410 381 413 190 190 259 398 253 241 375 373 305 383CIEsCIEs-U 80.082.585.087.590.092.595.097.5100.0 A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 8: Performance of each loss+optimizer combination on ResNeXt-20 (CIFAR-10)
A.1.4 ResNet-56 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s
589 543 583 544 536 554 547 634 551 586 521 560 569 565 580 579323 292 335 293 290 314 290 368 290 301 274 307 313 309 323 322CIEsCIEs-U 80.082.585.087.590.092.595.097.5100.0 A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 9: Performance of each loss+optimizer combination on ResNet-56 (CIFAR-10) .1.5 ResNet-164 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s
466 415 433 476 485 505 359 372 379 490 379 358 465 480 388 472266 243 239 280 293 300 191 194 206 295 200 186 269 271 205 277CIEsCIEs-U 80.082.585.087.590.092.595.097.5100.0 A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 10: Performance of each loss+optimizer combination on ResNet-164 (CIFAR-10)
A.1.6 ResNeXt-164 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s
472 450 474 526 504 527 463 415 518 517 551 499 480 489 577 473276 263 276 316 298 303 278 228 311 312 333 287 276 287 361 281CIEsCIEs-U 80.082.585.087.590.092.595.097.5100.0 A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 11: Performance of each loss+optimizer combination on ResNeXt-164 (CIFAR-10) .2. Results on CIFAR-100
A.2.1 ResNet-20 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s
236 545 480 377 167 200 217 404 231 193 154 479 142 157CIEsCIEs-U 60.062.565.067.570.072.575.077.580.0 A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 12: Performance of each loss+optimizer combination on ResNet-20 (CIFAR-100)
A.2.2 ResNeXt-20 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 13: Performance of each loss+optimizer combination on ResNeXt-20 (CIFAR-100) .2.3 ResNet-164 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 14: Performance of each loss+optimizer combination on ResNet-164 (CIFAR-100)
A.2.4 ResNeXt-164 C E R O U N D - R O B I N - C O M B O R A N D O M - C O M B O U N I F O R M ( C E ) U N I F O R M ( C E p r e d ) U N I F O R M ( C E , C E p r e d ) U N I F O R M ( C E p r e d , M S E ) U N I F O R M ( C E , C E p r e d , M S E ) S O F T A D A P T ( C E , M S E ) S O F T A D A P T ( C E , C E p r e d ) S O F T A D A P T ( C E p r e d , M S E ) S O F T A D A P T ( C E , C E p r e d , M S E ) L E A R N A B L E ( C E , M S E ) L E A R N A B L E ( C E , C E p r e d ) L E A R N A B L E ( C E p r e d , M S E ) L E A R N A B L E ( C E , C E p r e d , M S E ) C I E s A cc u r a c y Uncompressed AccuracyCompressed Accuracy
Figure 15: Performance of each loss+optimizer combination on ResNeXt-164 (CIFAR-100) .3. Results on Brain MRI FLAIR Segmentation.3. Results on Brain MRI FLAIR Segmentation