GNN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
GGNN-RL Compression: Topology-Aware Network Pruning using Multi-stageGraph Embedding and Reinforcement Learning
Sixing Yu Arya Mazaheri Ali Jannesari Abstract
Model compression is an essential technique fordeploying deep neural networks (DNNs) on powerand memory-constrained resources. However, ex-isting model-compression methods often rely onhuman expertise and focus on parameters’ lo-cal importance, ignoring the rich topology in-formation within DNNs. In this paper, we pro-pose a novel multi-stage graph embedding tech-nique based on graph neural networks (GNNs) toidentify the DNNs’ topology and use reinforce-ment learning (RL) to find a suitable compres-sion policy. We performed resource-constrained(i.e., FLOPs) channel pruning and compared ourapproach with state-of-the-art compression meth-ods using over-parameterized DNNs (e.g., ResNetand VGG-16) and mobile-friendly DNNs (e.g.,MobileNet and ShuffleNet). We evaluated ourmethod on various models from typical to mobile-friendly networks, such as ResNet family, VGG-16, MobileNet-v1/v2, and ShuffleNet. The resultsdemonstrate that our method can prune dense net-works (e.g., VGG-16) by up to 80% of their origi-nal FLOPs. More importantly, our method outper-formed state-of-the-art methods and achieved ahigher accuracy by up to 1.84% for ShuffleNet-v1.Furthermore, following our approach, the prunedVGG-16 achieved a noticeable 1.38 × speed upand 141 MB GPU memory reduction.
1. Introduction
The demand for deploying DNN models on edge devices(e.g., mobile phones, robots, and self-driving cars) are ex-panding rapidly. However, the increasing memory and com-puting power requirements of DNNs make their deployment Department of Computer Science, Iowa State University,Iowa, USA Department of Computer Science, TechnischeUniversit¨at Darmstadt, Germany. Correspondence to: Six-ing Yu < [email protected] > , Arya Mazaheri < [email protected] > , Ali Jannesari < [email protected] > . on edge devices a grand challenge. Thus, various custom-made DNN models have been introduced by experts to ac-commodate a DNN model with reasonably high accuracy onmobile devices (Howard et al., 2019; Tan & Le, 2019; Zhanget al., 2018b; Ma et al., 2018; Mehta et al., 2020; Huanget al., 2018). In addition to mobile-friendly deep networks,model optimization methods such as network pruning (Hanet al., 2016; He et al., 2018), factorization (Sainath et al.,2013), knowledge distillation (Hinton et al., 2015), and pa-rameter quantization (Han et al., 2016) help to shrink theDNN model size down to the target hardware capabilities.Among such methods, network pruning has shown to beconsiderably useful in model compression by introducingsparsity or eliminating channels or filters, yet requires ex-tensive knowledge and effort to find the perfect balancebetween the accuracy and model size.The main challenge of network pruning is to find the bestpruning schedule or strategy for the layers of a network.Furthermore, a pruning strategy for a given DNN cannotbe used for other networks due to their different structure.Thus, each network demands a customized pruning strategy.Recently, He et al. (He et al., 2018) leveraged reinforcementlearning (RL) to automatically find the best pruning strategy.However, they used manually defined rules, such as num-ber of input/output channels, parameter size, and FLOPsfor the RL environment states vectors and ignored the richstructural information within the DNN. Yu et al. (Yu et al.,2020) are the first to model a given DNN as a hierarchicalgraph and proposed a GNN-based encoder-decoder to em-bed DNN layers. However, their method learns the topologyindirectly and does not consider topology changes whilemodel compression. Moreover, existing RL-based modelcompression methods require manually defined pruning ra-tio to get the desired model size reduction. Although themodel accuracy is used within the RL agent’s reward func-tion, there is a negative correlation between the compressionratio and reward. Thus, without any constraint, the RL agenttends to search for a tiny compression ratio to get a betterreward.Deep neural networks are already being represented as com-putational graphs in deep-learning frameworks, such asTensorFlow(Abadi et al., 2016) and PyTorch(Paszke et al., a r X i v : . [ c s . C V ] F e b NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
2. Related Work
Within the context of this paper, researchers already pro-posed various methods to compress DNN models, suchas architecture design, network pruning, and quantization.Graph neural networks are also gaining momentum amongthese research fields. In the following, we will review thesemethods.
Model Compression.
Extensive works focus on modelcompression and efficient deployment of DNNs, such as network pruning (Han et al., 2016; He et al., 2018), knowl-edge distillation (Hinton et al., 2015), and network quanti-zation (Han et al., 2016; Courbariaux et al., 2016; Rastegariet al., 2016). Within the scope of this paper, we mainlyconsider network pruning. Structured (Anwar et al., 2017)and unstructured pruning (Zhang et al., 2018a; Guo et al.,2016) evaluate model parameters importance and removethose with a lower rank. The unstructured pruning promisesa higher compression ratio by tensor sparsification. How-ever, the potential speedup is only attainable on specializedAI-accelerators. On the other hand, structured pruning at-tempts to eliminate filters or channels and provides benefitto all hardware platforms. For instance, the uniform, shal-low, deep empirical structured pruning policies (He et al.,2017; Li et al., 2016), the hand-crafted structured pruningmethods, such as SPP (Wang et al., 2017), FP (Li et al.,2016), and RNP (Lin et al., 2017) fall into the structuredpruning category. The SPP analyzes each layer and mea-sures a reconstruction error to determine the pruning ratio.FP evaluates the performance of single-layer-pruning andranks the importance of layers and prunes aggressively onlow ranks. RNP groups all convolutional channels into setsand trains an RL agent to decide on the sets. However, hand-crafted pruning policies often fail to be extended to newmodels and might lead to sub-optimal performance.Recently, researchers tend to leverage reinforcement learn-ing to search for pruning policies automatically. Liu etal. (Liu et al., 2020) proposed an ADMM-based (Boyd et al.,2011) structured weight pruning method and an innovativeadditional purification step for further weight reduction. Heet al. (He et al., 2018) proposed the AMC for network prun-ing and leveraged reinforcement learning to predict eachhidden layer’s compression policies. However, they man-ually defined DNN’s embeddings and ignored the neuralnetwork’s essential structural information. Yu et al. (Yuet al., 2020) are the first to model DNNs as graphs and in-troduced a GNN-based graph encoder-decoder to embedDNNs’ hidden layers. Nevertheless, their RL agent learnsthe topology information indirectly and is insensitive to thestructural changes of DNNs while being pruned.
Graph Neural Networks (GNN).
GNN and its vari-ants (Kipf & Welling, 2017; Schlichtkrull et al., 2018) canlearn the graph embeddings and have been successfully usedfor link prediction (Liben-Nowell & Kleinberg, 2007) andnode classification. However, these methods are mainlyfocused on node embedding and are inherently flat, whichis inefficient to deal with the hierarchical data. In this pa-per, we aim to learn the global topology information fromDNNs. Thus, we proposed multi-stage GNN (m-GNN),which takes advantage of the repetitive motifs available inDNNs. m-GNN considers the edge features and has a novellearning-based pooling strategy to learn the global graphembedding.
NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
Graph-based Neural Architecture Search (NAS).
Al-though this paper is not directly related to NAS, it is anactive area of research wherein the computationally expen-sive operations are replaced with more efficient alternative.Particularly, graph-based NAS methods apply GNN anduse graph-based neural architecture encoding schemes toexploit neural network’s topology. They model neural ar-chitecture’s search spaces as graphs and aim to search forthe best performing neural network structure (Guo et al.,2019; Shi et al., 2020; Dudziak et al., 2021). Such methodsinspired us to exploit compression policy from the topologyinformation of DNNs.
3. Approach
To prune a given DNN, the user provides the model sizeconstraint (e.g., FLOPs-constraint). The DNN-Graph en-vironment receives the constraint, takes the DNN’s hierar-chical computational graph as the environment state, andleverages the GNN-RL agent to search for a compressionpolicy.Figure 1 depicts a high-level overview of our method. TheDNN-Graph environment episode is essentially a modelcompression iteration. As the red arrows show, the processstarts from the original DNN. The model size evaluator firstevaluates the size of the DNN. If the size is not satisfied,the graph generator converts the DNN to a hierarchicalcomputational graph. Then the GNN-RL agent leverages m-GNN to learn pruning ratios (compression policy) from thegraph. The pruner prunes the DNN with the pruning ratiosand begins the next iteration from the compressed DNN.Each step of the compression will lead to DNN’s topologychange. Thus, the DNN-Graph environment reconstructsa new hierarchical computational graph for the GNN-RLagent corresponding to the current compression state.Once the compressed DNN satisfies the size constraint, theevaluator will end the episode, and the accuracy evaluatorwill assess the pruned DNN’s accuracy as an episode rewardfor the GNN-RL agent. As opposed to the existing RL-based methods (He et al., 2018; Yu et al., 2020; Liu et al.,2020), with the DNN-Graph environment, the GNN-RL canautomatically learn to reach the desired model size. Hence,it prevents us from manual adjustments and obtaining tinycompression ratios.In the following, we will explain the details of the m-GNNand RL agent within our approach.
Neural networks representation as computational graphs indeep learning frameworks, such as TensorFlow and PyTorch,contains rich topology information. However, it may involvebillions of operations (He et al., 2016), which makes the
Figure 1.
DNN-Graph environment. The graph generator convertsthe DNN into a graph. The model size evaluator evaluates theDNN’s size. The accuracy evaluator measures DNN’s accuracy onthe target dataset. The pruner module is responsible for pruningthe DNN. computational graph bloated. Nevertheless, computationalgraphs often contain repetitive sub-graphs (a.k.a. motifs),such as 3 × add , multi-ple , and minus with machine-learning high-level operations(e.g., convolution, pooling, etc.).Formally, we model the DNN as an l -layer hierarchicalcomputational graph, such that at the l th layer (the toplayer) we would have the hierarchical computational graphset G l = { G l } , where each item is a computational graph G l = ( V l , E l , G l − ) . V l is the graph nodes correspondingto hidden states. E l is the set of directed edges with aspecific edge type associated with the operations. Lastly, G l − = { G l − , G l − , ... } is the computational graph set atthe ( l − -layer and the operation set at layer l . Within thefirst layer, we manually choose commonly used machinelearning operations as the primitive operations for G .As an example, Figure 2 illustrates the idea behind gen-erating hierarchical computational graphs using a samplegraph G , where the edges are operations and the nodes arehidden states. In the input graph, we choose three primitiveoperations G = { × × × } corresponding to the three edge types. Then, we extract therepetitive subgraphs (i.e., G , G and G ), each denoting acompound operation, and decompose the graph G into twohierarchical levels, as shown in Figure 2 (b) and (c). Thelevel-1 computational graphs are motifs that correspond tothe edges within the level-2 computational graph.The hierarchical computation graph’s size depends on theprimitive operations we choose in G . In our experiments,we choose the commonly used operations in machine learn-ing as primitive operations (e.g., convolution, pooling, etc.). NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
Figure 2.
A two-level hierarchical computational graph and m-GNN. (a) A computational graph G with motifs (the sub-graph painted red,blue, and green). The nodes on the graph denote the feature maps of input data and the edges corresponding to operations. (b) Level 1hierarchical computational graphs. We extract motifs { G , G , G } from G and split the G into 2 hierarchical levels. Level 1 are motifs.(c) Level 2 hierarchical computational graph G . The edges in G correspond to motifs at level 1. The m-GNN embeds the 2-levelhierarchical computational graph in two stages. At stage 1, m-GNN embeds the motifs { G , G , G } , and applies their embeddings (i.e., e , e and e ) as the edge features of G . At stage 2, the m-GNN embeds the G , and its embedding g is the final embedding of the2-level hierarchical computational graph. ULTI - STAGE
GNNStandard GNN and its variants (Kipf & Welling, 2017) areinherently flat (Ying et al., 2018). Since we model a givenDNN as an l − layer hierarchical computational graph (seeSection 3.1), we propose a multi-stage GNN (m-GNN),which embeds the hierarchical graph in l -stages accordingto its hierarchical levels and analyzes the motifs.As depicted in Figure 2, m-GNN initially learns the lowerlevel embeddings and uses them as the correspondingedge features in high-level computation graphs. Insteadof learning node embeddings, m-GNN aims to learn theglobal graph representation. We further introduced a novellearning-based pooling strategy for every stage of embed-ding. With m-GNN, we only need embedding once for eachmotif on the computational graph. It is much more efficientand uses less memory than embedding a flat computationgraph with standard GNN. Multi-stage Embedding.
For the computational graphs G t = { G t , G t , ..., G tN t } in the t th hierarchical layer, weembed the computational graph G ti = ( V ti , E ti , G t − ) , i = { , , ..., N t } as: e ti = EncoderGN N t ( G ti , E t − ) (1), where e ti is the embedding vector of G ti , E t − = { e t − j } , j = { , , ..., N t − } is the embedding of the com-putational graphs at level t − and the type of edges at level t . For layer-1, E contains the initial features (e.g., one-hot,and random standard) of the primitive operations G that wemanually select.In the hierarchical computational graphs, each edge corre-sponds to a computational graph of the previous level anduses its graph embedding as the edge feature. Furthermore,the graphs at the same hierarchical level share the GNN’sparameter. At the top layer ( l th layer) of the hierarchical graph G l = { G l } , we only have one computational graphand its embedding is the DNN’s final embedding g : g = EncoderGN N l ( G l , E l − ) (2) Message passing.
In the multi-stage hierarchical embed-ding, we consider the edge features. However in the stan-dard graph convolutional networks (GCN) (Kipf & Welling,2017), it only passes the node features and the messagepassing function can be formulated as follows: h l +1 i = (cid:88) j ∈ N i c i W l h lj (3), where h is nodes’ hidden states, c i is a constant coefficient, N i is node i neighbors, and W l is GNN’s learnable weightmatrix. Instead of standard message passing, in the multi-stage GNN, we add the edge features: h l +1 i = (cid:88) j ∈ N i c i W l ( h lj ◦ e l − k ) (4), where e l − k is the features of edge ( i, j ) and is also theembeddings of the k th graph at layer l − , such that the edge ( i, j ) corresponds to the operation G l − k . The operation ◦ denotes the element-wise product, which we selected forthe convenience of multi-stage message passing, but it isnot limited to element-wise product. Learning-based pooling.
Standard GNN aims to learn thenode embeddings of a graph (e.g., learn node representationand perform node classification). However, our goal isto learn the graph representation of a given DNN. Thus,we introduced a learning-based pooling method for multi-stage GNN to pool node embeddings and learn the graphembedding. We define the graph embedding e as: e = (cid:88) i ∈ N α i h i (5) NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning , where N is the set of nodes, h i is the i − th node embedding,and α i is the learnable weight coefficient for h i . In themulti-stage GNN, the computational graphs at the samehierarchical level share the GNN’s parameters, but in thepooling, each computational graph has its learnable poolingparameters α .3.2.2. R EINFORCEMENT LEARNING
We use the generated hierarchical computational graph G l for representing the DNN’s state and the RL agent’s environ-ment state. Since pruning the model causes its underlyinggraph topology to change, we constantly update the graph G l after each pruning step to help the RL agent find thepruning policy on current states.We employ deep deterministic policy gradient (DDPG)RL (Lillicrap et al., 2016) together with m-GNN (GNN-RL) to learn the compression policy directly from topologystates. The actor and critic-network within the GNN-RLagent contain an m-GNN graph encoder and a multi-layerperception. The graph encoder is used to learn the graphembedding, and the multi-layer perception projects the em-bedding into action space (i.e., compression policy). Theactor’s output layer applies the sigmoid function to boundthe actions within (0 , .Specifically, we perform FLOPs-constrained model com-pression using structured channel pruning (filter pruning)on the DNN’s convolutional layers, which are the most com-putationally intensive. Thus, the GNN-RL agent’s actionspace A ∈ R N × , where the N is the number of pruninglayers, is the pruning ratios for hidden layers: A = a i ,where i = { , , ..., N } , and a i ∈ [0 , is the pruning ratiofor i th layer. The GNN-RL agent makes the actions directlyfrom the topology states: g = GraphEncoder ( G l ) (6) A = M LP ( g ) (7), where the G l is the environment states, g is the graphrepresentation, The MLP is a multi-layer perception neuralnetwork. The graph encoder learns the topology embedding,and the MLP projects the embedding into hidden layers’pruning ratios. The reward function is defined in Equation 8. R err = − Error (8), where the
Error is the compressed DNN’s Top-1 error onvalidation set.
4. Experiments
To show the effectiveness of the GNN-RL, we evaluateour approach on over-parameterized DNNs (e.g., ResNet-20/32/44/56/110(He et al., 2016) and VGG-16(Simonyan & Zisserman, 2015)) and mobile-friendly DNNs (e.g., Mo-bileNet(Howard et al., 2017; Sandler et al., 2018) and Shuf-flenet(Ma et al., 2018; Zhang et al., 2018b)). Additionally,to demonstrate the superiority of our proposed method, wecompare GNN-RL with three sets of methods:• Uniform, shallow, and deep empirical policies (Heet al., 2017; Li et al., 2016).• The handcrafted channel reduction methods, such asSPP(Wang et al., 2017), FP (Li et al., 2016), andRNP (Lin et al., 2017).• The state-of-the-art RL-Based AutoML methods, suchas AMC (He et al., 2018), AGMC (Yu et al., 2020),and random search (RS) with RL.We use soft target update rate τ = 0 . for the GNN-RLupdates. In the first episodes, we warm up the agent withrandom action. Then exploits 150 episodes with exponen-tially decayed noise and trains the network with 64 batchsize and 2000 as replay buffer size.The experiment involves multiple datasets, includingCIFAR-10/100 (Krizhevsky & Hinton, 2009), and Ima-geNet (Russakovsky et al., 2015). In the CIFAR-10/100dataset, we sample K images from the test set as the val-idation set. In ILSVRC-2012, we split K images fromthe test set as the validation set. When searching, the DNN-Graph environment uses the compressed model’s R err onthe validation set as the GNN-RL agent’s reward. We evaluate the effectiveness of GNN-RL on ResNet-20/32/44/56/110 (He et al., 2016) and VGG-16 (Simonyan& Zisserman, 2015), which fall into the over-parameterizednetworks category. With its residual connections, ResNetavoids gradient vanishing and allows an efficient trainingon its deep layers. However, its deep neural structure andbillions of parameters make ResNet a challenging networkto deploy on edge devices. Similarly, the VGG-16 networkcontains compact and dense convolutional layers, wheresome layers have hundreds of filters, leading to a giantmodel size (528 MB GPU memory for VGG-16). To com-press these over-parameterized DNNs, we perform FLOPs-constrained channel pruning (filter pruning) on their convo-lutional layers.We trained ResNet-20/32/44/56/110 and VGG-16 mod-els on CIFAR-10 (Krizhevsky & Hinton, 2009) and Im-ageNet (Russakovsky et al., 2015) datasets, respectively.Since the validation accuracy on the ImageNet dataset issensitive to the compression ratio, with high compressionratios, the accuracy drops considerably without fine-tuning(in some cases, the pruned model without fine-tuning has
NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
Layer P r un i n g R a t i o Layer P r un i n g R a t i o Figure 3.
The hidden layers’ pruning ratio of
FLOPs ResNet-110 and
FLOPs ResNet-56.The horizontal axis is the layer, andthe vertical axis is the pruning ratio. The bars that tangent with the dot-line are the residual connection layers. Since we share the pruningratios between residual connection layers to avoid the hidden states’ dimension mismatch, the residual connection layers’ pruning ratiosare uniform. For both the ResNet-110 and ResNet-56, the residual connection layers’ pruning ratio is higher than most of the layers. less than validation accuracy). We applied a one epochfine-tuning process on each RL search episode to ensurethat the GNN-RL gets a valuable reward when pruning theVGG-16. When pruning the ResNett-20/32/44/56/110, weshare the pruning index between residual connection layersto avoid channel mismatch.Table 1 shows the top-1 test accuracy of the pruned models.We set the FLOPs constraint, and all the RL-Basedmethods use the R err as the reward. After pruning, we fine-tuned the DNNs with 100 epochs and only updated prunedlayers’ parameters. Results show that GNN-RL outperformsall the baselines and achieves higher test accuracy and com-pression ratio. For the ResNet-110/56/44 models, the modelpruned by the GNN-RL even achieves higher test accuracythan the original model. After further investigations, webelieve that it is due to the over-fitting of ResNet-110/56/44,as the accuracy on the training set was 100%. To verify ourassumption, we performed a further experiment to explorethe relationship between the FLOPs constraints and the ac-curacy of DNNs. Figure 4 shows the FLOPs ratio between0.4 to 0.6 (compared to the original model’s FLOPs) canget the highest test accuracy on ResNet-110. When theFLOPs reduction ratio exceeds 0.6, the test accuracy dropsintensively.In addition to the experiments above, we further analyzedthe redundancy and the importance of each layer. Figure 3shows the hidden layers’ pruning ratios on ResNet-110 andResNet-56. ResNet contains residual connection layers,which transfer hidden states directly from previous residuallayers. Thus, the residual connection layers are more re-dundant and informative since they contain the informationof both the current layer’s hidden states and the previouslayers. The GNN-RL agent automatically learns that theresidual connection layers are more redundant and appliesmore pruning on the residual connection layers. Another in-sight from Figure 3 is that the GNN-RL agent applies morepruning on layers 45 to 65 within ResNet-110. Similarly,layers 23 to 35 of ResNet-56 have been pruned more. Such FLOPs Ratio T e s t A cc u r a c y ResNet-110
Figure 4.
Test accuracy of ResNet-110 using various FLOPs ratios.
Figure 5.
An example for pruning depth-wise filters on MobileNet-v1 blocks. an insight shows that the middle layers have less impact onmodel accuracy.
We evaluated GNN-RL on MobileNet-v1/v2 (Howard et al.,2017; Sandler et al., 2018) and ShuffleNet-v1/v2 (Zhanget al., 2018b; Ma et al., 2018), which are more suitablefor devices with limited resources. Instead of using tradi-tional convolutional operation, the MobileNet-v1/v2 andShuffleNet-v1/v2 have designed more efficient convolu-tional blocks. To maintain the characteristics and high-efficiency of those custom-designed blocks, we have devel-
NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
Table 1.
Top-1 Classification accuracy for pruned over-parameterized DNNs. The original ResNet-110/56/44/32/20 arepre-trained on CIFAR-10 with Top-1 test accuracy of 93.68%,93.39%, 93.10%, 92.63%, and 91.73%, respectively. The originalVGG-16 is pre-trained on ImageNet with Top-1 test accuracyof 70.5%. The column FLOPs is the preserved FLOPs ratiocompared with the original model.
MODEL METHOD FLOPS
Acc. % ∆
Acc. R ES N ET AGMC
50% 93 . − . RS
50% 87 . − . GNN - RL R ES N ET UNIFORM
50% 87 . − . DEEP
50% 88 . − . AMC
50% 90 . − . AGMC
50% 92 . − . GNN - RL R ES N ET AGMC
50% 92 . − . RS
50% 88 . − . GNN - RL R ES N ET AGMC
50% 90 . − . RS
50% 89 . − . GNN - RL % − . R ES N ET DEEP
50% 79 . − . SHALLOW
50% 83 . − . UNIFORM
50% 84 − . AMC
50% 86 . − . AGMC
50% 88 . − . GNN - RL − . VGG-16 FP . − . RNP . − . SPP . − . GNN - RL − . oped specific pruning strategies for them.4.2.1. P RUNING STRATEGY
MobileNet-v1.
The MobileNet-v1 block separates theconvolution into the depth-wise and point-wise convolu-tions (Howard et al., 2017). Each depth-wise filter onlyoperates on one channel of feature maps. On the other hand,point-wise operations are the × convolutions, whichoperate on the feature maps processed by the depth-wiseconvolutions. In our experiments, applying regular filterpruning on such layers causes information loss. As depictedin Figure 5, pruning the filter painted in grey causes its cor-responding channel (the green one) to be deleted as well. Tohandle this, instead of pruning depth-wise and point-wisefilters separately, we only prune the point-wise filters withinMobileNet-v1 blocks. MobileNet-v2.
The MobileNet-v2 is principally designedbased on MobileNet-v1 blocks with an additional linearexpansion layer. The linear expansion layers are 1 × Table 2.
Top-1 Classification accuracies for pruned mobile-friendlyDNNs. The original MobileNet-v1/v2 and ShuffleNet-v1/v2 arepre-trained on CIFAR-100 with Top-1 test accuracy of 64.88%,65.74%, 68.64%, and 68.85%, respectively. The column FLOPs isthe preserved FLOPs ratio compared with the original model.
MODEL METHOD FLOPS
Acc. % ∆
Acc. S HUFFLE N ET - V AGMC
60% 65 . − . RS
60% 63 . − . GNN - RL − . S HUFFLE N ET - V AGMC
60% 66 . − . RS
60% 65 . − . GNN - RL − . M OBILE N ET - V AGMC
80% 64 . − . RS
80% 63 . − . GNN - RL − . M OBILE N ET - V AGMC
80% 65 . − . RS
80% 65 . − . GNN - RL lutions without non-linear activation. Residual shortcuts arebetween every two linear expansion layers, which connectMobileNet-v1 blocks. Similar to the MobileNet-v1, here weprune linear expansion layers and point-wise convolutionallayers. Since residual connections are between linear expan-sion layers, we share the linear expansion layers’ pruningratio. ShuffleNet-v1/v2.
The ShuffleNet model uses blocks con-taining depth-wise and point-with convolutions, channelshuffle, linear expansion, and residual connections. To avoiddimension mismatch when downsampling, we consider theShuffleNet blocks together and perform channel pruninginside the blocks. In a ShuffleNet block, we do not prunethe expansion layer (the output layer of the block), whichcan preserve the number of output channels and keep thefeature maps dimensions when downsampling.4.2.2. R
ESULTS
Table 2 shows the FLOPs-constrained channel pruning re-sults with 60% and 80% FLOPs ratio for ShuffleNet andMobileNet, respectively. We have compared GNN-RL withAGMC (Yu et al., 2020), and random search (RS) with RL.We did not include AMC and handcrafted methods sincewe designed specific pruning strategies for mobile-friendlyDNNs. We believe that these strategies are incompatiblewith AMC layer embeddings and handcrafted rules, leadingto unfair comparison.The MobileNet-v1/v2 and ShuffleNet-v1/v2 are pre-trainedon the CIFAR-100 (Krizhevsky & Hinton, 2009). Afterpruning, we fine-tuned the compressed DNNs with 150epochs. Our approach outperformed all the baselines. Al-though the networks have already been very compact, with
NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
Table 3.
The latency and GPU Memory usage of compressed mod-els. The ResNet-110/56/44/32/20 models are tested on the CIFAR-10. The VGG-16 is tested on the ImageNet, and MobileNet-v1/v2and ShuffleNet-v1/v2 are tested on the CIFAR-100. The columnFLOPs is the preserved FLOPs ratio compared with the originalmodel.M
ODEL
FLOP S L ATENCY
GPU M EM .VGG-16 . ms MB
20% 0 . ms MBR ES N ET -110 . ms . MB
48% 0 . ms . MBR ES N ET -56 . ms . MB
46% 0 . ms . MBR ES N ET -44 . ms . MB
49% 0 . ms . MBR ES N ET -32 . ms . MB
49% 0 . ms KBR ES N ET -20 . ms . MB
49% 0 . ms KBM
OBILE N ET - V . ms MB
79% 0 . ms . MBM
OBILE N ET - V . ms . MB
79% 0 . ms . MBS
HUFFLE N ET - V . ms . MB
58% 0 . ms MBS
HUFFLE N ET - V . ms . MB
54% 0 . MS . MB FLOPs reduction on the MobileNet-v2, the GNN-RLincreases the top-1 accuracy by . . The inference and memory usage of compressed DNNsare essential metrics to determine the possibility of DNNdeployment on a given platform. Thus, we evaluated thepruned models’ inference latency using PyTorch 1.7.1 on anNvidia GTX 1080Ti GPU and recorded the GPU memoryusages. The ResNet-110/56/44/32/20 are measured on theCIFAR-10 test set with batch size 32. The VGG-16 isevaluated on the ImageNet test set with batch size 32. Lastly,MobileNet-v1/v2 and ShuffleNet-v1/v2 are measured on theCIFAR-100 with batch size 32.Table 3 shows the inference accelerations and memorysavings on our GPU. All the models pruned by GNN-RLachieve noteworthy inference acceleration and GPU mem-ory reductions. Particularly, for the VGG-16, the origi-nal model’s GPU memory usage is 528 MB since it has avery compact dense layer, which contributes little to FLOPsbut leads to extensive memory requirement. The GNN-RLprunes convolutional layers and significantly reduces the feature map size, thus consuming 141 MB less memory thanthe original version. The inference acceleration on VGG-16is also noticeable, with . × speed up on the ImageNet.The inference acceleration for mobile-friendly DNNs mayseem relatively insignificant. However, such models are de-signed for deployment on mobile devices. Thus, we believethat our tested GPU, with its extensive resources, does nottake advantage of the mobile-friendly properties.
5. Conclusion
This paper proposed a network compression approach calledGNN-RL, which utilizes a graph neural network and a rein-forcement learning agent to exploit a topology-aware com-pression policy. We introduced a DNN-Graph environmentthat converts compression states to a topology changingprocess and allow GNN-RL to learn the desired compres-sion ratio without human intervention. To efficiently em-bed DNNs and take advantage of motifs, we introducedm-GNN, a new multi-stage graph embedding method. Inour experiments, GNN-RL is validated and verified onover-parameterized and mobile-friendly networks. For theover-parameterized models pruned by GNN-RL, ResNet-110/56/44, the test accuracy even outperformed the originalmodels, i.e. +0 . on ResNet-110, +0 . on ResNet-56and +0 . on ResNet-44. For mobile-friendly DNNs, the FLOPs MobileNet-v2 pruned by GNN-RL increasedthe test accuracy by . compared to the original model.Additionally, all the pruned models accelerated the infer-ence speed and saved a considerable amount of memoryusage. References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M.,Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner,B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y.,and Zheng, X. Tensorflow: A system for large-scale machinelearning. In
Proc. of the 12th USENIX Symposium on OperatingSystems Design and Implementation (OSDI 16) , pp. 265–283,2016.Anwar, S., Hwang, K., and Sung, W. Structured pruning of deepconvolutional neural networks.
Proc. of the J. Emerg. Technol.Comput. Syst. , 13(3), February 2017. ISSN 1550-4832.Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Dis-tributed optimization and statistical learning via the alternatingdirection method of multipliers.
Found. Trends Mach. Learn. , 3(1):1–122, January 2011. ISSN 1935-8237.Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and Bengio,Y. Binarized neural networks: Training deep neural networkswith weights and activations constrained to +1 or -1. arXivpreprint arXiv:1602.02830 , 2016.Dudziak, L., Chau, T., Abdelfattah, M. S., Lee, R., Kim, H., and
NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
Lane, N. D. BRP-NAS: Prediction-based NAS using gcns.2021.Guo, Y., Yao, A., and Chen, Y. Dynamic network surgery forefficient dnns. In Lee, D., Sugiyama, M., Luxburg, U., Guyon,I., and Garnett, R. (eds.),
Proc. of the Advances in NeuralInformation Processing Systems , volume 29, pp. 1379–1387.Curran Associates, Inc., 2016.Guo, Y., Zheng, Y., Tan, M., Chen, Q., Chen, J., Zhao, P., andHuang, J. NAT: Neural architecture transformer for accurateand compact architectures. In
Proc. of the Advances in Neu-ral Information Processing Systems , volume 32, pp. 737–748.Curran Associates, Inc., 2019.Han, S., Mao, H., and Dally, W. J. Deep compression: Compress-ing deep neural networks with pruning, trained quantizationand huffman coding. In
Proc. of International Conference onLearning Representations (ICLR) , 2016.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning forimage recognition. pp. 770–778, 2016.He, Y., Zhang, X., and Sun, J. Channel pruning for acceleratingvery deep neural networks. In
Proc. of the IEEE InternationalConference on Computer Vision , pp. 1389–1397, 2017.He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. AMC:AutoML for model compression and acceleration on mobiledevices. In
Proc. of the European Conference on ComputerVision (ECCV) , pp. 784–800, 2018.Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledgein a neural network. In
Proc. of NIPS Deep Learning andRepresentation Learning Workshop , 2015.Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan,M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al. Search-ing for mobilenetv3. In
Proc. of the IEEE/CVF InternationalConference on Computer Vision , pp. 1314–1324, 2019.Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficientconvolutional neural networks for mobile vision applications.
CoRR , abs/1704.04861, 2017.Huang, G., Liu, S., Van der Maaten, L., and Weinberger, K. Q.Condensenet: An efficient densenet using learned group con-volutions. In
Proc. of the IEEE conference on computer visionand pattern recognition , pp. 2752–2761, 2018.Kipf, T. N. and Welling, M. Semi-supervised classification withgraph convolutional networks. In
Proc. of the InternationalConference on Learning Representations (ICLR) , 2017.Krizhevsky, A. and Hinton, G. Learning multiple layers of featuresfrom tiny images. 2009.Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P.Pruning filters for efficient convnets.
CoRR , abs/1608.08710,2016.Liben-Nowell, D. and Kleinberg, J. The link-prediction prob-lem for social networks.
Journal of the American Society forInformation Science and Technology , 58(7):1019–1031, 2007.Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa,Y., Silver, D., and Wierstra, D. Continuous control with deepreinforcement learning. In
Proc. of the ICLR (Poster) , 2016. Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning.In
Proc. of the Advances in Neural Information ProcessingSystems , pp. 2181–2191, 2017.Liu, N., Ma, X., Xu, Z., Wang, Y., Tang, J., and Ye, J. Auto-Compress: An automatic dnn structured pruning framework forultra-high compression rates. In
Proc. of the Artificial Intelli-gence Conference (AAAI) , pp. 4876–4883, 2020.Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2:Practical guidelines for efficient cnn architecture design. In
Proc. of the European conference on computer vision (ECCV) ,pp. 116–131, 2018.Mehta, S., Hajishirzi, H., and Rastegari, M. Dicenet: Dimension-wise convolutions for efficient networks.
IEEE Transactions onPattern Analysis and Machine Intelligence , 2020.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan,G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmai-son, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A.,Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala,S. Pytorch: An imperative style, high-performance deep learn-ing library. In Wallach, H., Larochelle, H., Beygelzimer, A.,d'Alch´e-Buc, F., Fox, E., and Garnett, R. (eds.),
Proc. of theAdvances in Neural Information Processing Systems , volume 32,pp. 8026–8037. Curran Associates, Inc., 2019.Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neuralnetworks. In
Proc. of European Conference on Computer Vision ,pp. 525–542. Springer, 2016.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma,S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg,A. C., and Fei-Fei, L. ImageNet Large Scale Visual RecognitionChallenge.
International Journal of Computer Vision (IJCV) ,115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.Sainath, T. N., Kingsbury, B., Sindhwani, V., Arisoy, E., andRamabhadran, B. Low-rank matrix factorization for deep neuralnetwork training with high-dimensional output targets. In
Proc.of the 2013 IEEE International Conference on Acoustics, Speechand Signal Processing , pp. 6655–6659, 2013. doi: 10.1109/ICASSP.2013.6638949.Sandler, M., Howard, G. A., Zhu, M., Zhmoginov, A., and Chen,L.-C. MobileNetV2: Inverted residuals and linear bottlenecks.pp. 4510–4520, 2018.Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R., Titov,I., and Welling, M. Modeling relational data with graph convo-lutional networks. In Gangemi, A., Navigli, R., Vidal, M.-E.,Hitzler, P., Troncy, R., Hollink, L., Tordai, A., and Alam, M.(eds.),
The Semantic Web , pp. 593–607, Cham, 2018. SpringerInternational Publishing. ISBN 978-3-319-93417-4.Shi, H., Pi, R., Xu, H., Li, Z., Kwok, J. T., and Zhang, T. Bridgingthe gap between sample-based and one-shot neural architecturesearch with BONAS. 2020.Simonyan, K. and Zisserman, A. Very deep convolutional networksfor large-scale image recognition. international conference onlearning representations , 2015.Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for con-volutional neural networks. In
Proc. of the International Con-ference on Machine Learning , pp. 6105–6114. PMLR, 2019.
NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning
Wang, H., Zhang, Q., Wang, Y., and Hu, R. Structured probabilisticpruning for deep convolutional neural network acceleration.
British Machine Vision Conference , 2017.Ying, R., You, J., Morris, C., Ren, X., Hamilton, W. L., andLeskovec, J. Hierarchical graph representation learning withdifferentiable pooling. In
Proceedings of the 32nd Interna-tional Conference on Neural Information Processing Systems ,NIPS’18, pp. 4805–4815, Red Hook, NY, USA, 2018. CurranAssociates Inc.Yu, S., Mazaheri, A., and Jannesari, A. Auto graph encoder-decoder for model compression and network acceleration. 2020.Zhang, T., Ye, S., Zhang, K., Tang, J., Wen, W., Fardad, M., andWang, Y. A systematic DNN weight pruning framework usingalternating direction method of multipliers.
ECCV , 2018a.Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremelyefficient convolutional neural network for mobile devices. In