[PDF] GNN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Abstract

Model compression is an essential technique for deploying deep neural networks (DNNs) on power and memory-constrained resources. However, existing model-compression methods often rely on human expertise and focus on parameters' local importance, ignoring the rich topology information within DNNs. In this paper, we propose a novel multi-stage graph embedding technique based on graph neural networks (GNNs) to identify the DNNs' topology and use reinforcement learning (RL) to find a suitable compression policy. We performed resource-constrained (i.e., FLOPs) channel pruning and compared our approach with state-of-the-art compression methods using over-parameterized DNNs (e.g., ResNet and VGG-16) and mobile-friendly DNNs (e.g., MobileNet and ShuffleNet). We evaluated our method on various models from typical to mobile-friendly networks, such as ResNet family, VGG-16, MobileNet-v1/v2, and ShuffleNet. The results demonstrate that our method can prune dense networks (e.g., VGG-16) by up to 80% of their original FLOPs. More importantly, our method outperformed state-of-the-art methods and achieved a higher accuracy by up to 1.84% for ShuffleNet-v1. Furthermore, following our approach, the pruned VGG-16 achieved a noticeable 1.38\times speed up and 141 MB GPU memory reduction.

Full PDF

GGNN-RL Compression: Topology-Aware Network Pruning using Multi-stageGraph Embedding and Reinforcement Learning

Sixing Yu Arya Mazaheri Ali Jannesari Abstract

Model compression is an essential technique fordeploying deep neural networks (DNNs) on powerand memory-constrained resources. However, ex-isting model-compression methods often rely onhuman expertise and focus on parameters’ lo-cal importance, ignoring the rich topology in-formation within DNNs. In this paper, we pro-pose a novel multi-stage graph embedding tech-nique based on graph neural networks (GNNs) toidentify the DNNs’ topology and use reinforce-ment learning (RL) to ﬁnd a suitable compres-sion policy. We performed resource-constrained(i.e., FLOPs) channel pruning and compared ourapproach with state-of-the-art compression meth-ods using over-parameterized DNNs (e.g., ResNetand VGG-16) and mobile-friendly DNNs (e.g.,MobileNet and ShufﬂeNet). We evaluated ourmethod on various models from typical to mobile-friendly networks, such as ResNet family, VGG-16, MobileNet-v1/v2, and ShufﬂeNet. The resultsdemonstrate that our method can prune dense net-works (e.g., VGG-16) by up to 80% of their origi-nal FLOPs. More importantly, our method outper-formed state-of-the-art methods and achieved ahigher accuracy by up to 1.84% for ShufﬂeNet-v1.Furthermore, following our approach, the prunedVGG-16 achieved a noticeable 1.38 × speed upand 141 MB GPU memory reduction.

1. Introduction

The demand for deploying DNN models on edge devices(e.g., mobile phones, robots, and self-driving cars) are ex-panding rapidly. However, the increasing memory and com-puting power requirements of DNNs make their deployment Department of Computer Science, Iowa State University,Iowa, USA Department of Computer Science, TechnischeUniversit¨at Darmstadt, Germany. Correspondence to: Six-ing Yu < [email protected] > , Arya Mazaheri < [email protected] > , Ali Jannesari < [email protected] > . on edge devices a grand challenge. Thus, various custom-made DNN models have been introduced by experts to ac-commodate a DNN model with reasonably high accuracy onmobile devices (Howard et al., 2019; Tan & Le, 2019; Zhanget al., 2018b; Ma et al., 2018; Mehta et al., 2020; Huanget al., 2018). In addition to mobile-friendly deep networks,model optimization methods such as network pruning (Hanet al., 2016; He et al., 2018), factorization (Sainath et al.,2013), knowledge distillation (Hinton et al., 2015), and pa-rameter quantization (Han et al., 2016) help to shrink theDNN model size down to the target hardware capabilities.Among such methods, network pruning has shown to beconsiderably useful in model compression by introducingsparsity or eliminating channels or ﬁlters, yet requires ex-tensive knowledge and effort to ﬁnd the perfect balancebetween the accuracy and model size.The main challenge of network pruning is to ﬁnd the bestpruning schedule or strategy for the layers of a network.Furthermore, a pruning strategy for a given DNN cannotbe used for other networks due to their different structure.Thus, each network demands a customized pruning strategy.Recently, He et al. (He et al., 2018) leveraged reinforcementlearning (RL) to automatically ﬁnd the best pruning strategy.However, they used manually deﬁned rules, such as num-ber of input/output channels, parameter size, and FLOPsfor the RL environment states vectors and ignored the richstructural information within the DNN. Yu et al. (Yu et al.,2020) are the ﬁrst to model a given DNN as a hierarchicalgraph and proposed a GNN-based encoder-decoder to em-bed DNN layers. However, their method learns the topologyindirectly and does not consider topology changes whilemodel compression. Moreover, existing RL-based modelcompression methods require manually deﬁned pruning ra-tio to get the desired model size reduction. Although themodel accuracy is used within the RL agent’s reward func-tion, there is a negative correlation between the compressionratio and reward. Thus, without any constraint, the RL agenttends to search for a tiny compression ratio to get a betterreward.Deep neural networks are already being represented as com-putational graphs in deep-learning frameworks, such asTensorFlow(Abadi et al., 2016) and PyTorch(Paszke et al., a r X i v : . [ c s . C V ] F e b NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

2. Related Work

Within the context of this paper, researchers already pro-posed various methods to compress DNN models, suchas architecture design, network pruning, and quantization.Graph neural networks are also gaining momentum amongthese research ﬁelds. In the following, we will review thesemethods.

Model Compression.

Extensive works focus on modelcompression and efﬁcient deployment of DNNs, such as network pruning (Han et al., 2016; He et al., 2018), knowl-edge distillation (Hinton et al., 2015), and network quanti-zation (Han et al., 2016; Courbariaux et al., 2016; Rastegariet al., 2016). Within the scope of this paper, we mainlyconsider network pruning. Structured (Anwar et al., 2017)and unstructured pruning (Zhang et al., 2018a; Guo et al.,2016) evaluate model parameters importance and removethose with a lower rank. The unstructured pruning promisesa higher compression ratio by tensor sparsiﬁcation. How-ever, the potential speedup is only attainable on specializedAI-accelerators. On the other hand, structured pruning at-tempts to eliminate ﬁlters or channels and provides beneﬁtto all hardware platforms. For instance, the uniform, shal-low, deep empirical structured pruning policies (He et al.,2017; Li et al., 2016), the hand-crafted structured pruningmethods, such as SPP (Wang et al., 2017), FP (Li et al.,2016), and RNP (Lin et al., 2017) fall into the structuredpruning category. The SPP analyzes each layer and mea-sures a reconstruction error to determine the pruning ratio.FP evaluates the performance of single-layer-pruning andranks the importance of layers and prunes aggressively onlow ranks. RNP groups all convolutional channels into setsand trains an RL agent to decide on the sets. However, hand-crafted pruning policies often fail to be extended to newmodels and might lead to sub-optimal performance.Recently, researchers tend to leverage reinforcement learn-ing to search for pruning policies automatically. Liu etal. (Liu et al., 2020) proposed an ADMM-based (Boyd et al.,2011) structured weight pruning method and an innovativeadditional puriﬁcation step for further weight reduction. Heet al. (He et al., 2018) proposed the AMC for network prun-ing and leveraged reinforcement learning to predict eachhidden layer’s compression policies. However, they man-ually deﬁned DNN’s embeddings and ignored the neuralnetwork’s essential structural information. Yu et al. (Yuet al., 2020) are the ﬁrst to model DNNs as graphs and in-troduced a GNN-based graph encoder-decoder to embedDNNs’ hidden layers. Nevertheless, their RL agent learnsthe topology information indirectly and is insensitive to thestructural changes of DNNs while being pruned.

Graph Neural Networks (GNN).

GNN and its vari-ants (Kipf & Welling, 2017; Schlichtkrull et al., 2018) canlearn the graph embeddings and have been successfully usedfor link prediction (Liben-Nowell & Kleinberg, 2007) andnode classiﬁcation. However, these methods are mainlyfocused on node embedding and are inherently ﬂat, whichis inefﬁcient to deal with the hierarchical data. In this pa-per, we aim to learn the global topology information fromDNNs. Thus, we proposed multi-stage GNN (m-GNN),which takes advantage of the repetitive motifs available inDNNs. m-GNN considers the edge features and has a novellearning-based pooling strategy to learn the global graphembedding.

NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Graph-based Neural Architecture Search (NAS).

Al-though this paper is not directly related to NAS, it is anactive area of research wherein the computationally expen-sive operations are replaced with more efﬁcient alternative.Particularly, graph-based NAS methods apply GNN anduse graph-based neural architecture encoding schemes toexploit neural network’s topology. They model neural ar-chitecture’s search spaces as graphs and aim to search forthe best performing neural network structure (Guo et al.,2019; Shi et al., 2020; Dudziak et al., 2021). Such methodsinspired us to exploit compression policy from the topologyinformation of DNNs.

3. Approach

To prune a given DNN, the user provides the model sizeconstraint (e.g., FLOPs-constraint). The DNN-Graph en-vironment receives the constraint, takes the DNN’s hierar-chical computational graph as the environment state, andleverages the GNN-RL agent to search for a compressionpolicy.Figure 1 depicts a high-level overview of our method. TheDNN-Graph environment episode is essentially a modelcompression iteration. As the red arrows show, the processstarts from the original DNN. The model size evaluator ﬁrstevaluates the size of the DNN. If the size is not satisﬁed,the graph generator converts the DNN to a hierarchicalcomputational graph. Then the GNN-RL agent leverages m-GNN to learn pruning ratios (compression policy) from thegraph. The pruner prunes the DNN with the pruning ratiosand begins the next iteration from the compressed DNN.Each step of the compression will lead to DNN’s topologychange. Thus, the DNN-Graph environment reconstructsa new hierarchical computational graph for the GNN-RLagent corresponding to the current compression state.Once the compressed DNN satisﬁes the size constraint, theevaluator will end the episode, and the accuracy evaluatorwill assess the pruned DNN’s accuracy as an episode rewardfor the GNN-RL agent. As opposed to the existing RL-based methods (He et al., 2018; Yu et al., 2020; Liu et al.,2020), with the DNN-Graph environment, the GNN-RL canautomatically learn to reach the desired model size. Hence,it prevents us from manual adjustments and obtaining tinycompression ratios.In the following, we will explain the details of the m-GNNand RL agent within our approach.

Neural networks representation as computational graphs indeep learning frameworks, such as TensorFlow and PyTorch,contains rich topology information. However, it may involvebillions of operations (He et al., 2016), which makes the

Figure 1.

DNN-Graph environment. The graph generator convertsthe DNN into a graph. The model size evaluator evaluates theDNN’s size. The accuracy evaluator measures DNN’s accuracy onthe target dataset. The pruner module is responsible for pruningthe DNN. computational graph bloated. Nevertheless, computationalgraphs often contain repetitive sub-graphs (a.k.a. motifs),such as 3 × add , multi-ple , and minus with machine-learning high-level operations(e.g., convolution, pooling, etc.).Formally, we model the DNN as an l -layer hierarchicalcomputational graph, such that at the l th layer (the toplayer) we would have the hierarchical computational graphset G l = { G l } , where each item is a computational graph G l = ( V l , E l , G l − ) . V l is the graph nodes correspondingto hidden states. E l is the set of directed edges with aspeciﬁc edge type associated with the operations. Lastly, G l − = { G l − , G l − , ... } is the computational graph set atthe ( l − -layer and the operation set at layer l . Within theﬁrst layer, we manually choose commonly used machinelearning operations as the primitive operations for G .As an example, Figure 2 illustrates the idea behind gen-erating hierarchical computational graphs using a samplegraph G , where the edges are operations and the nodes arehidden states. In the input graph, we choose three primitiveoperations G = { × × × } corresponding to the three edge types. Then, we extract therepetitive subgraphs (i.e., G , G and G ), each denoting acompound operation, and decompose the graph G into twohierarchical levels, as shown in Figure 2 (b) and (c). Thelevel-1 computational graphs are motifs that correspond tothe edges within the level-2 computational graph.The hierarchical computation graph’s size depends on theprimitive operations we choose in G . In our experiments,we choose the commonly used operations in machine learn-ing as primitive operations (e.g., convolution, pooling, etc.). NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Figure 2.

A two-level hierarchical computational graph and m-GNN. (a) A computational graph G with motifs (the sub-graph painted red,blue, and green). The nodes on the graph denote the feature maps of input data and the edges corresponding to operations. (b) Level 1hierarchical computational graphs. We extract motifs { G , G , G } from G and split the G into 2 hierarchical levels. Level 1 are motifs.(c) Level 2 hierarchical computational graph G . The edges in G correspond to motifs at level 1. The m-GNN embeds the 2-levelhierarchical computational graph in two stages. At stage 1, m-GNN embeds the motifs { G , G , G } , and applies their embeddings (i.e., e , e and e ) as the edge features of G . At stage 2, the m-GNN embeds the G , and its embedding g is the ﬁnal embedding of the2-level hierarchical computational graph. ULTI - STAGE

GNNStandard GNN and its variants (Kipf & Welling, 2017) areinherently ﬂat (Ying et al., 2018). Since we model a givenDNN as an l − layer hierarchical computational graph (seeSection 3.1), we propose a multi-stage GNN (m-GNN),which embeds the hierarchical graph in l -stages accordingto its hierarchical levels and analyzes the motifs.As depicted in Figure 2, m-GNN initially learns the lowerlevel embeddings and uses them as the correspondingedge features in high-level computation graphs. Insteadof learning node embeddings, m-GNN aims to learn theglobal graph representation. We further introduced a novellearning-based pooling strategy for every stage of embed-ding. With m-GNN, we only need embedding once for eachmotif on the computational graph. It is much more efﬁcientand uses less memory than embedding a ﬂat computationgraph with standard GNN. Multi-stage Embedding.

For the computational graphs G t = { G t , G t , ..., G tN t } in the t th hierarchical layer, weembed the computational graph G ti = ( V ti , E ti , G t − ) , i = { , , ..., N t } as: e ti = EncoderGN N t ( G ti , E t − ) (1), where e ti is the embedding vector of G ti , E t − = { e t − j } , j = { , , ..., N t − } is the embedding of the com-putational graphs at level t − and the type of edges at level t . For layer-1, E contains the initial features (e.g., one-hot,and random standard) of the primitive operations G that wemanually select.In the hierarchical computational graphs, each edge corre-sponds to a computational graph of the previous level anduses its graph embedding as the edge feature. Furthermore,the graphs at the same hierarchical level share the GNN’sparameter. At the top layer ( l th layer) of the hierarchical graph G l = { G l } , we only have one computational graphand its embedding is the DNN’s ﬁnal embedding g : g = EncoderGN N l ( G l , E l − ) (2) Message passing.

In the multi-stage hierarchical embed-ding, we consider the edge features. However in the stan-dard graph convolutional networks (GCN) (Kipf & Welling,2017), it only passes the node features and the messagepassing function can be formulated as follows: h l +1 i = (cid:88) j ∈ N i c i W l h lj (3), where h is nodes’ hidden states, c i is a constant coefﬁcient, N i is node i neighbors, and W l is GNN’s learnable weightmatrix. Instead of standard message passing, in the multi-stage GNN, we add the edge features: h l +1 i = (cid:88) j ∈ N i c i W l ( h lj ◦ e l − k ) (4), where e l − k is the features of edge ( i, j ) and is also theembeddings of the k th graph at layer l − , such that the edge ( i, j ) corresponds to the operation G l − k . The operation ◦ denotes the element-wise product, which we selected forthe convenience of multi-stage message passing, but it isnot limited to element-wise product. Learning-based pooling.

Standard GNN aims to learn thenode embeddings of a graph (e.g., learn node representationand perform node classiﬁcation). However, our goal isto learn the graph representation of a given DNN. Thus,we introduced a learning-based pooling method for multi-stage GNN to pool node embeddings and learn the graphembedding. We deﬁne the graph embedding e as: e = (cid:88) i ∈ N α i h i (5) NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning , where N is the set of nodes, h i is the i − th node embedding,and α i is the learnable weight coefﬁcient for h i . In themulti-stage GNN, the computational graphs at the samehierarchical level share the GNN’s parameters, but in thepooling, each computational graph has its learnable poolingparameters α .3.2.2. R EINFORCEMENT LEARNING

We use the generated hierarchical computational graph G l for representing the DNN’s state and the RL agent’s environ-ment state. Since pruning the model causes its underlyinggraph topology to change, we constantly update the graph G l after each pruning step to help the RL agent ﬁnd thepruning policy on current states.We employ deep deterministic policy gradient (DDPG)RL (Lillicrap et al., 2016) together with m-GNN (GNN-RL) to learn the compression policy directly from topologystates. The actor and critic-network within the GNN-RLagent contain an m-GNN graph encoder and a multi-layerperception. The graph encoder is used to learn the graphembedding, and the multi-layer perception projects the em-bedding into action space (i.e., compression policy). Theactor’s output layer applies the sigmoid function to boundthe actions within (0 , .Speciﬁcally, we perform FLOPs-constrained model com-pression using structured channel pruning (ﬁlter pruning)on the DNN’s convolutional layers, which are the most com-putationally intensive. Thus, the GNN-RL agent’s actionspace A ∈ R N × , where the N is the number of pruninglayers, is the pruning ratios for hidden layers: A = a i ,where i = { , , ..., N } , and a i ∈ [0 , is the pruning ratiofor i th layer. The GNN-RL agent makes the actions directlyfrom the topology states: g = GraphEncoder ( G l ) (6) A = M LP ( g ) (7), where the G l is the environment states, g is the graphrepresentation, The MLP is a multi-layer perception neuralnetwork. The graph encoder learns the topology embedding,and the MLP projects the embedding into hidden layers’pruning ratios. The reward function is deﬁned in Equation 8. R err = − Error (8), where the

Error is the compressed DNN’s Top-1 error onvalidation set.

4. Experiments

To show the effectiveness of the GNN-RL, we evaluateour approach on over-parameterized DNNs (e.g., ResNet-20/32/44/56/110(He et al., 2016) and VGG-16(Simonyan & Zisserman, 2015)) and mobile-friendly DNNs (e.g., Mo-bileNet(Howard et al., 2017; Sandler et al., 2018) and Shuf-ﬂenet(Ma et al., 2018; Zhang et al., 2018b)). Additionally,to demonstrate the superiority of our proposed method, wecompare GNN-RL with three sets of methods:• Uniform, shallow, and deep empirical policies (Heet al., 2017; Li et al., 2016).• The handcrafted channel reduction methods, such asSPP(Wang et al., 2017), FP (Li et al., 2016), andRNP (Lin et al., 2017).• The state-of-the-art RL-Based AutoML methods, suchas AMC (He et al., 2018), AGMC (Yu et al., 2020),and random search (RS) with RL.We use soft target update rate τ = 0 . for the GNN-RLupdates. In the ﬁrst episodes, we warm up the agent withrandom action. Then exploits 150 episodes with exponen-tially decayed noise and trains the network with 64 batchsize and 2000 as replay buffer size.The experiment involves multiple datasets, includingCIFAR-10/100 (Krizhevsky & Hinton, 2009), and Ima-geNet (Russakovsky et al., 2015). In the CIFAR-10/100dataset, we sample K images from the test set as the val-idation set. In ILSVRC-2012, we split K images fromthe test set as the validation set. When searching, the DNN-Graph environment uses the compressed model’s R err onthe validation set as the GNN-RL agent’s reward. We evaluate the effectiveness of GNN-RL on ResNet-20/32/44/56/110 (He et al., 2016) and VGG-16 (Simonyan& Zisserman, 2015), which fall into the over-parameterizednetworks category. With its residual connections, ResNetavoids gradient vanishing and allows an efﬁcient trainingon its deep layers. However, its deep neural structure andbillions of parameters make ResNet a challenging networkto deploy on edge devices. Similarly, the VGG-16 networkcontains compact and dense convolutional layers, wheresome layers have hundreds of ﬁlters, leading to a giantmodel size (528 MB GPU memory for VGG-16). To com-press these over-parameterized DNNs, we perform FLOPs-constrained channel pruning (ﬁlter pruning) on their convo-lutional layers.We trained ResNet-20/32/44/56/110 and VGG-16 mod-els on CIFAR-10 (Krizhevsky & Hinton, 2009) and Im-ageNet (Russakovsky et al., 2015) datasets, respectively.Since the validation accuracy on the ImageNet dataset issensitive to the compression ratio, with high compressionratios, the accuracy drops considerably without ﬁne-tuning(in some cases, the pruned model without ﬁne-tuning has

NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Layer P r un i n g R a t i o Layer P r un i n g R a t i o Figure 3.

The hidden layers’ pruning ratio of

FLOPs ResNet-110 and

FLOPs ResNet-56.The horizontal axis is the layer, andthe vertical axis is the pruning ratio. The bars that tangent with the dot-line are the residual connection layers. Since we share the pruningratios between residual connection layers to avoid the hidden states’ dimension mismatch, the residual connection layers’ pruning ratiosare uniform. For both the ResNet-110 and ResNet-56, the residual connection layers’ pruning ratio is higher than most of the layers. less than validation accuracy). We applied a one epochﬁne-tuning process on each RL search episode to ensurethat the GNN-RL gets a valuable reward when pruning theVGG-16. When pruning the ResNett-20/32/44/56/110, weshare the pruning index between residual connection layersto avoid channel mismatch.Table 1 shows the top-1 test accuracy of the pruned models.We set the FLOPs constraint, and all the RL-Basedmethods use the R err as the reward. After pruning, we ﬁne-tuned the DNNs with 100 epochs and only updated prunedlayers’ parameters. Results show that GNN-RL outperformsall the baselines and achieves higher test accuracy and com-pression ratio. For the ResNet-110/56/44 models, the modelpruned by the GNN-RL even achieves higher test accuracythan the original model. After further investigations, webelieve that it is due to the over-ﬁtting of ResNet-110/56/44,as the accuracy on the training set was 100%. To verify ourassumption, we performed a further experiment to explorethe relationship between the FLOPs constraints and the ac-curacy of DNNs. Figure 4 shows the FLOPs ratio between0.4 to 0.6 (compared to the original model’s FLOPs) canget the highest test accuracy on ResNet-110. When theFLOPs reduction ratio exceeds 0.6, the test accuracy dropsintensively.In addition to the experiments above, we further analyzedthe redundancy and the importance of each layer. Figure 3shows the hidden layers’ pruning ratios on ResNet-110 andResNet-56. ResNet contains residual connection layers,which transfer hidden states directly from previous residuallayers. Thus, the residual connection layers are more re-dundant and informative since they contain the informationof both the current layer’s hidden states and the previouslayers. The GNN-RL agent automatically learns that theresidual connection layers are more redundant and appliesmore pruning on the residual connection layers. Another in-sight from Figure 3 is that the GNN-RL agent applies morepruning on layers 45 to 65 within ResNet-110. Similarly,layers 23 to 35 of ResNet-56 have been pruned more. Such FLOPs Ratio T e s t A cc u r a c y ResNet-110

Figure 4.

Test accuracy of ResNet-110 using various FLOPs ratios.

Figure 5.

An example for pruning depth-wise ﬁlters on MobileNet-v1 blocks. an insight shows that the middle layers have less impact onmodel accuracy.

We evaluated GNN-RL on MobileNet-v1/v2 (Howard et al.,2017; Sandler et al., 2018) and ShufﬂeNet-v1/v2 (Zhanget al., 2018b; Ma et al., 2018), which are more suitablefor devices with limited resources. Instead of using tradi-tional convolutional operation, the MobileNet-v1/v2 andShufﬂeNet-v1/v2 have designed more efﬁcient convolu-tional blocks. To maintain the characteristics and high-efﬁciency of those custom-designed blocks, we have devel-

NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Table 1.

Top-1 Classiﬁcation accuracy for pruned over-parameterized DNNs. The original ResNet-110/56/44/32/20 arepre-trained on CIFAR-10 with Top-1 test accuracy of 93.68%,93.39%, 93.10%, 92.63%, and 91.73%, respectively. The originalVGG-16 is pre-trained on ImageNet with Top-1 test accuracyof 70.5%. The column FLOPs is the preserved FLOPs ratiocompared with the original model.

MODEL METHOD FLOPS

Acc. % ∆

Acc. R ES N ET AGMC

50% 93 . − . RS

50% 87 . − . GNN - RL R ES N ET UNIFORM

50% 87 . − . DEEP

50% 88 . − . AMC

50% 90 . − . AGMC

50% 92 . − . GNN - RL R ES N ET AGMC

50% 92 . − . RS

50% 88 . − . GNN - RL R ES N ET AGMC

50% 90 . − . RS

50% 89 . − . GNN - RL % − . R ES N ET DEEP

50% 79 . − . SHALLOW

50% 83 . − . UNIFORM

50% 84 − . AMC

50% 86 . − . AGMC

50% 88 . − . GNN - RL − . VGG-16 FP . − . RNP . − . SPP . − . GNN - RL − . oped speciﬁc pruning strategies for them.4.2.1. P RUNING STRATEGY

MobileNet-v1.

The MobileNet-v1 block separates theconvolution into the depth-wise and point-wise convolu-tions (Howard et al., 2017). Each depth-wise ﬁlter onlyoperates on one channel of feature maps. On the other hand,point-wise operations are the × convolutions, whichoperate on the feature maps processed by the depth-wiseconvolutions. In our experiments, applying regular ﬁlterpruning on such layers causes information loss. As depictedin Figure 5, pruning the ﬁlter painted in grey causes its cor-responding channel (the green one) to be deleted as well. Tohandle this, instead of pruning depth-wise and point-wiseﬁlters separately, we only prune the point-wise ﬁlters withinMobileNet-v1 blocks. MobileNet-v2.

The MobileNet-v2 is principally designedbased on MobileNet-v1 blocks with an additional linearexpansion layer. The linear expansion layers are 1 × Table 2.

Top-1 Classiﬁcation accuracies for pruned mobile-friendlyDNNs. The original MobileNet-v1/v2 and ShufﬂeNet-v1/v2 arepre-trained on CIFAR-100 with Top-1 test accuracy of 64.88%,65.74%, 68.64%, and 68.85%, respectively. The column FLOPs isthe preserved FLOPs ratio compared with the original model.

MODEL METHOD FLOPS

Acc. % ∆

Acc. S HUFFLE N ET - V AGMC

60% 65 . − . RS

60% 63 . − . GNN - RL − . S HUFFLE N ET - V AGMC

60% 66 . − . RS

60% 65 . − . GNN - RL − . M OBILE N ET - V AGMC

80% 64 . − . RS

80% 63 . − . GNN - RL − . M OBILE N ET - V AGMC

80% 65 . − . RS

80% 65 . − . GNN - RL lutions without non-linear activation. Residual shortcuts arebetween every two linear expansion layers, which connectMobileNet-v1 blocks. Similar to the MobileNet-v1, here weprune linear expansion layers and point-wise convolutionallayers. Since residual connections are between linear expan-sion layers, we share the linear expansion layers’ pruningratio. ShufﬂeNet-v1/v2.

The ShufﬂeNet model uses blocks con-taining depth-wise and point-with convolutions, channelshufﬂe, linear expansion, and residual connections. To avoiddimension mismatch when downsampling, we consider theShufﬂeNet blocks together and perform channel pruninginside the blocks. In a ShufﬂeNet block, we do not prunethe expansion layer (the output layer of the block), whichcan preserve the number of output channels and keep thefeature maps dimensions when downsampling.4.2.2. R

ESULTS

Table 2 shows the FLOPs-constrained channel pruning re-sults with 60% and 80% FLOPs ratio for ShufﬂeNet andMobileNet, respectively. We have compared GNN-RL withAGMC (Yu et al., 2020), and random search (RS) with RL.We did not include AMC and handcrafted methods sincewe designed speciﬁc pruning strategies for mobile-friendlyDNNs. We believe that these strategies are incompatiblewith AMC layer embeddings and handcrafted rules, leadingto unfair comparison.The MobileNet-v1/v2 and ShufﬂeNet-v1/v2 are pre-trainedon the CIFAR-100 (Krizhevsky & Hinton, 2009). Afterpruning, we ﬁne-tuned the compressed DNNs with 150epochs. Our approach outperformed all the baselines. Al-though the networks have already been very compact, with

NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Table 3.

The latency and GPU Memory usage of compressed mod-els. The ResNet-110/56/44/32/20 models are tested on the CIFAR-10. The VGG-16 is tested on the ImageNet, and MobileNet-v1/v2and ShufﬂeNet-v1/v2 are tested on the CIFAR-100. The columnFLOPs is the preserved FLOPs ratio compared with the originalmodel.M

ODEL

FLOP S L ATENCY

GPU M EM .VGG-16 . ms MB

20% 0 . ms MBR ES N ET -110 . ms . MB

48% 0 . ms . MBR ES N ET -56 . ms . MB

46% 0 . ms . MBR ES N ET -44 . ms . MB

49% 0 . ms . MBR ES N ET -32 . ms . MB

49% 0 . ms KBR ES N ET -20 . ms . MB

49% 0 . ms KBM

OBILE N ET - V . ms MB

79% 0 . ms . MBM

OBILE N ET - V . ms . MB

79% 0 . ms . MBS

HUFFLE N ET - V . ms . MB

58% 0 . ms MBS

HUFFLE N ET - V . ms . MB

54% 0 . MS . MB FLOPs reduction on the MobileNet-v2, the GNN-RLincreases the top-1 accuracy by . . The inference and memory usage of compressed DNNsare essential metrics to determine the possibility of DNNdeployment on a given platform. Thus, we evaluated thepruned models’ inference latency using PyTorch 1.7.1 on anNvidia GTX 1080Ti GPU and recorded the GPU memoryusages. The ResNet-110/56/44/32/20 are measured on theCIFAR-10 test set with batch size 32. The VGG-16 isevaluated on the ImageNet test set with batch size 32. Lastly,MobileNet-v1/v2 and ShufﬂeNet-v1/v2 are measured on theCIFAR-100 with batch size 32.Table 3 shows the inference accelerations and memorysavings on our GPU. All the models pruned by GNN-RLachieve noteworthy inference acceleration and GPU mem-ory reductions. Particularly, for the VGG-16, the origi-nal model’s GPU memory usage is 528 MB since it has avery compact dense layer, which contributes little to FLOPsbut leads to extensive memory requirement. The GNN-RLprunes convolutional layers and signiﬁcantly reduces the feature map size, thus consuming 141 MB less memory thanthe original version. The inference acceleration on VGG-16is also noticeable, with . × speed up on the ImageNet.The inference acceleration for mobile-friendly DNNs mayseem relatively insigniﬁcant. However, such models are de-signed for deployment on mobile devices. Thus, we believethat our tested GPU, with its extensive resources, does nottake advantage of the mobile-friendly properties.

5. Conclusion

This paper proposed a network compression approach calledGNN-RL, which utilizes a graph neural network and a rein-forcement learning agent to exploit a topology-aware com-pression policy. We introduced a DNN-Graph environmentthat converts compression states to a topology changingprocess and allow GNN-RL to learn the desired compres-sion ratio without human intervention. To efﬁciently em-bed DNNs and take advantage of motifs, we introducedm-GNN, a new multi-stage graph embedding method. Inour experiments, GNN-RL is validated and veriﬁed onover-parameterized and mobile-friendly networks. For theover-parameterized models pruned by GNN-RL, ResNet-110/56/44, the test accuracy even outperformed the originalmodels, i.e. +0 . on ResNet-110, +0 . on ResNet-56and +0 . on ResNet-44. For mobile-friendly DNNs, the FLOPs MobileNet-v2 pruned by GNN-RL increasedthe test accuracy by . compared to the original model.Additionally, all the pruned models accelerated the infer-ence speed and saved a considerable amount of memoryusage. References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M.,Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner,B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y.,and Zheng, X. Tensorﬂow: A system for large-scale machinelearning. In

Proc. of the 12th USENIX Symposium on OperatingSystems Design and Implementation (OSDI 16) , pp. 265–283,2016.Anwar, S., Hwang, K., and Sung, W. Structured pruning of deepconvolutional neural networks.

Proc. of the J. Emerg. Technol.Comput. Syst. , 13(3), February 2017. ISSN 1550-4832.Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Dis-tributed optimization and statistical learning via the alternatingdirection method of multipliers.

Found. Trends Mach. Learn. , 3(1):1–122, January 2011. ISSN 1935-8237.Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and Bengio,Y. Binarized neural networks: Training deep neural networkswith weights and activations constrained to +1 or -1. arXivpreprint arXiv:1602.02830 , 2016.Dudziak, L., Chau, T., Abdelfattah, M. S., Lee, R., Kim, H., and

NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Lane, N. D. BRP-NAS: Prediction-based NAS using gcns.2021.Guo, Y., Yao, A., and Chen, Y. Dynamic network surgery forefﬁcient dnns. In Lee, D., Sugiyama, M., Luxburg, U., Guyon,I., and Garnett, R. (eds.),

Proc. of the Advances in NeuralInformation Processing Systems , volume 29, pp. 1379–1387.Curran Associates, Inc., 2016.Guo, Y., Zheng, Y., Tan, M., Chen, Q., Chen, J., Zhao, P., andHuang, J. NAT: Neural architecture transformer for accurateand compact architectures. In

Proc. of the Advances in Neu-ral Information Processing Systems , volume 32, pp. 737–748.Curran Associates, Inc., 2019.Han, S., Mao, H., and Dally, W. J. Deep compression: Compress-ing deep neural networks with pruning, trained quantizationand huffman coding. In

Proc. of International Conference onLearning Representations (ICLR) , 2016.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning forimage recognition. pp. 770–778, 2016.He, Y., Zhang, X., and Sun, J. Channel pruning for acceleratingvery deep neural networks. In

Proc. of the IEEE InternationalConference on Computer Vision , pp. 1389–1397, 2017.He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. AMC:AutoML for model compression and acceleration on mobiledevices. In

Proc. of the European Conference on ComputerVision (ECCV) , pp. 784–800, 2018.Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledgein a neural network. In

Proc. of NIPS Deep Learning andRepresentation Learning Workshop , 2015.Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan,M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al. Search-ing for mobilenetv3. In

Proc. of the IEEE/CVF InternationalConference on Computer Vision , pp. 1314–1324, 2019.Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efﬁcientconvolutional neural networks for mobile vision applications.

CoRR , abs/1704.04861, 2017.Huang, G., Liu, S., Van der Maaten, L., and Weinberger, K. Q.Condensenet: An efﬁcient densenet using learned group con-volutions. In

Proc. of the IEEE conference on computer visionand pattern recognition , pp. 2752–2761, 2018.Kipf, T. N. and Welling, M. Semi-supervised classiﬁcation withgraph convolutional networks. In

Proc. of the InternationalConference on Learning Representations (ICLR) , 2017.Krizhevsky, A. and Hinton, G. Learning multiple layers of featuresfrom tiny images. 2009.Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P.Pruning ﬁlters for efﬁcient convnets.

CoRR , abs/1608.08710,2016.Liben-Nowell, D. and Kleinberg, J. The link-prediction prob-lem for social networks.

Journal of the American Society forInformation Science and Technology , 58(7):1019–1031, 2007.Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa,Y., Silver, D., and Wierstra, D. Continuous control with deepreinforcement learning. In

Proc. of the ICLR (Poster) , 2016. Lin, J., Rao, Y., Lu, J., and Zhou, J. Runtime neural pruning.In

Proc. of the Advances in Neural Information ProcessingSystems , pp. 2181–2191, 2017.Liu, N., Ma, X., Xu, Z., Wang, Y., Tang, J., and Ye, J. Auto-Compress: An automatic dnn structured pruning framework forultra-high compression rates. In

Proc. of the Artiﬁcial Intelli-gence Conference (AAAI) , pp. 4876–4883, 2020.Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufﬂenet v2:Practical guidelines for efﬁcient cnn architecture design. In

Proc. of the European conference on computer vision (ECCV) ,pp. 116–131, 2018.Mehta, S., Hajishirzi, H., and Rastegari, M. Dicenet: Dimension-wise convolutions for efﬁcient networks.

IEEE Transactions onPattern Analysis and Machine Intelligence , 2020.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan,G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmai-son, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A.,Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala,S. Pytorch: An imperative style, high-performance deep learn-ing library. In Wallach, H., Larochelle, H., Beygelzimer, A.,d'Alch´e-Buc, F., Fox, E., and Garnett, R. (eds.),

Proc. of theAdvances in Neural Information Processing Systems , volume 32,pp. 8026–8037. Curran Associates, Inc., 2019.Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classiﬁcation using binary convolutional neuralnetworks. In

Proc. of European Conference on Computer Vision ,pp. 525–542. Springer, 2016.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma,S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg,A. C., and Fei-Fei, L. ImageNet Large Scale Visual RecognitionChallenge.

International Journal of Computer Vision (IJCV) ,115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.Sainath, T. N., Kingsbury, B., Sindhwani, V., Arisoy, E., andRamabhadran, B. Low-rank matrix factorization for deep neuralnetwork training with high-dimensional output targets. In

Proc.of the 2013 IEEE International Conference on Acoustics, Speechand Signal Processing , pp. 6655–6659, 2013. doi: 10.1109/ICASSP.2013.6638949.Sandler, M., Howard, G. A., Zhu, M., Zhmoginov, A., and Chen,L.-C. MobileNetV2: Inverted residuals and linear bottlenecks.pp. 4510–4520, 2018.Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R., Titov,I., and Welling, M. Modeling relational data with graph convo-lutional networks. In Gangemi, A., Navigli, R., Vidal, M.-E.,Hitzler, P., Troncy, R., Hollink, L., Tordai, A., and Alam, M.(eds.),

The Semantic Web , pp. 593–607, Cham, 2018. SpringerInternational Publishing. ISBN 978-3-319-93417-4.Shi, H., Pi, R., Xu, H., Li, Z., Kwok, J. T., and Zhang, T. Bridgingthe gap between sample-based and one-shot neural architecturesearch with BONAS. 2020.Simonyan, K. and Zisserman, A. Very deep convolutional networksfor large-scale image recognition. international conference onlearning representations , 2015.Tan, M. and Le, Q. Efﬁcientnet: Rethinking model scaling for con-volutional neural networks. In

Proc. of the International Con-ference on Machine Learning , pp. 6105–6114. PMLR, 2019.

NN-RL Compression: Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Wang, H., Zhang, Q., Wang, Y., and Hu, R. Structured probabilisticpruning for deep convolutional neural network acceleration.

British Machine Vision Conference , 2017.Ying, R., You, J., Morris, C., Ren, X., Hamilton, W. L., andLeskovec, J. Hierarchical graph representation learning withdifferentiable pooling. In

Proceedings of the 32nd Interna-tional Conference on Neural Information Processing Systems ,NIPS’18, pp. 4805–4815, Red Hook, NY, USA, 2018. CurranAssociates Inc.Yu, S., Mazaheri, A., and Jannesari, A. Auto graph encoder-decoder for model compression and network acceleration. 2020.Zhang, T., Ye, S., Zhang, K., Tang, J., Wen, W., Fardad, M., andWang, Y. A systematic DNN weight pruning framework usingalternating direction method of multipliers.

ECCV , 2018a.Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufﬂenet: An extremelyefﬁcient convolutional neural network for mobile devices. In