WeightNet: Revisiting the Design Space of Weight Networks
WWeightNet: Revisiting the Design Space ofWeight Networks
Ningning Ma , Xiangyu Zhang (cid:63) , Jiawei Huang , and Jian Sun Hong Kong University of Science and Technology MEGVII Technology [email protected], { zhangxiangyu,huangjiawei,sunjian } @megvii.com Abstract.
We present a conceptually simple, flexible and effective frame-work for weight generating networks. Our approach is general that unifiestwo current distinct and extremely effective SENet and CondConv intothe same framework on weight space. The method, called
W eightNet ,generalizes the two methods by simply adding one more grouped fully-connected layer to the attention activation layer. We use the WeightNet,composed entirely of (grouped) fully-connected layers, to directly out-put the convolutional weight. WeightNet is easy and memory-conservingto train, on the kernel space instead of the feature space. Because ofthe flexibility, our method outperforms existing approaches on both Im-ageNet and COCO detection tasks, achieving better Accuracy-FLOPsand Accuracy-Parameter trade-offs. The framework on the flexible weightspace has the potential to further improve the performance. Code is avail-able at https://github.com/megvii-model/WeightNet . Keywords:
CNN architecture design, weight generating network, con-ditional kernel
Designing convolution weight is a key issue in convolution networks (CNNs). Theweight-generating methods [14,6,24] using a network, which we call weight net-works, provide an insightful neural architecture design space. These approachesare conceptually intuitive, easy and efficient to train. Our goal in this work is topresent a simple and effective framework, in the design space of weight networks,inspired by the rethinking of recent effective conditional networks.Conditional networks (or dynamic networks)[2,11,38], which use extra sample-dependent modules to conditionally adjust the network, have achieved greatsuccess. SENet [11], an effective and robust attention module, helps many tasksachieve state-of-the-art results [9,31,32]. Conditionally Parameterized Convolu-tion (CondConv) [38] uses over-parameterization to achieve great improvementsbut maintains the computational complexity at the inference phase.Both of the methods consist of two steps: first, they obtain an attentionactivation vector, then using the vector, SE scales the feature channels, while (cid:63)
Corresponding author a r X i v : . [ c s . C V ] J u l Ningning Ma et al.
Params (Million) I m a g e N e t , T o p - A cc u r a c y ( % ) (b) Accuracy vs Parameters StandardConvSESKCondConv 2xCondConv 4xWeightNet
FLOPs (Million) I m a g e N e t , T o p - A cc u r a c y ( % ) (a) Accuracy vs FLOPs StandardConvSESKCondConv 2xCondConv 4xWeightNet
Fig. 1. Accuracy vs. FLOPs vs. Parameters comparisons on ImageNet, usingShuffleNetV2 [22]. (a) The trade-off between accuracy and FLOPs; (b) the trade-offbetween accuracy and number of parameters.
CondConv performs a mixture of expert weights. Despite they are usually treatedas entirely distinct methods, they have some things in common. It is natural toask: do they have any correlations?
We show that we can link the two extremelyeffective approaches, by generalizing them in the weight network space.Our methods, called
WeightNet , extends the first step by simply adding onemore layer for generating the convolutional weight directly (see Fig. 2). The layeris a grouped fully-connected layer applied to the attention vector, generating theweight in a group-wise manner. To achieve this, we rethink SENet and CondConvand discover that the subsequent operations after the first step can be cast to a grouped fully-connected layer , however, they are particular cases.In that grouped layer, the output is direct the convolution weight, but theinput size and the group number are variable. In CondConv the group number isdiscovered to be a minimum number of one and the input is small (4, 8, 16, etc.)to avoid the rapid growth of the model size. In SENet the group is discovered tobe the maximum number equal to the input channel number.Despite the two variants having seemingly minor differences, they have a largeimpact: they together control the parameter-FLOPs-accuracy tradeoff, leadingto surprisingly different performance. Intuitively, we introduce two hyperparame-ters M and G , to control the input number and the group number, respectively.The two hyperparameters have not been observed and investigated before, inthe additional grouped fully-connected layer. By simply adjusting them, we canstrike a better trade-off between the representation capacity and the numberof model parameters. We show by experiments on ImageNet classification andCOCO detection the superiority of our method (Figure 1).Our main contributions include: 1) First, we rethink the weight generat-ing manners in SENet and CondConv, for the first time, to be complete fully-connected networks; 2) Second, only from this new perspective can we revisitthe novel network design space in the weight space, which provides more effec-tive structures than those in convolution design space (group-wise, point-wise,and depth-wise convolutions, etc). In this new and rarely explored weight space, eightNet: Revisiting the Design Space of Weight Network 3 there could be new structures besides fully-connected layers, there could also bemore kinds of sparse matrix besides those in Fig. 4. We believe this is a promisingdirection and hope it would have a broader impact on the vision community. Weight generation networks
Schmidhuber et al. [28] incorporate the ”fast”weights into recurrent connections in RNN methods. Dynamic filter networks [14]use filter-generating networks on video and stereo prediction. HyperNetworks [6]decouple the neural networks according to the relationship in nature: a genotype(the hypernetwork), and a phenotype (the main network), that uses a smallnetwork to produce the weights for the main network, which reduces the numberof parameters while achieving respectable results. Meta networks [24] generateweights using a meta learner for rapid generalization across tasks. The methods[14,6,24,25] provide a worthy design space in the weight-generating network, ourmethod follows the spirits and uses a WeightNet to generate the weights.
Conditional CNNs
Different from standard CNNs [29,7,30,40,4,10,27,9], con-ditional (or dynamic) CNNs [17,20,37,39,15] use dynamic kernels, widths, ordepths conditioned on the input samples, showing great improvement. SpatialTransform Networks [13] learns to transform to warp the feature map in a para-metric way. Yang et al. [38] proposed conditional parameterized convolution tomix the experts voted by each sample’s feature. The methods are extremelyeffective because they improve the Top-1 accuracy by more than 5% on the Im-ageNet dataset, which is a great improvement. Different from dynamic featuresor dynamic kernels, another series of work [35,12] focus on dynamic depths ofthe convolutional networks, that skip some layers for different samples.
Attention and gating mechanism
Attention mechanism [33,21,1,34,36] isalso a kind of conditional network, that adjusts the networks dependent on theinput. Recently the attention mechanism has shown its great improvement. Huet al. [11] proposed a block-wise gating mechanism to enhance the representationability, where they adopted a squeeze and excitation method to use global infor-mation and capture channel-wise dependencies. SENet achieves great success bynot only winning the ImageNet challenge [5], but also helping many structures toachieve state-of-the-art performance [9,31,32]. In GaterNet [3], a gater networkwas used to predict binary masks for the backbone CNN, which can result inperformance improvement. Besides, Li et al. [16] introduced a kernel selectingmodule, where they added attention to kernels with different sizes to enhanceCNN’s learning capability. In contrast, WeightNet is designed on kernel spacewhich is more time-conserving and memory-conserving than feature space.
The WeightNet generalizes the current two extremely effective modules in weightspace. Our method is conceptually simple: both SENet and CondConv generate
Ningning Ma et al. conv weight WN conv weight WN conv weight WN (a) (b) Grouped FCFCs
Weight network
Fig. 2. The WeightNet structure.
The convolutional weight is generated by Weight-Net that is comprised entirely of (grouped) fully-connected layers. The symbol ( (cid:66) ) rep-resents the dimension reduction (global average pool) from feature space ( C × H × W ) tokernel space ( C ). The ’FC’ denotes a fully-connected layer, ’conv’ denotes convolution,and ’WN’ denotes the WeightNet. the activation vector using a global average pooling (GAP) and one or two fully-connected layers followed with a non-linear sigmoid operation; to this we simplyadd one more grouped fully-connected layer, to generate the weight directly (Fig.2). This is different from common practice that applies the vector to feature spaceand we avoid the memory-consuming training period.WeightNet is computationally efficient because of the dimension reduction(GAP) from C × H × W dimension to a 1-D dimension C . Evidently, the Weight-Net only consists of (grouped) fully-connected layers. We begin by introducingthe matrix multiplication behaviors of (grouped) fully-connected operations. Grouped fully-connected operation
Conceptually, neurons in a fully-connectedlayer have full connections and thus can be computed with a matrix multipli-cation, in the form Y = W X (see Fig. 3 (a)). Further, neurons in a groupedfully-connected layer have group-wise sparse connections with activations in theprevious layer.Formally, in Fig. 3 (b), the neurons are divided exactly into g groups, eachgroup (with i/g inputs and o/g outputs) performs a fully-connected operation(see the red box for example). One notable property of this operation, which canbe easily seen in the graphic illustration, is that the weight matrix becomes asparse, block diagonal matrix , with a size of ( o/g × i/g ) in each block.Grouped fully-connected operation is a general form of fully-connected opera-tion where the group number is one. Next, we show how it generalizes CondConvand SENet: use the grouped fully-connected layer to replace the subsequent op-erations after the activation vector and directly output the generated weight. Denotation
We denote a convolution operation with the input feature map X ∈ R C × h × w , the output feature map Y ∈ R C × h (cid:48) × w (cid:48) , and the convolutionweight W (cid:48) ∈ R C × C × k h × k w . For simplicity, but without loss of generality, it is eightNet: Revisiting the Design Space of Weight Network 5 W ... W , W , ... W , ... ... ... ... ... ... ... ... W o , i X i X ......... X ... Y o Y ......... Y ... X W ... W i ... ... ...W o,1 ... W o , i X X i X ... Y o Y ... (a) Fully-connected layer(b) Grouped fully-connected layer Sparse matrix gigigogo gi go
Fig. 3.
The matrix multiplication behaviors of the (grouped) fully-connected oper-ations. Here i , o and g denote the numbers of the input channel, output channel andgroup number. (a) A standard matrix multiplication representing a fully-connectedlayer. (b) With the weight in a block diagonal sparse matrix, it becomes a generalgrouped fully-connected layer. Each group (red box) is exactly a standard matrix mul-tiplication in (a), with i/g input channels and o/g output channels. Fig. (a) is a specialcase of Fig. (b) where g = 1. assumed that the number of the input channels equals to that of output channels,here ( h, w ) , ( h (cid:48) , w (cid:48) ) , ( k h , k w ) denote the 2-D heights and the widths for the input,output, and kernel. Therefore, we denote the convolution operation using thesymbol ( ∗ ): Y c = W (cid:48) c ∗ X . We use α to denote the attention activation vectorin CondConv and SENet. Conditionally parameterized convolution (CondConv) [38] is a mixture of m experts of weights, voted by a m -dimensional vector α , that is sample-dependentand makes each sample’s weight dynamic.Formally, we begin with reviewing the first step in CondConv, it gets α by a global average pooling and a fully-connected layer W fc , followed by asigmoid σ ( · ) : α = σ ( W fc × hw (cid:80) i ∈ h,j ∈ w X c,i,j ), here ( × ) denotes the matrixmultiplication, W fc ∈ R m × C , α ∈ R m × .Next, we show the following mixture of expert operations in the original papercan essentially be replaced by a fully-connected layer. The weight is generatedby multiple weights: W (cid:48) = α · W + α · W + ... + α m · W m , here W i ∈ R C × C × k h × k w , ( i ∈ { , , ..., m } ). We rethink it as follows: W (cid:48) = W T × α where W = [ W W ... W m ] (1) Ningning Ma et al.
Table 1. Summary of the configure in the grouped fully-connected layer. λ is theproportion of input size to group number, representing the major increased parameters. Model Input size Group number λ Output sizeCondConv m 1 m C × C × k h × k w SENet C C 1 C × C × k h × k w WeightNet M × C G × C M/G C × C × k h × k w Here W ∈ R m × CCk h k w denotes the matrix concatenation result, ( × ) denotesthe matrix multiplication (fully-connected in Fig. 3a). Therefore, the weight isgenerated by simply adding one more layer ( W ) to the activation layer. Thatlayer is a fully-connected layer with m inputs and C × C × k h × k w outputs.This is different from the practice in the original paper in the training phase.In that case, it is memory-consuming and suffers from the batch problem whenincreasing m (batch size should be set to one when m > Squeeze and Excitation (SE) [11] block is an extremely effective ”plug-n-play”module that is acted on the feature map. We integrate the SE module into theconvolution kernels and discover it can also be represented by adding one moregrouped fully-connected layer to the activation vector α . We start from the re-viewing of the α generation process. It has a similar process with CondConv: aglobal average pool, two fully-connected layer with non-linear ReLU ( δ ) and sig-moid ( σ ): α = σ ( W fc × δ ( W fc × hw (cid:80) i ∈ h,j ∈ w X c,i,j )), here W fc ∈ R C/r × C , W fc ∈ R C × C/r , ( × ) in the equation denotes the matrix multiplication. The twofully-connected layers here are mainly used to reduce the number of parametersbecause α here is a C -dimensional vector, a single layer is parameter-consuming.Next, in common practice the block is used before or after a convolution layer, α is computed right before a convolution (on the input feature X ): Y c = W (cid:48) c ∗ ( X · α ), or right after a convolution (on the output feature Y ): Y c = ( W (cid:48) c ∗ X ) · α c ,here ( · ) denotes dot multiplication broadcasted along the C axis. In contrast, onkernel level, we analyze the case that SE is acted on W (cid:48) : Y c = ( W (cid:48) c · α c ) ∗ X .Therefore we rewrite the weight to be W (cid:48) · α , the ( · ) here is different from the( × ) in Equ. 1. In that case, a dimension reduction is performed; in this case,no dimension reduction. Therefore, it is essentially a grouped sparse connectedoperation, that is a particular case of Fig. 3 (b), with C inputs, C × C × k h × k w outputs, and C groups. By far, we note that the group number in the general grouped fully-connectedlayer (Fig. 3 b) has values range from 1 to the channel number. That is, the group eightNet: Revisiting the Design Space of Weight Network 7 F i x ed ou t pu t s i z e (a) min group (b) max group (c) general group Fig. 4.
The diagrams of the different cases in the block diagonal matrix (Fig. 3b), that can represent the weights of the grouped fully-connected layer in CondConv,SENet and the general WeightNet. They output the same fixed size (convolution ker-nel’s size C × C × k h × k w ), but have different group numbers: (a) the group numberhas a minimum number of one (CondConv), (b) the group number has a maximumnumber equals to the input size C (SENet), since (a) and (b) are extreme cases, (c)shows the general group number between 1 and the input size (WeightNet). has a minimum number of one and has a maximum number of the input channelnumbers. It, therefore, generalizes the CondConv, where the group number takesthe minimum value (one), and the SENet, where it takes the maximum value(the input channel number). We conclude that they are two extreme cases of thegeneral grouped fully-connected layer (Fig. 4).We summarize the configure in the grouped fully-connected layer (in Table 1)and generalize them using two additional hyperparameters M and G . To makethe group number more flexible, we set it by combining the channel C and a con-stant hyperparameter G , Moreover, another hyperparameter M is used to controlthe input number, thus M and G together to control the parameter-accuracytradeoff. The layer in CondConv is a special case with M = m/C, G = 1 /C ,while for SENet M = 1 , G = 1. We constrain M × C and G × C to be integers, M is divisible by G in this case. It is notable that the two hyperparameters areright there but have not been noticed and investigated. Implementation details
For the activation vector α ’s generating step, since α is a ( M × C )-dimensional vector, it may be large and parameter-consuming,therefore, we use two fully-connected layers with a reduction ratio r . It has asimilar process with the two methods: a global average pool, two fully-connectedlayer with non-linear sigmoid ( σ ): α = σ ( W fc × W fc × hw (cid:80) i ∈ h,j ∈ w X c,i,j ),here W fc ∈ R C/r × C , W fc ∈ R MC × C/r , ( × ) denotes the matrix multiplication, r has a default setting of 16.In the second step, we adopt a grouped fully-connected layer with M × C input, C × C × k h × k w output, and G × C groups. We note that the structureis a straightforward design, and more complex structures have the potential toimprove the performance further, but it is beyond the focus of this work. Ningning Ma et al.
Complexity analysis
The structure of WeightNet decouples the convolutioncomputation and the weight computation into two separate branches (see Fig.2). Because the spatial dimensions ( h × w ) are reduced before feeding into theweight branch, the computational amount (FLOPs) is mainly in the convolutionbranch. The FLOPs complexities in the convolution and weight branches are O ( hwCCk h k w ) and O ( M CCk h k w /G ), the latter is relatively negligible. Theparameter complexities for each branch are zero and O ( M/G × C × C × k h × k w ),which is M/G times of normal convolution. We notate λ to represent it (Table1). Training with batch dimension
The weight generated by WeightNet has adimension of batch size, here we briefly introduce the training method related tothe batch dimension. We denote B as batch size and reshape the input X of theconvolution layer to (1 , B × C, h, w ). Thus X has B × C channel numbers, whichmeans we regard different samples in the same batch as different channels. Next,we reshape the generated weight W to ( B, C, C, k h , k w ). Then it becomes a groupconvolution, with a group number of B , the inputs and the outputs in each groupare both equal to C . Therefore, we use the same memory-conserving method forboth training and inference periods, and this is different from CondConv. In this section, we evaluate the WeightNet on classification and COCO detec-tion tasks [19]. In classification task, we conduct experiments on a light-weightCNN model ShuffleNetV2 [22] and a deep model ResNet50 [7]. In the detectiontask, we evaluate our method’s performance on distinct backbone models underRetinaNet. In the final analysis, we conduct ablation studies and investigate theproperties of WeightNet in various aspects.
We conduct image classification experiments on ImageNet 2012 classificationdataset, which includes 1000 classes [26]. Our models are first trained on thetraining dataset that consists of 1.28 million images and then evaluated over 50kimages in the validation dataset. For the training settings, all the ShuffleNetV2[22] models are trained with the same settings as [22]. For ResNet-50, we usea linear decay scheduled learning rate starting with 0.1, a batch size of 256, aweight decay of 1e-4, and 600k iterations.
ShuffleNetV2
To investigate the performance of our method on light-weightconvolution networks, we construct experiments based on a recent effective net-work ShuffleNetV2 [22]. For a fair comparison, we retrain all the models byourselves, using the same code base. We replace the standard convolution ker-nels in each bottleneck with our proposed WeightNet, and control FLOPs andthe number of parameters for fairness comparison. eightNet: Revisiting the Design Space of Weight Network 9
Table 2. ImageNet classification re-sults of the WeightNet on ShuffleNetV2[22]. For fair comparison, we control thevalues of λ to be 1 × , to make sure that theexperiments are under the same FLOPsand the same number of parameters. Model × ) 1.4M 41M 39.7+ WeightNet (1 × ) 1.5M 41M ShuffleNetV2 (1 × ) 2.2M 138M 30.9+ WeightNet (1 × ) 2.4M 139M ShuffleNetV2 (1.5 × ) 3.5M 299M 27.4+ WeightNet (1 × ) 3.9M 301M ShuffleNetV2 (2 × ) 5.5M 557M 25.5+ WeightNet (1 × ) 6.1M 562M Table 3. ImageNet classification re-sults of the WeightNet on ShuffleNetV2[22]. The comparison is under the sameFLOPs and regardless of the number ofparameters. To obtain the optimum per-formance, we set the λ to { × , 4 × , 4 × ,4 ×} respectively. Model × ) 1.4M 41M 39.7+ WeightNet (8 × ) 2.7M 42M ShuffleNetV2 (1 × ) 2.2M 138M 30.9+ WeightNet (4 × ) 5.1M 141M ShuffleNetV2 (1.5 × ) 3.5M 299M 27.4+ WeightNet (4 × ) 9.6M 307M ShuffleNetV2 (2 × ) 5.5M 557M 25.5+ WeightNet (4 × ) 18.1M 573M As shown in Table 1, λ is utilized to control the number of parameters in aconvolution. For simplicity, we fix G = 2 when adjusting λ . In our experiments, λ has several sizes { × , × , × , ×} . To make the number of channels convenientlydivisible by G when scaling the number of parameters, we slightly adjust thenumber of channels for ShuffleNetV2 1 × and 2 × .We evaluate the WeightNet from two aspects. Table 2 reports the perfor-mance of our method considering parameters. The experiments illustrate thatour method has significant advantages over the other counterparts under thesame FLOPs and the same number of parameter constraints. ShuffleNetV2 0.5 × gains 3% Top-1 accuracy without additional computation budgets.In Table 3, we report the advantages after applying our method on Shuf-fleNetV2 with different sizes. Considering in practice, the storage space is suffi-cient. Therefore, without the loss of fairness, we only constrain the Flops to bethe same and tolerate the increment of parameters.ShuffleNet V2 (0.5 × ) gains 5.7% Top-1 accuracy which shows further sig-nificant improvements by adding a minority of parameters. ShuffleNet V2 (2 × )gains 2.0% Top-1 accuracy.To further investigate the improvement of our method, we compare ourmethod with some recent effective conditional CNN methods under the sameFLOPs and the same number of parameters. For the network settings of Cond-Conv [38], we replace standard convolutions in the bottlenecks with CondConv,and change the number of experts as described in CondConv to adjust param-eters, as the number of experts grows, the number of parameters grows. Toreveal the model capacity under the same number of parameters, for our pro-posed WeightNet, we control the number of parameters by changing λ . Table 4describes the comparison between our method and other counterpart effectivemethods, from which we observe our method outperforms the other conditionalCNN methods under the same budgets. The Accuracy-Parameters tradeoff andthe Accuracy-FLOPs tradeoff are shown in Figure 1. Table 4. Comparison with recently effective attention modules on Shuf-fleNetV2 [22] and ResNet50 [7]. We show results on ImageNet.
Model × ) 1.4M 41M 39.7+ SE [11] 1.4M 41M 37.5+ SK [16] 1.5M 42M 37.5+ CondConv [38] (2 × ) 1.5M 41M 37.3+ WeightNet (1 × ) 1.5M 41M + CondConv [38] (4 × ) 1.8M 41M 35.9+ WeightNet (2 × ) 1.8M 41M ShuffleNetV2 [22] (1.5 × ) 3.5M 299M 27.4+ SE [11] 3.9M 299M 26.4+ SK [16] 3.9M 306M 26.1+ CondConv [38] (2 × ) 5.2M 303M 26.3+ WeightNet (1 × ) + CondConv [38] (4 × ) 8.7M 306M 26.1+ WeightNet (2 × ) ShuffleNetV2 [22] (2.0 × ) 5.5M 557M 25.5+ WeightNet (2 × ) 10.1M 565M ResNet50 [7] 25.5M 3.86G 24.0+ SE [11] 26.7M 3.86G 22.8+ CondConv [38] (2 × ) 72.4M 3.90G 23.4+ WeightNet (1 × ) 31.1M 3.89G From the results, we can see SE and CondConv boost the base models of allsizes significantly. However, CondConv has major improvements in smaller sizesespecially, but as the model becomes larger, the smaller the advantage it has.For example, CondConv performs better than SE on ShuffleNetV2 0.5 × but SEperforms better on ShuffleNetV2 2 × . In contrast, we find our method can beuniformly better than SE and CondConv.To reduce the overfitting problem while increasing parameters, we add dropout[8] for models with more than 3M parameters. As we described in Section 3.3, λ represents the increase of parameters, so we measure the capacity of networksby changing parameter multiplier λ in { × , × , × , ×} . We further analyze theeffect of λ and the grouping hyperparameter G on each filter in the ablationstudy section. ResNet50
For larger classification models, we conduct experiments on ResNet50[7]. We use a similar way to replace the standard convolution kernels in ResNet50bottlenecks with our proposed WeightNet. Besides, we train the conditionalCNNs utilizing the same training settings with the base ResNet50 network.In Table 4, based on ResNet50 model, we compare our method with SE [11]and CondConv [38] under the same computational budgets. It’s shown that ourmethod still performs better than other conditional convolution modules. Weperform CondConv (2 × ) on ResNet50, the results reveal that it does not havefurther improvement comparing with SE, although CondConv has a larger num- eightNet: Revisiting the Design Space of Weight Network 11 Table 5. Object detection results com-paring with baseline backbone. We showRetinaNet [18] results on COCO.
Backbone × ) 1.4M 41M 22.5+ WeightNet (4 × ) 2.0M 41M ShuffleNetV2 [22] (1.0 × ) 2.2M 138M 29.2+ WeightNet (4 × ) 4.8M 141M ShuffleNetV2 [22] (1.5 × ) 3.5M 299M 30.8+ WeightNet (2 × ) 5.7M 303M ShuffleNetV2 [22] (2.0 × ) 5.5M 557M 33.0+ WeightNet (2 × ) 9.7M 565M Table 6. Object detection results com-paring with other conditional CNN back-bones. We show RetinaNet [18] results onCOCO.
Backbone × ) 1.4M 41M 22.5+ SE [11] 1.4M 41M 25.0+ SK [16] 1.5M 42M 24.5+ CondConv [38] (2 × ) 1.5M 41M 25.8+ CondConv [38] (4 × ) 1.8M 41M 25.0+ CondConv [38] (8 × ) 2.3M 42M 26.4+ WeightNet (4 × ) 2.0M 41M ber of parameters. We conduct our method (1 × ) by adding limited parametersand it also shows further improvement comparing with SE. Moreover, Shuf-fleNetV2 [22] (2 × ) with our method performs better than ResNet50, with only40% parameters and 14.6% FLOPs. We evaluate the performance of our method on COCO detection [19] task. TheCOCO dataset has 80 object categories. We use the trainval k set for trainingand use the minival set for testing. For a fair comparison, we train all the modelswith the same settings. The batch size is set to 2, the weight decay is set to 1e-4 and the momentum is set to 0.9. We use anchors for 3 scales and 3 aspectratios and use a 600-pixel train and test image scale. We conduct experimentson RetinaNet [18] using ShuffleNetV2 [22] as the backbone feature extractor. Wecompare the backbone models of our method with the standard CNN models.Table 5 illustrates the improvement of our method over standard convolutionon the RetinaNet framework. For simplicity we set G = 1 and adjust the size ofWeightNet to { × , ×} . As we can see our method improves the mAP signifi-cantly by adding a minority of parameters. ShuffleNetV2 (0.5 × ) with WeightNet(4 × ) improves 4.6 mAP by adding few parameters under the same FLOPs.To compare the performance between WeightNet and CondConv [38] underthe same parameters, we utilize ShuffleNetV2 0.5 × as the backbone and investi-gate the performances of all CondConv sizes. Table 6 reveals the clear advantageof our method over CondConv. Our method outperforms CondConv uniformlyunder the same computational budgets. As a result, our method is indeed robustand fundamental on different tasks. λ . By tuning λ , we control the number of parameters. Weinvestigate the influence of λ on ImageNet Top-1 accuracy, conducting experi-ments on the ShuffleNetV2 structure. Table 7 shows the results. We find thatthe optimal λ for ShuffleNetV2 { × , 1 × , 1.5 × , 2 ×} are { } , respectively. Table 7. Ablation on λ . The table showsthe ImageNet Top-1 err. results. The ex-periments are conducted on ShuffleNetV2[22]. By increasing λ in the range { } ,the FLOPs does not change and the num-ber of parameters increases. λ Model 1 2 4 8ShuffleNetV2 (0.5 × ) 36.7 35.5 34.4 ShuffleNetV2 (1.0 × ) 28.8 28.1 × ) 25.6 25.2 × ) 24.1 23.7 Table 8. Ablation on G . We tune thegroup hyperparameter G to { } , wekeep λ = 1. The results are ImageNet Top-1 err. The experiments are conducted onShuffleNetV2 [22]. Model G × ) G=1 1.4M 41M 37.18G=2 1.5M 41M 36.73G=4 1.5M 41M ShuffleNetV2(1.0 × ) G=1 2.3M 139M 29.09G=2 2.4M 139M 28.77G=4 2.6M 139M Table 9.
Ablation study on differentstages . The (cid:88) means the convolutions inthat stage is integrated with our proposedWeightNet.Stage2 Stage3 Stage4 Top-1 err. (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Table 10.
Ablation study on the numberof the global average pooling operatorsin the whole network. We conduct Shuf-fleNetV2 0.5 × experiments on ImageNetdataset. We compare the cases: one globalaverage pooling in 1) each stage, 2) eachblock, and 3) each layer. GAP representsglobal average pooling in this table.Top-1 err.Stage wise GAP 37.01Block wise GAP 35.47Layer wise GAP 35.04 The model capacity has an upper bounded as we increase λ , and there exists achoice of λ to achieve the optimal performance. The influence of G . To investigate the influence of G , we conduct experimentson ImageNet based on ShuffleNetV2. Talbe 8 illustrates the influence of G . Wekeep λ equals to 1, and change G to { } . From the result we conclude thatincreasing G has a positive influence on model capacity. WeightNet on different stages.
As WeightNet makes the weights for eachconvolution layer changes dynamically for distinct samples, we investigate theinfluence of each stage. We change the static convolutions of each stage to ourmethod respectively as Table 9 shows. From the result, we conclude that thelast stage influences much larger than other stages, and the performance is bestwhen we change the convolutions in the last two stages to our method.
The number of the global average pooling operator.
Sharing the globalaverage pooling (GAP) operator contributes to improving the speed of the con- eightNet: Revisiting the Design Space of Weight Network 13
10 5 0 5 10Dimension_164202468 D i m e n s i o n _ Fig. 5.
Analysis for weights generated by our WeightNet. The figure illustrates of theweights of the 1,000 samples. The samples belong to 20 classes, which are representedby 20 different colors. Each point represents the weights of one sample. ditional convolution network. We compare the following three kinds of usagesof GAP operator: using GAP for each layer, sharing GAPs in a block, sharingGAPs in a stage. We conduct experiments on ShuffleNetV2 [22] (0.5 × ) baselinewith WeightNet (2 × ). Table 10 illustrates the comparison of these three kindsof usages. The results indicate that by adding the number of GAPs, the modelcapacity improves. Weight similarity among distinct samples.
We randomly select 20 classesin the ImageNet validation set, which has 1,000 classes in total. Each class has50 samples, and there are 1,000 samples in total. We extract the weights in thelast convolution layer in Stage 4 from a well-trained ShuffleNetV2 (2 × ) withour WeightNet. We project the weights of each sample from a high dimensionalspace to a 2-dimension space by t-SNE [23], which is shown in Figure 5. We use20 different colors to distinguish samples from 20 distinct classes.We observe two characteristics. First, different samples have distinct weights.Second, there are roughly 20 point clusters and the weights of samples in thesame classes are closer to each other, which indicates that the weights of ourmethod capture more class-specific information than static convolution. Channel similarity.
We conduct the experiments to show the channel similar-ity of our method, we use the different filters’ similarity in a convolution weightto represent the channel similarity. Lower channel similarity would improve therepresentative ability of CNN models and improve the channel representative ca-pacities. Strong evidence was found to show that our method has a lower channelsimilarity.We analyze the last convolution layer’s kernel in the last stage of Shuf-fleNetV2 [22] (0.5 × ), where the channel number is 96, thus there are 96 filters Fig. 6. Cosine similarity matrix.
A 96 ×
96 matrix represents 96 filters’ pair-by-pairsimilarity, the smaller value (darker color) means the lower similarities. (a) Standardconvolution kernel’s similarity matrix, (b-f) WeightNet kernels’ similarity matrixes.The colors in (b-f) are obviously much darker than (a), meaning lower similarity. in that convolution kernel. We compute the cosine similarities of the filters pairby pair, that comprise a 96 ×
96 cosine similarity matrix.In Figure 6, we compare the channel similarity of WeightNet and standardconvolution. We first compute the cosine similarity matrix of a standard con-volution kernel and display it in Figure 6-(a). Then for our method, becausedifferent samples do not share the same kernel, we randomly choose 5 samplesin distinct classes from the ImageNet validation set and show the correspondingsimilarity matrix in Figure 6-(b,c,d,e,f). The results clearly illustrate that ourmethod has lower channel similarity.
The study connects two distinct but extremely effective methods SENet andCondConv on weight space, and unifies them into the same framework we callWeightNet. In the simple WeightNet comprised entirely of (grouped) fully-connectedlayers, the grouping manners of SENet and CondConv are two extreme cases,thus involving two hyperparameters M and G that have not been observed andinvestigated. By simply adjusting them, we got a straightforward structure thatachieves better tradeoff results. The more complex structures in the frameworkhave the potential to further improve the performance, and we hope the simpleframework in the weight space helps ease future research. Therefore, this wouldbe a fruitful area for future work. Acknowledgements
This work is supported by The National Key Researchand Development Program of China (No. 2017YFA0700800) and Beijing Academyof Artificial Intelligence (BAAI). eightNet: Revisiting the Design Space of Weight Network 15
References
1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 (2014)2. Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE International Con-ference on Computer Vision Workshops. pp. 0–0 (2019)3. Chen, Z., Li, Y., Bengio, S., Si, S.: Gaternet: Dynamic filter selection in con-volutional neural network via a dedicated global gating network. arXiv preprintarXiv:1811.11205 (2018)4. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 1251–1258 (2017)5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer visionand pattern recognition. pp. 248–255. Ieee (2009)6. Ha, D., Dai, A., Le, Q.V.: Hypernetworks. arXiv preprint arXiv:1609.09106 (2016)7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 770–778 (2016)8. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:Improving neural networks by preventing co-adaptation of feature detectors. arXivpreprint arXiv:1207.0580 (2012)9. Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu,Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: Proceedings ofthe IEEE International Conference on Computer Vision. pp. 1314–1324 (2019)10. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 (2017)11. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of theIEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)12. Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.:Multi-scale dense networks for resource efficient image classification. arXiv preprintarXiv:1703.09844 (2017)13. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.In: Advances in neural information processing systems. pp. 2017–2025 (2015)14. Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic filter networks.In: Advances in Neural Information Processing Systems. pp. 667–675 (2016)15. Keskin, C., Izadi, S.: Splinenets: continuous neural decision graphs. In: Advancesin Neural Information Processing Systems. pp. 1994–2004 (2018)16. Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedingsof the IEEE conference on computer vision and pattern recognition. pp. 510–519(2019)17. Lin, J., Rao, Y., Lu, J., Zhou, J.: Runtime neural pruning. In: Advances in NeuralInformation Processing Systems. pp. 2181–2191 (2017)18. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. In: Proceedings of the IEEE international conference on computer vision.pp. 2980–2988 (2017)19. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conferenceon computer vision. pp. 740–755. Springer (2014)6 Ningning Ma et al.20. Liu, L., Deng, J.: Dynamic deep neural networks: Optimizing accuracy-efficiencytrade-offs by selective execution. In: Thirty-Second AAAI Conference on ArtificialIntelligence (2018)21. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-basedneural machine translation. arXiv preprint arXiv:1508.04025 (2015)22. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines forefficient cnn architecture design. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 116–131 (2018)23. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learn-ing research (Nov), 2579–2605 (2008)24. Munkhdalai, T., Yu, H.: Meta networks. In: Proceedings of the 34th InternationalConference on Machine Learning-Volume 70. pp. 2554–2563. JMLR. org (2017)25. Platanios, E.A., Sachan, M., Neubig, G., Mitchell, T.: Contextual parameter gen-eration for universal neural machine translation. arXiv preprint arXiv:1808.08493(2018)26. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-nition challenge. International journal of computer vision (3), 211–252 (2015)27. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In-verted residuals and linear bottlenecks. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 4510–4520 (2018)28. Schmidhuber, J.: Learning to control fast-weight memories: An alternative to dy-namic recurrent networks. Neural Computation4