Learning Shared Filter Bases for Efficient ConvNets
LLearning Shared Filter Bases for Efficient ConvNets
Daeyeon Kim
Dept. of Embedded Systems EngineeringIncheon National UniversityIncheon, South Korea 22012 [email protected]
Woochul Kang ∗ Dept. of Embedded Systems EngineeringIncheon National UniversityIncheon, South Korea 22012 [email protected]
Abstract
Modern convolutional neural networks (ConvNets) achieve state-of-the-art perfor-mance for many computer vision tasks. However, such high performance requiresmillions of parameters and high computational costs. Recently, inspired by theiterative structure of modern ConvNets, such as ResNets, parameter sharing amongrepetitive convolution layers has been proposed to reduce the size of parameters.However, naive sharing of convolution filters poses many challenges such as over-fitting and vanishing/exploding gradients. Furthermore, parameter sharing oftenincreases computational complexity due to additional operations. In this paper,we propose to exploit the linear structure of convolution filters for effective andefficient sharing of parameters among iterative convolution layers. Instead ofsharing convolution filters themselves, we hypothesize that a filter basis of linearly-decomposed convolution layers is a more effective unit for sharing parameterssince a filter basis is an intrinsic and reusable building block constituting diversehigh dimensional convolution filters. The representation power and peculiarityof individual convolution layers are further increased by adding a small numberof layer-specific non-shared components to the filter basis. We show empiricallythat enforcing orthogonality to shared filter bases can mitigate the difficulty intraining shared parameters. Experimental results show that our approach achievessignificant reductions both in model parameters and computational costs whilemaintaining competitive, and often better, performance than non-shared baselinenetworks.
In the past few years, the accuracy of convolutional neural networks (ConvNets) has been improvedcontinuously [1][2][3][4]. However, the cost of these networks has also increased significantly bothin parameter size and computational complexity. To address this problem, many model compressionand acceleration approaches have been proposed [5][6][7]. Among them, low-rank approximationof convolution filters has been intensively applied to various networks [5][8][9][10][11][12]. In thelow-rank approximation approaches, the linear structure of filters is exploited to decompose trainedfilters into linear combinations of a filter basis. Such linearly decomposed ConvNets suggest efficientnetwork structures both in terms of parameter size and computational complexity. Further, unlikeother compression approaches such as pruning [6][7], low-rank approximation can be performedwith minimal modification to the original network.However, in previous low-rank approximation approaches, filter bases are used only to approximatealready trained convolution filters. Unlike previous studies, we propose to exploit only the linearstructure of convolution filters, not the trained weights, for efficient and effective sharing of network ∗ Corresponing authorPreprint. Under review. a r X i v : . [ c s . C V ] J u l arameters. In our work, each typical convolution layer of a network is replaced with decomposedlayers that have a low rank filter basis and coefficients for their linear combinations. Instead of havingdifferent filter bases for these decomposed convolution layers, a single filter basis (or a small numberof bases) is/are shared across these layers to save parameters. We hypothesize that these filter basesare more intrinsic and reusable building blocks constituting subspace of high dimensional convolutionfilters, and, hence, they are more appropriate for being shared across many convolution layers.However, one major challenge is that repetitive use of shared filter bases might result in potential vanishing gradients and exploding gradients problems, which are often found in recurrent neuralnetworks (RNNs) [13][14]. Another challenge in sharing filter bases is that the constructed filtersfrom the linear combinations of shared filter basis are all in the same linear subspace. If all filtersare in a single low-dimensional subspace, it can potentially suppress the peculiarity of individualconvolution layers and the overall representation power of the networks.To address these challenges, we make the following contributions. First, we propose a hybrid approachto sharing filter bases, in which a small number of layer-specific non-shared filter basis componentsare combined with shared filter basis components. With this hybrid scheme, the constructed filterscan be positioned in different subspaces that reflect the peculiarity of individual convolution layers.We argue that this layer-specific variation can contribute to increase the representation power of thenetworks while a large portion of parameters is shared. Our second contribution is the training methodof shared filter bases. We show that a shared filter basis can cause vanishing gradients and explodinggradients problems, and this problem can be controlled to a large extent with the orthogonality in thefilter basis. To enforce the orthogonality of filter bases, we propose an orthogonality regularizer totrain ConvNets having shared filter bases.We validate empirically our proposed solution on image classification tasks with CIFAR and ImageNetdatasets. Our experimental results demonstrate that our method can reduce a significant amountof parameters and computational costs while achieving competitive, and often better, performancecompared to original ConvNet models. For example, in heavily overparameterized networks onCIFAR, our method can save up to 63.8% of parameters and 33.4% of FLOPs, respectively, whileachieving lower test errors than much deeper ResNet models.The remainder of this paper is organized as follows: In Section 2, we discuss related work. In Section3, we review the linear structure of convolution filters and give the details of our filter basis sharingmethod. In Section 4, the experiments on classification tasks are presented. Section 5 concludes thepaper and discuss future works. Model compression and efficient convolution block design:
Reducing storage and inference timeof ConvNets has been an important research topic for both resource constrained mobile/embeddedsystems and energy-hungry data centers. A number of research techniques have been developedsuch as knowledge distillation [15][16], filter pruning [17][18][7][19], low-rank factorization [5][8],quantization [20], kernel clustering [21][22], to name a few. In particular, filter-decomposition[5][8][9][10][11][12] is proposed to approximate original filters with computationally efficient low-rank tensors. Though these compression techniques are effective in reducing the resource usage,they have been suggested as post-processing steps that are applied to original networks after initialtraining. Often these compression steps are tricky and take long fine-tuning time [11][10]. Moreover,they often cannot recover the original models’ accuracy and incur the higher loss of accuracy withthe higher compression ratio. By contrast, the networks with our filter basis sharing method areend-to-end trainable without pretrained weights, and our method often outperforms the counterpartoriginal models while achieving significant savings both in parameters and computational costs.Some compact networks such as SqueezeNet [23], ShuffleNet [24], and MobileNet [23][25] showthat delicately designed internal structure of convolution blocks acquire better ability with lowercomputational complexity. Our work also suggests an efficient block structure of convolution layers.However, unlike these works, our block design aims to reveal more reusable and shareable buildingblocks, or filter bases, by decomposing convolution layers of widely used networks.
Recursive networks and parameter sharing:
Recurrent neural networks (RNNs) [26] have beenwell-studied for temporal and sequential data. As a generalization of RNNs, recursive variants of2
R T 𝑊 𝑏𝑎𝑠𝑖𝑠 𝛼 shared shared 𝑊 S T (a) Full convolution (b) Decomposed convolution unique
Figure 1: Illustration of the proposed filter basis sharing method. Unlike normal convolution in(a), our method in (b) replaces the original layer (given by W ) by two layers (given by filter basis W basis and coefficients α ). While most components of W basis are shared across many convolutionlayers, some are not shared and unique to each layer, allowing layer-specific peculiarity and morerepresentation power of the network. Intermediate basis feature maps (of R channels) generated with W basis are linearly combined with coefficients α .ConvNets are used extensively for visual tasks [27][28][29][30]. For instance, Eigen et al. [31]explore recursive convolutional architectures that share filters across multiple convolution layers.They show that recurrence with deeper layers tends to increase performance. However, their recursivearchitecture shows worse performance than independent convolution layers due to overfitting. Inmost previous works, filters themselves are shared across layers. In contrast, we propose to sharefilter bases that are more fundamental and reusable building blocks to construct layer-specific filters.More recently, Jastrzebski et al. [14] show that iterative refinement of features in Resnets suggeststhat deep networks can potentially leverage intensive parameter sharing. Guo et al. [32] introducea gate unit to determine whether to jump out of the recursive loop of convolution blocks to savecomputational resources. These works show that training recursive networks with naively sharedblocks leads to bad performance due to the problem of gradient explosion and vanish like RNN[13][33]. In order to mitigate the problem of gradient explosion and vanish, they suggest unsharedbatch normalization strategy. In our work, we propose an orthogonality regularization of shared filterbases to further address this problem.Savarese et al. ’s work [34] is most similar to our work. In their parameter sharing scheme, differentlayers of ConvNets are defined by a linear combination of parameter tensors from a global bank oftemplates. Though similar to our work, our method propose filter bases as more fine-grained andreusable units for parameter sharing and it allows combining non-shared layer-specific components infilter bases to express peculiarity of each convolution layer. Our result shows that these layer-specificnon-shared components are critical to achieve high performance. Further, our filter basis sharingmethod achieves significant saving not just in parameters, but also in computational costs. In this section, we discuss how to share parameters of ConvNets effectively by decomposing typicalconvolution layers into more reusable units, or filter bases. We also discuss how to train shared filterbases effectively.
We assume that a convolution layer with S input channels, T output channels, and a set of filters W = { W t ∈ R k × k × S , t ∈ [1 ..T ] } . Each filter W t can be decomposed using a lower rank filter basis W basis and coefficients α : W t = R (cid:88) r =1 α rt W rbasis , (1)where W basis = { W rbasis ∈ R k × k × S , r ∈ [1 ..R ] } is a filter basis, and α = { α rt ∈ R , r ∈ [1 ..R ] , t ∈ [1 ..T ] } is scalar coefficients. In Equation 1, R is the rank of the basis. In a typical convolution layer,output feature maps V t ∈ R w × h × T , t ∈ [1 ..T ] are obtained by the convolution between input feature3aps U ∈ R w × h × S and filters W t , t ∈ [1 ..T ] . With Equation 1, this convolution can be rewritten asfollows: V t = U ∗ W t = U ∗ R (cid:88) r =1 α rt W rbasis (2) = R (cid:88) r =1 α rt ( U ∗ W rbasis ) , where t ∈ [1 ..T ] . (3)In Equation 3, the order of the convolution operation and the linear combination of filter basis isreordered according to the linearity of convolution operators. This result shows that a standardconvolution layer can be replaced with two successive convolution layers as shown in Figure 1-(b).The first decomposed convolution layer performs R convolutions between W rbasis , r ∈ [1 ..R ] andinput feature maps U , and it generates an intermediate feature map basis V basis ∈ R w × h × R . Thesecond decomposed convolution layer performs point-wise × convolutions that linearly combine R intermediate feature maps V basis to generate output feature maps V . In previous works, theprimary goal of such decomposition is to reduce the computational complexity [11][12]. For example,the computational complexity of the original convolution is O ( whk ST ) while the decomposedoperation takes O ( wh ( k SR + RT )) . As far as R < T , the decomposed convolution has lowercomputational complexity than the original convolution.
In typical ConvNets, convolution layers have different filters W s and, hence, each decomposedconvolution layer has its own filter basis W basis and coefficients α . In contrast, our primary goal indecomposing convolution layers is to share a single filter basis (or a small number of filter bases)across many convolution layers. Some previous works such as [14][35] propose to share convolutionfilters W themselves across multiple layers, and the peculiarity of individual convolution layers areexpressed only through layer-specific non-shared batch normalization. In contrast, we argue that afilter basis W basis is a more intrinsic and reusable building block that can be shared effectively sincea filter basis constitutes a subspace, in which high dimensional filters across many convolution layerscan be approximated.Though components of a basis only need to be independent and span a vector subspace, some specificbases are more convenient and appropriate for specific purposes. For the purpose of sharing a filterbasis, we need to find an optimal filter basis W basis that can expedite the training of filters of sharedconvolution layers. Although this optimization can be done with a typical stochastic gradient descent(SGD), one problem is that exploding/vanishing gradients problems might prevent efficient search ofthe optimization space. More formally, we consider a series of N decomposed convolution layers, inwhich a filter basis W basis is shared N times. Let x i be the input of the i -th convolution layer, and a i +1 be the output of the convolution of x i with the filter basis W basis a i ( x i − ) = W (cid:62) basis x i − . (4)In Equation 4, W basis ∈ R k S × R is a reshaped filter basis that has basis components at its columns.We assume that input x is properly adapted (e.g., with im2col) to express convolutions using amatrix-matrix multiplication. Since W basis is shared across N convolution layers, the gradient of W basis for some loss function L is: ∂L∂W basis = N (cid:88) i =1 ∂L∂a N N − (cid:89) j = i (cid:18) ∂a j +1 ∂a j (cid:19) ∂a i ∂W basis , (5), where ∂a j +1 ∂a j = ∂a j +1 ∂ x j ∂ x j ∂a j = W basis ∂ x j ∂a j (6)If we plug W basis ∂ x j ∂a j in Equation 6 into Equation 5, we can see that (cid:81) ∂a j +1 ∂a j is the term that makesgradients unstable since W basis is multiplied many times. This exploding/vanishing gradients can4e controlled to a large extent by keeping W basis close to orthogonal [33]. For instance, if W basis admits eigendecomposition, [ W basis ] N can be rewritten as follows: [ W basis ] N = [ Q Λ Q − ] N = Q Λ N Q − , (7)where Λ is a diagonal matrix with the eigenvalues placed on the diagonal and Q is a matrix composedof the corresponding eigenvectors. If W basis is orthogonal, [ W basis ] N neither explodes nor vanishes,since all the eigenvalues of an orthogonal matrix have absolute value 1. Similarly, an orthogonalshared basis ensures that forward signals neither explodes nor vanishes. We also need to ensure thatthe norm of ∂ x j ∂a j in Equation 5 is bounded [13] for stability during forward and backward passes. It isshown that batch normalization after non-linear activation at each convolution layer ensures healthynorms [36][14][32].For training networks, the orthogonality of shared bases can be enforced with an orthogonalityregularizer . For instance, when each residual block group of a ResNet shares a filter basis for itsconvolution layers, the objective function L R can be defined to have an orthogonality regularizer inaddition to the original loss L : L R = L + λ G (cid:88) g (cid:107) W ( g ) basis (cid:62) · W ( g ) basis − I (cid:107) , (8)where W ( g ) basis is a shared filter basis for g -th residual block group and λ is a hyperparameter. In our filter basis sharing approach, filters of many convolution layers are constructed by the linearcombination of a shared filter basis as in Equation 1. This implies that those high-dimensionalfilters are all in the same low-dimensional subspace. If the rank of a filter basis is too low, it isvery challenging to find such subspace that can express individual peculiarity of many layers’ filters.Conversely, if the rank of a shared filter basis is too high (e.g., R ≥ T ), the gain in computationalcomplexity from decomposing filters is mitigated. One way to increase the representational powerof each convolution layer, while still maintaining its computational complexity low, is adding somesmall number of layer-specific components to the filter basis. For instance, we build a filter basis W basis not only using shared components, but also using non-shared components: W basis = W bs _ shared ∪ W bs _ unique , (9)where W bs _ shared = { W rbs _ shared ∈ R k × k × S , r ∈ [1 ..n ] } are shared filter basis components,and W bs _ unique = { W rbs _ unique ∈ R k × k × S , r ∈ [ n +1 ..R ] } are per-layer non-shared filter basiscomponents. With this hybrid scheme, filters in different convolution layers are placed in differentlayer-specific subspace. One disadvantage of this hybrid scheme is that non-shared filter basiscomponents require more parameters. The ratio of non-shared basis components can be varied tocontrol the tradeoffs. But, our results in Section 4 show that only a few per-layer non-shared basiscomponents is enough to achieve high performance. In this section, we perform a series of experiments on image classification tasks. Using ResNets [3]as base models, we train networks with the proposed method and compare them with the baselinenetworks. We also analyze the effect of the orthogonality regularizer and the hybrid scheme.
Throughout the experiments, we use ResNets [3] as base networks by replacing their × convolutionlayers to decomposed convolution layers sharing filter bases. Since each residual block group ofResNets have different number of channels and kernel sizes, our networks share a filter basis only inthe same group. In each group with n residual blocks, the first block has a different stride, and, hence,it does not share a filter basis. Each residual block of the baseline ResNets has two × convolution5ayers, and, hence, our networks’ each group has n − decomposed convolution layers sharinga filter basis. Throughout the experiments, we denote by ResNet L -S s U u a ResNet with L layersthat has a filter basis with s shared components and u layer-specific non-shared components in thefirst residual block group. This ratio between s and u is maintained for all residual block groups.However, since each group of ResNets has × filters than its prior group, each group of our networkshave × higher filter basis ranks than its prior group. For instance, the second residual block groupof ResNet34-S U has × shared components of the filter basis and each convolutionlayer in the group has × layer-specific components. Hence, each decomposed convolutionlayer in the group uses 34 (32 shared and 2 non-shared) filter basis components.The CIFAR-10 and CIFAR-100 datasets contains 50,000 and 10,000 three-channel × imagesfor training and testing, respectively. For training networks, we follow a similar training schemeas [3]. Standardized data-augmentation and normalization are applied to input data. Networks aretrained for 300 epochs with SGD optimizer with a weight decay of 5e-4 and a momentum of 0.9. Thelearning rate is initialized to 0.1 and is decayed by 10 at 50% and 75% of the epochs. Table 1 shows the results on CIFAR-100. Networks trained with the proposed method clearlyoutperform their ResNet counterparts in every aspect. For instance, ResNet34-S32U1 requires only36.2% parameters and 66.6% FLOPs of the counterpart ResNet34. Furthermore, ResNet34-S32U1achieves even lower test error (21.79%) than much deeper ResNet50 (22.36%). To show the generalityof our work, we apply the proposed method to DenseNet [37] and ResNeXt [38] too. Althoughthe overall gain is not as great as ResNets’, we still observe reduction of resource usages in bothnetworks. For instance, ResNeXt50-S64U4 outperforms the counterpart ResNeXt50 while savingparameters and FLOPs by 16.7% and 12.1%, respectively. In ResNeXt, the gain is limited since theymainly exploit group convolutions; each group convolution is decomposed for filter basis sharing inour network. Similarly, for DenseNet, each × convolution layer has a relatively small number ofoutput channels, and, hence the overall gain is not pronounced as much as ResNet’s.Table 1: Error (%) on CIFAR-100. ‘ † ’ denotesorthogonality regularization is not applied. Model Params FLOPs ErrorResNet18 11.22M 1.11G 23.25ResNet34 21.33M 2.33G 22.49ResNet50 23.71M 2.61G 22.36DenseNet121 7.05M 1.81G 21.95ResNeXt50 23.17M 2.71G 20.71ResNet34-S8U1 5.87M 0.79G 23.11ResNet34-S16U0 6.20M 1.02G 23.43ResNet34-S16U1 6.49M 1.05G 22.64ResNet34-S32U0 7.44M 1.52G 22.32ResNet34-S32U1 † DenseNet121-S16U1
Parameter Sharing [34] 12M 10.49G 19.13
Table 2: Error (%) on CIFAR-10. ‘ (cid:63) ’ denoteshaving 2 shared bases in each group. ‘ † ’ denotesorthogonality regularization is not applied. Model Params FLOPs ErrorResNet32 [3] 0.46M 0.14G 7.51ResNet56 [3] 0.85M 0.25G 6.97ResNet110 [3] 1.73M 0.51G 6.43ResNet32-S8U1 0.15M 0.10G 8.08ResNet32-S16U1 0.20M 0.16G 7.43ResNet32-S16U1 (cid:63)
ResNet56-S8U1 0.20M 0.17G 7.52ResNet56-S16U0 0.22M 0.28G 7.84ResNet56-S16U1 † (cid:63) Filter Pruning [7] 0.77M 0.18G 6.94Basis Learning [12] 0.20M 0.46G 6.60
The result on CIFAR-10 is presented in Table 2. Unlike networks on CIFAR-100, networks onCIFAR-10 has much fewer channels (e.g. 16 channels in the first residual block group) and, hence,projecting filters to such low dimensional subspace might limit the performance of the networks.For instance, in ResNet32-S8U1, filters are supposed to be projected onto 9 dimensional subspaceconsisting of 8 shared and 1 layer-specific filter basis components. Therefore, ResNet32-S8U1results in higher testing error (8.08%) than its counterpart ResNet32 (7.51%). By increasing the rankof filter bases, the better accuracy can be achieved at the cost of increased FLOPs. For instance,ResNet32-S16U1’s testing error (7.43%) is lower than ResNet32’s (7.51%) but ResNet32-S16U1requires 14.2% additional FLOPs than ResNet32. For deeper networks such as ResNet56, a filter basisis supposed to be shared by many residual blocks in the group, and it can damage the performance.For example, every filter basis in ResNet56-S16U1 is shared by 8 2-layer residual blocks, or 166 .0 7.5 10.0 12.5 15.0 17.5 20.0 22.5
Parameters (M) T e s t E rr o r ( % ) ==== ========== ResNet50ResNet34ResNet18
Original ResNetResNet34-S16UResNet34-S32UResNet34-S U1
FLOPs (G) T e s t E rr o r ( % ) ==== ========== ResNet50ResNet34ResNet18
Original ResNetResNet34-S16UResNet34-S32UResNet34-S U1
Figure 2: Testing errors vs. the number of parameters and FLOPs on CIFAR-100. The number ofshared basis components ( s ), and non-shared basis components ( u ) are varied. Using more sharedbasis components results in better performance. In contrast, using more non-shared components doesnot always improve performance.convolution layers. Due to this excessive sharing, though ResNet56-S16U1 saves 41.3% parameters,its testing error (7.46%) is higher than the counterpart ResNet56’s (6.97%).To remedy this problem, we introduce a variant, in which each residual block group of the networksuses 2 shared bases; one basis is shared by the first convolution layers of all residual blocks, and theother is shared by the second convolution layers of the same blocks. In Table 2, networks with a‘ (cid:63) ’ mark denote this variant. Though this variant slightly increases the parameters of the networks,it can prevent excessive sharing of parameters. For example, although ResNet56-S16U1 (cid:63) needs0.04M more parameters for additional shared bases, it still saves 63% parameters of the counterpartResNet56 and achieves significantly lower testing error, 6.30%. It should be noted that this testingerror of ResNet56-S16U1 (cid:63) is even lower than much deeper ResNet110’s (6.43%). These results onCIFAR demonstrate that original networks on CIFAR are heavily overparameterized, and our methodis highly effective in compressing such overparameterized networks. Group 2 basis
Group 3 basis
Group 2 coefficients
Group 3 coefficients (a) Without orthogonality regularization
Group 2 basis
Group 3 basis
Group 2 coefficients
Group 3 coefficients (b) With orthogonality regularization
Figure 3: Cosine similarities of bases and coefficients of ResNet34-S16U1 (2-th and 3-th group.) Inthe upper row, X and Y axes show the indexes to each group’s vectorized filter basis components (bothshared and non-shared). The lower row shows corresponding coefficients in the groups. Brightercolor corresponds to higher similarity. 7igure 2 shows test errors as parameters and FLOPs are increased by varying the number ofshared/non-shared basis components of networks. In general, the higher performance is expectedwith the more parameters. We observe that this presumption is true for shared basis components. Forinstance, when the number of shared basis components s is varied from 8 to 32, the test error sharplydecreases from 23.1% to 21.7%. However, non-shared basis components manifest counter-intuitiveresults. Although a small number of non-shared basis components (e.g., u = 1 ) are clearly beneficialto the performance, the higher u ’s do not always lead to the higher performance. For instance,when u = 4 , both ResNet34-S16U u and ResNet34-S32U u show the worst performance. This resultdemonstrates the difficulty of training networks with larger parameters. Further study is required forthis problem.In order to analyze the effect of the orthogonality regularizer, in Figure 3, we illustrate absolutecosine similarities of all filter basis components and coefficients of the 2nd and the 3rd residualblock groups of ResNet34-S16U1. In the upper row, the X and Y axes display the indexes tothe shared basis components first, and the layer-specific non-shared basis components next. InFigure 3, as expected, notice that shared filter basis components have almost zero cosine similaritieswhen the orthogonality regularization in Equation 8 is applied. The bottom row shows the absolutecosine similarities of coefficients of the corresponding groups. In Figure 3, we can clearly see thatcoefficients manifest lower similarities when the orthogonality regularizer is applied. Without theorthogonality regularization, interesting grid patterns are observed in coefficients. This repetitivegrid pattern might be related to ResNets’ nature of iterative refinement [14]. However, since thesebases and coefficients are used to build layer-specific filters, we conjecture that such high cosinesimilarities imply higher redundancy in the networks. With the orthogonality regularization, suchrepetitive patterns are less evident, implying less redundancy in the networks. We evaluate our method on the ILSVRC2012 dataset [39] that has 1000 classes. The dataset consistsof 1.28M training and 50K validation images. For training networks, we follow the ImageNet trainingprocedure [40]. The networks are trained for 120 epochs with SGD optimizer with a mini-batch sizeof 256. We use a weight decay of 1e-4 and a momentum of 0.9. The learning rate starts with 0.1 anddecays by 10 at 45-th, 75-th, and 110-th epochs. For data augmentation, we apply random single × crops and horizontal flips. For this experiment, we use ResNet34 as a base model, andreplace its × convolution layers with decomposed convolution layers sharing filter bases.Table 3: Error (%) on ImageNet. ‘ (cid:63) ’ denotes having 2 shared bases in each residual block group. Model Params FLOPs top-1 top-5ResNet18 11.69M 3.64G 30.24 10.92ResNet34 21.80M 7.34G 26.70 8.58ResNet34-S16U1 6.96M 3.44G 30.04 10.50ResNet34-S32U1 8.20M 4.98G 28.42 9.55ResNet34-S32U1 (cid:63)
Table 3 shows the results. Although our networks do not outperform the counterpart ResNet34 interms of accuracy, they show competitive performance while using less computational resources. Forexample, ResNet34-S32U1 (cid:63) achieves 27.69%/9.11% top-1/top-5 validation errors using only 44.7%and 67.8% parameters and FLOPs, respectively, of ResNet34. Though we expect better performancewith the higher rank of filter bases, we leave further investigation of tradeoffs as our future work.
In this work, we propose to share filter bases of decomposed convolution layers for efficient andeffective sharing of parameters in ConvNets. The usual gradient explosion/vanishing problemof recursive networks is addressed by the proposed orthogonal regularizer. Further, we increasethe representation power of each convolution layer by combining a small number of non-sharedcomponents to filter bases. Experiments on CIFAR and ImageNet show that our method reduces8arameters and computational costs substantially while achieving competitive performance. Inparticular, in heavily overparameterized networks on CIFAR, our method outperforms much deepercounterpart original networks. The proposed filter basis sharing method could be further extended todifferent kinds of common convolutions such as × and depthwise convolutions. References [1] Krizhevsky, A., I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutionalneural networks. In
Advances in neural information processing systems , pages 1097–1105.2012.[2] Szegedy, C., W. Liu, Y. Jia, et al. Going deeper with convolutions. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages 1–9. 2015.[3] He, K., X. Zhang, S. Ren, et al. Deep residual learning for image recognition. In
Proceedingsof the IEEE conference on computer vision and pattern recognition , pages 770–778. 2016.[4] Hu, J., L. Shen, G. Sun. Squeeze-and-excitation networks. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages 7132–7141. 2018.[5] Denton, E. L., W. Zaremba, J. Bruna, et al. Exploiting linear structure within convolutionalnetworks for efficient evaluation. In
Advances in neural information processing systems , pages1269–1277. 2014.[6] Han, S., J. Pool, J. Tran, et al. Learning both weights and connections for efficient neuralnetwork. In
Advances in neural information processing systems , pages 1135–1143. 2015.[7] Li, H., A. Kadav, I. Durdanovic, et al. Pruning filters for efficient convnets. In
InternationalConference on Learning Representations . 2017.[8] Jaderberg, M., A. Vedaldi, A. Zisserman. Speeding up convolutional neural networks with lowrank expansions. arXiv preprint arXiv:1405.3866 , 2014.[9] Lebedev, V., Y. Ganin, M. Rakhuba, et al. Speeding-up convolutional neural networks usingfine-tuned cp-decomposition. In
International Conference on Learning Representations . 2015.[10] Kim, Y.-D., E. Park, S. Yoo, et al. Compression of deep convolutional neural networks for fastand low power mobile applications. In
International Conference on Learning Representations(ICLR) . 2016.[11] Zhang, X., J. Zou, K. He, et al. Accelerating very deep convolutional networks for classificationand detection.
IEEE transactions on pattern analysis and machine intelligence , 38(10):1943–1955, 2015.[12] Li, Y., S. Gu, L. Van Gool, et al. Learning filter basis for convolutional neural networkcompression. In
Proceedings of the IEEE International Conference on Computer Vision . 2019.[13] Pascanu, R., T. Mikolov, Y. Bengio. On the difficulty of training recurrent neural networks. In
International conference on machine learning , pages 1310–1318. 2013.[14] Jastrz˛ebski, S., D. Arpit, N. Ballas, et al. Residual connections encourage iterative inference. In
International Conference on Learning Representations . 2018.[15] Hinton, G., O. Vinyals, J. Dean. Distilling the knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.[16] Chen, G., W. Choi, X. Yu, et al. Learning efficient object detection models with knowledgedistillation. In
Advances in Neural Information Processing Systems , pages 742–751. 2017.[17] LeCun, Y., J. S. Denker, S. A. Solla. Optimal brain damage. In
Advances in neural informationprocessing systems , pages 598–605. 1990.[18] Polyak, A., L. Wolf. Channel-level acceleration of deep face representations.
IEEE Access ,3:2163–2175, 2015.[19] He, Y., X. Zhang, J. Sun. Channel pruning for accelerating very deep neural networks. In
Proceedings of the IEEE International Conference on Computer Vision , pages 1389–1397.2017.[20] Han, S., H. Mao, W. J. Dally. Deep compression: Compressing deep neural networks withpruning, trained quantization and huffman coding, 2016.921] Son, S., S. Nah, K. Mu Lee. Clustering convolutional kernels to compress deep neural networks.In
Proceedings of the European Conference on Computer Vision (ECCV) , pages 216–232. 2018.[22] Li, Y., S. Lin, B. Zhang, et al. Exploiting kernel sparsity and entropy for interpretable cnn com-pression. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 2800–2809. 2019.[23] Howard, A. G., M. Zhu, B. Chen, et al. Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 , 2017.[24] Zhang, X., X. Zhou, M. Lin, et al. Shufflenet: An extremely efficient convolutional neuralnetwork for mobile devices. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 6848–6856. 2018.[25] Sandler, M., A. Howard, M. Zhu, et al. Mobilenetv2: Inverted residuals and linear bottlenecks.In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages4510–4520. 2018.[26] Graves, A., A. Mohamed, G. Hinton. Speech recognition with deep recurrent neural networks.In , pages6645–6649. 2013.[27] Socher, R., C. C. Lin, C. Manning, et al. Parsing natural scenes and natural language withrecursive neural networks. In
Proceedings of the 28th international conference on machinelearning (ICML-11) , pages 129–136. 2011.[28] Liang, M., X. Hu. Recurrent convolutional neural network for object recognition. In
Proceedingsof the IEEE conference on computer vision and pattern recognition , pages 3367–3375. 2015.[29] Xingjian, S., Z. Chen, H. Wang, et al. Convolutional lstm network: A machine learning approachfor precipitation nowcasting. In
Advances in neural information processing systems , pages802–810. 2015.[30] Zamir, A. R., T.-L. Wu, L. Sun, et al. Feedback networks. In , pages 1808–1817. IEEE, 2017.[31] Eigen, D., J. Rolfe, R. Fergus, et al. Understanding deep architectures using a recursiveconvolutional network: 2nd international conference on learning representations, iclr 2014. In . 2014.[32] Guo, Q., Z. Yu, Y. Wu, et al. Dynamic recursive neural network. In , pages 5142–5151. 2019.[33] Vorontsov, E., C. Trabelsi, S. Kadoury, et al. On orthogonality and learning recurrent networkswith long term dependencies. In
Proceedings of the 34th International Conference on MachineLearning , pages 3570–3578. 2017.[34] Savarese, P., M. Maire. Learning implicitly recurrent CNNs through parameter sharing. In
International Conference on Learning Representations . 2019.[35] Köpüklü, O., M. Babaee, S. Hörmann, et al. Convolutional neural networks with layer reuse. In , pages 345–349. IEEE, 2019.[36] Ioffe, S., C. Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In
International Conference on Machine Learning , pages 448–456.2015.[37] Huang, G., Z. Liu, L. Van Der Maaten, et al. Densely connected convolutional networks.In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages4700–4708. 2017.[38] Xie, S., R. Girshick, P. Dollár, et al. Aggregated residual transformations for deep neuralnetworks. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 1492–1500. 2017.[39] Russakovsky, O., J. Deng, H. Su, et al. Imagenet large scale visual recognition challenge.