[PDF] Learning Shared Filter Bases for Efficient ConvNets

Abstract

Modern convolutional neural networks (ConvNets) achieve state-of-the-art performance for many computer vision tasks. However, such high performance requires millions of parameters and high computational costs. Recently, inspired by the iterative structure of modern ConvNets, such as ResNets, parameter sharing among repetitive convolution layers has been proposed to reduce the size of parameters. However, naive sharing of convolution filters poses many challenges such as overfitting and vanishing/exploding gradients. Furthermore, parameter sharing often increases computational complexity due to additional operations. In this paper, we propose to exploit the linear structure of convolution filters for effective and efficient sharing of parameters among iterative convolution layers. Instead of sharing convolution filters themselves, we hypothesize that a filter basis of linearly-decomposed convolution layers is a more effective unit for sharing parameters since a filter basis is an intrinsic and reusable building block constituting diverse high dimensional convolution filters. The representation power and peculiarity of individual convolution layers are further increased by adding a small number of layer-specific non-shared components to the filter basis. We show empirically that enforcing orthogonality to shared filter bases can mitigate the difficulty in training shared parameters. Experimental results show that our approach achieves significant reductions both in model parameters and computational costs while maintaining competitive, and often better, performance than non-shared baseline networks.

Full PDF

LLearning Shared Filter Bases for Efﬁcient ConvNets

Daeyeon Kim

Dept. of Embedded Systems EngineeringIncheon National UniversityIncheon, South Korea 22012 [email protected]

Woochul Kang ∗ Dept. of Embedded Systems EngineeringIncheon National UniversityIncheon, South Korea 22012 [email protected]

Abstract

Modern convolutional neural networks (ConvNets) achieve state-of-the-art perfor-mance for many computer vision tasks. However, such high performance requiresmillions of parameters and high computational costs. Recently, inspired by theiterative structure of modern ConvNets, such as ResNets, parameter sharing amongrepetitive convolution layers has been proposed to reduce the size of parameters.However, naive sharing of convolution ﬁlters poses many challenges such as over-ﬁtting and vanishing/exploding gradients. Furthermore, parameter sharing oftenincreases computational complexity due to additional operations. In this paper,we propose to exploit the linear structure of convolution ﬁlters for effective andefﬁcient sharing of parameters among iterative convolution layers. Instead ofsharing convolution ﬁlters themselves, we hypothesize that a ﬁlter basis of linearly-decomposed convolution layers is a more effective unit for sharing parameterssince a ﬁlter basis is an intrinsic and reusable building block constituting diversehigh dimensional convolution ﬁlters. The representation power and peculiarityof individual convolution layers are further increased by adding a small numberof layer-speciﬁc non-shared components to the ﬁlter basis. We show empiricallythat enforcing orthogonality to shared ﬁlter bases can mitigate the difﬁculty intraining shared parameters. Experimental results show that our approach achievessigniﬁcant reductions both in model parameters and computational costs whilemaintaining competitive, and often better, performance than non-shared baselinenetworks.

In the past few years, the accuracy of convolutional neural networks (ConvNets) has been improvedcontinuously [1][2][3][4]. However, the cost of these networks has also increased signiﬁcantly bothin parameter size and computational complexity. To address this problem, many model compressionand acceleration approaches have been proposed [5][6][7]. Among them, low-rank approximationof convolution ﬁlters has been intensively applied to various networks [5][8][9][10][11][12]. In thelow-rank approximation approaches, the linear structure of ﬁlters is exploited to decompose trainedﬁlters into linear combinations of a ﬁlter basis. Such linearly decomposed ConvNets suggest efﬁcientnetwork structures both in terms of parameter size and computational complexity. Further, unlikeother compression approaches such as pruning [6][7], low-rank approximation can be performedwith minimal modiﬁcation to the original network.However, in previous low-rank approximation approaches, ﬁlter bases are used only to approximatealready trained convolution ﬁlters. Unlike previous studies, we propose to exploit only the linearstructure of convolution ﬁlters, not the trained weights, for efﬁcient and effective sharing of network ∗ Corresponing authorPreprint. Under review. a r X i v : . [ c s . C V ] J u l arameters. In our work, each typical convolution layer of a network is replaced with decomposedlayers that have a low rank ﬁlter basis and coefﬁcients for their linear combinations. Instead of havingdifferent ﬁlter bases for these decomposed convolution layers, a single ﬁlter basis (or a small numberof bases) is/are shared across these layers to save parameters. We hypothesize that these ﬁlter basesare more intrinsic and reusable building blocks constituting subspace of high dimensional convolutionﬁlters, and, hence, they are more appropriate for being shared across many convolution layers.However, one major challenge is that repetitive use of shared ﬁlter bases might result in potential vanishing gradients and exploding gradients problems, which are often found in recurrent neuralnetworks (RNNs) [13][14]. Another challenge in sharing ﬁlter bases is that the constructed ﬁltersfrom the linear combinations of shared ﬁlter basis are all in the same linear subspace. If all ﬁltersare in a single low-dimensional subspace, it can potentially suppress the peculiarity of individualconvolution layers and the overall representation power of the networks.To address these challenges, we make the following contributions. First, we propose a hybrid approachto sharing ﬁlter bases, in which a small number of layer-speciﬁc non-shared ﬁlter basis componentsare combined with shared ﬁlter basis components. With this hybrid scheme, the constructed ﬁlterscan be positioned in different subspaces that reﬂect the peculiarity of individual convolution layers.We argue that this layer-speciﬁc variation can contribute to increase the representation power of thenetworks while a large portion of parameters is shared. Our second contribution is the training methodof shared ﬁlter bases. We show that a shared ﬁlter basis can cause vanishing gradients and explodinggradients problems, and this problem can be controlled to a large extent with the orthogonality in theﬁlter basis. To enforce the orthogonality of ﬁlter bases, we propose an orthogonality regularizer totrain ConvNets having shared ﬁlter bases.We validate empirically our proposed solution on image classiﬁcation tasks with CIFAR and ImageNetdatasets. Our experimental results demonstrate that our method can reduce a signiﬁcant amountof parameters and computational costs while achieving competitive, and often better, performancecompared to original ConvNet models. For example, in heavily overparameterized networks onCIFAR, our method can save up to 63.8% of parameters and 33.4% of FLOPs, respectively, whileachieving lower test errors than much deeper ResNet models.The remainder of this paper is organized as follows: In Section 2, we discuss related work. In Section3, we review the linear structure of convolution ﬁlters and give the details of our ﬁlter basis sharingmethod. In Section 4, the experiments on classiﬁcation tasks are presented. Section 5 concludes thepaper and discuss future works. Model compression and efﬁcient convolution block design:

Reducing storage and inference timeof ConvNets has been an important research topic for both resource constrained mobile/embeddedsystems and energy-hungry data centers. A number of research techniques have been developedsuch as knowledge distillation [15][16], ﬁlter pruning [17][18][7][19], low-rank factorization [5][8],quantization [20], kernel clustering [21][22], to name a few. In particular, ﬁlter-decomposition[5][8][9][10][11][12] is proposed to approximate original ﬁlters with computationally efﬁcient low-rank tensors. Though these compression techniques are effective in reducing the resource usage,they have been suggested as post-processing steps that are applied to original networks after initialtraining. Often these compression steps are tricky and take long ﬁne-tuning time [11][10]. Moreover,they often cannot recover the original models’ accuracy and incur the higher loss of accuracy withthe higher compression ratio. By contrast, the networks with our ﬁlter basis sharing method areend-to-end trainable without pretrained weights, and our method often outperforms the counterpartoriginal models while achieving signiﬁcant savings both in parameters and computational costs.Some compact networks such as SqueezeNet [23], ShufﬂeNet [24], and MobileNet [23][25] showthat delicately designed internal structure of convolution blocks acquire better ability with lowercomputational complexity. Our work also suggests an efﬁcient block structure of convolution layers.However, unlike these works, our block design aims to reveal more reusable and shareable buildingblocks, or ﬁlter bases, by decomposing convolution layers of widely used networks.

Recursive networks and parameter sharing:

Recurrent neural networks (RNNs) [26] have beenwell-studied for temporal and sequential data. As a generalization of RNNs, recursive variants of2

R T 𝑊 𝑏𝑎𝑠𝑖𝑠 𝛼 shared shared 𝑊 S T (a) Full convolution (b) Decomposed convolution unique

Figure 1: Illustration of the proposed ﬁlter basis sharing method. Unlike normal convolution in(a), our method in (b) replaces the original layer (given by W ) by two layers (given by ﬁlter basis W basis and coefﬁcients α ). While most components of W basis are shared across many convolutionlayers, some are not shared and unique to each layer, allowing layer-speciﬁc peculiarity and morerepresentation power of the network. Intermediate basis feature maps (of R channels) generated with W basis are linearly combined with coefﬁcients α .ConvNets are used extensively for visual tasks [27][28][29][30]. For instance, Eigen et al. [31]explore recursive convolutional architectures that share ﬁlters across multiple convolution layers.They show that recurrence with deeper layers tends to increase performance. However, their recursivearchitecture shows worse performance than independent convolution layers due to overﬁtting. Inmost previous works, ﬁlters themselves are shared across layers. In contrast, we propose to shareﬁlter bases that are more fundamental and reusable building blocks to construct layer-speciﬁc ﬁlters.More recently, Jastrzebski et al. [14] show that iterative reﬁnement of features in Resnets suggeststhat deep networks can potentially leverage intensive parameter sharing. Guo et al. [32] introducea gate unit to determine whether to jump out of the recursive loop of convolution blocks to savecomputational resources. These works show that training recursive networks with naively sharedblocks leads to bad performance due to the problem of gradient explosion and vanish like RNN[13][33]. In order to mitigate the problem of gradient explosion and vanish, they suggest unsharedbatch normalization strategy. In our work, we propose an orthogonality regularization of shared ﬁlterbases to further address this problem.Savarese et al. ’s work [34] is most similar to our work. In their parameter sharing scheme, differentlayers of ConvNets are deﬁned by a linear combination of parameter tensors from a global bank oftemplates. Though similar to our work, our method propose ﬁlter bases as more ﬁne-grained andreusable units for parameter sharing and it allows combining non-shared layer-speciﬁc components inﬁlter bases to express peculiarity of each convolution layer. Our result shows that these layer-speciﬁcnon-shared components are critical to achieve high performance. Further, our ﬁlter basis sharingmethod achieves signiﬁcant saving not just in parameters, but also in computational costs. In this section, we discuss how to share parameters of ConvNets effectively by decomposing typicalconvolution layers into more reusable units, or ﬁlter bases. We also discuss how to train shared ﬁlterbases effectively.

We assume that a convolution layer with S input channels, T output channels, and a set of ﬁlters W = { W t ∈ R k × k × S , t ∈ [1 ..T ] } . Each ﬁlter W t can be decomposed using a lower rank ﬁlter basis W basis and coefﬁcients α : W t = R (cid:88) r =1 α rt W rbasis , (1)where W basis = { W rbasis ∈ R k × k × S , r ∈ [1 ..R ] } is a ﬁlter basis, and α = { α rt ∈ R , r ∈ [1 ..R ] , t ∈ [1 ..T ] } is scalar coefﬁcients. In Equation 1, R is the rank of the basis. In a typical convolution layer,output feature maps V t ∈ R w × h × T , t ∈ [1 ..T ] are obtained by the convolution between input feature3aps U ∈ R w × h × S and ﬁlters W t , t ∈ [1 ..T ] . With Equation 1, this convolution can be rewritten asfollows: V t = U ∗ W t = U ∗ R (cid:88) r =1 α rt W rbasis (2) = R (cid:88) r =1 α rt ( U ∗ W rbasis ) , where t ∈ [1 ..T ] . (3)In Equation 3, the order of the convolution operation and the linear combination of ﬁlter basis isreordered according to the linearity of convolution operators. This result shows that a standardconvolution layer can be replaced with two successive convolution layers as shown in Figure 1-(b).The ﬁrst decomposed convolution layer performs R convolutions between W rbasis , r ∈ [1 ..R ] andinput feature maps U , and it generates an intermediate feature map basis V basis ∈ R w × h × R . Thesecond decomposed convolution layer performs point-wise × convolutions that linearly combine R intermediate feature maps V basis to generate output feature maps V . In previous works, theprimary goal of such decomposition is to reduce the computational complexity [11][12]. For example,the computational complexity of the original convolution is O ( whk ST ) while the decomposedoperation takes O ( wh ( k SR + RT )) . As far as R < T , the decomposed convolution has lowercomputational complexity than the original convolution.

In typical ConvNets, convolution layers have different ﬁlters W s and, hence, each decomposedconvolution layer has its own ﬁlter basis W basis and coefﬁcients α . In contrast, our primary goal indecomposing convolution layers is to share a single ﬁlter basis (or a small number of ﬁlter bases)across many convolution layers. Some previous works such as [14][35] propose to share convolutionﬁlters W themselves across multiple layers, and the peculiarity of individual convolution layers areexpressed only through layer-speciﬁc non-shared batch normalization. In contrast, we argue that aﬁlter basis W basis is a more intrinsic and reusable building block that can be shared effectively sincea ﬁlter basis constitutes a subspace, in which high dimensional ﬁlters across many convolution layerscan be approximated.Though components of a basis only need to be independent and span a vector subspace, some speciﬁcbases are more convenient and appropriate for speciﬁc purposes. For the purpose of sharing a ﬁlterbasis, we need to ﬁnd an optimal ﬁlter basis W basis that can expedite the training of ﬁlters of sharedconvolution layers. Although this optimization can be done with a typical stochastic gradient descent(SGD), one problem is that exploding/vanishing gradients problems might prevent efﬁcient search ofthe optimization space. More formally, we consider a series of N decomposed convolution layers, inwhich a ﬁlter basis W basis is shared N times. Let x i be the input of the i -th convolution layer, and a i +1 be the output of the convolution of x i with the ﬁlter basis W basis a i ( x i − ) = W (cid:62) basis x i − . (4)In Equation 4, W basis ∈ R k S × R is a reshaped ﬁlter basis that has basis components at its columns.We assume that input x is properly adapted (e.g., with im2col) to express convolutions using amatrix-matrix multiplication. Since W basis is shared across N convolution layers, the gradient of W basis for some loss function L is: ∂L∂W basis = N (cid:88) i =1 ∂L∂a N N − (cid:89) j = i (cid:18) ∂a j +1 ∂a j (cid:19) ∂a i ∂W basis , (5), where ∂a j +1 ∂a j = ∂a j +1 ∂ x j ∂ x j ∂a j = W basis ∂ x j ∂a j (6)If we plug W basis ∂ x j ∂a j in Equation 6 into Equation 5, we can see that (cid:81) ∂a j +1 ∂a j is the term that makesgradients unstable since W basis is multiplied many times. This exploding/vanishing gradients can4e controlled to a large extent by keeping W basis close to orthogonal [33]. For instance, if W basis admits eigendecomposition, [ W basis ] N can be rewritten as follows: [ W basis ] N = [ Q Λ Q − ] N = Q Λ N Q − , (7)where Λ is a diagonal matrix with the eigenvalues placed on the diagonal and Q is a matrix composedof the corresponding eigenvectors. If W basis is orthogonal, [ W basis ] N neither explodes nor vanishes,since all the eigenvalues of an orthogonal matrix have absolute value 1. Similarly, an orthogonalshared basis ensures that forward signals neither explodes nor vanishes. We also need to ensure thatthe norm of ∂ x j ∂a j in Equation 5 is bounded [13] for stability during forward and backward passes. It isshown that batch normalization after non-linear activation at each convolution layer ensures healthynorms [36][14][32].For training networks, the orthogonality of shared bases can be enforced with an orthogonalityregularizer . For instance, when each residual block group of a ResNet shares a ﬁlter basis for itsconvolution layers, the objective function L R can be deﬁned to have an orthogonality regularizer inaddition to the original loss L : L R = L + λ G (cid:88) g (cid:107) W ( g ) basis (cid:62) · W ( g ) basis − I (cid:107) , (8)where W ( g ) basis is a shared ﬁlter basis for g -th residual block group and λ is a hyperparameter. In our ﬁlter basis sharing approach, ﬁlters of many convolution layers are constructed by the linearcombination of a shared ﬁlter basis as in Equation 1. This implies that those high-dimensionalﬁlters are all in the same low-dimensional subspace. If the rank of a ﬁlter basis is too low, it isvery challenging to ﬁnd such subspace that can express individual peculiarity of many layers’ ﬁlters.Conversely, if the rank of a shared ﬁlter basis is too high (e.g., R ≥ T ), the gain in computationalcomplexity from decomposing ﬁlters is mitigated. One way to increase the representational powerof each convolution layer, while still maintaining its computational complexity low, is adding somesmall number of layer-speciﬁc components to the ﬁlter basis. For instance, we build a ﬁlter basis W basis not only using shared components, but also using non-shared components: W basis = W bs _ shared ∪ W bs _ unique , (9)where W bs _ shared = { W rbs _ shared ∈ R k × k × S , r ∈ [1 ..n ] } are shared ﬁlter basis components,and W bs _ unique = { W rbs _ unique ∈ R k × k × S , r ∈ [ n +1 ..R ] } are per-layer non-shared ﬁlter basiscomponents. With this hybrid scheme, ﬁlters in different convolution layers are placed in differentlayer-speciﬁc subspace. One disadvantage of this hybrid scheme is that non-shared ﬁlter basiscomponents require more parameters. The ratio of non-shared basis components can be varied tocontrol the tradeoffs. But, our results in Section 4 show that only a few per-layer non-shared basiscomponents is enough to achieve high performance. In this section, we perform a series of experiments on image classiﬁcation tasks. Using ResNets [3]as base models, we train networks with the proposed method and compare them with the baselinenetworks. We also analyze the effect of the orthogonality regularizer and the hybrid scheme.

Throughout the experiments, we use ResNets [3] as base networks by replacing their × convolutionlayers to decomposed convolution layers sharing ﬁlter bases. Since each residual block group ofResNets have different number of channels and kernel sizes, our networks share a ﬁlter basis only inthe same group. In each group with n residual blocks, the ﬁrst block has a different stride, and, hence,it does not share a ﬁlter basis. Each residual block of the baseline ResNets has two × convolution5ayers, and, hence, our networks’ each group has n − decomposed convolution layers sharinga ﬁlter basis. Throughout the experiments, we denote by ResNet L -S s U u a ResNet with L layersthat has a ﬁlter basis with s shared components and u layer-speciﬁc non-shared components in theﬁrst residual block group. This ratio between s and u is maintained for all residual block groups.However, since each group of ResNets has × ﬁlters than its prior group, each group of our networkshave × higher ﬁlter basis ranks than its prior group. For instance, the second residual block groupof ResNet34-S U has × shared components of the ﬁlter basis and each convolutionlayer in the group has × layer-speciﬁc components. Hence, each decomposed convolutionlayer in the group uses 34 (32 shared and 2 non-shared) ﬁlter basis components.The CIFAR-10 and CIFAR-100 datasets contains 50,000 and 10,000 three-channel × imagesfor training and testing, respectively. For training networks, we follow a similar training schemeas [3]. Standardized data-augmentation and normalization are applied to input data. Networks aretrained for 300 epochs with SGD optimizer with a weight decay of 5e-4 and a momentum of 0.9. Thelearning rate is initialized to 0.1 and is decayed by 10 at 50% and 75% of the epochs. Table 1 shows the results on CIFAR-100. Networks trained with the proposed method clearlyoutperform their ResNet counterparts in every aspect. For instance, ResNet34-S32U1 requires only36.2% parameters and 66.6% FLOPs of the counterpart ResNet34. Furthermore, ResNet34-S32U1achieves even lower test error (21.79%) than much deeper ResNet50 (22.36%). To show the generalityof our work, we apply the proposed method to DenseNet [37] and ResNeXt [38] too. Althoughthe overall gain is not as great as ResNets’, we still observe reduction of resource usages in bothnetworks. For instance, ResNeXt50-S64U4 outperforms the counterpart ResNeXt50 while savingparameters and FLOPs by 16.7% and 12.1%, respectively. In ResNeXt, the gain is limited since theymainly exploit group convolutions; each group convolution is decomposed for ﬁlter basis sharing inour network. Similarly, for DenseNet, each × convolution layer has a relatively small number ofoutput channels, and, hence the overall gain is not pronounced as much as ResNet’s.Table 1: Error (%) on CIFAR-100. ‘ † ’ denotesorthogonality regularization is not applied. Model Params FLOPs ErrorResNet18 11.22M 1.11G 23.25ResNet34 21.33M 2.33G 22.49ResNet50 23.71M 2.61G 22.36DenseNet121 7.05M 1.81G 21.95ResNeXt50 23.17M 2.71G 20.71ResNet34-S8U1 5.87M 0.79G 23.11ResNet34-S16U0 6.20M 1.02G 23.43ResNet34-S16U1 6.49M 1.05G 22.64ResNet34-S32U0 7.44M 1.52G 22.32ResNet34-S32U1 † DenseNet121-S16U1

Parameter Sharing [34] 12M 10.49G 19.13

Table 2: Error (%) on CIFAR-10. ‘ (cid:63) ’ denoteshaving 2 shared bases in each group. ‘ † ’ denotesorthogonality regularization is not applied. Model Params FLOPs ErrorResNet32 [3] 0.46M 0.14G 7.51ResNet56 [3] 0.85M 0.25G 6.97ResNet110 [3] 1.73M 0.51G 6.43ResNet32-S8U1 0.15M 0.10G 8.08ResNet32-S16U1 0.20M 0.16G 7.43ResNet32-S16U1 (cid:63)

ResNet56-S8U1 0.20M 0.17G 7.52ResNet56-S16U0 0.22M 0.28G 7.84ResNet56-S16U1 † (cid:63) Filter Pruning [7] 0.77M 0.18G 6.94Basis Learning [12] 0.20M 0.46G 6.60

The result on CIFAR-10 is presented in Table 2. Unlike networks on CIFAR-100, networks onCIFAR-10 has much fewer channels (e.g. 16 channels in the ﬁrst residual block group) and, hence,projecting ﬁlters to such low dimensional subspace might limit the performance of the networks.For instance, in ResNet32-S8U1, ﬁlters are supposed to be projected onto 9 dimensional subspaceconsisting of 8 shared and 1 layer-speciﬁc ﬁlter basis components. Therefore, ResNet32-S8U1results in higher testing error (8.08%) than its counterpart ResNet32 (7.51%). By increasing the rankof ﬁlter bases, the better accuracy can be achieved at the cost of increased FLOPs. For instance,ResNet32-S16U1’s testing error (7.43%) is lower than ResNet32’s (7.51%) but ResNet32-S16U1requires 14.2% additional FLOPs than ResNet32. For deeper networks such as ResNet56, a ﬁlter basisis supposed to be shared by many residual blocks in the group, and it can damage the performance.For example, every ﬁlter basis in ResNet56-S16U1 is shared by 8 2-layer residual blocks, or 166 .0 7.5 10.0 12.5 15.0 17.5 20.0 22.5

Parameters (M) T e s t E rr o r ( % ) ==== ========== ResNet50ResNet34ResNet18

Original ResNetResNet34-S16UResNet34-S32UResNet34-S U1

FLOPs (G) T e s t E rr o r ( % ) ==== ========== ResNet50ResNet34ResNet18

Original ResNetResNet34-S16UResNet34-S32UResNet34-S U1

Figure 2: Testing errors vs. the number of parameters and FLOPs on CIFAR-100. The number ofshared basis components ( s ), and non-shared basis components ( u ) are varied. Using more sharedbasis components results in better performance. In contrast, using more non-shared components doesnot always improve performance.convolution layers. Due to this excessive sharing, though ResNet56-S16U1 saves 41.3% parameters,its testing error (7.46%) is higher than the counterpart ResNet56’s (6.97%).To remedy this problem, we introduce a variant, in which each residual block group of the networksuses 2 shared bases; one basis is shared by the ﬁrst convolution layers of all residual blocks, and theother is shared by the second convolution layers of the same blocks. In Table 2, networks with a‘ (cid:63) ’ mark denote this variant. Though this variant slightly increases the parameters of the networks,it can prevent excessive sharing of parameters. For example, although ResNet56-S16U1 (cid:63) needs0.04M more parameters for additional shared bases, it still saves 63% parameters of the counterpartResNet56 and achieves signiﬁcantly lower testing error, 6.30%. It should be noted that this testingerror of ResNet56-S16U1 (cid:63) is even lower than much deeper ResNet110’s (6.43%). These results onCIFAR demonstrate that original networks on CIFAR are heavily overparameterized, and our methodis highly effective in compressing such overparameterized networks. Group 2 basis

Group 3 basis

Group 2 coefficients

Group 3 coefficients (a) Without orthogonality regularization

Group 2 basis

Group 3 basis

Group 2 coefficients

Group 3 coefficients (b) With orthogonality regularization

Figure 3: Cosine similarities of bases and coefﬁcients of ResNet34-S16U1 (2-th and 3-th group.) Inthe upper row, X and Y axes show the indexes to each group’s vectorized ﬁlter basis components (bothshared and non-shared). The lower row shows corresponding coefﬁcients in the groups. Brightercolor corresponds to higher similarity. 7igure 2 shows test errors as parameters and FLOPs are increased by varying the number ofshared/non-shared basis components of networks. In general, the higher performance is expectedwith the more parameters. We observe that this presumption is true for shared basis components. Forinstance, when the number of shared basis components s is varied from 8 to 32, the test error sharplydecreases from 23.1% to 21.7%. However, non-shared basis components manifest counter-intuitiveresults. Although a small number of non-shared basis components (e.g., u = 1 ) are clearly beneﬁcialto the performance, the higher u ’s do not always lead to the higher performance. For instance,when u = 4 , both ResNet34-S16U u and ResNet34-S32U u show the worst performance. This resultdemonstrates the difﬁculty of training networks with larger parameters. Further study is required forthis problem.In order to analyze the effect of the orthogonality regularizer, in Figure 3, we illustrate absolutecosine similarities of all ﬁlter basis components and coefﬁcients of the 2nd and the 3rd residualblock groups of ResNet34-S16U1. In the upper row, the X and Y axes display the indexes tothe shared basis components ﬁrst, and the layer-speciﬁc non-shared basis components next. InFigure 3, as expected, notice that shared ﬁlter basis components have almost zero cosine similaritieswhen the orthogonality regularization in Equation 8 is applied. The bottom row shows the absolutecosine similarities of coefﬁcients of the corresponding groups. In Figure 3, we can clearly see thatcoefﬁcients manifest lower similarities when the orthogonality regularizer is applied. Without theorthogonality regularization, interesting grid patterns are observed in coefﬁcients. This repetitivegrid pattern might be related to ResNets’ nature of iterative reﬁnement [14]. However, since thesebases and coefﬁcients are used to build layer-speciﬁc ﬁlters, we conjecture that such high cosinesimilarities imply higher redundancy in the networks. With the orthogonality regularization, suchrepetitive patterns are less evident, implying less redundancy in the networks. We evaluate our method on the ILSVRC2012 dataset [39] that has 1000 classes. The dataset consistsof 1.28M training and 50K validation images. For training networks, we follow the ImageNet trainingprocedure [40]. The networks are trained for 120 epochs with SGD optimizer with a mini-batch sizeof 256. We use a weight decay of 1e-4 and a momentum of 0.9. The learning rate starts with 0.1 anddecays by 10 at 45-th, 75-th, and 110-th epochs. For data augmentation, we apply random single × crops and horizontal ﬂips. For this experiment, we use ResNet34 as a base model, andreplace its × convolution layers with decomposed convolution layers sharing ﬁlter bases.Table 3: Error (%) on ImageNet. ‘ (cid:63) ’ denotes having 2 shared bases in each residual block group. Model Params FLOPs top-1 top-5ResNet18 11.69M 3.64G 30.24 10.92ResNet34 21.80M 7.34G 26.70 8.58ResNet34-S16U1 6.96M 3.44G 30.04 10.50ResNet34-S32U1 8.20M 4.98G 28.42 9.55ResNet34-S32U1 (cid:63)

Table 3 shows the results. Although our networks do not outperform the counterpart ResNet34 interms of accuracy, they show competitive performance while using less computational resources. Forexample, ResNet34-S32U1 (cid:63) achieves 27.69%/9.11% top-1/top-5 validation errors using only 44.7%and 67.8% parameters and FLOPs, respectively, of ResNet34. Though we expect better performancewith the higher rank of ﬁlter bases, we leave further investigation of tradeoffs as our future work.

In this work, we propose to share ﬁlter bases of decomposed convolution layers for efﬁcient andeffective sharing of parameters in ConvNets. The usual gradient explosion/vanishing problemof recursive networks is addressed by the proposed orthogonal regularizer. Further, we increasethe representation power of each convolution layer by combining a small number of non-sharedcomponents to ﬁlter bases. Experiments on CIFAR and ImageNet show that our method reduces8arameters and computational costs substantially while achieving competitive performance. Inparticular, in heavily overparameterized networks on CIFAR, our method outperforms much deepercounterpart original networks. The proposed ﬁlter basis sharing method could be further extended todifferent kinds of common convolutions such as × and depthwise convolutions. References [1] Krizhevsky, A., I. Sutskever, G. E. Hinton. Imagenet classiﬁcation with deep convolutionalneural networks. In

Advances in neural information processing systems , pages 1097–1105.2012.[2] Szegedy, C., W. Liu, Y. Jia, et al. Going deeper with convolutions. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 1–9. 2015.[3] He, K., X. Zhang, S. Ren, et al. Deep residual learning for image recognition. In

Proceedingsof the IEEE conference on computer vision and pattern recognition , pages 770–778. 2016.[4] Hu, J., L. Shen, G. Sun. Squeeze-and-excitation networks. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 7132–7141. 2018.[5] Denton, E. L., W. Zaremba, J. Bruna, et al. Exploiting linear structure within convolutionalnetworks for efﬁcient evaluation. In

Advances in neural information processing systems , pages1269–1277. 2014.[6] Han, S., J. Pool, J. Tran, et al. Learning both weights and connections for efﬁcient neuralnetwork. In

Advances in neural information processing systems , pages 1135–1143. 2015.[7] Li, H., A. Kadav, I. Durdanovic, et al. Pruning ﬁlters for efﬁcient convnets. In

InternationalConference on Learning Representations . 2017.[8] Jaderberg, M., A. Vedaldi, A. Zisserman. Speeding up convolutional neural networks with lowrank expansions. arXiv preprint arXiv:1405.3866 , 2014.[9] Lebedev, V., Y. Ganin, M. Rakhuba, et al. Speeding-up convolutional neural networks usingﬁne-tuned cp-decomposition. In

International Conference on Learning Representations . 2015.[10] Kim, Y.-D., E. Park, S. Yoo, et al. Compression of deep convolutional neural networks for fastand low power mobile applications. In

International Conference on Learning Representations(ICLR) . 2016.[11] Zhang, X., J. Zou, K. He, et al. Accelerating very deep convolutional networks for classiﬁcationand detection.

IEEE transactions on pattern analysis and machine intelligence , 38(10):1943–1955, 2015.[12] Li, Y., S. Gu, L. Van Gool, et al. Learning ﬁlter basis for convolutional neural networkcompression. In

Proceedings of the IEEE International Conference on Computer Vision . 2019.[13] Pascanu, R., T. Mikolov, Y. Bengio. On the difﬁculty of training recurrent neural networks. In

International conference on machine learning , pages 1310–1318. 2013.[14] Jastrz˛ebski, S., D. Arpit, N. Ballas, et al. Residual connections encourage iterative inference. In

International Conference on Learning Representations . 2018.[15] Hinton, G., O. Vinyals, J. Dean. Distilling the knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.[16] Chen, G., W. Choi, X. Yu, et al. Learning efﬁcient object detection models with knowledgedistillation. In

Advances in Neural Information Processing Systems , pages 742–751. 2017.[17] LeCun, Y., J. S. Denker, S. A. Solla. Optimal brain damage. In

Advances in neural informationprocessing systems , pages 598–605. 1990.[18] Polyak, A., L. Wolf. Channel-level acceleration of deep face representations.

IEEE Access ,3:2163–2175, 2015.[19] He, Y., X. Zhang, J. Sun. Channel pruning for accelerating very deep neural networks. In

Proceedings of the IEEE International Conference on Computer Vision , pages 1389–1397.2017.[20] Han, S., H. Mao, W. J. Dally. Deep compression: Compressing deep neural networks withpruning, trained quantization and huffman coding, 2016.921] Son, S., S. Nah, K. Mu Lee. Clustering convolutional kernels to compress deep neural networks.In

Proceedings of the European Conference on Computer Vision (ECCV) , pages 216–232. 2018.[22] Li, Y., S. Lin, B. Zhang, et al. Exploiting kernel sparsity and entropy for interpretable cnn com-pression. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 2800–2809. 2019.[23] Howard, A. G., M. Zhu, B. Chen, et al. Mobilenets: Efﬁcient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 , 2017.[24] Zhang, X., X. Zhou, M. Lin, et al. Shufﬂenet: An extremely efﬁcient convolutional neuralnetwork for mobile devices. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 6848–6856. 2018.[25] Sandler, M., A. Howard, M. Zhu, et al. Mobilenetv2: Inverted residuals and linear bottlenecks.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages4510–4520. 2018.[26] Graves, A., A. Mohamed, G. Hinton. Speech recognition with deep recurrent neural networks.In , pages6645–6649. 2013.[27] Socher, R., C. C. Lin, C. Manning, et al. Parsing natural scenes and natural language withrecursive neural networks. In

Proceedings of the 28th international conference on machinelearning (ICML-11) , pages 129–136. 2011.[28] Liang, M., X. Hu. Recurrent convolutional neural network for object recognition. In

Proceedingsof the IEEE conference on computer vision and pattern recognition , pages 3367–3375. 2015.[29] Xingjian, S., Z. Chen, H. Wang, et al. Convolutional lstm network: A machine learning approachfor precipitation nowcasting. In

Advances in neural information processing systems , pages802–810. 2015.[30] Zamir, A. R., T.-L. Wu, L. Sun, et al. Feedback networks. In , pages 1808–1817. IEEE, 2017.[31] Eigen, D., J. Rolfe, R. Fergus, et al. Understanding deep architectures using a recursiveconvolutional network: 2nd international conference on learning representations, iclr 2014. In . 2014.[32] Guo, Q., Z. Yu, Y. Wu, et al. Dynamic recursive neural network. In , pages 5142–5151. 2019.[33] Vorontsov, E., C. Trabelsi, S. Kadoury, et al. On orthogonality and learning recurrent networkswith long term dependencies. In

Proceedings of the 34th International Conference on MachineLearning , pages 3570–3578. 2017.[34] Savarese, P., M. Maire. Learning implicitly recurrent CNNs through parameter sharing. In

International Conference on Learning Representations . 2019.[35] Köpüklü, O., M. Babaee, S. Hörmann, et al. Convolutional neural networks with layer reuse. In , pages 345–349. IEEE, 2019.[36] Ioffe, S., C. Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In

International Conference on Machine Learning , pages 448–456.2015.[37] Huang, G., Z. Liu, L. Van Der Maaten, et al. Densely connected convolutional networks.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages4700–4708. 2017.[38] Xie, S., R. Girshick, P. Dollár, et al. Aggregated residual transformations for deep neuralnetworks. In

Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 1492–1500. 2017.[39] Russakovsky, O., J. Deng, H. Su, et al. Imagenet large scale visual recognition challenge.