IC Networks: Remodeling the Basic Unit for Convolutional Neural Networks
IIC Networks: Remodeling the Basic Unit for Convolutional Neural Networks
Junyi An Fengshan Liu Jian zhao Furao shen ∗ Department of Computer Science and Technology, Nanjing University, Nanjing, China School of Electronic Science and Engineering, Nanjing University, Nanjing, China { junyian, liufengshan } @smail.nju.edu.cn { frshen, jianzhao } @nju.edu.cn Abstract
Convolutional neural network (CNN) is a class ofartificial neural networks widely used in computervision tasks. Most CNNs achieve excellent perfor-mance by stacking certain types of basic units. Inaddition to increasing the depth and width of thenetwork, designing more effective basic units hasbecome an important research topic. Inspired bythe elastic collision model in physics, we present ageneral structure which can be integrated into theexisting CNNs to improve their performance. Weterm it the “Inter-layer Collision” (IC) structure.Compared to the traditional convolution structure,the IC structure introduces nonlinearity and fea-ture recalibration in the linear convolution opera-tion, which can capture more fine-grained features.In addition, a new training method, namely weaklogit distillation (WLD), is proposed to speed upthe training of IC networks by extracting knowl-edge from pre-trained basic models. In the Ima-geNet experiment, we integrate the IC structure intoResNet-50 and reduce the top-1 error from . to . , which also catches up the top-1 error ofResNet-100 ( . ) with nearly half of FLOPs. Convolutional neural networks (CNNs) have made greatachievements in the field of computer vision. The success ofAlexNet [Krizhevsky et al. , 2012] and VGGNet [Simonyanand Zisserman, 2015] shows the superiority of deep networks,leading to a trend of building larger and deeper networks.However, this method is inefficient in improving network per-formance. On the one hand, increasing the depth and widthbrings a huge computational burden and causes a series ofproblems, such as model degradation. On the other hand,because the relationship between different hyper-parametersis complicated, the increased number of hyper-parametersmakes it more difficult to design deep networks. Therefore,the research focus in recent years has gradually shifted to im-proving the representation ability of basic network units inorder to design more efficient CNN architectures.The convolutional layer is a basic unit of CNN. By stack-ing a series of convolutional layers together with non-linear activation layers, CNNs are able to produce image representa-tions that capture abstract features and cover global theoreti-cal receptive fields. For each convolutional layer, the filter cancapture a certain type of feature in the input. However, not allfeatures contribute to a given task. Recent studies have shownthat the network can obtain more powerful representationability by feature recalibration, which emphasizes informa-tive features and suppresses less useful ones [Bell et al. , 2016;Hu et al. , 2018]. Besides, there is evidence that introducingthe non-linear kernel method in the convolutional layer canimprove the generalization ability of the network [Wang etal. , 2019; Zoumpourlis et al. , 2017]. However, kernel meth-ods may cause overfitting by complicating the networks andintroducing a high amount of calculation.We argue that the convolutional layer can also benefit froma simple non-linear representation, and propose a structure ofintroducing a non-linear operation in the convolutional layer,which also performs feature recalibration to enhance the rep-resentation ability of convolutional layers. Starting from themost basic neural network structure where a linear transfor-mation and a non-linear activation function are applied to theinput successively, we use a non-linear operation to optimizethe representation of the linear part. The proposed struc-ture divides the input space into multiple subspaces whichcan represent different linear transformations, providing morepatterns for succedent activation function. We call this struc-ture the inter-layer collision (IC) neuron, since it is inspiredby the elastic collision model that we use to mimic the wayinformation is transmitted between adjacent layers.We build the IC convolutional layer by combining the ICneuron with the convolution operation. The structure of theIC layer is depicted in Figure 1, where the term F conv de-notes the convolution operation mapping the input X to thefeature maps U . An IC branch used to divide the input spacecan be easily combined with the F conv branch. When the in-put X passes though the IC branch, the F ex operation firstextract local features by aggregating input features in localregions, which represents the local distribution of channel-wise input features. Then, local features are recalibrated bythe adjustment operation F ad to increase flexibility of repre-sentation. Finally, the local features pass though a F com op-eration which combines two branches to generate extra lineartransformation. The final output U (cid:48) of the IC layer with thesame spatial dimensions ( H × W ) as U can be fed directly a r X i v : . [ c s . C V ] F e b igure 1: The structure of a complete IC layer. into subsequent layers of the network. In addition, the ICbranch only introduces little computational cost, because itsoperations use lightweight structures.We construct a set of IC networks by integrating the IClayers into existing models. Our experiments show that the ICnetworks have significant improvements compared to basicmodels with little additional computational burden. However,training from scratch may take a lot of time and suffer fromcomplex hyper-parameter settings, especially when the basicmodels are large. From [Kirkpatrick et al. , 2017], we find thatthe basic models may have similar parameter configurationsto the corresponding IC models. Therefore, we propose atraining method, namely the weak logit distillation (WLD),which distills the knowledge of pre-trained basic models. Bycombining the optimal basic models with a loss using weaksoft targets, we show that the WLD only needs a few trainingrounds to successfully achieve or even exceed the result oftraining from scratch. In summary, our contributions are:• We propose a novel structure called the IC layer that canbe used to build CNN architectures. We prove its ef-fectiveness on enhancing representation and integrate itinto some existing models. The experiments show theuniversality and superiority of the IC layer.• We propose a method called WLD that can guide thelearning of IC networks. It is a novel idea whose goalis to extract knowledge from smaller teacher modelsthrough the knowledge distillation (KD) method. Theexperiments show that by the WLD, the IC networksachieve higher performance with a shorter training time.Especially, the IC-ResNet-50 integrates the IC layer intoResNet-50 and reduces the top-1 error from . to . , which also achieves the top-1 error of ResNet-100 ( . ). Effective computational unit.
Designing effective basicCNN units was a significant topic, since it reduces the dif-ficulty of designing architectures by directly using basic unitsin existing models. [Hu et al. , 2018] proposed the SE blockthat used feature recalibration to improve the performanceof existing networks. However, the SE blocks were usu-ally combined with building blocks rather than more basicstructures. [Wang et al. , 2019] introduced the non-linear ker-nel method to convolutional layers to improve representation.Although that work bypassed the explicit calculation of the high dimensional features via a kernel trick, the complexityis obviously increased. In contrast to these work. Our pro-posed IC branch introduces the non-linear representation bya lightweight structure. Besides, it is combined with a singleconvolutional layer that can be applied to a wider range ofCNN architectures.
Knowledge Distillation.
Recent KD methods used the fea-ture distillation [Heo et al. , 2019] or self-supervision [Xu etal. , 2020] to extract the deep knowledge from a larger teachermodel. Different from them, our training method WLD distillknowledge from a smaller pre-trained model while tolerat-ing the gap between the teacher and student models. WLDis novel because it used the KD method to solve a differ-ent task: learning the optimal representation quickly and pre-cisely when new components are introduced into the CNNs.The loss of WLD refers to the common KD loss proposedin [Hinton et al. , 2015].
In this section, we first show how the IC structure works andits combination with existing CNNs. Then, we introduce theWLD method to optimize IC networks. Finally, we analyzethe influence of IC structure on model complexity.
The MP neuron [McCulloch and Pitts, 1990] is the mostcommonly used neuron model, which can be formulated as f( (cid:80) ni =1 w i x i + b ) , where a linear transformation and a non-linear activation function f( · ) are applied to the n -D inputsuccessively. w i and b denote the learnable weight and thebias, respectively. To facilitate the non-linear representationof neural models, we propose a new neuron model by replac-ing the linear transformation with a non-linear one: y = f (cid:32) n (cid:88) i =1 w i x i + σ (cid:32) n (cid:88) i =1 ( w i − x i + b (cid:33) + b (cid:33) = f (cid:32) n (cid:88) i =1 w i x i + σ (cid:32) n (cid:88) i =1 w i x i − x sum + b (cid:33) + b (cid:33) , (1)where f( · ) denotes a non-linear activation function and σ ( · ) is a rectified linear unit (ReLU) [Nair and Hinton, 2010]. b and b are two independent biases used to adjust the centerof the model distribution. x sum = (cid:80) ni =1 x i denotes thesummation of all the input features. We term the structure de-fined in eq. (1) the inter-layer collision (IC) neuron, since thisidea is inspired by the physical elastic collision model wherethe speeds of two objects are m m + m v and ( m m + m − v after collision (Details in Appendix C). We treat m m + m aslearnable weight w i and introduce some mathematical adjust-ments.Through introducing the σ ( · ) operation, the neuron modelcan definitely increase the number of non-linear patterns. Weuse the term H = (cid:80) ni =1 ( w i − x i to represent a hyperplane( H = 0 ) in an n -dimensional Euclidean space, which divides x x (cid:0) x (a) x x x (cid:0) x x (cid:0) x (b) a bcd x x (c)Figure 2: The value of f ( x , x ) given x , x . (a): f ( x , x ) = σ ( x − x ) . (b): f ( x , x ) = σ ( x − x + σ ( − x )) . (c): f ( x , x ) = σ ( w x + w x + b + σ (( w − x + ( w − x + b )) . Here w = w = 0 . . b = − . and b = 0 . are used to shift the boundary across the whole space. eq. (1) as follows: y = (cid:26) f (2 (cid:80) ni =1 w i x i − (cid:80) ni =1 x i ) if H ≥
0f ( (cid:80) ni =1 w i x i ) if H < . (2)Here we omit the bias term. Intuitively, this IC neuron hasa stronger representation ability than the MP neuron, sinceit can produce two different linear representations before theactivation operation f( · ) . To facilitate understanding, we use2-D data as input and ReLU as the activation function to showhow the two kinds of neurons generate non-linear boundaries.Figure 2(a)(b) shows that the single MP neuron and the ICneuron can divide the -D Euclidean space into multiple sub-spaces, each of which can represent a fixed linear transfor-mation. We observe that a single IC neuron can divide onemore subspace to represent a different linear transformation.Furthermore, we map the XOR problem which is a typicallinear inseparable problem onto a -D plane to explain thedifference between the two kinds of neurons. For the ReLUMP neuron, it is clear that at least two neurons are required tosolve the XOR problem. However, the non-linear boundaryof the IC neuron is a broken line, providing a single neuronpossibility to solve the XOR problem. Figure 2(c) gives a so-lution with a single IC neuron, dividing the whole plane intothree spaces where (0 , , (1 , are represented by zero and (1 , , (0 , are in two spaces with similar representation.Although neuron using the hyperplane H = 0 can increasethe number of non-linear patterns, the calculation of the hy-perplane is limited by weight w i , making it inflexible to di-vide different subspaces. Hence, the subspaces divided bythe hyperplane are usually not optimal thus the weights eas-ily converge to a local minimum. To add more representationflexibility to the hyperplane H = 0 , we improve eq. (1) byintroducing an adjustment weight w (cid:48) : y = f (cid:32) n (cid:88) i =1 w i x i + σ (cid:32) n (cid:88) i =1 w i x i − w (cid:48) x sum + b (cid:33) + b (cid:33) = f (cid:0) w T x + σ (cid:0) w T x − w (cid:48) × T x + b (cid:1) + b (cid:1) , (3)where w and x represent weight vector and input vector, re-spectively. is an all-one vector. w (cid:48) can be regarded asthe intrinsic weight of one neuron, which is different fromthe weight w i connecting two neurons. Then there are twoindependent parameters in the calculation of the hyperplane H = 0 : w (cid:48) is used to change the direction of the hyperplane,and the bias b is used to shift the hyperplane in the wholespace. We give a theoretical analysis of the adjustable rangeof w (cid:48) : Theorem 1. w = ( w , . . . , w n ) T and = (1 , . . . , T are two n- D vectors. By adjusting w (cid:48) , the hyperplane (cid:80) ni =1 ( w i − w (cid:48) ) x i = 0 can be rotated π rad around thecross product of w and when the two vectors are linearityindependent. Theorem 1 implies that (cid:80) ni =1 ( w i − w (cid:48) ) x i = 0 can almostrepresent all the hyperplanes parallel to the cross product of w and , providing flexible strategies for dividing spaces. Be-sides, the IC neuron keeps a significant advantage over theMP neuron in that the (cid:80) ni =1 w i x i term can flexibly representany linear combinations in the input space. We provide thetheorem as below: Theorem 2.
In a closed n -D input space, for any givenMP neuron ( w , . . . , w n , b ) , there is always an IC neuron ( w , . . . , w n , w (cid:48) , b , b ) that can completely represent thisMP neuron. The proof of Theorem 1 and 2 are provided in the Ap-pendix A. In summary, by adjusting the relationship between w and w (cid:48) , the IC neuron can retain the representation abilityof the MP neuron and flexibly segment linear representationspaces for some complex distributions. The convolutional kernel, a filter used to capture the latentfeatures in input signals, can be regarded as a combination ofthe MP model and a sliding window. To simplify the notation,we omit the activation operator and bias term. The outputfeature u i ∈ R H × W of the standard convolutional layer isgiven by: u i = w i ⊗ X , (4)where X ∈ R H (cid:48) × W (cid:48) × C (cid:48) is the input feature map, w i ∈ R k × k × C (cid:48) is a filter kernel that belongs to W i =[ w , . . . , w C ] and ⊗ is used to denote the convolution op-erator. To apply the IC neuron model to eq. (4), we replacethe kernel w i with an IC kernel [ w i , w (cid:48) i ] , u i = [ w i , w (cid:48) i ] ⊗ X = w i ⊗ X + σ ( w i ⊗ X − w (cid:48) i × ( ⊗ X )) , (5)where is an all-one tensor with the same size as w i . Theinput feature map may contain hundreds of channels and ⊗ X term will mix features in all the channels with the sameproportion. Therefore, we distinguish different features by agrouped convolutional trick: u i = w i ⊗ X + σ ( w i ⊗ X − ( ˆ ⊗ X ) · w (cid:48) i ) , (6)where ˆ ⊗ denotes the depthwise convolution [Chollet, 2017]which separates and X into C (cid:48) independent channels andperforms channel-wise convolution. The adjustment weight w (cid:48) i becomes a vector with size C (cid:48) , recalibrating the featuresof ˆ ⊗ X by channel-wise multiplication ( · ). Note that all theconvolution operators ( ⊗ and ˆ ⊗ ) in eq. (6) share the samestride and padding. The structure in eq. (6) which we term theIC layer has two main advantages compared to the traditionalconvolutional layer: r o ll e y bu s t u r t l e original image ResNet-18 IC-ResNet-18Figure 3: The Grad-CAM visualizations for different models. Theimages are randomly picked from ImageNet validation set. • According to Section 3.1, the single kernel [ w i , w (cid:48) i ] canrepresent more linear patterns before activation, whichenhancing the representation of convolutional layer.• The information ˆ ⊗ X contains some low-level featuresfrom the previous layer. It helps the filters to learn high-level features faster since it provides an approximate dis-tribution of the pixels in the input feature maps.The IC layer can be easily integrated into some existing mod-els. Consider the ResNets [He et al. , 2016] which are com-monly used networks as examples. The basic block used byResNet-18 and ResNet-34 has two × convolutional lay-ers. The bottleneck block used by deeper ResNets has threeconvolutional layers ( × , × , × ). Note that the in-formation ˆ ⊗ X equals X when kernel size is × , it willnot provide extra features for the filter. Therefore, we pre-fer to replace all the × layers in building blocks to buildIC-ResNets. The combination with other popular models isintroduced in Section 4, where we show the universality andsuperiority of the IC layer. In order to further understand why the IC layer can capturemore fine-grained features, we compare the Grad-CAM vi-sualizations [Selvaraju et al. , 2017] of the IC models withtheir basic models. As shown in Figure 3, the IC networkstend to focus on more relevant regions with more object de-tails. More importantly, the features of IC networks and basicnetworks have some similarities. We think that although thefeatures captured by IC networks have finer texture informa-tion, their focus on the image is similar to features capturedby basic networks.Motivated by the observed similarity, we propose a methodto guide the learning of IC networks by using the knowledgeof pre-trained basic networks. First, we have a basic net-work B with a pre-trained set of parameters θ ∗ and our goalis to train a corresponding IC network IC-B with better per-formance. According to Theorem 2 and the argument that anetwork has many configurations of parameters with the sameperformance [Kirkpatrick et al. , 2017], we assume that
IC-B can have a configuration close to θ ∗ . Based on our hypoth-esis, we load the set θ ∗ for IC-B and add a scaling factor α to control the influence of σ ( · ) . The IC layer in IC-B can beinitialized as: u i = w ∗ i ⊗ X + α × σ ( w ∗ i ⊗ X − ( ˆ ⊗ X ) · w (cid:48) i ) , (7)where w ∗ i is a weight loaded from θ ∗ and we use random ini-tialization for w (cid:48) i . Since the w ∗ i ⊗ X term can capture thefeatures independently, α is set to zero at the beginning to let IC-B have the same performance as B . After initialization, wefine-tune all the parameters. The σ ( · ) term benefits from thepre-trained information, making the adjustment weight con-verges quickly. We use the scaling factor α to gradually am-plify the influence of σ ( · ) in the training process. There aretwo strategies to adjust α . The first one is to gradually in-crease its value manually, but it need to provide accurate hy-perparameters. The second is to treat α as a parameter trainedwith other parameters. Our experience shows that the bestvalue of α is usually in [0 . , . .To further use the knowledge of B , we refer to the wayin KD. We use the basic network B as a teacher model andthe IC-B as a student model. The soft targets predicted by awell-optimized teacher model can provide extra information,compared to the ground truth. To obtain the soft targets of B ,temperature scaling [Hinton et al. , 2015] is used to soften thepeaky softmax distribution: p i ( x ; τ ) = e s i ( x ) /τ (cid:80) k e s k ( x ) /τ , (8)where x is the data sample, i is the category index, s i ( x ) is the score logit that x obtains on category i , and τ is thetemperature. The knowledge distillation loss L kd measuredby KL-divergence as: L kd = − τ (cid:88) x ∈D x C (cid:88) i =1 p it ( x ; τ ) log( p is ( x ; τ )) , (9)where t and s denote the teacher and student models. C isthe total number of classes, D x indicates the dataset. Thecomplete loss function L of training IC-B is a combinationof the standard cross-entropy loss L ce and knowledge distil-lation loss L kd : L = (1 − λ ) L ce + λ max( L ce − e, , (10)where λ is a balancing weight. e is a constant used to increasetolerance the for the gap between the soft targets of B and IC-B . Different from the traditional KD method, our method doesnot extract the knowledge from a deeper network with bet-ter performance. Therefore, we do not need the output dis-tributions of the teacher and student models to be exactlyequal. The loss term max( L ce − e, allows deviation be-tween p is and p it . Besides, during training, we gradually re-duce the value of λ to make the IC-B reduce the dependenceon the teacher network. Eq. (10) provides a strategy to intro-duce some extra information for training IC networks. Com-bined with loading pre-trained parameters and fine-tuning,the IC networks achieve high performance after a few learn-ing rounds. We term this learning method the weak logit dis-tillation (WLD) since it weakens the impact of soft targets toreduce the dependence on the teacher networks.odel Top-1 err. Top-5 err. GFlops ParamsResNet-18 30.24 / 27.88 10.92 / 9.42 1.82 11.7MIC-ResNet-18 / / / / / / Table 1: IC-ResNets performance results on ImageNet validation set. The error rates (%) use single-crop/10-crop testing.
For the standard convolutional layer with k × k receptivefields, the transformation X ∈ R H (cid:48) × W (cid:48) × C (cid:48) conv −−−→ U ∈ R H × W × C (11)needs k × k × C (cid:48) × C parameters. In the IC layer, we calculatethe sum of each channel without additional parameters. Theweight W (cid:48) = [ w (cid:48) , w (cid:48) , · · · , w (cid:48) C ] adds × × C (cid:48) × C parameters. Therefore, the number of parameters added bythe IC layer is only k × k of the original layer.The IC layer adds a depthwise convolution and a learnableweight W (cid:48) which can be regarded as a × convolution. Thedepthwise convolution is used to calculate the element-wisesum of the input by an all-one kernel. The increased com-putational complexity is the same as adding a convolutionalfilter because we only need to do this operation once. There-fore, the increased complexity is C of the original layer. Theweight W (cid:48) is a × convolution which uses less than k × k computation of the k × k convolution [Sifre and Mallat, 2014].Therefore, the approximate extra computation is C + k × k ofthe original layer. In this section, we investigate the effectiveness of differentIC architectures by a series of comparative experiments. Be-sides, we evaluate the WLD method on training IC networks.
We use the ILSVRC 2012 classification dataset [Russakovsky et al. , 2015] which consists of more than 1 million color im-ages in 1000 classes divided into 1.28M training images and50K validation images. We use three versions of ResNets(ResNet-18, ResNet-34, ResNet-50) to build the correspond-ing IC networks. For a fair comparison, all our experi-ments on ImageNet are conducted with the same environ-ment setting. The optimizer uses the stochastic gradient de-scent (SGD) [LeCun et al. , 1989] method with a weight de-cay of − and a momentum of . . The training processis set to epochs with a batch size of . The learningrate is initially set to . and will be reduced 10 times every epochs. Besides, all experiments are implemented withthe Pytorch [Paszke et al. , 2019] framework on a server withNVIDIA TITAN Xp GPUs.To compare with previous research [He et al. , 2016;Wang et al. , 2019], we apply both the single-crop and 10-crop for testing. We build the IC-ResNet-18, IC-ResNet-34 Model Top-1 err. Top-5 err.IC-ResNet-18 28.82 9.98IC-ResNet-34 25.78 8.02IC-ResNet-50 23.07 6.67EfficientNet-B0 23.7 6.8IC-EfficientNet-B0 EfficientNet-B1 21.2 5.6IC-EfficientNet-B1
EfficientNet-B2 20.2 5.1IC-EfficientNet-B2
Table 2: The single-crop error rates (%) of WLD results on the Im-ageNet validation set. and IC-ResNet-50 by replacing the whole × convolu-tional layers in blocks with × IC layers. The valida-tion error and FLOPs are reported in Table 1. We observethat the IC-ResNet-18 and IC-ResNet-34 can obviously re-duce the 10-crop top-1 error by . , . and the top-5error by . , . with a small increase in the calcula-tion ( . , . ), validating the effectiveness of the IClayer. The IC-ResNet-50 retains × convolutional layersin bottleneck blocks. The 10-crop top-1 error is . andthe top-5 error is . , exceeding ResNet-50 by . and . , respectively. Moreover, the extra FLOPs of the IC-ResNet-50 is only . of the ResNet-50. For the deeperResNets, we believe that the deeper IC-ResNets can get sim-ilar results, since they all use the same block as ResNet-50. Training with WLD
To evaluate the WLD method, We construct a set of experi-ments to train the IC-ResNets and three versions of the Effi-cientNets [Tan and Le, 2019] (EfficientNet-b0, EfficientNet-b1 and EfficientNet-b2). The training process is set to epochs and the learning rate whose initial value is . willbe reduced 10 times every epochs ( epochs for Efficient-Nets). The other hyperparameter settings are: e is . ; λ initially set to . is reduced to . . Table 2 shows that WLDachieves the accuracy of training from scratch with fewertraining rounds. Remarkably, by training with WLD, the 10-crop result of the IC-ResNet-50 ( . top-1 and . top-5 error) achieves the error rate achieved by the deeperResNet-101 network ( . top-1 and . top-5 error)with nearly half of the computational burden ( . GFLOPsvs. . GFLOPs).This set of experiments prove our hypothesis mentionedodel Top-1 err. GFlops/ParamsIC-ResNet-50 23.07/6.68 4.33/26.8MIC-ResNet-50-B 23.02/6.66 5.27/31.2MIC-ResNet-50-C 22.96/6.65 6.10/36.2M
Table 3: Performance results of the IC networks with × IC layerson ImageNet validation set. in Section 3.3 and provide a understandable conclusion: al-though the low learning rate limits the connection weightsin the vicinity of pre-trained model, the WLD method canstill find optimal adjustment weights to improve the repre-sentation of IC networks. Specially, the EfficientNets whichcome from neural architecture search have difficuly in train-ing from scratch. Through the pre-trained models and WLDmethod, the corresponding IC-EfficientNets achieve higherperformance with a simple hyperparameter configuration.
The effectiveness of × IC layers
To evaluate the × IC layers, we integrate them into theIC-ResNet-50 to build the IC-ResNet-50-B by only replac-ing the first × layers in bottleneck blocks and the IC-ResNet-50-C by replacing all × layers. As shown inTable 3, although the IC-ResNet-50-B and IC-ResNet-50-Cexceed the IC-ResNet-50, it obviously increases the FLOPsof the model. Besides, we observe that the overfitting in twoIC models where the training accuracy is improved signifi-cantly but the test accuracy is not. Combined with analysisin Section 3.2, we argue that × IC layers focus more onimproving model capacity rather than introducing new fea-ture information. When the number of channels of × IClayers is relatively large, the models are more likely to causeoverfitting and bring an expensive computational burden.
We further investigate the universality of the IC layer by in-tegrating it into some other modern architectures. These ex-periments are conducted on the CIFAR10 dataset, which con-sists of 60K × colour images in 10 classes divided into50K training images and 10K testing images. Each model istrained in epochs with a batch size of . The learningrate is initialized to . , which will be reduced times at the th epoch and the th epoch. The optimizer settings arethe same as in the ImageNet experiment.We integrate the IC layers into VGGNets (VGG-16and VGG-19 versions), MobileNet [Howard et al. , 2017],SENets (SE-ResNet-18 and SE-ResNet-50 versions) andResNeXt [Xie et al. , 2017] (2x64d version). Specially, theadjacent convolution layers (a depthwise convolution layerand a pointwise convolutional layer) in MobileNet are treatedas one convolutional layer when integrating with the IC layerbecause there is a close relationship between adjacent layers.The results are listed in Table 4, we observe that the IC layersimprove representation of all the basic models. This set of ex-periments show the universality of the IC layers. Besides, weobserve that the IC networks have faster convergence speedthan basic models in both the ImageNet and CIFAR-10 ex-periments. The training curves are shown in Appendix B. Model Top-1 err. Model Top-1 err.VGG-16 6.36 SE-ResNet-18 5.08IC-VGG-16 IC-SE-ResNet-18
VGG-19 6.46 SE-ResNet-50 5.10IC-VGG-19
IC-SE-ResNet-50
MobileNet 9.92 ResNext(2x64d) 4.62IC-MobileNet
IC-ResNext(2x64d)
Table 4: Results on CIFAR10 with various IC models. We use ourenvironment settings to reprodurce the baseline results. framework backbone mAPFaster R-CNN ResNet-50 79.5IC-ResNet-50
Retinanet ResNet-50 77.3IC-ResNet-50
Table 5: Results on PASCAL VOC 2007+2012 test set.
We further assess the generalization of IC networks on thetask of object detection using the PASCAL VOC 2007+2012detection benchmark [Everingham et al. , 2010], This datasetconsists of about 5K train/val images and 5K test images over20 object categories. We use the IC-ResNet-50 as the back-bone networks to capture the features. Weights are initializedby the parameters of the IC-ResNet-50 trained by WLD onthe ImageNet dataset. The detection frameworks that we useare Faster R-CNN [Ren et al. , 2015] and Retinanet [Lin etal. , 2017]. We use the same configuration for both the IC-ResNet-50 and the ResNet-50, which is described in [Chen et al. , 2019]. We evaluate detection mean Average Precision(mAP) which is the actual metric for object detection. Asshown in Table 5, IC-ResNet-50 outperforms ResNet-50 by . and . on the Faster R-CNN and Retinanet frame-works, respectively. The ResNet-50 result we report is thesame as previous work. Our experiments demonstrate thatthe IC networks can be easily integrated into the object de-tection and achieve better performance with negligible addi-tional cost. We believe that IC networks can show their supe-riority across a broad range of vision tasks and datasets. In this paper, we propose the IC structure that brings non-linearity and feature recalibration to convolution operation.By dividing the input space, the IC structure has a strongerrepresentation ability than the traditional convolution struc-ture. We build the IC networks by integrating the IC struc-ture into the state-of-the-art models. Besides, we proposethe WLD method to facilitate the training of IC networks.It is shown that training with WLD can bypass requirementof the complex hyperparameter design and reach convergencequickly. A wide range of experiments show the effectivenessof IC networks across multiple datasets and tasks. Finally, weexpect to integrate IC structure into more architectures andfurther improve the performance of computer vision tasks. eferences [Bell et al. , 2016] S. Bell, C. L. Zitnick, K. Bala, and R. B.Girshick. Inside-outside net: Detecting objects in contextwith skip pooling and recurrent neural networks. In
CVPR ,pages 2874–2883, 2016.[Chen et al. , 2019] K. Chen, J. Q. Wang, J. M. Pang, Y. H.Cao, Y. Xiong, X. X. Li, S. Y. Sun, W. S. Feng, Z. W.Liu, J. R. Xu, et al. Mmdetection: Open mmlab detectiontoolbox and benchmark. arXiv:1906.07155 , 2019.[Chollet, 2017] F. Chollet. Xception: Deep learning withdepthwise separable convolutions. In
CVPR , pages 1251–1258, 2017.[Everingham et al. , 2010] M. Everingham, L. Van Gool,C. K. I. Williams, J. M. Winn, and A. Zisserman. The pas-cal visual object classes (VOC) challenge.
Internationaljournal of computer vision , 88(2):303–338, 2010.[He et al. , 2016] K. M. He, X. Y. Zhang, S. Q. Ren, andJ. Sun. Deep residual learning for image recognition. In
CVPR , pages 770–778, 2016.[Heo et al. , 2019] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak,and J. Y. Choi. A comprehensive overhaul of feature dis-tillation. In
ICCV , pages 1921–1930, 2019.[Hinton et al. , 2015] G. E. Hinton, O. Vinyals, and J. Dean.Distilling the knowledge in a neural network. In
NeurIPSDeep Learning and Representation Learning Workshop ,2015.[Howard et al. , 2017] A. G. Howard, M. l. Zhu, B. Chen,D. Kalenichenko, W. J. Wang, T. Weyand, M. Andreetto,and H. Adam. Mobilenets: Efficient convolutional neuralnetworks for mobile vision applications. In
CVPR , 2017.[Hu et al. , 2018] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In
CVPR , pages 7132–7141, 2018.[Kirkpatrick et al. , 2017] J. Kirkpatrick, R. Pascanu, N. Ra-binowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis,C. Clopath, D. Kumaran, and R. Hadsell. Overcomingcatastrophic forgetting in neural networks.
Proceedingsof the National Academy of Sciences , 114(13):3521–3526,2017.[Krizhevsky et al. , 2012] A. Krizhevsky, I. Sutskever, andG. E. Hinton. Imagenet classification with deep convo-lutional neural networks. In
NeurIPS , pages 1097–1105,2012.[LeCun et al. , 1989] Y. LeCun, B. Boser, J. S. Denker,D. Henderson, R. E. Howard, W. Hubbard, and L. D.Jackel. Backpropagation applied to handwritten zip coderecognition.
Neural computation , 1(4):541–551, 1989.[Lin et al. , 2017] T. Lin, P. Goyal, R. B. Girshick, K. M. He,and P. Doll´ar. Focal loss for dense object detection. In
ICCV , pages 2999–3007, 2017.[McCulloch and Pitts, 1990] W. S. McCulloch and W. Pitts.A logical calculus of the ideas immanent in nervous ac-tivity.
Bulletin of mathematical biology , 52(1-2):99–115,1990. [Nair and Hinton, 2010] V. Nair and G. E. Hinton. Recti-fied linear units improve restricted boltzmann machines.In
ICML , pages 807–814, 2010.[Paszke et al. , 2019] A. Paszke, S. Gross, F. Massa, A. Lerer,J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In
NeurIPS , pages8024–8035, 2019.[Ren et al. , 2015] S. Q. Ren, K. M. He, R. B. Girshick, andJ. Sun. Faster R-CNN: towards real-time object detectionwith region proposal networks. In
NeurIPS , pages 91–99,2015.[Russakovsky et al. , 2015] O. Russakovsky, J. Deng, H. Su,J. Krause, S. Satheesh, S. Ma, Z. H. Huang, A. Karpathy,A. Khosla, M. Bernstein, et al. Imagenet large scale visualrecognition challenge.
International journal of computervision , 115(3):211–252, 2015.[Selvaraju et al. , 2017] R. R. Selvaraju, M. Cogswell,A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam:Visual explanations from deep networks via gradient-based localization. In
ICCV , pages 618–626, 2017.[Sifre and Mallat, 2014] L. Sifre and S. Mallat.Rigid-motion scattering for texture classification. arXiv:1403.1687 , 2014.[Simonyan and Zisserman, 2015] K. Simonyan and A. Zis-serman. Very deep convolutional networks for large-scaleimage recognition. In
ICLR , 2015.[Tan and Le, 2019] M. X. Tan and Q. V. Le. Efficientnet: Re-thinking model scaling for convolutional neural networks.In
ICML , volume 97, pages 6105–6114, 2019.[Wang et al. , 2019] C. Wang, J. F. Yang, L. H. Xie, and J. S.Yuan. Kervolutional neural networks. In
CVPR , pages31–40, 2019.[Xie et al. , 2017] S. N. Xie, R. Girshick, P. Doll´ar, Z. W. Tu,and K. M. He. Aggregated residual transformations fordeep neural networks. In
CVPR , pages 1492–1500, 2017.[Xu et al. , 2020] G. D. Xu, Z. W. Liu, X. X. Li, and C. C.Loy. Knowledge distillation meets self-supervision. In
ECCV , volume 12354, pages 588–604, 2020.[Zoumpourlis et al. , 2017] G. Zoumpourlis,A. Doumanoglou, N. Vretos, and P. Daras. Non-linear convolution filters for cnn-based learning. In