Is Each Layer Non-trivial in CNN?
IIs Each Layer Non-trivial in CNN? (Student Abstract)
Wei Wang, Yanjie Zhu, Zhuoxu Cui, Dong Liang,
Center for Medical AI,Institute of Biomedical and Health Engineering, Shenzhen Institutes of Advanced Technology, ChineseAcademy of SciencesShen Zhen 518055, [email protected], [email protected], 86-755-86392243
Abstract
Convolutional neural network (CNN) models have achievedgreat success in many fields. With the advent of ResNet, net-works used in practice are getting deeper and wider. How-ever, is each layer non-trivial in networks? To answer thisquestion, we trained a network on the training set, then wereplace the network convolution kernels with zeros and testthe result models on the test set. We compared experimentalresults with baseline and showed that we can reach similar oreven the same performances. Although convolution kernelsare the cores of networks, we demonstrate that some of themare trivial and regular in ResNet.
Introduction
The structures of neural networks are getting more and morecomplex. There are two basis forms: short-connection andno-connection. Short-connection:ResNet (He et al. 2015).No-connection: VGG (Simonyan and Zisserman 2014). Par-ticularly, long-connection can be seen as a special no-connection in the local area. Long-connection: UNet (Ron-neberger, Fischer, and Brox 2015), SegNet (Badrinarayanan,Kendall, and Cipolla 2017). We say the layers are non-trivialif the performances change slightly after replacing the con-volution kernel with zeros, vice versa. It is obvious that eachlayer in no-connection form is important. But, many layersin short-connection (ResNet) are trivial.The main contributions of this paper can be summarizedas : We donate the non-trivial layers of ResNet are mainlyconcentrated on feature decomposition layers, which refersto the layers changing the number of channel dimension,when the model is over-parameterized.
Analyze the Convolution Kernels of ResNetReplaced by ResNet residual unit can be formulated as: x l +1 = σ ( x l + BN ( σ ( BN ( x l ∗ w (cid:48) l )) ∗ w (cid:48)(cid:48) l )) x l +1 = σ ( BN ( x l ∗ w ∗ l ) + BN ( σ ( BN ( x l ∗ w (cid:48) l )) ∗ w (cid:48)(cid:48) l )) Replacing one of the convolution kernels in residual unitwith can be written as: x (cid:48) l +1 = σ ( x l + BN ( σ ( BN ( x l ∗ )) ∗ w (cid:48)(cid:48) l ))= σ ( x l + BN ( σ ( β (cid:48) ) ∗ w (cid:48)(cid:48) l )) x (cid:48)(cid:48) l +1 = σ ( x l + BN ( σ ( BN ( x l ∗ w (cid:48) l )) ∗ ))= σ ( x l + β (cid:48)(cid:48) ) x (cid:48) l +1 = σ ( BN ( x l ∗ w ∗ l ) + BN ( σ ( BN ( x l ∗ )) ∗ w (cid:48)(cid:48) l ))= σ ( BN ( x l ∗ w ∗ l ) + BN ( σ ( β (cid:48) ) ∗ w (cid:48)(cid:48) l )) x (cid:48)(cid:48) l +1 = σ (cid:0) BN (cid:0) x ∗ w ∗ l (cid:1) + BN ( σ ( BN ( x l ∗ w (cid:48) l )) ∗ ) (cid:1) = σ (cid:0) BN (cid:0) x l ∗ w ∗ l (cid:1) + β (cid:48)(cid:48) (cid:1) x (cid:48)(cid:48)(cid:48) l +1 = σ ( BN ( x l ∗ ) + BN ( σ ( BN ( x l ∗ w (cid:48) l )) ∗ w (cid:48)(cid:48) l ))= σ ( β (cid:48)(cid:48)(cid:48) ) + BN ( σ ( BN ( x l ∗ w (cid:48) l )) ∗ w (cid:48)(cid:48) l )) x l , x l +1 : The input and output feature maps of the lth resid-ual unit. x (cid:48) l +1 , x (cid:48)(cid:48) l +1 , x (cid:48)(cid:48)(cid:48) l +1 : The output feature maps of the lth residual unit. w (cid:48) l , w (cid:48)(cid:48) l : The first convolution kernel andthe second convolution kernel of the lth residual unit. ∗ : Convolution operation. BN : Batch normalization opera-tion. β (cid:48) , β (cid:48)(cid:48) , β (cid:48)(cid:48)(cid:48) : The bias in BN layers.
Experiment
We chose ResNet34 and PSPNet-ResNet34 (Zhao et al.2016) to conduct a classification task and image segmen-tation task on Cifar-10 (Krizhevsky 2012) and T1 (Fahmyet al. 2019), respectively. The baselines are 84% and 87%.We conducted three groups of experiments. Firstly, we re-placed each layer’s convolution kernel with 0, respectively(see Figure 1 in supplementary material). Secondly, exceptfor the feature decomposition layers and adjacent layers, wereplaced all the other convolution kernels with 0 in the onelayer block which refers to a continuous layer with the samechannel number (see Figure 2 in supplementary material).Thirdly, we replaced feature decomposition layers of short-connection with 0 (see Figure 3 in supplementary material).
Results
The classification results of Cifar-10 are shown in Figure 1and Table 1,2. The segmentation results of T1 are shown inFigure 2 and Table 3,4. a r X i v : . [ c s . C V ] D ec igure 1: The first experimental results for Cifar-10.Layer block ACC(%)Layer block 1 0.51Layer block 2 0.61Layer block 3 0.83Layer block 4 0.84Table 1: The third experimental results for Cifar-10.Layer block ACC(%)Layer block 2 0.28Layer block 3 0.33Layer block 4 0.16Table 2: The second experimental results for Cifar-10.Figure 2: The first experimental results for T1.Layer block DiceLayer block 1 0.82Layer block 2 0.86Layer block 3 0.82Layer block 4 0.00Table 3: The second experimental results for T1.Layer block DiceLayer block 2 0.00Layer block 3 0.00Layer block 4 0.00Table 4: The third experimental results for T1. According to the structure of ResNet and the result of Fig-ure 1 and Figure 2, it is obvious that the feature decompo-sition layers’ convolution kernels are non-trivial, while therests are trivial. Table 1 and Table 3 also confirm our con-jecture. Table 2 and Table 4 also demonstrate the feature de-composition layers of short-connection are non-trivial. Discussion
We argue that ResNet is a continuous process of feature de-composition and information storage. ResNet shows differ-ent changes in non-trivialness at the front and back of thenetwork for different tasks. Generally, the classification taskneeds to learn enough information about the global abstractfeature. Since enough information has learned in the front,the back layers are no longer non-trivial. Segmentation re-quires information for each pixel, so the back layers are non-trivial.
Conclusion
When there are redundant parameters in ResNet, not all lay-ers of the network are non-trivial, or some layers may not beneeded when the network parameters have learned enoughinformation. The feature decomposition layers and identitymappings are important. Particularly, the feature decompo-sition layers are responsible for the feature decomposition,the identity mapping is responsible for the information stor-age and the residual layers are responsible for the adjustmentof the feature to make it fit the final target. According to theabove conclusions and experiments, when the model is over-parameterized, we can eliminate unnecessary layers in theResNet and improve the training efficiency on the premiseof ensuring performance.
References
Badrinarayanan, V.; Kendall, A.; and Cipolla, R. 2017. Seg-Net: A Deep Convolutional Encoder-Decoder Architecturefor Image Segmentation.
IEEE Transactions on PatternAnalysis and Machine Intelligence
Journal of Car-diovascular Magnetic Resonance arXiv e-prints arXiv:1512.03385.Krizhevsky, A. 2012. Learning Multiple Layers of Featuresfrom Tiny Images.
University of Toronto .Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-Net:Convolutional Networks for Biomedical Image Segmenta-tion. arXiv e-prints arXiv:1505.04597.Simonyan, K.; and Zisserman, A. 2014. Very Deep Convolu-tional Networks for Large-Scale Image Recognition. arXive-prints arXiv:1409.1556.Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2016. PyramidScene Parsing Network. arXiv e-printsarXiv e-prints