RegNet: Self-Regulated Network for Image Classification
Jing Xu, Yu Pan, Xinglin Pan, Steven Hoi, Zhang Yi, Zenglin Xu
11 RegNet: Self-Regulated Network for ImageClassification
Jing Xu, Yu Pan, Xinglin Pan, Steven Hoi,
Fellow, IEEE , Zhang Yi,
Fellow, IEEE , and Zenglin Xu ∗ . Abstract —The ResNet and its variants have achieved remark-able successes in various computer vision tasks. Despite itssuccess in making gradient flow through building blocks, thesimple shortcut connection mechanism limits the ability of re-exploring new potentially complementary features due to theadditive function. To address this issue, in this paper, we proposeto introduce a regulator module as a memory mechanism toextract complementary features, which are further fed to theResNet. In particular, the regulator module is composed ofconvolutional RNNs (e.g., Convolutional LSTMs or ConvolutionalGRUs), which are shown to be good at extracting spatio-temporalinformation. We named the new regulated networks as RegNet.The regulator module can be easily implemented and appendedto any ResNet architectures. We also apply the regulator mod-ule for improving the Squeeze-and-Excitation ResNet to showthe generalization ability of our method. Experimental resultson three image classification datasets have demonstrated thepromising performance of the proposed architecture comparedwith the standard ResNet, SE-ResNet, and other state-of-the-artarchitectures.
Index Terms —Residue Networks, Convolutional RecurrentNeural Networks, Convolutional Neural Networks
I. I
NTRODUCTION
Convolutional neural networks (CNNs) have achieved abun-dant breakthroughs in a number of computer vision tasks [1].Since the champion achieved by AlexNet [2] at the ImageNetcompetition in 2012, various new architectures have beenproposed, including VGGNet [3], GoogLeNet [4], ResNet [5],DenseNet [6], and recent NASNet [7].Among these deep architectures, ResNet and its vari-ants [8]–[11] have obtained significant attention with out-standing performances in both low-level and high-level visiontasks. The remarkable success of ResNets is mainly due to theshortcut connection mechanism, which makes the training ofa deeper network possible, where gradients can directly flowthrough building blocks and the gradient vanishing problemcan be avoided in some sense. However, the shortcut con-nection mechanism makes each block focus on learning itsrespective residual output, where the inner block information
Jing Xu and Zenglin Xu are with the School of Science and Technology,Harbin Institute of Technology, Shenzhen, Shenzhen 510085, Guangdong,China.Yu Pan and Xinglin Pan are with the Department of SMILE Lab, Schoolof Computer Science and Engineering, University of Electronic Science andTechnology of China, Chengdu 610031, China.Steven Hoi is with the School of Information Systems (SIS) SingaporeManagement University, SingaporeZhang yi is with the Machine Intelligence Laboratory, College of ComputerScience, Sichuan University, Chengdu 610065, ChinaZenglin Xu is the corresponding author (e-mail:[email protected]) communication is somehow ignored and some reusable in-formation learned from previous blocks tends to be forgottenin later blocks. To illustrate this point, we visualize theoutput(residual) feature maps learned by consecutive blocks inResNet in Fig. 1(a). It can be see that due to the summationoperation among blocks, the adjacent outputs 𝑂 𝑡 , 𝑂 𝑡 + and 𝑂 𝑡 + look very similar to each other, which indicates that lessnew information has been learned through consecutive blocks.A potential solution to address the above problems is to cap-ture the spatio-temporal dependency between building blockswhile constraining the speed of parameter increasing. To thisend, we introduce a new regulator mechanism in parallel tothe shortcuts in ResNets for controlling the necessary memoryinformation passing to the next building block. In detail, weadopt the Convolutional RNNs (“ConvRNNs") [12] as theregulator to encode the spatio-temporal memory. We namethe new architecture as RNN-Regulated Residual Networks,or “RegNet" for short. As shown in Fig. 1(a), at the 𝑖 𝑡ℎ building block, a recurrent unit in the convolutional RNNtakes the feature from the current building block as the input(denoted by 𝐼 𝑖 ), and then encodes both the input and the serialinformation to generate the hidden state (denoted by 𝐻 𝑖 ); thehidden state will be concatenated with the input for reuse in thenext convolution operation (leading to the output feature 𝑂 𝑖 ),and will also be transported to the next recurrent unit. To betterunderstand the role of the regulator, we visualize the featuremaps, as shown in Fig. 1(a). We can see that the 𝐻 𝑖 generatedby ConvRNN can complement with the input features 𝐼 𝑖 . Afterconducting convolution on the concatenated features of 𝐻 𝑖 and 𝐼 𝑖 , the proposed model gets more meaningful features withrich edge information 𝑂 𝑖 than ResNet does. For quantitativelyevaluating the information contained in the feature maps, wetest their classification ability on test data (by adding averagepooling layer and the last fully connected layer to the 𝑂 𝑖 ofthe last three blocks). As shown in Fig. 1(b), we can find thatthe new architecture can get higher prediction accuracy, whichindicates the effectiveness of the regulator from ConvRNNs.Thanks to the kind of parallel structure of the regulatormodule, the RNN-based regulator is easy to implement andcan be applicable to other ResNet-based structures, such asthe SE-ResNet [11], Wide ResNet [8], Inception-ResNet [9],ResNetXt [10], Dual Path Network(DPN) [13], and so on.Without loss of generality, as another instance to demonstratethe effectiveness of the proposed regulator, we also apply theConvRNN module for improving the Squeeze-and-ExcitationResNet (shorted as “SE-RegNet").For evaluation, we apply our model to the task of imageclassification on three highly competitive benchmark datasets, a r X i v : . [ ee ss . I V ] J a n BuildingBlockBuildingBlockBuildingBlock O t O t + O t + O i t th ( t + th ( t + th H t O i H i I i t th th ( t + ( t + th BuildingBlockBuildingBlockBuildingBlock O t O t + O t + ConvRNNConvRNNConvRNN I t H t H t + H t + I t + I t + : The output of building block i th O i i th : The input / ouput of ConvRNN at building block I i H i / ResNetInput RegNet (a) the output of i th block t e s t a cc u r a c y ( % ) ResNetRegNet (b)Fig. 1. (a):Visualization of feature maps in the ResNet [5] and RegNet. We visualize the outputs 𝑂 𝑖 feature maps of the 𝑖 𝑡ℎ building blocks, 𝑖 ∈ { 𝑡, 𝑡 + , 𝑡 + } .In RegNets, 𝐼 𝑖 denotes the input feature maps. 𝐻 𝑖 denotes the hidden states generated by the ConvRNN at step 𝑖 . By applying convolution operations overthe concatenation 𝐼 𝑖 with 𝐻 𝑖 , we can get the regulated outputs( denoted by 𝑂 𝑖 ) of the 𝑖 𝑡ℎ building block. (b): The prediction on test data based on theoutput feature maps of consecutive building blocks. During the test time, we add an average pooling layer and the last fully connected layer to the outputs ofthe last three building blocks( 𝑖 ∈ { , , } ) in ResNet-20 and RegNet-20 to get the classification results. It can be seen that the output of each block aidedwith the memory information results in higher classification accuracy. including CIFAR-10, CIFAR-100, and ImageNet. In com-parison with the ResNet and SE-ResNet, our experimentalresults have demonstrated that the proposed architecture cansignificantly improve the classification accuracy on all thedatasets. We further show that the regulator can reduce therequired depth of ResNets while reaching the same level ofaccuracy. II. R ELATED W ORK
Deep neural networks have been achieved empirical break-throughs in machine learning. However, training networks withsufficient depths is a very tricky problem. Shortcut connectionhas been proposed to address the difficulty in optimizationto some extent [5], [14]. Via the shortcut, information canflow across layers without attenuation. A pioneering work isthe Highway Network [14], which implements the shortcutconnections by using a gating mechanism. In addition, theResNet [5] explicitly requests building blocks fitting a residualmapping, which is assumed to be easier for optimization.Due to the powerful capabilities in dealing with visiontasks of ResNets, a number of variants have been proposed,including WRN [8], Inception-ResNet [9], ResNetXt [10], ,WResNet [15], and so on. ResNet and ResNet-based modelshave achieved impressive, record-breaking performance inmany challenging tasks. In object detection, 50- and 101-layered ResNets are usually used as basic feature extractors inmany models: Faster R-CNN [16], RetinaNet [17], Mask R-CNN [18] and so on. The most recent models aiming at imagesuper-resolution tasks, such as SRResNet [19], EDSR andMDSR [20], are all based on ResNets, with a little modifica-tion. Meanwhile, in [21], the ResNet is introduced to removerain streaks and obtains the state-of-the-art performance.Despite the success in many applications, ResNets stillsuffer from the depth issue [22]. DenseNet proposed by [6]concatenates the input features with the output features usinga densely connected path in order to encourage the network to reuse all of the feature maps of previous layers. Obviously,not all feature maps need to be reused in the future layers,and consequently the densely connected network also leadsto some redundancy with extra computational costs. Recently,Dual Path Network [13] and Mixed link Network [23] arethe trade-offs between ResNets and DenseNets. In addition,some module-based architectures are proposed to improve theperformance of the original ResNet. SENet [11] proposesa lightweight module to get the channel-wise attention ofintermediate feature maps. CBAM [24] and BAM [25] designmodules to infer attention maps along both channel andspatial dimensions. Despite their success, those modules try toregulate the intermediate feature maps based on the attentioninformation learned by the intermediate feature themselves, sothe full utilization of historical spatio-temporal information ofprevious features still remains an open problem.On the other hand, convolutional RNNs (shorted as Con-vRNN), such as ConvLSTM [12] and ConvGRU [26], havebeen used to capture spatio-temporal information in a num-ber of applications, such as rain removal [27], video super-resolution [28], video compression [29], video object detectionand segemetation [30], [31]. Most of those works embed Con-vRNNs into models to capture the dependency information ina sequence of images. In order to regulate the information flowof ResNet, we propose to leverage ConvRNNs as a separatemodule aiming to extracting spatio-temporal information ascomplementary to the original feature maps of ResNets.III. O UR M ODEL
In the section, we first revisit the background of ResNetsand two advanced ConvRNNs: ConvLSTM and ConvGRU.Then we present the proposed RegNet architectures.
A. ResNet
The degradation problem which makes the traditional net-work hard to converge, is exposed when the architecture goes deeper. The problem can be mitigated by ResNet [5] to someextent. Building blocks are the basic architecture of ResNet,as shown in Fig. 2(b), instead of directly fitting a originalunderlying mapping, shown in Fig. 2(a). The deep residualnetwork obtained by stacking building blocks has achievedexcellent performance in image classification, which provesthe competence of the residual mapping.
ConvConvX (a)
ConvConvX (b)Fig. 2. 2(a) shows the original underlying mapping while 2(b) shows theresidual mapping in ResNet [5].
B. ConvRNN and its Variants
RNN and its classical variants LSTM and GRU haveachieved great success in the field of sequence processing.To tackle the spatio-temporal problems, we adopt the ba-sic ConvRNN and its variants ConvLSTM and ConvGRU,which are transformed from the vanilla RNNs by replacingtheir fully-connected operators with convolutional operators.Furthermore, for reducing the computational overhead, wedelicately design the convolutional operation in ConvRNNs.In our implementation, the ConvRNN can be formulated as H 𝑡 = 𝑡𝑎𝑛ℎ ( 𝑁 W 𝑁ℎ ∗ [ X 𝑡 , H 𝑡 − ] + b ℎ ) , (1)where 𝑋 𝑡 is the input 3D feature map, 𝐻 𝑡 − is the hidden stateobtained from the earlier output of ConvRNN and 𝐻 𝑡 is theoutput 3D feature map at this state. Both the number of input 𝑋 𝑡 and output 𝐻 𝑡 channels in the ConvRNN are N.Additionally, 𝑁 W 𝑁 ∗ X denotes a convolution operationbetween weights W and input X with the input channel 2N andthe output channel N. To make the ConvRNN more efficient,inspired by [30], [32], given input X with 2N channels, weconduct the convolution operation in 2 steps:(1) Divide the input X with 2N channels into N groups,and use grouped convolutions [33] with × kernel toprocess each group separately for fusing input channels.(2) Divide the feature map obtained by (1) into N groups,and use grouped convolutions with × kernel toprocess each group separately for capturing the spatialinformation per input channel.Directly applying the original convolutions with × kernelssuffers from high computational complexity. As detailed inTable I, the new modification reduces the required computationby 18N/11 times with comparable result. Similarly, all theconvolutions in ConvGRU and ConvLSTM are replaced withthe light-weight modification. C. RNN-Regulated ResNet
To deal with the CIFAR-10/100 datasets and the Imagenetdataset, [5] proposed two kinds of ResNet building blocks:the non-bottleneck building block and the bottleneck building
TABLE IP
ERFORMANCE OF R EG N ET -20 WITH C ONV
GRU
AS REGULATORS ON
CIFAR-10. W
E COMPARE THE TEST ERROR RATES BETWEENTRADITIONAL × KERNELS AND OUR NEW MODIFICATION .kernel type err. Params FLOPs3 × BN + RELUBN + RELU
BNRELU
ConvRNN X t X t X t X t X t + H t Concat
BN + RELU H t T (a) BN + RELUBN + RELU RELU
BN + RELU BN ConvRNNH t X t X t X t X t X t X t + Concat
BN + RELU H t T (b)Fig. 3. The RegNet module is shown in 3(a). The bottleneck RegNet blockis shown in 3(b). The 𝑇 denotes the number of building blocks as well as thetotal time steps of ConvRNN. block. Based on those, by applying ConvRNNs as regulators,we get RNN-Regulated ResNet building module and bottle-neck RNN-Regulated ResNet building module correspond-ingly.
1) RNN-Regulated ResNet Module (RegNet module):
Theillustration of RegNet module is shown in Fig. 3(a). Here, wechoose ConvLSTM for expounding. 𝐻 𝑡 − denotes the earlieroutput from ConvLSTM, and 𝐻 𝑡 is output of the ConvLSTMat 𝑡 -th module . 𝑋 𝑡𝑖 denotes the 𝑖 -th feature map at the 𝑡 -thmodule.The 𝑡 -th RegNet(ConvLSTM) module can be expressed as X 𝑡 = 𝑅𝑒𝐿𝑈 ( 𝐵𝑁 ( W 𝑡 ∗ X 𝑡 + b 𝑡 ) , [ H 𝑡 , C 𝑡 ] = 𝑅𝑒𝐿𝑈 ( 𝐵𝑁 ( 𝐶𝑜𝑛𝑣𝐿𝑆𝑇 𝑀 ( X 𝑡 , [ H 𝑡 − , C 𝑡 − ]))) , X 𝑡 = 𝑅𝑒𝐿𝑈 ( 𝐵𝑁 ( W 𝑡 ∗ 𝐶𝑜𝑛𝑐𝑎𝑡 [ X 𝑡 , H 𝑡 ])) , X 𝑡 = 𝐵𝑁 ( W 𝑡 ∗ X 𝑡 + b 𝑡 ) , X 𝑡 + = 𝑅𝑒𝐿𝑈 ( X 𝑡 + X 𝑡 ) , (2)where W 𝑡𝑖 𝑗 denotes the convolutional kernel which mappingfeature map X 𝑡𝑖 to X 𝑡𝑗 and b 𝑡𝑖 𝑗 denotes the correlative bias.Both W 𝑡 and W 𝑡 are × convolutional kernels. The W 𝑡 is × kernel. BN( · ) indicates batch normalization. 𝐶𝑜𝑛𝑐𝑎𝑡 [·] refers to the concatenate operation.Notice that in Eq (2) the input feature X 𝑡 and the previousoutput of ConvLSTM H 𝑡 are the inputs of ConvLSTM in 𝑡 -thmodule. According to the inputs, the ConvLSTM automati-cally decides whether the information in memory cell will bepropagated to the output hidden feature map H 𝑡 .
2) Bottleneck RNN-Regulated ResNet Module (bottleneckRegNet module):
The bottleneck RegNet module based on the
TABLE IIA
RCHITECTURES FOR
CIFAR-10/100
DATASETS . B
Y SETTINGN ∈ { , , } , WE CAN GET THE { , , } - LAYERED R EG N ET .name output size (6n + ×
32 3 × , 16conv_1 × ConvRNN + (cid:20) × , × , (cid:21) × 𝑛 conv_2 × ConvRNN + (cid:20) × , × , (cid:21) × 𝑛 conv_3 × ConvRNN + (cid:20) × , × , (cid:21) × 𝑛 × AP, FC, softmaxTABLE IIIC
LASSIFICATION ERROR RATES ON THE
CIFAR-10/100. B
EST RESULTSARE MARKED IN BOLD .model C10 C100ResNet-20 [5] 8.38 31.72RegNet-20(ConvRNN) 7.60 30.03RegNet-20(ConvGRU) 7.42
RegNet-20(ConvLSTM) bottleneck ResNet building block is shown in Fig. 3(b). Thebottleneck building block introduced in [5] for dealing withthe pictures with large size. Based on that, the 𝑡 -th bottleneckRegNet module can be expressed as X 𝑡 = 𝑅𝑒𝐿𝑈 ( 𝐵𝑁 ( W 𝑡 ∗ X 𝑡 + b 𝑡 ) , [ H 𝑡 , C 𝑡 ] = 𝑅𝑒𝐿𝑈 ( 𝐵𝑁 ( 𝐶𝑜𝑛𝑣𝐿𝑆𝑇 𝑀 ( X 𝑡 , [ H 𝑡 − , C 𝑡 − ]))) , X 𝑡 = 𝑅𝑒𝐿𝑈 ( 𝐵𝑁 ( W 𝑡 ∗ X 𝑡 + b 𝑡 ) , X 𝑡 = 𝑅𝑒𝐿𝑈 ( 𝐵𝑁 ( W 𝑡 ∗ 𝐶𝑜𝑛𝑐𝑎𝑡 [ X 𝑡 , H 𝑡 ])) , X 𝑡 = 𝐵𝑁 ( W 𝑡 ∗ X 𝑡 + b 𝑡 ) , X 𝑡 + = 𝑅𝑒𝐿𝑈 ( X 𝑡 + X 𝑡 ) , (3)where W 𝑡 and W 𝑡 are the two × kernels, and W 𝑡 is the × bottleneck kernel. The W 𝑡 is a × kernel for fusingfeature in our model.IV. E XPERIMENTS
In this section, we evaluate the effectiveness of the proposedconvRNN regulator on three benchmark datasets, includingCIFAR-10, CIFAR-100, and ImageNet. We run the algorithmson pytorch. The small-scaled models for CIFAR are trained ona single NVIDIA 1080 Ti GPU, and the large-scaled modelsfor ImageNet are trained on 4 NVIDIA 1080 Ti GPUs.
A. Experiments on CIFAR
The CIFAR datasets [34] consist of RGB image with × pixels. Each dataset contains 50k training images and 10ktesting images. The images in CIFAR-10 and CIFAR-100 aredrawn from 10 and 100 classes respectively. We train on thetraining dataset and evaluate on the test dataset.By applying ConvRNNs to ResNet and SE-ResNet, weget the RegNet, and SE-RegNet models separately. Here, we use 20-layered RegNet and SE-RegNet to prove the wideapplicability of our method. The SE-RegNet building modulein Fig. 3(a) is used to analysis CIFAR datasets. The structuraldetails of SE-RegNet are shown in Table II. The inputs of thenetwork are × images. In each conv_ 𝑖 , 𝑖 ∈ { , , } layer,there are n RegNet building modules stacked sequentially, andconnected together by a ConvRNN. In summary, there are 3ConvRNNs in our architecture, and each ConvRNN impactson the n RegNet building modules. The reduction ratio r inSE block is 8.In this experiment, we use SGD with a momentum of 0.9and a weight decay of 1e-4. We train with a batch size of64 for 150 epoch. The initial learning rate is 0.1 and dividedby 10 at 80 epochs. Data augmentation in [35] is used intraining. The results of SE-ResNet on CIFAR are based on ourimplementation, since the results were not reported in [11].
1) Results on CIFAR:
The classification errors on theCIFAR-10/100 test sets are shown in Table III. We cansee from the results, with the same layer, both RegNet andSE-RegNet outperform the original models by a significantmargin. Compared with ResNet-20, our RegNet-20 with Con-vLSTM decreases the error rate by 1.51% on CIFAR-10 and2.04% on CIFAR-100. At the same time, compared with SE-ResNet-20, our SE-RegNet-20 with ConvLSTM decreases theerror rate by 1.04% on CIFAR-10 and 2.12% on CIFAR-100.Using ConvGRU as the regulator can reach the same level ofaccuracy as ConvLSTM. Due to the vanilla ConvRNN lacksgating mechanism, it performs slightly worse but still makesgreat progress compared with the baseline model.
2) Parameters Analysis:
For a fair comparison, we eval-uate our model’s ability by regarding the number of modelsparameters as the contrast reference. As shown in Table IV,we list the test accuracy of 20, 32, 56-layered ResNets andtheir respective RegNet counterparts on CIFAR-10/100. Afteradding minimal additional parameters, both our RegNet withConvGRU and ConvLSTM surpass the ResNet by a largemargin. Our 20-layered RegNet with extra 0.04M parameterseven outperforms the 32-layered ResNet on both CIFAR-10/100: our 20-layered RegNet(ConvLSTM) having 0.32Mparameters reaches 7.28% error rate on CIFAR-10 surpass the32-layered ResNet with 7.54% error rate which having 0.47Mparameters. Fig. 4 demonstrates the parameter efficiency com-parisons between RegNet and ResNet. We show our RegNetare more parameter-efficient than simply stacking layers invanilla ResNet. On both CIFAR-10/100, our RegNets(GRU)get comparable performance with ResNet-56 with nearly 1/2parameters.
3) Positions of Feature Reuse:
In this subsection, we per-form ablation experiment to further analyze the effect of theposition of feature reuse. We conduct an experiment to analysisthat with ConvRNN which layer has the maximum promotionto the final outcome. Some previous studies [36] show thatthe features in an earlier layer are more general while thefeatures in later layers exhibit more specific. As shown inTable II, the conv_1, conv_2, conv_3 layers are separated bythe down sampling operation, which makes the features inconv_1 are more low-level and in conv_3 are more specific forclassification. The classification results are shown in Table V.
TABLE IVT
EST ERROR RATES ON
CIFAR-10/100. W
E USE C ONV
GRU
AND C ONV
LSTM
AS REGULATORS OF R ES N ET . W E LIST THE INCREASE OF PARAMETERTHE ARCHITECTURES AT THE RIGHT CORNER OF THE ERROR RATES .C-10 C-100layer ResNet +ConvGRU +ConvLSTM ResNet +ConvGRU +ConvLSTM20 8.38 7.42 (+ . 𝑀 ) (+ . 𝑀 ) (+ . 𝑀 ) (+ . 𝑀 )
32 7.54 6.60 (+ . 𝑀 ) (+ . 𝑀 ) (+ . 𝑀 ) (+ . 𝑀 )
56 6.78 6.39 (+ . 𝑀 ) (+ . 𝑀 ) (+ . 𝑀 ) (+ . 𝑀 ) T e s t e rr o r ( % ) ResNetRegNet(ConvGRU)RegNet(ConvLSTM) (a) T e s t E rr o r ( % ) ResNetRegNet(ConvGRU)RegNet(ConvLSTM) (b)Fig. 4. Comparison of parameter efficiency on CIFAR-10 between RegNetand ResNet [5]. In both 4(a) and 4(b), the curves of our RegNet is alwaysbelow ResNet [5] which show that with the same parameters, our models havestronger ability of expression. TABLE VT
EST ERROR RATES ON
CIFAR-10/100. W
E USE C ONV
GRU
AND C ONV
LSTM
AS REGULATORS OF R ES N ET . W E LIST THE INCREASE OFPARAMETER THE ARCHITECTURES . I
N EACH OF OUR R EG N ET ( 𝑖 ) MODELS , THERE IS ONLY ONE C ONV
RNN
APPLIED IN LAYER CONV _ 𝑖 , 𝑖 ∈ { , , } .C-10 C-100model err. Params err. ParamsResNet [5] 8.38 0.273M 31.72 0.278MRegNet ( ) (GRU) 7.52 0.279M 30.40 0.285MRegNet ( ) (GRU) 7.48 0.285M 30.34 0.291MRegNet ( ) (GRU) 7.49 0.306M 30.30 0.312MRegNet ( ) (LSTM) 7.56 0.281M 30.23 0.286MRegNet ( ) (LSTM) 7.49 0.290M 30.28 0.296MRegNet ( ) (LSTM) 7.52 0.325M 29.92 0.331M In each model, only one ConvRNN is applied. We namethe models RegNet ( 𝑖 ) , 𝑖 ∈ { , , } which denotes that onlyapplying a ConvRNN in layer conv_ 𝑖 and maintaining theoriginal ResNet structure in the other layers. For a fair compar-ison, we evaluate the models ability by regarding the numberof models parameters as the contrast reference. We can seefrom the results, using ConvRNNs in a lower layer(conv_1) ismore parameter-efficient than higher layer(conv_3). With lessparameter increasing in lower layers, they can bring aboutnearly same improvement in accuracy compared with higherlayers. Compared with ResNet, our RegNet ( ) (GRU) decreasethe test error from 8.38% to 7.52%(-0.86%) on CIFAR-10 withadditional 0.006M parameters and from 31.72% to 30.40%(-1.32%) on CIFAR-100 with additional 0.007M parameters.This significant improvement with minimal additional param-eters further proves the effectiveness of the proposed method.The concatenate operation in our model can fuse featurestogether to explore new features [13], which is more importantfor general features in lower layers. TABLE VIS
INGLE - CROP VALIDATION ERROR RATES ON I MAGE N ET ANDCOMPLEXITY COMPARISONS . B
OTH R ES N ET AND R EG N ET ARE
LAYER . R ES N ET ∗ MEANS WE REPRODUCE THE RESULT BY OURSELF .model top-1 err. top-5 err. Params FLOPsResNet [5] 24.7 7.8 26.6M 4.14GResNet ∗ (− . ) (− . ) INGLE - CROP ERROR RATES ON THE I MAGE N ET VALIDATION SET FORSTATE - OF - THE - ART MODELS . T HE R ES N ET -50 ∗ MEANS THAT THERE - IMPLEMENTION RESULT BY OUR EXPERIMENTS .model top-1 top-5 Params(M) FLOPs(G)WRN-18(widen=2.0) [8] 25.58 8.06 45.6 6.70DenseNet-169 [6] 23.80 6.85 28.9 7.7SE-ResNet-50 [11] 23.29 6.62 26.7 4.14ResNet-50 [5] 24.7 7.8 - -ResNet-50 ∗ B. Experiments on ImageNet
We evaluate our model on ImageNet 2012 dataset [3] whichconsists of 1.28 million training images and 50k validationimages from 1000 classes. Following the previous papers, Wereport top-1 and top-5 classification errors on the validationdataset. Due to the limited resources of our GPUs and withoutof loss of generality, we run the experiments of ResNets andRegNets only.The bottleneck RegNet building modules are applied to Im-ageNet. We use 4 ConvRNNs in RegNet-50. The ConvRNN 𝑖 , 𝑖 ∈ { , , , } , controls { , , , } bottleneck RegNet modulesrespectively. In this experiment, we use SGD with a momen-tum of 0.9 and a weight decay of 1e-4. We train with batchsize 128 for 90 epoch. The initial learning rate is 0.06 anddivided by 10 at 50 and 70 epochs. The input of the networkis × images, which randomly cropped from the resizedoriginal images or their horizontal flips. Data augmentation in[27] is used in training. We evaluate our model by applying acenter-crop with × .We evaluate the efficiency of baseline models ResNet-50and its respectively RegNet counterpart. The comparison isbased on the computational overhead. As shown in Table VIwith additional 4.7M parameters, RegNet outperforms thebaseline model by 1.38% on top-1 accuracy and 0.85% ontop-5 accuracy.Table VII shows the error rates of some state-of-the-art models on the ImageNet validation set. Compared with thebaseline ResNet, our RegNet-50 with 31.3M parameters and5.12G FLOPs not only surpasses the ResNet-50 but alsooutperforms ResNet-101 with 44.6M parameters and 7.9GFLOPs. Since the proposed regulator module is essentially abeneficial makeup to the short cut mechanism in ResNets, onecan easily apply the regulator module to other ResNet-basedmodels, such as SE-ResNet, WRN-18 [8], ResNetXt [10],Dual Path Network (DPN) [13], etc. Due to computation re-source limitation, we leave the implementation of the regulatormodule in these ResNet extensions as our future work.V. C ONCLUSIONS
In this paper, we proposed to employ a regulator modulewith Convolutional RNNs to extract complementary featuresfor improving the representation power of the ResNets. Ex-perimental results on three image-classification datasets havedemonstrated the promising performance of the proposed ar-chitecture in comparison with standard ResNets and Squeeze-and-Excitation ResNets as well as other state-of-the-art archi-tectures.In the future, we intend to further improve the efficiency ofthe proposed architecture and to apply the regulator moduleto other ResNet-based architectures [8]–[10] to increase theircapacity. Besides, we will further explore RegNets for otherchallenging tasks, such as object detection [16], [17], imagesuper-resolution [19], [20], and so on.A
CKNOWLEDGMENT
This work was partially supported by the NationalKey Research and Development Program of China (No.2018AAA0100204). R
EFERENCES[1] Y. LeCun, Y. Bengio et al. , “Convolutional networks for images, speech,and time series,”
The handbook of brain theory and neural networks ,vol. 3361, no. 10, p. 1995, 1995.[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in Neural Infor-mation Processing Systems 25 , F. Pereira, C. J. C. Burges, L. Bottou, andK. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.[3] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,”
CoRR , vol. abs/1409.1556, 2014.[Online]. Available: http://arxiv.org/abs/1409.1556[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,”
CoRR , vol. abs/1409.4842, 2014.[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,”
CoRR , vol. abs/1512.03385, 2015.[6] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolu-tional networks,”
CoRR , vol. abs/1608.06993, 2016.[7] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning trans-ferable architectures for scalable image recognition,”
CoRR , vol.abs/1707.07012, 2017.[8] S. Zagoruyko and N. Komodakis, “Wide residual networks,”
CoRR , vol.abs/1605.07146, 2016.[9] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet and the impact of residual connections on learning,”
CoRR , vol.abs/1602.07261, 2016.[10] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residualtransformations for deep neural networks,”
CoRR , vol. abs/1611.05431,2016.[11] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”
CoRR ,vol. abs/1709.01507, 2017. [12] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo, “Convo-lutional LSTM network: A machine learning approach for precipitationnowcasting,”
CoRR , vol. abs/1506.04214, 2015.[13] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual pathnetworks,”
CoRR , vol. abs/1707.01629, 2017. [Online]. Available:http://arxiv.org/abs/1707.01629[14] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,”
CoRR , vol. abs/1505.00387, 2015.[15] F. Shen, R. Gan, and G. Zeng, “Weighted residuals for very deepnetworks,” international conference on systems , pp. 936–941, 2016.[16] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towardsreal-time object detection with region proposal networks,”
CoRR , vol.abs/1506.01497, 2015.[17] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss fordense object detection,”
CoRR , vol. abs/1708.02002, 2017.[18] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,”
CoRR , vol. abs/1703.06870, 2017.[19] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken, A. Te-jani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single imagesuper-resolution using a generative adversarial network,”
CoRR , vol.abs/1609.04802, 2016.[20] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residualnetworks for single image super-resolution,” in
CVPR, Workshops , July2017.[21] X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley, “Removingrain from single images via a deep detail network,” in . IEEE Computer Society, 2017,pp. 1715–1723.[22] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deepnetworks with stochastic depth,”
CoRR , vol. abs/1603.09382, 2016.[23] W. Wang, X. Li, J. Yang, and T. Lu, “Mixed link networks,”
CoRR , vol.abs/1802.01808, 2018.[24] S. Woo, J. Park, J. Lee, and I. S. Kweon, “CBAM: convolutional blockattention module,”
CoRR , vol. abs/1807.06521, 2018.[25] J. Park, S. Woo, J. Lee, and I. S. Kweon, “BAM: bottleneck attentionmodule,”
CoRR , vol. abs/1807.06514, 2018.[26] N. Ballas, L. Yao, C. Pal, and A. C. Courville, “Delving deeper intoconvolutional networks for learning video representations,”
CoRR , vol.abs/1511.06432, 2015.[27] X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha, “Recurrent squeeze-and-excitation context aggregation net for single image deraining,”
CoRR ,vol. abs/1807.05698, 2018.[28] Z. Wang, P. Yi, K. Jiang, J. Jiang, Z. Han, T. Lu, and J. Ma, “Multi-memory convolutional neural network for video super-resolution,”
IEEETIP , vol. 28, no. 5, pp. 2530–2544, May 2019.[29] Y. Xu, L. Gao, K. Tian, S. Zhou, and H. Sun, “Non-local convlstmfor video compression artifact reduction,”
CoRR , vol. abs/1910.12286,2019.[30] M. Liu, M. Zhu, M. White, Y. Li, and D. Kalenichenko, “Looking fastand slow: Memory-guided mobile video object detection,”
CoRR , vol.abs/1903.10172, 2019.[31] M. Siam, S. Valipour, M. Jägersand, and N. Ray, “Convolutional gatedrecurrent networks for video segmentation,”
CoRR , vol. abs/1611.05435,2016.[32] . IEEEComputer Society, 2018.[33] T. Zhang, G. Qi, B. Xiao, and J. Wang, “Interleaved group convolutionsfor deep neural networks,”
CoRR , vol. abs/1707.02725, 2017.[34] A. Krizhevsky and G. Hinton, “Learning multiple layers of featuresfrom tiny images,”
Master’s thesis, Department of Computer Science,University of Toronto , 2009.[35] G. Lebanon and S. V. N. Vishwanathan, Eds.,
Proceedings of theEighteenth International Conference on Artificial Intelligence andStatistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015 ,ser. JMLR Workshop and Conference Proceedings, vol. 38. JMLR.org,2015. [Online]. Available: http://jmlr.org/proceedings/papers/v38/[36] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable arefeatures in deep neural networks?”