Funnel Activation for Visual Recognition
FFunnel Activation for Visual Recognition
Ningning Ma , Xiangyu Zhang (cid:63) , and Jian Sun Hong Kong University of Science and Technology MEGVII Technology [email protected], { zhangxiangyu,sunjian } @megvii.com Abstract.
We present a conceptually simple but effective funnel acti-vation for image recognition tasks, called
Funnel activation (FReLU) ,that extends ReLU and PReLU to a 2D activation by adding a neg-ligible overhead of spatial condition. The forms of ReLU and PReLUare y = max ( x,
0) and y = max ( x, px ), respectively, while FReLU is inthe form of y = max ( x, T ( x )), where T ( · ) is the 2D spatial condition.Moreover, the spatial condition achieves a pixel-wise modeling capac-ity in a simple way, capturing complicated visual layouts with regularconvolutions. We conduct experiments on ImageNet, COCO detection,and semantic segmentation tasks, showing great improvements and ro-bustness of FReLU in the visual recognition tasks. Code is available at https://github.com/megvii-model/FunnelAct . Keywords: funnel activation, visual recognition, CNN
Convolutional neural networks (CNNs) have achieved state-of-the-art perfor-mance in many visual recognition tasks, such as image classification, object de-tection, and semantic segmentation. As popularized in the CNN framework, onemajor kind of layer is the convolution layer, another is the non-linear activationlayer.First in the convolution layers, capturing the spatial dependency adaptivelyis challenging, many advances in more complex and effective convolutions havebeen proposed to grasp the local context adaptively in images [7,18]. The ad-vances achieve great success especially on dense prediction tasks (e.g., semanticsegmentation, object detection). Driven by the advances in more complex convo-lutions and their less efficient implementations, a question arises:
Could regularconvolutions achieve similar accuracy, to grasp the challenging complex images?
Second, usually right after capturing spatial dependency in a convolutionlayer linearly , then an activation layer acts as a scalar non-linear transformation.Many insightful activations have been proposed [31,14,5,25], but improving theperformance on visual tasks is challenging, therefore currently the most widelyused activation is still the Rectified Linear Unit (ReLU) [32]. Driven by the (cid:63)
Corresponding author a r X i v : . [ c s . C V ] J u l Ningning Ma et al.
More effective Transfer better A cc u r a cy I m p r o v e m en t ( % ) FReLUSwishPReLU C l a s s i fi c a t i o n D e t e c t i o n S e g m e n t a t i o n Fig. 1. Effectiveness and generalization performance. We set the ReLU network asthe baseline, and show the relative improvement of accuracy on the three basic tasksin computer vision: image classification (Top-1 accuracy), object detection (mAP),and semantic segmentation (mean IU). We use the ResNet-50 [15] as the backbonepre-trained on the ImageNet dataset, to evaluate the generalization performance onCOCO and CityScape datasets. FReLU is more effective, and transfer better on all ofthe three tasks. distinct roles of the convolution layers and activation layers, another questionarises:
Could we design an activation specifically for visual tasks?
To answer both questions raised above, we show that the simple but effec-tive visual activation, together with the regular convolutions, can also achievesignificant improvements on both dense and sparse predictions (e.g. image clas-sification, see Fig. 1). To achieve the results, we identify spatially insensitivenessin activations as the main obstacle impeding visual tasks from achieving sig-nificant improvements and propose a new visual activation that eliminates thisbarrier. In this work, we present a simple but effective visual activation thatextends ReLU and PReLU to a 2D visual activation.Spatially insensitiveness is addressed in modern activations for visual tasks.As popularized in the ReLU activation, non-linearity is performed using a max ( · )function, the condition is the hand-designed zero , thus in the scalar form: y = max ( x, Funnel activation (FReLU) , extends the spirit ofReLU/PReLU by adding a spatial condition (see Fig. 2) which is simple to im-plement and only adds a negligible computational overhead. Formally, the formof our proposed method is y = max ( x, T ( x )), where T ( x ) represents the simpleand efficient spatial contextual feature extractor. By using the spatial conditionin activations, it simply extends ReLU and PReLU to a visual parametric ReLUwith a pixel-wise modeling capacity. unnel Activation for Visual Recognition 3 Our proposed visual activation acts as an efficient but much more effectivealternative to previous activation approaches. To demonstrate the effectivenessof the proposed visual activation, we replace the normal ReLU in classificationnetworks, and we use the pre-trained backbone to show its generality on theother two basic vision tasks: object detection and semantic segmentation. Theresults show that FReLU not only improves performance on a single task butalso transfers well to other visual tasks.
Scalar activations
Scalar activations are activations with single input and sin-gle output, in the form of y = f ( x ). The Rectified Linear Unit (ReLU) [13,23,32]is the most widely used scalar activation on various tasks [26,38], in the form of y = max ( x, σ ( x ) =1 / (1+ e − x ), and the Tanh non-linearity has the form tanh ( x ) = 2 σ (2 x ) −
1. Theseactivations are not widely used in deep CNNs mainly because they saturate andkill gradients, also involve expensive operations (exponentials, etc.).Many advances followed [25,39,1,16,35,10,46], and recent searching techniquecontributes to a new searched scalar activation called Swish [36] by combing acomprehensive set of unary functions and binary functions. The form is y = x ∗ Sigmoid ( x ), outperforms other scalar activations on some structures anddatasets, and many searched results show great potential. Contextual conditional activations
Besides the scalar activation which onlydepends on the neuron itself, conditional activation is a many-to-one function,which activates the neurons conditioned on contextual information. A represen-tative method is Maxout [12], it extends the layer to a multi-branch and selectsthe maximum. Most activations apply a non-linearity on the linear dot prod-uct between the weights and the data, which is: f ( w T x + b ). Maxout computesthe max ( w T x + b , w T x + b ), and generalizes ReLU and Leaky ReLU into thesame framework. With dropout [17], the Maxout network shows improvement.However, it increases the complexity too much, the numbers of parameters andmultiply-adds has doubled and redoubled.Contextual gating methods [8,44] use contextual information to enhance theefficacy, especially on RNN based methods, because the feature dimension isrelatively smaller. There are also on CNN based methods [34], since 2D featuresize has a large dimension, the method is used after a feature reduction.The contextually conditioned activations are usually channel-wise methods.However, in this paper, we find the spatial dependency is also important in thenon-linear activation functions. We use light-weight CNN technique depth-wiseseparable convolution to help with the reduction of additional complexity. Ningning Ma et al.
Spatial dependency modeling
Learning better spatial dependency is chal-lenging, Some approaches use different shapes of convolution kernels [41,42,40]to aggregate the different ranges of spatial dependences. However, it requires amulti-branch that decreases efficiency. Advances in convolution kernels such asatrous convolution [18] and dilated convolution [47] also lead to better perfor-mance by increasing the receptive field.Another type of methods learn the spatial dependency adaptively, such asSTN [22], active convolution [24], deformable convolution [7]. These methodsadaptively use the spatial transformations to refine the short-range dependencies,especially for dense vision tasks (e.g. object detection, semantic segmentation).Our simple FReLU even outperforms them without complex convolutions.Moreover, the non-local network provides the methods to capture long-rangedependencies to address this problem. GCNet [3] provides a spatial attentionmechanism to better use the spatial global context. Long-range modeling meth-ods achieve better performance but still require additional blocks into the originnetwork structure, which decreases efficiency. Our method address this issue inthe non-linear activations, solve this issue better and more efficiently.
Receptive field
The region and size of receptive field are essential in visionrecognition tasks [50,33]. The work on effective receptive field [29,11] finds thatdifferent pixels contribute unequally and the center pixels have a larger impact.Therefore, many methods have been proposed to implement the adaptive recep-tive field [7,51,49]. The methods achieve the adaptive receptive field and improvethe performance, by involving additional branches in the architecture, such asdeveloping more complex convolutions or utilizing the attention mechanism. Ourmethod also achieves the same goal, but in a more simple and efficient mannerby introducing the receptive field into the non-linear activations. By using themore adaptive receptive field, we can approximate the layouts in common com-plex shapes, thus achieve even better results than the complex convolutions, byusing the efficient regular convolutions.
FReLU is designed specifically for visual tasks and is conceptually simple: thecondition is a hand-designed zero for ReLU and a parametric px for PReLU, tothis we modify it to a 2D funnel-like condition dependent on the spatial context.The visual condition helps extract the fine spatial layout of an object. Next, weintroduce the key elements of FReLU, including the funnel condition and thepixel-wise modeling capacity, which are the main missing parts in ReLU and itsvariants. ReLU
We begin by briefly reviewing the ReLU activation. ReLU, in the form max ( x, max ( · ) to serve as non-linearity and uses a hand-designed zero as the condition. The non-linear transformation acts as a supplement ofthe linear transformation such as convolution and fully-connected layers. unnel Activation for Visual Recognition 5
0X Y pXX YMAX MAX (a) ReLU:MAX(X,0) (b) PReLU:MAX(X,pX)
Parametric Condition T(X)X YMAX (c) FReLU:MAX(X,T(X))
Funnel Condition
Fig. 2. Funnel activation.
We propose a novel activation for visual recognition we call
F ReLU that follows the spirit of ReLU/PReLU and extends them to 2D by addinga visual funnel condition T ( x ). (a) ReLU with a condition zero; (b) PReLU with aparametric condition; (c) FReLU with a visual parametric condition. PReLU
As an advanced variant of ReLU, PReLU has an original form max ( x, p · min ( x, p is a learnable parameter and initialized as 0.25. However, inmost case p <
1, under this assumption, we rewrite it to the form: max ( x, px ),( p < p is a channel-wise parameter, it can be interpreted as a 1x1depth-wise convolution regardless of the bias terms. Funnel condition
FReLU adopts the same max ( · ) as the simple non-linearfunction. For the condition part, FReLU extends it to be a 2D condition depen-dent on the spatial context for each pixel (see Fig. 2). This is in contrast to mostrecent methods whose condition depends on the pixel itself (e.g. [31,14]) or thechannel context (e.g. [12]). Our approach follows the spirit of ReLU that uses a max ( · ) to obtain the maximum between x and a condition.Formally, we define the funnel condition as T ( x ). To implement the spa-tial condition, we use a Parametric Pooling Window to create the spatialdependency, specifically, we define the activation function: f ( x c,i,j ) = max ( x c,i,j , T ( x c,i,j )) (1) T ( x c,i,j ) = x ωc,i,j · p ωc (2)Here, x c,i,j is the input pixel of the non-linear activation f ( · ) on the c -thchannel, at the 2-D spatial position ( i, j ); function T ( · ) denotes the funnel con-dition, x ωc,i,j denotes a k h × k w Parametric Pooling Window centered on x c,i,j , p ωc denotes the coefficient on this window which is shared in the samechannel, and ( · ) denotes dot multiply. Ningning Ma et al. (a) (b) (c)
Fig. 3.
Graphic depiction of how the per-pixel funnel condition can achieve pixel-wisemodeling capacity . The distinct sizes of squares represent the distinct activate fields ofeach pixel in the top activation layers. (a) The normal activate field that has equalsizes of squares per-pixel, and can only describe the horizontal and vertical layouts. Incontrast, the max ( · ) allows each pixel to choose looking around or not in each layer,after enough number of layers, they have many different sizes of squares. Therefore,the different sizes of squares can approximate (b) the shape of the oblique line, and (c)the shape of an arc, which are more common natural object layouts. Pixel-wise modeling capacity
Our definition of funnel condition allows thenetwork to generate spatial conditions in the non-linear activations for everypixel. The network conducts non-linear transformations and creates spatial de-pendencies simultaneously . This is different from common practice which cre-ates spatial dependency in the convolution layer and conducts non-linear trans-formations separately. In that case, the activations do not depend on spatialconditions explicitly; in our case, with the funnel condition, they do.As a result, the pixel-wise condition makes the network has a pixel-wisemodeling capacity, the function max ( · ) gives per-pixel a choice between lookingat the spatial context or not . Formally, consider a network { F , F , ..., F n } with n FReLU layers, each FReLU layer F i has a k × k parametric window. Forbrevity, we only analyze the FReLU layers regardless of the convolution layers.Because the max selection between 1 × k × k , each pixel after F has a activate f iled set { , r } ( r = k − F n layer, the set becomes { , r, r, ..., nr } , which gives more choices to each pixel and canapproximate any layouts if n is sufficiently large. With many distinct sizes ofthe activate field, the distinct sizes of squares can approximate the shape of theoblique line and arc (see Fig. 3). As we know, the layout of the objects in theimages are usually not horizontal or vertical, they are usually in the shape of theoblique line or arc, therefore extracting the spatial structure of objects can beaddressed naturally by the pixel-wise modeling capacity provided by the spatialcondition. We show by experiments that it captures irregular and detailedobject layouts better in complex tasks (see Fig. 4). Our proposed change is simple: we avoid the hand-designed condition in acti-vations, we use a simple and effective spatial 2D condition to replace it. Thevisual activation leads to significant improvements as shown in Fig. 1. We firstchange the ReLU activations in the classification task on the ImageNet dataset. unnel Activation for Visual Recognition 7
We use ResNet [15] as the classification network and use the pre-trained networkas backbones for other tasks: object detection and semantic segmentation.All the regions x ωc,i,j in the same channel share the same coefficient p ωc , there-fore, it only adds a slight additional number of parameters. The region repre-sented by x ωc,i,j is a sliding window, the size is default set to a 3 × x ωc,i,j · p ωc = (cid:88) i − ≤ h ≤ i +1 ,j − ≤ w ≤ j +1 x c,h,w · p c,h,w (3) Parameter initialization
We use the gaussian initialization to initialize thehyper-parameters. Therefore we get the condition values close to zero, whichdoes not change the origin network’s property too much. We also investigatethe cases without parameters, (e.g. max pooling, average pooling), which do notshow improvement. That shows the importance of the additional parameters.
Parameter computation
We assume there is a K (cid:48) h × K (cid:48) w convolution with theinput feature size of C × H × W input, and the output size of C × H (cid:48) × W (cid:48) , thenwe compute the number of parameters to be CCK (cid:48) h K (cid:48) w , and the FLOPs (floatingpoint operations) to be CCK (cid:48) h K (cid:48) w HW . To this we add our funnel condition withwindow K h × K w , the additional number of parameters is CK h K w , and theadditional number of FLOPs is CK h K w HW . We assume K = K h = K w , K (cid:48) = K (cid:48) h = K (cid:48) w for simplification.Therefore the original complexity of parameters is O ( C K (cid:48) ), after adoptingFReLU, it becomes O ( C K (cid:48) + CK )); and the original complexity of FLOPs is O ( C K (cid:48) HW ), after adopting the visual activation, it becomes O ( C K (cid:48) HW + CK HW ). Usually, C is much larger than K and K (cid:48) , therefore the additionalcomplexity can be negligible. Actually in practice the additional part is negligible(more details in Table 1). Moreover, the funnel condition is a k h × k w slidingwindow, and we implement it using the highly optimized depth-wise separableconvolution operator followed with a BN [21] layer. To evaluate the effectiveness of our visual activation, first, we conduct our ex-periments on ImageNet 2012 classification dataset[9,37], which comprises 1.28million training images and 50K validation images.Our visual activation is easy to adopt on the network structures, by sim-ply changing the ReLU in the original CNN structure. First, we evaluate theactivation on different sizes of ResNet [15]. For the network structure, we usethe original implementation. Spatial dependency is important especially in theshallow layers, for the small 224 ×
224 input size, we replace the ReLUs in all thestages except the last stage, which has a small 7 × Ningning Ma et al.
Table 1.
Comparisons with other effective activations [14,36] on ResNets [15] in Ima-geNet 2012. Image size 224x224. Single crop. We evaluate the Top-1 error rate on thetest set. Model Activation
ResNet-101 ReLU 44.4M 7.6G 22.8PReLU 44.4M 7.6G 22.7Swish 44.4M 7.6G 22.7FReLU 44.5M 7.6G training settings, we use a batch size of 256, 600k iterations, a learning rate of0.1 with linear decay schedule, a weight decay of 1e-4, and a dropout [17] rate of0.1. We present the Top-1 error rate on the validation set. For a fair comparison,we run all the results on the same code base.
Comparisons with scalar activations
We conduct a comprehensive compar-ison on ResNets [15] with different depths (e.g. ResNet-50, ResNet-101). We takeReLU as the baseline and take one of its variants PReLU for comparison. Fur-ther, we compare our visual activation with the activation Swish [36] searchedby the NAS [52,53] technique. Swish has shown its positive influence on variousmodel structures, comparing with many scalar activations.Table 1 shows the comparison, our visual activation still outperforms all ofthem with a negligible additional complexity. Our visual activation improves1.6% and 0.7% top-1 accuracy rates on ResNet-50 and ResNet-101. It’s remark-able that with the increase of model size and model depth, other scalar acti-vations show limited improvement, while visual activation still has significantimprovement. For example, Swish and PReLU improve the accuracy of 0.1% onResNet-101, while visual activation increases still significantly on ResNet-101with an improvement of 0.7%.
Comparison on light-weight CNNs
Besides deep CNNs, we compare thevisual activation with other effective activations on recent light-weight CNNssuch as MobileNets [19] and ShuffleNets [30]. We use the same training settingsin [30]. The model sizes are extremely small, we use a window size of 1 × × × can improve 2.5%top-1 accuracy by only adding a slight additional FLOPs. unnel Activation for Visual Recognition 9 Table 2.
Comparisons among other effective activations [14,36] on light-weight CNNs(MobileNet [19], ShuffleNetV2 [30]) in ImageNet 2012. Image size 224x224. Single crop.We evaluate the Top-1 error rate on the test set.Model Activation
ShuffleNetV2 ReLU 1.4M 41M 39.6PReLU 1.4M 41M 39.1Swish 1.4M 41M 38.7FReLU 1.4M 45M
To evaluate the generalization performance of visual activation on different tasks,we conduct object detection experiments on COCO dataset [28]. The COCOdataset has 80 object categories. We use the trainval k set for training anduse the minival set for testing. Table 3.
Comparisons of different activations in COCO object detection . We useResNet-50 [15] and ShuffleNetV2 (1.5 × ) [30] with different activations as the pre-trained backbones. We use the RetinaNet [27] detector.Model Activation AP AP AP s AP m AP l ResNet-50 ReLU 25.5M 3.86G 35.2 53.7 37.5 18.8 39.7 48.8Swish 25.5M 3.86G 35.8 54.1 38.7 18.6 40.0 49.4FReLU 25.5M 3.87G
ShuffleNetV2 ReLU 3.5M 299M 31.7 49.4 33.7 15.3 35.1 45.2Swish 3.5M 299M 32.0 49.9 34.0 16.2 35.2 45.2FReLU 3.7M 318M
We present the result on RetinaNet [27] detector. For a fair comparison, wetrain all the models in the same code base with the same settings. We use abatch size of 2, a weight decay of 1e-4 and a momentum of 0.9. We use anchorsfor 3 scales and 3 aspect ratios and use a 600-pixel train and test image scale. Forthe backbone, we use the pre-trained model in Section 4.1 as a feature extractor,and compare the generality among different activations.Table 3 shows the comparison among different activations. The comparisonshows that our visual activation increases 1.4% mAP comparing to the ReLUbackbone, and increases 0.8% mAP comparing to the Swish backbone. It isworth mentioning that, on all the small, medium, and large objects, FReLUoutperforms all the other counterparts significantly.
Table 4.
Comparisons on the semantic segmentation task in CityScape dataset.We use the PSPNet [48] as the the framework and use the ResNet-50 [15] as backbone.The pre-trained backbones are from Table 1.ReLU Swish[36] FReLUmean IU 77.2 77.5 road 98.0 98.1 98.1sidewalk 84.2 85.0 84.7building 92.3 92.5 92.7wall 55.0 56.3 59.5fence 59.0 59.6 60.9pole 63.3 63.6 64.3traffic light 71.4 72.1 72.2traffic sign 79.0 80.0 79.9vegetation 92.4 92.7 92.8 ReLU Swish FReLUterrain 65.0 64.0 64.5sky 94.7 94.9 94.8person 82.1 83.1 83.2rider 62.3 65.5 64.7car 95.1 94.8 95.3truck 77.7 70.1 79.8bus 84.9 84.0 87.8train 63.3 68.8 74.6motorcycle 68.3 69.4 69.8bicycle 78.2 78.4 78.7
We also show the comparison on the light-weight CNNs. As the comparisonof ResNet-50, we use pre-trained ShuffleNetV2 backbones adopted with differentactivations. We mainly compare FReLU with ReLU and the effective activationSwish [36]. Table 3 shows visual activation also outperforms much better thanReLU and Swish backbones, to which it improves 1.1% mAP and 0.8% mAPrespectively. Moreover, it increases the performance of all the sizes of objects.
We further present the semantic segmentation results on CityScape dataset [6].The dataset is a semantic urban scene understanding dataset, contains 19 cate-gories. It has 5,000 finely annotated images, 2,975 for training, 500 for validationand 1525 for testing.We use the PSPNet [48] as the segmentation framework, for the trainingsettings we use the poly learning rate policy [4] where the base is 0.01 and thepower is 0.9, we use a weight decay of 1e-4, and 8 GPUs with a batch size of 2on each GPU.To evaluate the generality of the previous pre-trained models in Section 4.1,we use the pre-trained ResNet-50 [15] backbone models with different activations,we compare FReLU with Swish and ReLU respectively.In Table 4, we show the comparison with scalar activations. From the result,we observe that our visual activation outperforms the ReLU and the searchedSwish 1.7% and 1.4% mean IU, respectively. Moreover, our visual activation hassignificant improvements in both large and small objects, especially on categoriessuch as ’train’, ’bus’, ’wall’, etc.For better visualization of the improved performance, Fig. 4 shows the predictresults on the testing dataset. It shows that by only changing the backboneactivations, the results have obvious improvement. The boundaries of both thelarge and the small objects are well-segmented because the pixel-wise modeling unnel Activation for Visual Recognition 11
Image GroundTruth ReLU Swish FReLU
Fig. 4. Visualization of semantic segmentation on ResNet-50[15]-PSPNet[48]with different activations in backbone. We clip the CityScape images to make the differ-ences more clear (better view enlarge images). FReLU has better long-range (large orslender objects) and short-range (small objects) understandings due to its better con-text capturing capacity. It captures irregular and detailed object layouts in complexcases much better. We note that modern frameworks are finely optimized with ReLU,however, it has obvious improvements by only changing the backbones, thus havingthe potential for further gains if redesign the frameworks for the visual activation. capacity can handle both global and detailed regions (see Fig. 3). We note thatthe modern recognition frameworks are finely designed with the ReLU activation,therefore the visual activation still has great potential for further improving theresults, which is beyond the focus of this work.
The previous sections demonstrate the optimum performance comparing withother effective activations. To further investigate our visual activation, we con-duct ablation studies. We first discuss the properties of the visual activation,then we discuss the compatibility with existing methods.
Our funnel activation mainly has two components: 1) funnel condition, and 2) max ( · ) non-linearity. Separately, we investigate the effect of each component. Table 5.
Ablation on the different spatial con-dition manners , and the different non-linearmanners . The experiments are conducted onResNet-50 [15]. Model A, B, C compare differ-ent visual conditions with/without parameters.Model D replaces max with sum , to this weadd a ReLU, or it will not converge. Model Eseparates and evaluates the performance of thespatial condition itself. DW(x) represent the 3x3depth-wise separable convolution.Model Activation Top-1 Err.A Max(x, ParamPool(x))
B Max(x, MaxPool(x)) 24.4C Max(x, AvgPool(x)) 24.5A Max(x, ParamPool(x))
D Sum(x, ParamPool(x)) 23.6E Max(DW(x), 0) 23.7
Table 6.
Ablation on differentnormalization methods after thespatial condition layer. We adoptBatch Normalization (BN) [21],Layer Normalization (LN) [2], In-stance Normalization (IN) [43] andGroup Normalization (GN) [45]after the spatial condition layerwhich is implemented by depth-wise convolution. ImageNet resultson ShuffleNetV2 0.5 × .Normalization Top-1 Err.- 37.6BN 37.1LN 36.5IN 38.0GN 36.5 Ablation on the spatial condition
First, we compare the different mannersof the spatial condition. Besides the manner of parametric pooling that we used,to investigate the importance of the additional parameters, we compare otherpooling manners without additional parameters, they are max pooling and av-erage pooling. We simply replace the parametric pooling with the other twonon-parametric manners and evaluate the results on the ImageNet dataset.Table 5 (A, B, C) shows the importance of the parametric pooling. Withoutadditional parameter, the results decrease more than 2% top-1 accuracy, evenperform worse than the baseline that does not use spatial condition. Table 6shows the comparison of different normalization after the spatial condition.
Ablation on the non-linearity
Second, we also compare the use of non-linearity. In our method, we use the max ( · ) function to perform the non-linearity, simultaneously capturing visual dependency. In contrast, we compare with themanners that separately capture visual dependency and non-linearity.For the spatial context capturing, we use two manners: 1) use the parametricpooling as before, then linearly add up with the original feature, 2) simply adda depth-wise separable convolution layer. For the non-linear transformation, weuse the ReLU function. Table 5 (A,D,E) show the results. Comparing with thebaseline, the spatial context itself improves about 0.3% accuracy, but togetheras the non-linear condition in our method, it further increases more than 1%.Therefore, performing the spatial dependency and non-linearity separately hasnot an ideal effect as doing them simultaneously . Ablation on the window size
In the parametric pooling window, the sizeof the window decides the size of the area each pixel looks . We simply changethe window size in the funnel condition and compare different sizes among { × unnel Activation for Visual Recognition 13 Table 7.
Ablation on the window size .We simply change the window size in thefunnel condition. We evaluate the top-1error rate on ImageNet dataset using theResNet-50 [15] structure.Model Window size Top-1 Err.A 1 × × C 5 × × × ×
1) 22.6F Max(1 ×
3, 3 × Ablation on different layers .We replace the ReLU with FReLU afterthe 1 × × × × (cid:88) (cid:88) (cid:88) (cid:88) MobileNet (cid:88) (cid:88) (cid:88) (cid:88)
Ablation of visual activation on different stages (Stage { } in ResNet-50[15]). In each stage we replace each ReLU withour visual activation. The results are the top-1error rates on ImageNet. Image size 224x224.Stage 2 Stage 3 Stage 4 Top-1 Err. (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Ablation comparisons ofthe compatibility between FReLUand SENet [20] on ResNet-50 [15].The results are the top-1 error rateson ImageNet. Image size 224x224.Single crop.Model , × , × , × } . The case of 1 × × × × ×
1, we consider to use the sum andmax of them as the condition. Table 7 { B,E,F } show the comparison. The resultsshow that irregular window sizes also have the optimum performance since theyhave a more flexible pixel-wise modeling capacity (Fig. 3). To adopt the new activation into the convolutional networks, we have to choosewhich layers, and which stages to adopt. Moreover, we also investigate the com-patibility with existing effective approaches such as SENet.
Compatibility with different convolution layers
First, we compare thepositions after different convolution layers. That is, we investigate the effectof FReLU in different positions after 1 × × experiments on ResNet-50 [15] and ShuffleNetV2 [30]. We replace the ReLUafter the 1 × × Compatibility with different stages
Secondly, we investigate the compat-ibility with different stages in the CNN structures. The visual activations areimportant especially on the layer with high spatial dimensions. For the classifi-cation network whose shallow layers have larger spatial dimensions and deeperlayers have large channel dimensions, there may be differences when we applyvisual activations on different stages. For Stage 5 of ResNet-50 with 224x224input, it has a relatively small 7x7 feature size, which mainly contains channeldependency instead of spatial dependency. Therefore, we adopt visual activationson Stage { } on ResNet-50, as Table 9 shows. The results reveal that adoptingthe shallow layers has a larger effect, while a deeper layer has a smaller effect.Moreover, adopting FReLU on all of them has the optimum top-1 accuracy. Compatibility with SENet
At last, we compare the performance with SENet[20] and show the compatibility with it. Without the complex advances in CNNarchitecture, it achieves significant improvements on all the three vision tasks,simply together with the regular convolution layers. We further compare visualactivation with recent effective attention module SENet, since SENet is one ofthe most effective attention modules recently.Table 10 shows the result, although SENet uses an additional block to en-hance the model capacity, it is remarkable that the simple visual activation evenoutperforms SENet. We also wish the visual activation we proposed can co-existwith other techniques, such as the SE module. We adopt the SE module on thelast stage in ResNet-50 to avoid overfitting. Table 10 also shows the co-existencebetween FReLU and SE module. Together with SENet, funnel activation im-proves 0.3% accuracy further.
In this work, we present a funnel activation specifically designed visual tasks,which easily captures complex layouts using the pixel-wise modeling capacity.Our approach is simple, effective, and finely compatible with other techniques,that provides a new alternative activation for image recognition tasks. We notethat ReLU has been so influential that many state-of-the-art architectures havebeen designed for it, however, their settings may not be optimal for the funnelactivation. Therefore, it still has a large potential for further improvements.
Acknowledgements
This work is supported by The National Key Researchand Development Program of China (No. 2017YFA0700800) and Beijing Academyof Artificial Intelligence (BAAI). unnel Activation for Visual Recognition 15
References
1. Agostinelli, F., Hoffman, M., Sadowski, P., Baldi, P.: Learning activation functionsto improve deep neural networks. arXiv preprint arXiv:1412.6830 (2014)2. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprintarXiv:1607.06450 (2016)3. Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE International Con-ference on Computer Vision Workshops. pp. 0–0 (2019)4. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se-mantic image segmentation with deep convolutional nets, atrous convolution, andfully connected crfs. IEEE transactions on pattern analysis and machine intelli-gence (4), 834–848 (2017)5. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep networklearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)6. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban sceneunderstanding. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3213–3223 (2016)7. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolu-tional networks. In: Proceedings of the IEEE international conference on computervision. pp. 764–773 (2017)8. Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gatedconvolutional networks. In: Proceedings of the 34th International Conference onMachine Learning-Volume 70. pp. 933–941. JMLR. org (2017)9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer visionand pattern recognition. pp. 248–255. Ieee (2009)10. Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural net-work function approximation in reinforcement learning. Neural Networks , 3–11(2018)11. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforwardneural networks. In: Proceedings of the thirteenth international conference on ar-tificial intelligence and statistics. pp. 249–256 (2010)12. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxoutnetworks. arXiv preprint arXiv:1302.4389 (2013)13. Hahnloser, R.H., Sarpeshkar, R., Mahowald, M.A., Douglas, R.J., Seung, H.S.:Digital selection and analogue amplification coexist in a cortex-inspired siliconcircuit. Nature (6789), 947–951 (2000)14. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE interna-tional conference on computer vision. pp. 1026–1034 (2015)15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 770–778 (2016)16. Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regularizers withgaussian error linear units (2016)17. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:Improving neural networks by preventing co-adaptation of feature detectors. arXivpreprint arXiv:1207.0580 (2012)6 Ningning Ma et al.18. Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P.: A real-timealgorithm for signal analysis with the help of the wavelet transform. In: Wavelets,pp. 286–297. Springer (1990)19. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 (2017)20. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of theIEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)21. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)22. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.In: Advances in neural information processing systems. pp. 2017–2025 (2015)23. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stagearchitecture for object recognition? In: 2009 IEEE 12th international conferenceon computer vision. pp. 2146–2153. IEEE (2009)24. Jeon, Y., Kim, J.: Active convolution: Learning the shape of convolution for imageclassification. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 4201–4209 (2017)25. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neu-ral networks. In: Advances in neural information processing systems. pp. 971–980(2017)26. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-volutional neural networks. In: Advances in neural information processing systems.pp. 1097–1105 (2012)27. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. In: Proceedings of the IEEE international conference on computer vision.pp. 2980–2988 (2017)28. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conferenceon computer vision. pp. 740–755. Springer (2014)29. Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the effective receptivefield in deep convolutional neural networks. In: Advances in neural informationprocessing systems. pp. 4898–4906 (2016)30. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines forefficient cnn architecture design. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 116–131 (2018)31. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural net-work acoustic models. In: Proc. icml. vol. 30, p. 3 (2013)32. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-chines. In: Proceedings of the 27th international conference on machine learning(ICML-10). pp. 807–814 (2010)33. Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval withattentive deep local features. In: Proceedings of the IEEE international conferenceon computer vision. pp. 3456–3465 (2017)34. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.:Conditional image generation with pixelcnn decoders. In: Advances in neural in-formation processing systems. pp. 4790–4798 (2016)35. Qiu, S., Xu, X., Cai, B.: Frelu: Flexible rectified linear units for improving convolu-tional neural networks. In: 2018 24th International Conference on Pattern Recog-nition (ICPR). pp. 1223–1228. IEEE (2018)unnel Activation for Visual Recognition 1736. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXivpreprint arXiv:1710.05941 (2017)37. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-nition challenge. International journal of computer vision115