[PDF] Funnel Activation for Visual Recognition

Abstract

We present a conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition. The forms of ReLU and PReLU are y = max(x, 0) and y = max(x, px), respectively, while FReLU is in the form of y = max(x,T(x)), where T(x) is the 2D spatial condition. Moreover, the spatial condition achieves a pixel-wise modeling capacity in a simple way, capturing complicated visual layouts with regular convolutions. We conduct experiments on ImageNet, COCO detection, and semantic segmentation tasks, showing great improvements and robustness of FReLU in the visual recognition tasks. Code is available at this https URL.

Full PDF

FFunnel Activation for Visual Recognition

Ningning Ma , Xiangyu Zhang (cid:63) , and Jian Sun Hong Kong University of Science and Technology MEGVII Technology [email protected], { zhangxiangyu,sunjian } @megvii.com Abstract.

We present a conceptually simple but eﬀective funnel acti-vation for image recognition tasks, called

Funnel activation (FReLU) ,that extends ReLU and PReLU to a 2D activation by adding a neg-ligible overhead of spatial condition. The forms of ReLU and PReLUare y = max ( x,

0) and y = max ( x, px ), respectively, while FReLU is inthe form of y = max ( x, T ( x )), where T ( · ) is the 2D spatial condition.Moreover, the spatial condition achieves a pixel-wise modeling capac-ity in a simple way, capturing complicated visual layouts with regularconvolutions. We conduct experiments on ImageNet, COCO detection,and semantic segmentation tasks, showing great improvements and ro-bustness of FReLU in the visual recognition tasks. Code is available at https://github.com/megvii-model/FunnelAct . Keywords: funnel activation, visual recognition, CNN

Convolutional neural networks (CNNs) have achieved state-of-the-art perfor-mance in many visual recognition tasks, such as image classiﬁcation, object de-tection, and semantic segmentation. As popularized in the CNN framework, onemajor kind of layer is the convolution layer, another is the non-linear activationlayer.First in the convolution layers, capturing the spatial dependency adaptivelyis challenging, many advances in more complex and eﬀective convolutions havebeen proposed to grasp the local context adaptively in images [7,18]. The ad-vances achieve great success especially on dense prediction tasks (e.g., semanticsegmentation, object detection). Driven by the advances in more complex convo-lutions and their less eﬃcient implementations, a question arises:

Could regularconvolutions achieve similar accuracy, to grasp the challenging complex images?

Second, usually right after capturing spatial dependency in a convolutionlayer linearly , then an activation layer acts as a scalar non-linear transformation.Many insightful activations have been proposed [31,14,5,25], but improving theperformance on visual tasks is challenging, therefore currently the most widelyused activation is still the Rectiﬁed Linear Unit (ReLU) [32]. Driven by the (cid:63)

Corresponding author a r X i v : . [ c s . C V ] J u l Ningning Ma et al.

More effective Transfer better A cc u r a cy I m p r o v e m en t ( % ) FReLUSwishPReLU C l a s s i ﬁ c a t i o n D e t e c t i o n S e g m e n t a t i o n Fig. 1. Eﬀectiveness and generalization performance. We set the ReLU network asthe baseline, and show the relative improvement of accuracy on the three basic tasksin computer vision: image classiﬁcation (Top-1 accuracy), object detection (mAP),and semantic segmentation (mean IU). We use the ResNet-50 [15] as the backbonepre-trained on the ImageNet dataset, to evaluate the generalization performance onCOCO and CityScape datasets. FReLU is more eﬀective, and transfer better on all ofthe three tasks. distinct roles of the convolution layers and activation layers, another questionarises:

Could we design an activation speciﬁcally for visual tasks?

To answer both questions raised above, we show that the simple but eﬀec-tive visual activation, together with the regular convolutions, can also achievesigniﬁcant improvements on both dense and sparse predictions (e.g. image clas-siﬁcation, see Fig. 1). To achieve the results, we identify spatially insensitivenessin activations as the main obstacle impeding visual tasks from achieving sig-niﬁcant improvements and propose a new visual activation that eliminates thisbarrier. In this work, we present a simple but eﬀective visual activation thatextends ReLU and PReLU to a 2D visual activation.Spatially insensitiveness is addressed in modern activations for visual tasks.As popularized in the ReLU activation, non-linearity is performed using a max ( · )function, the condition is the hand-designed zero , thus in the scalar form: y = max ( x, Funnel activation (FReLU) , extends the spirit ofReLU/PReLU by adding a spatial condition (see Fig. 2) which is simple to im-plement and only adds a negligible computational overhead. Formally, the formof our proposed method is y = max ( x, T ( x )), where T ( x ) represents the simpleand eﬃcient spatial contextual feature extractor. By using the spatial conditionin activations, it simply extends ReLU and PReLU to a visual parametric ReLUwith a pixel-wise modeling capacity. unnel Activation for Visual Recognition 3 Our proposed visual activation acts as an eﬃcient but much more eﬀectivealternative to previous activation approaches. To demonstrate the eﬀectivenessof the proposed visual activation, we replace the normal ReLU in classiﬁcationnetworks, and we use the pre-trained backbone to show its generality on theother two basic vision tasks: object detection and semantic segmentation. Theresults show that FReLU not only improves performance on a single task butalso transfers well to other visual tasks.

Scalar activations

Scalar activations are activations with single input and sin-gle output, in the form of y = f ( x ). The Rectiﬁed Linear Unit (ReLU) [13,23,32]is the most widely used scalar activation on various tasks [26,38], in the form of y = max ( x, σ ( x ) =1 / (1+ e − x ), and the Tanh non-linearity has the form tanh ( x ) = 2 σ (2 x ) −

1. Theseactivations are not widely used in deep CNNs mainly because they saturate andkill gradients, also involve expensive operations (exponentials, etc.).Many advances followed [25,39,1,16,35,10,46], and recent searching techniquecontributes to a new searched scalar activation called Swish [36] by combing acomprehensive set of unary functions and binary functions. The form is y = x ∗ Sigmoid ( x ), outperforms other scalar activations on some structures anddatasets, and many searched results show great potential. Contextual conditional activations

Besides the scalar activation which onlydepends on the neuron itself, conditional activation is a many-to-one function,which activates the neurons conditioned on contextual information. A represen-tative method is Maxout [12], it extends the layer to a multi-branch and selectsthe maximum. Most activations apply a non-linearity on the linear dot prod-uct between the weights and the data, which is: f ( w T x + b ). Maxout computesthe max ( w T x + b , w T x + b ), and generalizes ReLU and Leaky ReLU into thesame framework. With dropout [17], the Maxout network shows improvement.However, it increases the complexity too much, the numbers of parameters andmultiply-adds has doubled and redoubled.Contextual gating methods [8,44] use contextual information to enhance theeﬃcacy, especially on RNN based methods, because the feature dimension isrelatively smaller. There are also on CNN based methods [34], since 2D featuresize has a large dimension, the method is used after a feature reduction.The contextually conditioned activations are usually channel-wise methods.However, in this paper, we ﬁnd the spatial dependency is also important in thenon-linear activation functions. We use light-weight CNN technique depth-wiseseparable convolution to help with the reduction of additional complexity. Ningning Ma et al.

Spatial dependency modeling

Learning better spatial dependency is chal-lenging, Some approaches use diﬀerent shapes of convolution kernels [41,42,40]to aggregate the diﬀerent ranges of spatial dependences. However, it requires amulti-branch that decreases eﬃciency. Advances in convolution kernels such asatrous convolution [18] and dilated convolution [47] also lead to better perfor-mance by increasing the receptive ﬁeld.Another type of methods learn the spatial dependency adaptively, such asSTN [22], active convolution [24], deformable convolution [7]. These methodsadaptively use the spatial transformations to reﬁne the short-range dependencies,especially for dense vision tasks (e.g. object detection, semantic segmentation).Our simple FReLU even outperforms them without complex convolutions.Moreover, the non-local network provides the methods to capture long-rangedependencies to address this problem. GCNet [3] provides a spatial attentionmechanism to better use the spatial global context. Long-range modeling meth-ods achieve better performance but still require additional blocks into the originnetwork structure, which decreases eﬃciency. Our method address this issue inthe non-linear activations, solve this issue better and more eﬃciently.

Receptive ﬁeld

The region and size of receptive ﬁeld are essential in visionrecognition tasks [50,33]. The work on eﬀective receptive ﬁeld [29,11] ﬁnds thatdiﬀerent pixels contribute unequally and the center pixels have a larger impact.Therefore, many methods have been proposed to implement the adaptive recep-tive ﬁeld [7,51,49]. The methods achieve the adaptive receptive ﬁeld and improvethe performance, by involving additional branches in the architecture, such asdeveloping more complex convolutions or utilizing the attention mechanism. Ourmethod also achieves the same goal, but in a more simple and eﬃcient mannerby introducing the receptive ﬁeld into the non-linear activations. By using themore adaptive receptive ﬁeld, we can approximate the layouts in common com-plex shapes, thus achieve even better results than the complex convolutions, byusing the eﬃcient regular convolutions.

FReLU is designed speciﬁcally for visual tasks and is conceptually simple: thecondition is a hand-designed zero for ReLU and a parametric px for PReLU, tothis we modify it to a 2D funnel-like condition dependent on the spatial context.The visual condition helps extract the ﬁne spatial layout of an object. Next, weintroduce the key elements of FReLU, including the funnel condition and thepixel-wise modeling capacity, which are the main missing parts in ReLU and itsvariants. ReLU

We begin by brieﬂy reviewing the ReLU activation. ReLU, in the form max ( x, max ( · ) to serve as non-linearity and uses a hand-designed zero as the condition. The non-linear transformation acts as a supplement ofthe linear transformation such as convolution and fully-connected layers. unnel Activation for Visual Recognition 5

0X Y pXX YMAX MAX (a) ReLU:MAX(X,0) (b) PReLU:MAX(X,pX)

Parametric Condition T(X)X YMAX (c) FReLU:MAX(X,T(X))

Funnel Condition

Fig. 2. Funnel activation.

We propose a novel activation for visual recognition we call

F ReLU that follows the spirit of ReLU/PReLU and extends them to 2D by addinga visual funnel condition T ( x ). (a) ReLU with a condition zero; (b) PReLU with aparametric condition; (c) FReLU with a visual parametric condition. PReLU

As an advanced variant of ReLU, PReLU has an original form max ( x, p · min ( x, p is a learnable parameter and initialized as 0.25. However, inmost case p <

1, under this assumption, we rewrite it to the form: max ( x, px ),( p < p is a channel-wise parameter, it can be interpreted as a 1x1depth-wise convolution regardless of the bias terms. Funnel condition

FReLU adopts the same max ( · ) as the simple non-linearfunction. For the condition part, FReLU extends it to be a 2D condition depen-dent on the spatial context for each pixel (see Fig. 2). This is in contrast to mostrecent methods whose condition depends on the pixel itself (e.g. [31,14]) or thechannel context (e.g. [12]). Our approach follows the spirit of ReLU that uses a max ( · ) to obtain the maximum between x and a condition.Formally, we deﬁne the funnel condition as T ( x ). To implement the spa-tial condition, we use a Parametric Pooling Window to create the spatialdependency, speciﬁcally, we deﬁne the activation function: f ( x c,i,j ) = max ( x c,i,j , T ( x c,i,j )) (1) T ( x c,i,j ) = x ωc,i,j · p ωc (2)Here, x c,i,j is the input pixel of the non-linear activation f ( · ) on the c -thchannel, at the 2-D spatial position ( i, j ); function T ( · ) denotes the funnel con-dition, x ωc,i,j denotes a k h × k w Parametric Pooling Window centered on x c,i,j , p ωc denotes the coeﬃcient on this window which is shared in the samechannel, and ( · ) denotes dot multiply. Ningning Ma et al. (a) (b) (c)

Fig. 3.

Graphic depiction of how the per-pixel funnel condition can achieve pixel-wisemodeling capacity . The distinct sizes of squares represent the distinct activate ﬁelds ofeach pixel in the top activation layers. (a) The normal activate ﬁeld that has equalsizes of squares per-pixel, and can only describe the horizontal and vertical layouts. Incontrast, the max ( · ) allows each pixel to choose looking around or not in each layer,after enough number of layers, they have many diﬀerent sizes of squares. Therefore,the diﬀerent sizes of squares can approximate (b) the shape of the oblique line, and (c)the shape of an arc, which are more common natural object layouts. Pixel-wise modeling capacity

Our deﬁnition of funnel condition allows thenetwork to generate spatial conditions in the non-linear activations for everypixel. The network conducts non-linear transformations and creates spatial de-pendencies simultaneously . This is diﬀerent from common practice which cre-ates spatial dependency in the convolution layer and conducts non-linear trans-formations separately. In that case, the activations do not depend on spatialconditions explicitly; in our case, with the funnel condition, they do.As a result, the pixel-wise condition makes the network has a pixel-wisemodeling capacity, the function max ( · ) gives per-pixel a choice between lookingat the spatial context or not . Formally, consider a network { F , F , ..., F n } with n FReLU layers, each FReLU layer F i has a k × k parametric window. Forbrevity, we only analyze the FReLU layers regardless of the convolution layers.Because the max selection between 1 × k × k , each pixel after F has a activate f iled set { , r } ( r = k − F n layer, the set becomes { , r, r, ..., nr } , which gives more choices to each pixel and canapproximate any layouts if n is suﬃciently large. With many distinct sizes ofthe activate ﬁeld, the distinct sizes of squares can approximate the shape of theoblique line and arc (see Fig. 3). As we know, the layout of the objects in theimages are usually not horizontal or vertical, they are usually in the shape of theoblique line or arc, therefore extracting the spatial structure of objects can beaddressed naturally by the pixel-wise modeling capacity provided by the spatialcondition. We show by experiments that it captures irregular and detailedobject layouts better in complex tasks (see Fig. 4). Our proposed change is simple: we avoid the hand-designed condition in acti-vations, we use a simple and eﬀective spatial 2D condition to replace it. Thevisual activation leads to signiﬁcant improvements as shown in Fig. 1. We ﬁrstchange the ReLU activations in the classiﬁcation task on the ImageNet dataset. unnel Activation for Visual Recognition 7

We use ResNet [15] as the classiﬁcation network and use the pre-trained networkas backbones for other tasks: object detection and semantic segmentation.All the regions x ωc,i,j in the same channel share the same coeﬃcient p ωc , there-fore, it only adds a slight additional number of parameters. The region repre-sented by x ωc,i,j is a sliding window, the size is default set to a 3 × x ωc,i,j · p ωc = (cid:88) i − ≤ h ≤ i +1 ,j − ≤ w ≤ j +1 x c,h,w · p c,h,w (3) Parameter initialization

We use the gaussian initialization to initialize thehyper-parameters. Therefore we get the condition values close to zero, whichdoes not change the origin network’s property too much. We also investigatethe cases without parameters, (e.g. max pooling, average pooling), which do notshow improvement. That shows the importance of the additional parameters.

Parameter computation

We assume there is a K (cid:48) h × K (cid:48) w convolution with theinput feature size of C × H × W input, and the output size of C × H (cid:48) × W (cid:48) , thenwe compute the number of parameters to be CCK (cid:48) h K (cid:48) w , and the FLOPs (ﬂoatingpoint operations) to be CCK (cid:48) h K (cid:48) w HW . To this we add our funnel condition withwindow K h × K w , the additional number of parameters is CK h K w , and theadditional number of FLOPs is CK h K w HW . We assume K = K h = K w , K (cid:48) = K (cid:48) h = K (cid:48) w for simpliﬁcation.Therefore the original complexity of parameters is O ( C K (cid:48) ), after adoptingFReLU, it becomes O ( C K (cid:48) + CK )); and the original complexity of FLOPs is O ( C K (cid:48) HW ), after adopting the visual activation, it becomes O ( C K (cid:48) HW + CK HW ). Usually, C is much larger than K and K (cid:48) , therefore the additionalcomplexity can be negligible. Actually in practice the additional part is negligible(more details in Table 1). Moreover, the funnel condition is a k h × k w slidingwindow, and we implement it using the highly optimized depth-wise separableconvolution operator followed with a BN [21] layer. To evaluate the eﬀectiveness of our visual activation, ﬁrst, we conduct our ex-periments on ImageNet 2012 classiﬁcation dataset[9,37], which comprises 1.28million training images and 50K validation images.Our visual activation is easy to adopt on the network structures, by sim-ply changing the ReLU in the original CNN structure. First, we evaluate theactivation on diﬀerent sizes of ResNet [15]. For the network structure, we usethe original implementation. Spatial dependency is important especially in theshallow layers, for the small 224 ×

224 input size, we replace the ReLUs in all thestages except the last stage, which has a small 7 × Ningning Ma et al.

Table 1.

Comparisons with other eﬀective activations [14,36] on ResNets [15] in Ima-geNet 2012. Image size 224x224. Single crop. We evaluate the Top-1 error rate on thetest set. Model Activation

ResNet-101 ReLU 44.4M 7.6G 22.8PReLU 44.4M 7.6G 22.7Swish 44.4M 7.6G 22.7FReLU 44.5M 7.6G training settings, we use a batch size of 256, 600k iterations, a learning rate of0.1 with linear decay schedule, a weight decay of 1e-4, and a dropout [17] rate of0.1. We present the Top-1 error rate on the validation set. For a fair comparison,we run all the results on the same code base.

Comparisons with scalar activations

We conduct a comprehensive compar-ison on ResNets [15] with diﬀerent depths (e.g. ResNet-50, ResNet-101). We takeReLU as the baseline and take one of its variants PReLU for comparison. Fur-ther, we compare our visual activation with the activation Swish [36] searchedby the NAS [52,53] technique. Swish has shown its positive inﬂuence on variousmodel structures, comparing with many scalar activations.Table 1 shows the comparison, our visual activation still outperforms all ofthem with a negligible additional complexity. Our visual activation improves1.6% and 0.7% top-1 accuracy rates on ResNet-50 and ResNet-101. It’s remark-able that with the increase of model size and model depth, other scalar acti-vations show limited improvement, while visual activation still has signiﬁcantimprovement. For example, Swish and PReLU improve the accuracy of 0.1% onResNet-101, while visual activation increases still signiﬁcantly on ResNet-101with an improvement of 0.7%.

Comparison on light-weight CNNs

Besides deep CNNs, we compare thevisual activation with other eﬀective activations on recent light-weight CNNssuch as MobileNets [19] and ShuﬄeNets [30]. We use the same training settingsin [30]. The model sizes are extremely small, we use a window size of 1 × × × can improve 2.5%top-1 accuracy by only adding a slight additional FLOPs. unnel Activation for Visual Recognition 9 Table 2.

Comparisons among other eﬀective activations [14,36] on light-weight CNNs(MobileNet [19], ShuﬄeNetV2 [30]) in ImageNet 2012. Image size 224x224. Single crop.We evaluate the Top-1 error rate on the test set.Model Activation

ShuﬄeNetV2 ReLU 1.4M 41M 39.6PReLU 1.4M 41M 39.1Swish 1.4M 41M 38.7FReLU 1.4M 45M

To evaluate the generalization performance of visual activation on diﬀerent tasks,we conduct object detection experiments on COCO dataset [28]. The COCOdataset has 80 object categories. We use the trainval k set for training anduse the minival set for testing. Table 3.

Comparisons of diﬀerent activations in COCO object detection . We useResNet-50 [15] and ShuﬄeNetV2 (1.5 × ) [30] with diﬀerent activations as the pre-trained backbones. We use the RetinaNet [27] detector.Model Activation AP AP AP s AP m AP l ResNet-50 ReLU 25.5M 3.86G 35.2 53.7 37.5 18.8 39.7 48.8Swish 25.5M 3.86G 35.8 54.1 38.7 18.6 40.0 49.4FReLU 25.5M 3.87G

ShuﬄeNetV2 ReLU 3.5M 299M 31.7 49.4 33.7 15.3 35.1 45.2Swish 3.5M 299M 32.0 49.9 34.0 16.2 35.2 45.2FReLU 3.7M 318M

We present the result on RetinaNet [27] detector. For a fair comparison, wetrain all the models in the same code base with the same settings. We use abatch size of 2, a weight decay of 1e-4 and a momentum of 0.9. We use anchorsfor 3 scales and 3 aspect ratios and use a 600-pixel train and test image scale. Forthe backbone, we use the pre-trained model in Section 4.1 as a feature extractor,and compare the generality among diﬀerent activations.Table 3 shows the comparison among diﬀerent activations. The comparisonshows that our visual activation increases 1.4% mAP comparing to the ReLUbackbone, and increases 0.8% mAP comparing to the Swish backbone. It isworth mentioning that, on all the small, medium, and large objects, FReLUoutperforms all the other counterparts signiﬁcantly.

Table 4.

Comparisons on the semantic segmentation task in CityScape dataset.We use the PSPNet [48] as the the framework and use the ResNet-50 [15] as backbone.The pre-trained backbones are from Table 1.ReLU Swish[36] FReLUmean IU 77.2 77.5 road 98.0 98.1 98.1sidewalk 84.2 85.0 84.7building 92.3 92.5 92.7wall 55.0 56.3 59.5fence 59.0 59.6 60.9pole 63.3 63.6 64.3traﬃc light 71.4 72.1 72.2traﬃc sign 79.0 80.0 79.9vegetation 92.4 92.7 92.8 ReLU Swish FReLUterrain 65.0 64.0 64.5sky 94.7 94.9 94.8person 82.1 83.1 83.2rider 62.3 65.5 64.7car 95.1 94.8 95.3truck 77.7 70.1 79.8bus 84.9 84.0 87.8train 63.3 68.8 74.6motorcycle 68.3 69.4 69.8bicycle 78.2 78.4 78.7

We also show the comparison on the light-weight CNNs. As the comparisonof ResNet-50, we use pre-trained ShuﬄeNetV2 backbones adopted with diﬀerentactivations. We mainly compare FReLU with ReLU and the eﬀective activationSwish [36]. Table 3 shows visual activation also outperforms much better thanReLU and Swish backbones, to which it improves 1.1% mAP and 0.8% mAPrespectively. Moreover, it increases the performance of all the sizes of objects.

We further present the semantic segmentation results on CityScape dataset [6].The dataset is a semantic urban scene understanding dataset, contains 19 cate-gories. It has 5,000 ﬁnely annotated images, 2,975 for training, 500 for validationand 1525 for testing.We use the PSPNet [48] as the segmentation framework, for the trainingsettings we use the poly learning rate policy [4] where the base is 0.01 and thepower is 0.9, we use a weight decay of 1e-4, and 8 GPUs with a batch size of 2on each GPU.To evaluate the generality of the previous pre-trained models in Section 4.1,we use the pre-trained ResNet-50 [15] backbone models with diﬀerent activations,we compare FReLU with Swish and ReLU respectively.In Table 4, we show the comparison with scalar activations. From the result,we observe that our visual activation outperforms the ReLU and the searchedSwish 1.7% and 1.4% mean IU, respectively. Moreover, our visual activation hassigniﬁcant improvements in both large and small objects, especially on categoriessuch as ’train’, ’bus’, ’wall’, etc.For better visualization of the improved performance, Fig. 4 shows the predictresults on the testing dataset. It shows that by only changing the backboneactivations, the results have obvious improvement. The boundaries of both thelarge and the small objects are well-segmented because the pixel-wise modeling unnel Activation for Visual Recognition 11

Image GroundTruth ReLU Swish FReLU

Fig. 4. Visualization of semantic segmentation on ResNet-50[15]-PSPNet[48]with diﬀerent activations in backbone. We clip the CityScape images to make the diﬀer-ences more clear (better view enlarge images). FReLU has better long-range (large orslender objects) and short-range (small objects) understandings due to its better con-text capturing capacity. It captures irregular and detailed object layouts in complexcases much better. We note that modern frameworks are ﬁnely optimized with ReLU,however, it has obvious improvements by only changing the backbones, thus havingthe potential for further gains if redesign the frameworks for the visual activation. capacity can handle both global and detailed regions (see Fig. 3). We note thatthe modern recognition frameworks are ﬁnely designed with the ReLU activation,therefore the visual activation still has great potential for further improving theresults, which is beyond the focus of this work.

The previous sections demonstrate the optimum performance comparing withother eﬀective activations. To further investigate our visual activation, we con-duct ablation studies. We ﬁrst discuss the properties of the visual activation,then we discuss the compatibility with existing methods.

Our funnel activation mainly has two components: 1) funnel condition, and 2) max ( · ) non-linearity. Separately, we investigate the eﬀect of each component. Table 5.

Ablation on the diﬀerent spatial con-dition manners , and the diﬀerent non-linearmanners . The experiments are conducted onResNet-50 [15]. Model A, B, C compare diﬀer-ent visual conditions with/without parameters.Model D replaces max with sum , to this weadd a ReLU, or it will not converge. Model Eseparates and evaluates the performance of thespatial condition itself. DW(x) represent the 3x3depth-wise separable convolution.Model Activation Top-1 Err.A Max(x, ParamPool(x))

B Max(x, MaxPool(x)) 24.4C Max(x, AvgPool(x)) 24.5A Max(x, ParamPool(x))

D Sum(x, ParamPool(x)) 23.6E Max(DW(x), 0) 23.7

Table 6.

Ablation on diﬀerentnormalization methods after thespatial condition layer. We adoptBatch Normalization (BN) [21],Layer Normalization (LN) [2], In-stance Normalization (IN) [43] andGroup Normalization (GN) [45]after the spatial condition layerwhich is implemented by depth-wise convolution. ImageNet resultson ShuﬄeNetV2 0.5 × .Normalization Top-1 Err.- 37.6BN 37.1LN 36.5IN 38.0GN 36.5 Ablation on the spatial condition

First, we compare the diﬀerent mannersof the spatial condition. Besides the manner of parametric pooling that we used,to investigate the importance of the additional parameters, we compare otherpooling manners without additional parameters, they are max pooling and av-erage pooling. We simply replace the parametric pooling with the other twonon-parametric manners and evaluate the results on the ImageNet dataset.Table 5 (A, B, C) shows the importance of the parametric pooling. Withoutadditional parameter, the results decrease more than 2% top-1 accuracy, evenperform worse than the baseline that does not use spatial condition. Table 6shows the comparison of diﬀerent normalization after the spatial condition.

Ablation on the non-linearity

Second, we also compare the use of non-linearity. In our method, we use the max ( · ) function to perform the non-linearity, simultaneously capturing visual dependency. In contrast, we compare with themanners that separately capture visual dependency and non-linearity.For the spatial context capturing, we use two manners: 1) use the parametricpooling as before, then linearly add up with the original feature, 2) simply adda depth-wise separable convolution layer. For the non-linear transformation, weuse the ReLU function. Table 5 (A,D,E) show the results. Comparing with thebaseline, the spatial context itself improves about 0.3% accuracy, but togetheras the non-linear condition in our method, it further increases more than 1%.Therefore, performing the spatial dependency and non-linearity separately hasnot an ideal eﬀect as doing them simultaneously . Ablation on the window size

In the parametric pooling window, the sizeof the window decides the size of the area each pixel looks . We simply changethe window size in the funnel condition and compare diﬀerent sizes among { × unnel Activation for Visual Recognition 13 Table 7.

Ablation on the window size .We simply change the window size in thefunnel condition. We evaluate the top-1error rate on ImageNet dataset using theResNet-50 [15] structure.Model Window size Top-1 Err.A 1 × × C 5 × × × ×

1) 22.6F Max(1 ×

3, 3 × Ablation on diﬀerent layers .We replace the ReLU with FReLU afterthe 1 × × × × (cid:88) (cid:88) (cid:88) (cid:88) MobileNet (cid:88) (cid:88) (cid:88) (cid:88)

Ablation of visual activation on diﬀerent stages (Stage { } in ResNet-50[15]). In each stage we replace each ReLU withour visual activation. The results are the top-1error rates on ImageNet. Image size 224x224.Stage 2 Stage 3 Stage 4 Top-1 Err. (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Ablation comparisons ofthe compatibility between FReLUand SENet [20] on ResNet-50 [15].The results are the top-1 error rateson ImageNet. Image size 224x224.Single crop.Model , × , × , × } . The case of 1 × × × × ×

1, we consider to use the sum andmax of them as the condition. Table 7 { B,E,F } show the comparison. The resultsshow that irregular window sizes also have the optimum performance since theyhave a more ﬂexible pixel-wise modeling capacity (Fig. 3). To adopt the new activation into the convolutional networks, we have to choosewhich layers, and which stages to adopt. Moreover, we also investigate the com-patibility with existing eﬀective approaches such as SENet.

Compatibility with diﬀerent convolution layers

First, we compare thepositions after diﬀerent convolution layers. That is, we investigate the eﬀectof FReLU in diﬀerent positions after 1 × × experiments on ResNet-50 [15] and ShuﬄeNetV2 [30]. We replace the ReLUafter the 1 × × Compatibility with diﬀerent stages

Secondly, we investigate the compat-ibility with diﬀerent stages in the CNN structures. The visual activations areimportant especially on the layer with high spatial dimensions. For the classiﬁ-cation network whose shallow layers have larger spatial dimensions and deeperlayers have large channel dimensions, there may be diﬀerences when we applyvisual activations on diﬀerent stages. For Stage 5 of ResNet-50 with 224x224input, it has a relatively small 7x7 feature size, which mainly contains channeldependency instead of spatial dependency. Therefore, we adopt visual activationson Stage { } on ResNet-50, as Table 9 shows. The results reveal that adoptingthe shallow layers has a larger eﬀect, while a deeper layer has a smaller eﬀect.Moreover, adopting FReLU on all of them has the optimum top-1 accuracy. Compatibility with SENet

At last, we compare the performance with SENet[20] and show the compatibility with it. Without the complex advances in CNNarchitecture, it achieves signiﬁcant improvements on all the three vision tasks,simply together with the regular convolution layers. We further compare visualactivation with recent eﬀective attention module SENet, since SENet is one ofthe most eﬀective attention modules recently.Table 10 shows the result, although SENet uses an additional block to en-hance the model capacity, it is remarkable that the simple visual activation evenoutperforms SENet. We also wish the visual activation we proposed can co-existwith other techniques, such as the SE module. We adopt the SE module on thelast stage in ResNet-50 to avoid overﬁtting. Table 10 also shows the co-existencebetween FReLU and SE module. Together with SENet, funnel activation im-proves 0.3% accuracy further.

In this work, we present a funnel activation speciﬁcally designed visual tasks,which easily captures complex layouts using the pixel-wise modeling capacity.Our approach is simple, eﬀective, and ﬁnely compatible with other techniques,that provides a new alternative activation for image recognition tasks. We notethat ReLU has been so inﬂuential that many state-of-the-art architectures havebeen designed for it, however, their settings may not be optimal for the funnelactivation. Therefore, it still has a large potential for further improvements.

Acknowledgements

This work is supported by The National Key Researchand Development Program of China (No. 2017YFA0700800) and Beijing Academyof Artiﬁcial Intelligence (BAAI). unnel Activation for Visual Recognition 15

References

1. Agostinelli, F., Hoﬀman, M., Sadowski, P., Baldi, P.: Learning activation functionsto improve deep neural networks. arXiv preprint arXiv:1412.6830 (2014)2. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprintarXiv:1607.06450 (2016)3. Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE International Con-ference on Computer Vision Workshops. pp. 0–0 (2019)4. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se-mantic image segmentation with deep convolutional nets, atrous convolution, andfully connected crfs. IEEE transactions on pattern analysis and machine intelli-gence (4), 834–848 (2017)5. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep networklearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)6. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban sceneunderstanding. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3213–3223 (2016)7. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolu-tional networks. In: Proceedings of the IEEE international conference on computervision. pp. 764–773 (2017)8. Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gatedconvolutional networks. In: Proceedings of the 34th International Conference onMachine Learning-Volume 70. pp. 933–941. JMLR. org (2017)9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer visionand pattern recognition. pp. 248–255. Ieee (2009)10. Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural net-work function approximation in reinforcement learning. Neural Networks , 3–11(2018)11. Glorot, X., Bengio, Y.: Understanding the diﬃculty of training deep feedforwardneural networks. In: Proceedings of the thirteenth international conference on ar-tiﬁcial intelligence and statistics. pp. 249–256 (2010)12. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxoutnetworks. arXiv preprint arXiv:1302.4389 (2013)13. Hahnloser, R.H., Sarpeshkar, R., Mahowald, M.A., Douglas, R.J., Seung, H.S.:Digital selection and analogue ampliﬁcation coexist in a cortex-inspired siliconcircuit. Nature (6789), 947–951 (2000)14. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In: Proceedings of the IEEE interna-tional conference on computer vision. pp. 1026–1034 (2015)15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 770–778 (2016)16. Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regularizers withgaussian error linear units (2016)17. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:Improving neural networks by preventing co-adaptation of feature detectors. arXivpreprint arXiv:1207.0580 (2012)6 Ningning Ma et al.18. Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P.: A real-timealgorithm for signal analysis with the help of the wavelet transform. In: Wavelets,pp. 286–297. Springer (1990)19. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M., Adam, H.: Mobilenets: Eﬃcient convolutional neural networks formobile vision applications. arXiv preprint arXiv:1704.04861 (2017)20. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of theIEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)21. Ioﬀe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)22. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.In: Advances in neural information processing systems. pp. 2017–2025 (2015)23. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stagearchitecture for object recognition? In: 2009 IEEE 12th international conferenceon computer vision. pp. 2146–2153. IEEE (2009)24. Jeon, Y., Kim, J.: Active convolution: Learning the shape of convolution for imageclassiﬁcation. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 4201–4209 (2017)25. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neu-ral networks. In: Advances in neural information processing systems. pp. 971–980(2017)26. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con-volutional neural networks. In: Advances in neural information processing systems.pp. 1097–1105 (2012)27. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. In: Proceedings of the IEEE international conference on computer vision.pp. 2980–2988 (2017)28. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conferenceon computer vision. pp. 740–755. Springer (2014)29. Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the eﬀective receptiveﬁeld in deep convolutional neural networks. In: Advances in neural informationprocessing systems. pp. 4898–4906 (2016)30. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shuﬄenet v2: Practical guidelines foreﬃcient cnn architecture design. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 116–131 (2018)31. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectiﬁer nonlinearities improve neural net-work acoustic models. In: Proc. icml. vol. 30, p. 3 (2013)32. Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted boltzmann ma-chines. In: Proceedings of the 27th international conference on machine learning(ICML-10). pp. 807–814 (2010)33. Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval withattentive deep local features. In: Proceedings of the IEEE international conferenceon computer vision. pp. 3456–3465 (2017)34. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.:Conditional image generation with pixelcnn decoders. In: Advances in neural in-formation processing systems. pp. 4790–4798 (2016)35. Qiu, S., Xu, X., Cai, B.: Frelu: Flexible rectiﬁed linear units for improving convolu-tional neural networks. In: 2018 24th International Conference on Pattern Recog-nition (ICPR). pp. 1223–1228. IEEE (2018)unnel Activation for Visual Recognition 1736. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXivpreprint arXiv:1710.05941 (2017)37. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-nition challenge. International journal of computer vision115