[PDF] iqiyi Submission to ActivityNet Challenge 2019 Kinetics-700 challenge: Hierarchical Group-wise Attention

Abstract

In this report, the method for the iqiyi submission to the task of ActivityNet 2019 Kinetics-700 challenge is described. Three models are involved in the model ensemble stage: TSN, HG-NL and StNet. We propose the hierarchical group-wise non-local (HG-NL) module for frame-level features aggregation for video classification. The standard non-local (NL) module is effective in aggregating frame-level features on the task of video classification but presents low parameters efficiency and high computational cost. The HG-NL method involves a hierarchical group-wise structure and generates multiple attention maps to enhance performance. Basing on this hierarchical group-wise structure, the proposed method has competitive accuracy, fewer parameters and smaller computational cost than the standard NL. For the task of ActivityNet 2019 Kinetics-700 challenge, after model ensemble, we finally obtain an averaged top-1 and top-5 error percentage 28.444% on the test set.

Full PDF

aa r X i v : . [ c s . C V ] F e b iqiyi Submission to ActivityNet Challenge 2019 Kinetics-700 challenge:Hierarchical Group-wise Attention Qian Liu, Dongyang Cai, Jie Liu, Nan Ding, Tao Wangiqiyi, Inc.

Abstract

In this report, the method for the iqiyi submission tothe task of ActivityNet 2019 Kinetics-700 challenge is de-scribed. Three models are involved in the model ensemblestage: TSN, HG-NL and StNet. We propose the hierarchi-cal group-wise non-local (HG-NL) module for frame-levelfeatures aggregation for video classiﬁcation. The standardnon-local (NL) module is effective in aggregating frame-level features on the task of video classiﬁcation but presentslow parameters efﬁciency and high computational cost. TheHG-NL method involves a hierarchical group-wise struc-ture and generates multiple attention maps to enhance per-formance. Basing on this hierarchical group-wise structure,the proposed method has competitive accuracy, fewer pa-rameters and smaller computational cost than the standardNL. For the task of ActivityNet 2019 Kinetics-700 challenge,after model ensemble, we ﬁnally obtain an averaged top-1and top-5 error percentage 28.444% on the test set.

1. Introduction

Video classiﬁcation is one of the challenging tasks incomputer vision. Publicly challenges and available videodatasets accelerate the research processing, especially theActivityNet series challenges and related datasets. In recentyears, deep convolutional neural networks (CNNs) bring re-markable improvements on the accuracy of video classiﬁca-tion [7, 1, 5, 3].In this report, the method for the iqiyi submission to thetrimmed activity recognition (Kinetics) tasks of the Activi-tyNet Large Scale Activity Recognition Challenge 2019 isdescribed. The Kinetics-700 dataset covers 700 human ac-tion classes and consists of approximately 650,000 videoclips. And, each clip lasts around 10 seconds.In our model ensemble stage, three models are involved:TSN[7], HG-NL and StNet[4]. We propose the hierarchi-cal group-wise non-local (HG-NL) module for frame-levelfeatures aggregation for video classiﬁcation. Frequently-used aggregating methods include maxi-mum, evenly averaging and weighted averaging. The NLmodule in [8] is also able to be used for aggregating frame-level features. However, the NL module in [8] presents lowparameters efﬁciency and high computational cost, as dis-cussed in detail later in this paper.We address the problem of building a highly efﬁcientself-attention based frame-level features aggregation mod-ule. The Hierarchical Group-wise Non-Local (HG-NL)module for frame-level features aggregation is proposed.Comparison with NL in [8], the HG-NL module has fewerparameters and smaller computational cost. The proposedmodule involves a hierarchical group-wise structure, whichincludes the primary grouped convolutions and the sec-ondary grouped matrix multiplication. Moreover, HG-NLgenerates multiple attention maps. It brings one attentionmap for each feature group in the entire feature matrix andcan mine the non-local information in features in detail.

2. Method

In this section, the HG-NL is presented in detail.

Considering a video v , a sequence of frames { s , s , , s n } ( n is the length of a sequence of frames) are extracted fromthe entire video via some speciﬁc rules.The feature information of a single frame is obtained viaa pre-trained convolution network: f i = C ( s i ) , (1)where s i denotes the i -th frame, f i is the feature informationof s i , and C ( · ) denotes the ConvNet operating.The compact video-level features can be obtained via ag-gregating the features from multiple frames: F v = Agg ( f , f , , f n ) , (2)1 !" … ’() * + , - . - / - + - Softmax Scale

Frame-level feature ! Avg

Figure 1. Self-Attention (Non-local) Based Frame-level Features Aggregation ( ⊕ denotes element-wise sum and ⊗ denotes matrix multi-plication).Table 1. The number of parameters in NL and HG-NL ( m = 1024 , g = 16 and g = 8 ). The parameters of HG-NL is about 8 - 14 timesfewer than NL method. It is roughly 70 times fewer if parameters are shared across groups in a grouped convolutional layer in HG-NL. NL ( m = m/ ) NL ( m = m/ ) HG-NL ( m = m/ ) HG-NL (shared parameters, m = m/ ) W q W k W v Agg ( · ) is the aggregating function, n is the length ofa sequence of frames, and F v denotes the compact video-level features. In self-attention module, the response of a position is com-puted with weighted average of all positions in an embed-ding space. As a representative module of attention mech-anism, the NL in [8] is adopted to aggregate frame-levelfeatures here and is able to obtain long-rang dependenciesacross the frames.Let F ′ = [ f , f , , f n ] ∈ R m × n , where m is the length ofeach frame’s feature vector. F ∈ R m × n × , which denotesthe feature information of n frames, can be obtained viareshaping the size of F ′ to m × n × (corresponding to C ∗ H ∗ W ). Then, an attention map having the size of n × n and containing the relationships between every pairof frames can be obtained A = softmax( Q T K ) , (3)where Q = W q F , K = W k F , and weight matrices W q ∈ R m × m and W k ∈ R m × m are learned parameters.Commonly, weight matrices W q , W k are implemented as × convolutions. The output based on the attention map A is F o = V A, (4)where V = W v F and weight matrices W v ∈ R m × m is alsooperated as × convolutions.After this, F weight can be obtained F weight = s · F o + F. (5)In the above formulation, s is a scale parameter and the out-put F weight has the same size as the input signal F .The video-level feature F v is obtained via evenly aver-aging of F weight F v = avg( F weight ) . (6)Figure 1 shows the schema of the NL module for frame-level features aggregation. Analysis of NL module

The NL module is effective foraggregating frame-level features. However, the NL mod-ule presents low parameters efﬁciency and high computa-tional cost. The number of parameters in the NL moduleis computed as follows. For convolution layers correspond-ing to W q , W k and W v , the number of their parameters is (1 × × m × m + m ) , (1 × × m × m + m ) and (1 × × m × m + m ) individually. When m = 1024 , thenumber of parameters in the NL module can be computed able 2. MAdds (multiply-adds) of NL and HG-NL. Each convolution layer in HG-NL has g1(=16) or g2(=8) times fewer MAdds thanNL. The MAdds of other non-convolution layers keep roughly unchanged. NL ( m = m/ ) NL ( m = m/ ) HG-NL ( m = m/ ) W q F m n m n/ m n/ (4 g ) W k F m n m n/ m n/ (4 g ) W v F m n m n m n/g Q T K/Gmm ( Q, K ) n m − n n m/ − n n m/ − g n Softmax( · ) / Relu( · ) – – – V A/Gmm ( V, A ) mn (2 n − mn (2 n − mn (2 n − !" $ % & … … ( ) ’ ) ( ( … * + * ( * ) * , - Frame-level feature " " Figure 2. Hierarchical Group-wise Non-local module ( g = 2 g ). V , Q and K are obtained via grouped convolutions. A and F o areobtained using grouped matrix multiplication. and shown in Table 1. As shown in Table 1, if m = m/ ,the number of parameters is about 1.31M. If m = m/ ,the number of parameters is about 2M. The number is quitelarge for the practical use. In contrast, many backbone net-works have very small number of parameters, such as Mo-bileNetV2 [6] (3.4M), MobileNetV2-1.4 (6.9M), MF-Net-2D (5.8M), MF-Net-3D [2] (8.0M) and I3D-RGB [1] (12.1M). As for the computational complexity, when m = 1024 ,the total number of multiply-adds (MAdds) required in con-volution layers in NL is about n M (when m = m/ ) and . n M (when m = m/ ). Therefore, it makes sense toreduces parameters redundancy and computational cost ofNL module. In order to reduce parameters redundancy and computa-tional cost, the Hierarchical Group-wise Non-local (HG-NL) module for fame-level features aggregation is pro-posed. HG-NL has the hierarchical group-wise structureand generates several attention maps. The HG-NL module for fame-level features aggregation is performed as follow-ing.Firstly, in HG-NL, weight matrices W q , W k are imple-mented as × grouped convolutions with the numberof groups being g . The grouped convolutions can reducethe parameters and the number of operations measured byMAdds largely.After this, the attention map A is computed: A = Relu(Gmm( Q, K )) , (7)where Gmm( · ) denotes the grouped matrix multiplicationwith the number of groups being g , A ∈ R g × n × n includes g attention maps, and each attention map has size n × n .As shown in Figure 2, the grouped matrix multiplicationin Eq (7) brings one attention map for each feature group in V , and the number of attention map achieves g . This canmine the non-local information in features more detailedlyand effectively. As for the NL, only one attention map oc-curs. Besides, the softmax is deleted in HG-NL. The com-putation of Relu( · ) in Eq (7) is lightweight, and the Relu( · ) can provide the non-linearity for the HG-NL module.hen, keeping the same groups as in the grouped matrixmultiplication in Eq (7), weight matrices W v is operated as × grouped convolutions with the number of groups being g and F o is computed via the grouped matrix multiplica-tion with the number of groups being g F o = Gmm( V, A ) . (8)At last, F v can be obtained based on F o via Eq (5) andEq (6) in Section 2.1.2.Figure 2 shows the schema of the HG-NL module ( g =2 g ). In general, let g = rg and r is a ratio. Then the rela-tionship of g (primary grouped convolutions) and g (sec-ondary grouped matrix multiplication) forms the hierarchi-cal group-wise structure. Consider the value of g and g .Even though multiple attention maps are able to mine thenon-local information more detailedly, each attention mapwill cover too narrow feature information if g is too big.On the other hand, when g is bigger, the related parametersand MAdds is smaller. Therefore, in common, the values of g and g are set to different values. As a special case, when g equals g , the effect of HG-NL is the same as processingeach feature group of F via NL module individually. Analysis of HG-NL module

For convolution layers cor-responding to W q , W k and W v , the number of parametersof them is g × (1 × × ( m/g ) × ( m /g ) + m /g ) , g × (1 × × ( m/g ) × ( m /g ) + m /g ) and g × (1 × × ( m/g ) × ( m/g ) + m/g ) individually. As shownin Table 1, when m = 1024 , g = 16 and g = 8 , theHG-NL only requires about 1:8 - 1:14 times fewer param-eters than the NL, which has roughly 1.31M ( m = m/ )- 2.1M ( m = m/ ) parameters. If parameters are sharedacross groups in a grouped convolutional layer in HG-NL,the number of each convolution layer’s parameters will befurther reduced g or g times. Besides, as shown in Ta-ble 2, when m = 1024 , the MAdds required in convolutionlayers in HG-NL is about (0 . n/g + 2 n/g ) M and is sev-eral times fewer MAdds than convolution layers in the NL.The MAdds required in other non-convolution layers keeproughly unchanged. Thus, as we can see that the HG-NLis able to reduce the model redundancy and computationalcost. Meanwhile, HG-NL can achieve the competitive ac-curacy as NL.

Beneﬁting from that no fully-connected layers are includedin the network architecture of HG-NL, n (the number offrames) can be arbitrarily adjusted. Thus, in the evalua-tion phase of the proposed HG-NL module, the number offrames selected from a video for predicting the label is notneeded ﬁxed as the same value as in the training phase andcan be adjusted. In the model ensemble stage, three models are involved:TSN[7], HG-NL and StNet[4].

3. Experiments

In this section, we report some experimental results onKinetics-700 dataset of our method. All models are pre-trained on the Kinetics-600 training set. We ﬁnetuned thesemodels on the Kinetics-700 training set. Se-Resnext101is adopted as the backbone network. Due to the limitedtime, we exploit only RGB information. In our experi-mens, the full-length video is divided into several equal seg-ments, some frames are randomly selected from each seg-ment. During our training, the number of segments is set to3 and one frame are randomly selected from each segment.During evaluation, we follow the same testing setup as inTSN [7].

In TSN experiments, the initial learning rate is set as0.001 and decayed by a factor of 10 at 20 epochs and 30epochs. The maximum iteration is set as 40 epochs.

In HG-NL experiments, m = m , g = 16 , g = 8 . Dueto time limits, we ﬁnetuned the HG-NL on the Kinetics-700training set for only 8 epochs with the model pre-trained byTSN in Section 3.1. The initial learning rate is set as 0.001and decayed by a factor of 10 at 4 epochs and 6 epochs. Themaximum iteration is set as 8 epochs. The results are shownin Table 3. We can see that HG-NL can obtain the top-1accuracy of 62.12%, compared with the top-1 accuracy of61.83% of TSN on the Kinetics-700 validation set, as shownin Table 3. For StNet[4], the Temporal Modeling Block and Tempo-ral Xception Block are used in our network. We adopt thesame input of TSN as the input of StNet. Because of thetime limits, we only trained the network for 20 epochs onkinetics-700 datasets. The results of the StNet on kinetics-700 validation dataset is 55.7% for top1 and 78.3% for top5 in the train phase(3-frames test).

Three models are involved in the model ensemble stage:TSN[7], HG-NL and StNet[4]. Our team ﬁnally obtains anaveraged top-1 and top-5 error percentage of 28.444% onthe Kinetics-700 test set. able 3. Results of models on Kinetics-700 val set.

Model Val Accuracy in train phase (3 segments): Top-1 ( % ) Val Accuracy in test phase (25 segments): Top-1 ( % )TSN 57.38 61.83HG-NL 57.713 62.12StNet 55.7 - Table 4. Results on Kinetics-700 test set. The avg. error is anaveraged top-1 and top-5 error.

Models avg.errorModel Ensemble 0.28444

4. Conclusion

In this report, our teams solution to the task of Activ-ityNet 2019 Kinetics-700 challenge is described. Experi-ment results have evidenced the effectiveness of the pro-posed HG-NL method. HG-NL achieves the better accuracythan the TSN baseline. With the help of the hierarchicalgroup-wise structure, the HG-NL module has 8 - 70 timesfewer parameters and several times smaller computationalcomplexity than the NL module. After model ensemble,our team ﬁnally obtains an averaged top-1 and top-5 errorpercentage of 28.444% on the Kinetics-700 test set.

References [1] J. Carreira and A. Zisserman. Quo vadis, action recognition?a new model and the kinetics dataset. In proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 6299–6308, 2017. 1, 3[2] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. Multi-ﬁbernetworks for video recognition. In

The European Conferenceon Computer Vision (ECCV) , September 2018. 3[3] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slow-fast networks for video recognition. arXiv preprintarXiv:1812.03982 , 2018. 1[4] D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, andS. Wen. Stnet: Local and global spatial-temporal modeling foraction recognition. arXiv preprint arXiv:1811.01549 , 2018. 1,4[5] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal repre-sentation with pseudo-3d residual networks. In proceedings ofthe IEEE International Conference on Computer Vision , pages5533–5541, 2017. 1[6] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.Chen. Mobilenetv2: Inverted residuals and linear bottlenecks.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 4510–4520, 2018. 3[7] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, andL. Van Gool. Temporal segment networks: Towards goodpractices for deep action recognition. In

European conferenceon computer vision , pages 20–36. Springer, 2016. 1, 4[8] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neu-ral networks. In