[PDF] CBR-Net: Cascade Boundary Refinement Network for Action Detection: Submission to ActivityNet Challenge 2020 (Task 1)

Abstract

In this report, we present our solution for the task of temporal action localization (detection) (task 1) in ActivityNet Challenge 2020. The purpose of this task is to temporally localize intervals where actions of interest occur and predict the action categories in a long untrimmed video. Our solution mainly includes three components: 1) feature encoding: we apply three kinds of backbones, including TSN [7], Slowfast[3] and I3d[1], which are both pretrained on Kinetics dataset[2]. Applying these models, we can extract snippet-level video representations; 2) proposal generation: we choose BMN [5] as our baseline, base on which we design a Cascade Boundary Refinement Network (CBR-Net) to conduct proposal detection. The CBR-Net mainly contains two modules: temporal feature encoding, which applies BiLSTM to encode long-term temporal information; CBR module, which targets to refine the proposal precision under different parameter settings; 3) action localization: In this stage, we combine the video-level classification results obtained by the fine tuning networks to predict the category of each proposal. Moreover, we also apply to different ensemble strategies to improve the performance of the designed solution, by which we achieve 42.788% on the testing set of ActivityNet v1.3 dataset in terms of mean Average Precision metrics.

Full PDF

CCBR-Net: Cascade Boundary Reﬁnement Network for Action Detection:Submission to ActivityNet Challenge 2020 (Task 1)

Xiang Wang Baiteng Ma Zhiwu Qing Yongpeng Sang Changxin Gao Shiwei Zhang ∗ Nong Sang ∗ School of Artiﬁcial Intelligence and Automation, Huazhong University of Science and Technology School of Cyber Science and Engineering, Huazhong University of Science and Technology DAMO Academy, Alibaba Group { u201613707, btm, qzw, ypsang, cgao , nsang } @[email protected] Abstract

In this report, we present our solution for the task of temporal action localization (detection) (task 1) in Activi-tyNet Challenge 2020. The purpose of this task is to tempo-rally localize intervals where actions of interest occur andpredict the action categories in a long untrimmed video.Our solution mainly includes three components: 1) featureencoding: we apply three kinds of backbones, includingTSN [6], Slowfast[2] and I3D[1], which are both pretrainedon Kinetics dataset[1]. Applying these models, we can ex-tract snippet-level video representations; 2) proposal gen-eration: we choose BMN [4] as our baseline, base on whichwe design a Cascade Boundary Reﬁnement Network (CBR-Net) to conduct proposal detection. The CBR-Net mainlycontains two modules: temporal feature encoding, whichapplies BiLSTM to encode long-term temporal information;CBR module, which targets to reﬁne the proposal precisionunder different parameter settings; 3) action localization:In this stage, we combine the video-level classiﬁcation re-sults obtained by the ﬁne tuning networks to predict the cat-egory of each proposal. Moreover, we also apply to differentensemble strategies to improve the performance of the de-signed solution, by which we achieve 42.788% on the test-ing set of ActivityNet v1.3 dataset in terms of mean AveragePrecision metrics and achieve Rank 1 in the competition.

1. The proposed method

In this section, we present our solution detailedly.Firstly, we introduce the applied models for video represen-tation. Secondly, we present the proposed CBR-Net for pro-posal generation. Thirdly, we discuss the ensemble strate- ∗ Corresponding authors gies in our solution.

We ﬁrst extract features from image frames, and the fea-tures will be used as the input of CBR-Net.In this section, we extract features from input videos viaTSN [6], I3D [1] and Slowfast [2] models, which can beformed into a 2D Temporal Feature Sequence. Follow theprevious works [5, 4, 3, 7], we construct feature sequenceunder a same temporal scale to make the training operationmore efﬁcient.

Temporal Segment Network.

Temporal segment net-work (TSN [6]), a simple but efﬁcient framework for actionrecognition task, which is based on the idea of long-rangetemporal structure modeling. It combines a sparse tempo-ral sampling strategy and video-level supervision to enableefﬁcient and effective model learning by using the whole ac-tion video. TSN takes a strategy of sampling a ﬁxed numberof sparse segments from one video to model long-term tem-poral structure. In particular, the ﬁnal prediction of video isaveraged by the logits of each chip. In this competition, weexperimented with temporal segment network by sampling16 frames for each clip.

Inﬂated 3D ConvNet.

Inﬂated 3D ConvNet (I3D [1])is based on 2D ConvNet inﬂation: ﬁlters and pooling ker-nels of 2D CNN are expanded to 3D, making it possibleto learn seamless Spatio-temporal feature extractors fromvideo while leveraging successful ImageNet pretrained ar-chitecture designs and even their parameters. In this compe-tition, we experimented with the Inﬂated 3D ConvNet withsampling 16 frames form one clip to extract the features. Atthe same time, we use Kinetics pretrained model to initial-ize our I3D network.

Slowfast Network.

Slowfast network (Slowfast [2]),which involves (i) a Slow pathway, operating at a low frame1 a r X i v : . [ c s . C V ] J un C Feature Map Proposal Subnet Boundary Refinement Subnet

Feature Aggregation

Module

Boundary Refinement

Module

Figure 1. The framework of CBR-Net. First, the extracted features are the input of the Proposal Subnet. In this stage, we target to generateproposals with high recall rate. Then the obtained proposals and the extracted features are input into Boundary Reﬁnement Subnet togetherfor ﬁne-tuning the boundary. rate, to capture spatial semantics, and (ii) a Fast pathway,operating at a high frame rate, to capture motion at ﬁnetemporal resolution. The Fast pathway can be made verylightweight by reducing its channel capacity, yet it can learnuseful temporal information for video recognition. For de-tails about the architecture, please refer to its raw publi-cation [2]. In particular, in this competition, for the inputvideos, we keep the frame rate at 15FPS to extract videoframes. At the same time, each clip contains 32 frames isinput to slowFast network [2].

In this section, we introduce the CBR-Net designed inthe competition, as is shown in Figure 1. CBR-Net mainlycontains two components: temporal feature encoding mod-ule and CBR module, which will be introduced in this sec-tion detailly. Actually, the CBR-Net is designed based onthe Boundary Matching Network (BMN [4]), thus we willpresent the overview of BMN ﬁrst to make the report easierto be understood.

Boundary Matching Network.

BMN is mainly com-posed of two modules: temporal evaluation module andproposal evaluation module. The goal of the temporal eval-uation module is to evaluate the starting and ending proba-bilities for all temporal locations in the untrimmed video byconstructing two temporal 1D convolutional layers on fea-ture maps. These boundary probability sequences are usedfor generating proposals during post-processing. The goalof the proposal evaluation module is to generate Boundary-Matching (BM) conﬁdence map, which contains conﬁdencescores for densely distributed proposals. Here, the BM con-ﬁdence map takes the starting time points of action as its x-coordinate and the during of action as its y-coordinate. Thetemporal evaluation module and proposal evaluation mod- ule are jointly trained in a uniﬁed framework in BMN.

Temporal Feature Encoding module.

Temporal Fea-ture Encoding module is a simple but effective module inOur CBR-Net. The goal of this module is to encoding tem-poral information, which mainly by constructing BiLSTMlayers on the feature map to excavate the relationship be-tween different time points in the feature map.

Cascade Boundary Reﬁnement module.

By analyzingthe BMN network, we can ﬁnd that the length and boundaryof the proposals output by BMN are ﬁxed. And to obtain amore precise boundary, the size of the BM conﬁdence mapin BMN needs to be increased, which poses a challengeto the computation and makes BMN inefﬁcient. To solvethis problem, we proposed Cascade Boundary Reﬁnement(CBR) module. CBR module takes the coarse-boundaryproposals output from Proposal Subnet as input, and the de-tails of the CBR module is shown in ﬁgure 2. Obviously,the goal of the CBR module is to output proposals with ﬁnerboundaries.

In the competition, we use our Cascade Boundary Re-ﬁnement Network (CBR-Net) and the previous state-of-the-art works, Boundary Sensitive Network (BSN [5]) andBoundary Matching Network (BMN [4]) to conduct modelensemble. Then we integrate all of model results to getthe ﬁnal results. At the same time, the multi-feature fusionstrategy is also used in the competition. In particular, wefound that the ensemble strategies are greatly effective forimproving detection performance in the competition.2 core 0.43Score 0.72

Boundary Refinement

Module

PoolPool

Pool

Head1Head2

Head3

Score 0.73

Score 0.94

Figure 2. Details of Boundary Reﬁnement Subnet. For the in-put proposals, this subnet outputs the results of ﬁne-tuning theboundary and conﬁdence through the multi-layer cascade struc-ture. Here, a three-layers structure is shown.

Method Validation(mAP) Testing(mAP)BSN(baseline) 30.03 32.84BSN 32.8 -BMN(baseline) 33.85 36.42BMN 36.5 -

CBR-Net 38.0 - Ensemble 40.1 42.788

Table 1. Temporal action detection results. Performance compari-son between models and the ﬁnal test results.

2. Experiments

ActivityNet-1.3 is a large dataset for general temporalaction detection, which contains 19994 videos with 200 ac-tion classes annotated and was used in the ActivityNet Chal-lenge 2016, 2017, 2018, and 2019. ActivityNet-1.3 is di-vided into training, validation, and testing sets by ratio of2:1:1. In the competition, we train the model on the originalpartitioned training set and verify the model results on theoriginal partitioned validation set. The result of the compe-tition is the result of the model on the testing set. The labelfor the testing set is not disclosed.

In the temporal action detection task, mean Average Pre-cision (mAP) is adopted as the evaluation metric, where wecalculate Average Precision (AP) on each action categoryrespectively. Mean mAP with IoU thresholds [0 . .

05 :0 . are used on ActivityNet-1.3. Ground Truth Proposal0.01s 55.32s

Score 0.791.12s 34.32sScore 0.1895.26s 161.12s

Figure 3. Visualization examples of proposals on ActivityNet-1.3dataset.

In this section, we compared the performance of sev-eral current advanced models with that of our CBR-Netin the competition. As Shown in Table 1, it should benoted that the baseline in parenthesis means the result inthe original paper, the baseline can be greatly improved byusing our feature extraction and some model training andfusion tricks. Among them, BSN improved by 2.77% onmAP compared to the original paper, and BMN improvedby 2.65% on mAP compared to the original paper. In par-ticular, we argue that our proposed CBR-Net is well suitedfor the task of temporal action detection, and it turns outthat our CBR-Net exceeds the current state of the art mod-els, achieved 38.0% mAP of performance. Figure 3 showsexamples of proposal generated by our CBR-Net.In the competition, on the testing set, our ﬁnal resultsintegrated the results of BSN, BMN, and CBR-Net. Andresults ﬁnally reached 42.788% mAP.

From the results in Table 1, we can ﬁnd that CBR-Netcan output state-of-the-art method by 1.5%, which can bet-ter demonstrate the effectiveness of the proposed method.Moreover, in the ensemble stage, we can apply the com-plementarity among different methods, i.e.

BMN, BSN andCBR-Net, to improve the detection performance.

3. Conclusion

In this work, we propose a novel action detection net-work enhanced with temporal encoding module and CBRmodule for temporal action detection task. Especially, theexperimental results show that the solution CBR-Net cansigniﬁcantly improves the detection performance. At thesame time, in the competition, we also ensemble some otherprevious networks for better performance.3 eferences [1] J. Carreira and A. Zisserman. Quo vadis, action recognition?a new model and the kinetics dataset. In proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 6299–6308, 2017.[2] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast net-works for video recognition. In

Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 6202–6211,2019.[3] C. Lin, J. Li, Y. Wang, Y. Tai, D. Luo, Z. Cui, C. Wang,J. Li, F. Huang, and R. Ji. Fast learning of temporal ac-tion proposal via dense boundary generator. arXiv preprintarXiv:1911.04127 , 2019.[4] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen. Bmn: Boundary-matching network for temporal action proposal generation. In

Proceedings of the IEEE International Conference on Com-puter Vision , pages 3889–3898, 2019.[5] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Boundarysensitive network for temporal action proposal generation. In

Proceedings of the European Conference on Computer Vision(ECCV) , pages 3–19, 2018.[6] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, andL. Van Gool. Temporal segment networks: Towards goodpractices for deep action recognition. In

European conferenceon computer vision , pages 20–36. Springer, 2016.[7] S. Zhang, H. Peng, L. Yang, J. Fu, and J. Luo. Learning sparse2d temporal adjacent networks for temporal action localiza-tion. arXiv preprint arXiv:1912.03612 , 2019., 2019.