CBR-Net: Cascade Boundary Refinement Network for Action Detection: Submission to ActivityNet Challenge 2020 (Task 1)
Xiang Wang, Baiteng Ma, Zhiwu Qing, Yongpeng Sang, Changxin Gao, Shiwei Zhang, Nong Sang
CCBR-Net: Cascade Boundary Refinement Network for Action Detection:Submission to ActivityNet Challenge 2020 (Task 1)
Xiang Wang Baiteng Ma Zhiwu Qing Yongpeng Sang Changxin Gao Shiwei Zhang ∗ Nong Sang ∗ School of Artificial Intelligence and Automation, Huazhong University of Science and Technology School of Cyber Science and Engineering, Huazhong University of Science and Technology DAMO Academy, Alibaba Group { u201613707, btm, qzw, ypsang, cgao , nsang } @[email protected] Abstract
In this report, we present our solution for the task of temporal action localization (detection) (task 1) in Activi-tyNet Challenge 2020. The purpose of this task is to tempo-rally localize intervals where actions of interest occur andpredict the action categories in a long untrimmed video.Our solution mainly includes three components: 1) featureencoding: we apply three kinds of backbones, includingTSN [6], Slowfast[2] and I3D[1], which are both pretrainedon Kinetics dataset[1]. Applying these models, we can ex-tract snippet-level video representations; 2) proposal gen-eration: we choose BMN [4] as our baseline, base on whichwe design a Cascade Boundary Refinement Network (CBR-Net) to conduct proposal detection. The CBR-Net mainlycontains two modules: temporal feature encoding, whichapplies BiLSTM to encode long-term temporal information;CBR module, which targets to refine the proposal precisionunder different parameter settings; 3) action localization:In this stage, we combine the video-level classification re-sults obtained by the fine tuning networks to predict the cat-egory of each proposal. Moreover, we also apply to differentensemble strategies to improve the performance of the de-signed solution, by which we achieve 42.788% on the test-ing set of ActivityNet v1.3 dataset in terms of mean AveragePrecision metrics and achieve Rank 1 in the competition.
1. The proposed method
In this section, we present our solution detailedly.Firstly, we introduce the applied models for video represen-tation. Secondly, we present the proposed CBR-Net for pro-posal generation. Thirdly, we discuss the ensemble strate- ∗ Corresponding authors gies in our solution.
We first extract features from image frames, and the fea-tures will be used as the input of CBR-Net.In this section, we extract features from input videos viaTSN [6], I3D [1] and Slowfast [2] models, which can beformed into a 2D Temporal Feature Sequence. Follow theprevious works [5, 4, 3, 7], we construct feature sequenceunder a same temporal scale to make the training operationmore efficient.
Temporal Segment Network.
Temporal segment net-work (TSN [6]), a simple but efficient framework for actionrecognition task, which is based on the idea of long-rangetemporal structure modeling. It combines a sparse tempo-ral sampling strategy and video-level supervision to enableefficient and effective model learning by using the whole ac-tion video. TSN takes a strategy of sampling a fixed numberof sparse segments from one video to model long-term tem-poral structure. In particular, the final prediction of video isaveraged by the logits of each chip. In this competition, weexperimented with temporal segment network by sampling16 frames for each clip.
Inflated 3D ConvNet.
Inflated 3D ConvNet (I3D [1])is based on 2D ConvNet inflation: filters and pooling ker-nels of 2D CNN are expanded to 3D, making it possibleto learn seamless Spatio-temporal feature extractors fromvideo while leveraging successful ImageNet pretrained ar-chitecture designs and even their parameters. In this compe-tition, we experimented with the Inflated 3D ConvNet withsampling 16 frames form one clip to extract the features. Atthe same time, we use Kinetics pretrained model to initial-ize our I3D network.
Slowfast Network.
Slowfast network (Slowfast [2]),which involves (i) a Slow pathway, operating at a low frame1 a r X i v : . [ c s . C V ] J un C Feature Map Proposal Subnet Boundary Refinement Subnet
Feature Aggregation
Module
Boundary Refinement
Module
Figure 1. The framework of CBR-Net. First, the extracted features are the input of the Proposal Subnet. In this stage, we target to generateproposals with high recall rate. Then the obtained proposals and the extracted features are input into Boundary Refinement Subnet togetherfor fine-tuning the boundary. rate, to capture spatial semantics, and (ii) a Fast pathway,operating at a high frame rate, to capture motion at finetemporal resolution. The Fast pathway can be made verylightweight by reducing its channel capacity, yet it can learnuseful temporal information for video recognition. For de-tails about the architecture, please refer to its raw publi-cation [2]. In particular, in this competition, for the inputvideos, we keep the frame rate at 15FPS to extract videoframes. At the same time, each clip contains 32 frames isinput to slowFast network [2].
In this section, we introduce the CBR-Net designed inthe competition, as is shown in Figure 1. CBR-Net mainlycontains two components: temporal feature encoding mod-ule and CBR module, which will be introduced in this sec-tion detailly. Actually, the CBR-Net is designed based onthe Boundary Matching Network (BMN [4]), thus we willpresent the overview of BMN first to make the report easierto be understood.
Boundary Matching Network.
BMN is mainly com-posed of two modules: temporal evaluation module andproposal evaluation module. The goal of the temporal eval-uation module is to evaluate the starting and ending proba-bilities for all temporal locations in the untrimmed video byconstructing two temporal 1D convolutional layers on fea-ture maps. These boundary probability sequences are usedfor generating proposals during post-processing. The goalof the proposal evaluation module is to generate Boundary-Matching (BM) confidence map, which contains confidencescores for densely distributed proposals. Here, the BM con-fidence map takes the starting time points of action as its x-coordinate and the during of action as its y-coordinate. Thetemporal evaluation module and proposal evaluation mod- ule are jointly trained in a unified framework in BMN.
Temporal Feature Encoding module.
Temporal Fea-ture Encoding module is a simple but effective module inOur CBR-Net. The goal of this module is to encoding tem-poral information, which mainly by constructing BiLSTMlayers on the feature map to excavate the relationship be-tween different time points in the feature map.
Cascade Boundary Refinement module.
By analyzingthe BMN network, we can find that the length and boundaryof the proposals output by BMN are fixed. And to obtain amore precise boundary, the size of the BM confidence mapin BMN needs to be increased, which poses a challengeto the computation and makes BMN inefficient. To solvethis problem, we proposed Cascade Boundary Refinement(CBR) module. CBR module takes the coarse-boundaryproposals output from Proposal Subnet as input, and the de-tails of the CBR module is shown in figure 2. Obviously,the goal of the CBR module is to output proposals with finerboundaries.
In the competition, we use our Cascade Boundary Re-finement Network (CBR-Net) and the previous state-of-the-art works, Boundary Sensitive Network (BSN [5]) andBoundary Matching Network (BMN [4]) to conduct modelensemble. Then we integrate all of model results to getthe final results. At the same time, the multi-feature fusionstrategy is also used in the competition. In particular, wefound that the ensemble strategies are greatly effective forimproving detection performance in the competition.2 core 0.43Score 0.72
Boundary Refinement
Module
PoolPool
Pool
Head1Head2
Head3
Score 0.73
Score 0.94
Figure 2. Details of Boundary Refinement Subnet. For the in-put proposals, this subnet outputs the results of fine-tuning theboundary and confidence through the multi-layer cascade struc-ture. Here, a three-layers structure is shown.
Method Validation(mAP) Testing(mAP)BSN(baseline) 30.03 32.84BSN 32.8 -BMN(baseline) 33.85 36.42BMN 36.5 -
CBR-Net 38.0 - Ensemble 40.1 42.788
Table 1. Temporal action detection results. Performance compari-son between models and the final test results.
2. Experiments
ActivityNet-1.3 is a large dataset for general temporalaction detection, which contains 19994 videos with 200 ac-tion classes annotated and was used in the ActivityNet Chal-lenge 2016, 2017, 2018, and 2019. ActivityNet-1.3 is di-vided into training, validation, and testing sets by ratio of2:1:1. In the competition, we train the model on the originalpartitioned training set and verify the model results on theoriginal partitioned validation set. The result of the compe-tition is the result of the model on the testing set. The labelfor the testing set is not disclosed.
In the temporal action detection task, mean Average Pre-cision (mAP) is adopted as the evaluation metric, where wecalculate Average Precision (AP) on each action categoryrespectively. Mean mAP with IoU thresholds [0 . .
05 :0 . are used on ActivityNet-1.3. Ground Truth Proposal0.01s 55.32s
Score 0.791.12s 34.32sScore 0.1895.26s 161.12s
Figure 3. Visualization examples of proposals on ActivityNet-1.3dataset.
In this section, we compared the performance of sev-eral current advanced models with that of our CBR-Netin the competition. As Shown in Table 1, it should benoted that the baseline in parenthesis means the result inthe original paper, the baseline can be greatly improved byusing our feature extraction and some model training andfusion tricks. Among them, BSN improved by 2.77% onmAP compared to the original paper, and BMN improvedby 2.65% on mAP compared to the original paper. In par-ticular, we argue that our proposed CBR-Net is well suitedfor the task of temporal action detection, and it turns outthat our CBR-Net exceeds the current state of the art mod-els, achieved 38.0% mAP of performance. Figure 3 showsexamples of proposal generated by our CBR-Net.In the competition, on the testing set, our final resultsintegrated the results of BSN, BMN, and CBR-Net. Andresults finally reached 42.788% mAP.
From the results in Table 1, we can find that CBR-Netcan output state-of-the-art method by 1.5%, which can bet-ter demonstrate the effectiveness of the proposed method.Moreover, in the ensemble stage, we can apply the com-plementarity among different methods, i.e.
BMN, BSN andCBR-Net, to improve the detection performance.
3. Conclusion
In this work, we propose a novel action detection net-work enhanced with temporal encoding module and CBRmodule for temporal action detection task. Especially, theexperimental results show that the solution CBR-Net cansignificantly improves the detection performance. At thesame time, in the competition, we also ensemble some otherprevious networks for better performance.3 eferences [1] J. Carreira and A. Zisserman. Quo vadis, action recognition?a new model and the kinetics dataset. In proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 6299–6308, 2017.[2] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast net-works for video recognition. In
Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 6202–6211,2019.[3] C. Lin, J. Li, Y. Wang, Y. Tai, D. Luo, Z. Cui, C. Wang,J. Li, F. Huang, and R. Ji. Fast learning of temporal ac-tion proposal via dense boundary generator. arXiv preprintarXiv:1911.04127 , 2019.[4] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen. Bmn: Boundary-matching network for temporal action proposal generation. In
Proceedings of the IEEE International Conference on Com-puter Vision , pages 3889–3898, 2019.[5] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Boundarysensitive network for temporal action proposal generation. In
Proceedings of the European Conference on Computer Vision(ECCV) , pages 3–19, 2018.[6] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, andL. Van Gool. Temporal segment networks: Towards goodpractices for deep action recognition. In
European conferenceon computer vision , pages 20–36. Springer, 2016.[7] S. Zhang, H. Peng, L. Yang, J. Fu, and J. Luo. Learning sparse2d temporal adjacent networks for temporal action localiza-tion. arXiv preprint arXiv:1912.03612 , 2019., 2019.