[PDF] A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization

Abstract

Weakly supervised temporal action localization is a challenging vision task due to the absence of ground-truth temporal locations of actions in the training videos. With only video-level supervision during training, most existing methods rely on a Multiple Instance Learning (MIL) framework to predict the start and end frame of each action category in a video. However, the existing MIL-based approach has a major limitation of only capturing the most discriminative frames of an action, ignoring the full extent of an activity. Moreover, these methods cannot model background activity effectively, which plays an important role in localizing foreground activities. In this paper, we present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions to address these issues. Our temporal soft attention module, guided by an auxiliary background class in the classification module, models the background activity by introducing an "action-ness" score for each video snippet. Moreover, our temporal semi-soft and hard attention modules, calculating two attention scores for each video snippet, help to focus on the less discriminative frames of an action to capture the full action boundary. Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset, and by at least 1.3% mAP at IoU threshold 0.75 on the ActivityNet1.2 dataset. Code can be found at: this https URL

Full PDF

AA Hybrid Attention Mechanism for Weakly-Supervised Temporal ActionLocalization

Ashraful Islam , Chengjiang Long , Richard J. Radke Rensselaer Polytechnic Institute JD Digits AI [email protected], [email protected], [email protected]

Abstract

Weakly supervised temporal action localization is a challeng-ing vision task due to the absence of ground-truth temporallocations of actions in the training videos. With only video-level supervision during training, most existing methods relyon a Multiple Instance Learning (MIL) framework to predictthe start and end frame of each action category in a video.However, the existing MIL-based approach has a major lim-itation of only capturing the most discriminative frames ofan action, ignoring the full extent of an activity. Moreover,these methods cannot model background activity effectively,which plays an important role in localizing foreground ac-tivities. In this paper, we present a novel framework namedHAM-Net with a hybrid attention mechanism which includestemporal soft, semi-soft and hard attentions to address theseissues. Our temporal soft attention module, guided by an aux-iliary background class in the classiﬁcation module, modelsthe background activity by introducing an “action-ness” scorefor each video snippet. Moreover, our temporal semi-soft andhard attention modules, calculating two attention scores foreach video snippet, help to focus on the less discriminativeframes of an action to capture the full action boundary. Ourproposed approach outperforms recent state-of-the-art meth-ods by at least 2.2% mAP at IoU threshold 0.5 on the THU-MOS14 dataset, and by at least 1.3% mAP at IoU thresh-old 0.75 on the ActivityNet1.2 dataset. Code can be found at:https://github.com/asrafulashiq/hamnet.

Introduction

Temporal action localization refers to the task of predict-ing the start and end times of all action instances in a video.There has been remarkable progress in fully-supervised tem-poral action localization (Tran et al. 2017; Zhao et al. 2017;Chao et al. 2018; Lin et al. 2018; Xu et al. 2019). However,annotating the precise temporal ranges of all action instancesin a video dataset is expensive, time-consuming, and error-prone. On the contrary, weakly supervised temporal actionlocalization (WTAL) can greatly simplify the data collectionand annotation cost.WTAL aims at localizing and classifying all action in-stances in a video given only video-level category label dur-ing training stage. Most existing WTAL methods rely onthe multiple instance learning (MIL) paradigm (Paul, Roy, (a) Ground-Truth (b) Prediction from MIL-based framework(a)(b)

Figure 1: The existing MIL framework does not necessar-ily capture the full extent of an action instance. In this ex-ample of a diving activity. (a) shows the ground-truth lo-calization, and (b) shows the prediction from an MIL-basedWTAL framework. The MIL framework only captures themost discriminative part of the diving activity, ignoring thebeginning and ending parts of the full action.and Roy-Chowdhury 2018; Liu, Jiang, and Wang 2019; Is-lam and Radke 2020). In this paradigm, a video consistsof several snippets; snippet-level class scores, commonlyknown as Class Activation Sequences (CAS), are calcu-lated and then temporally pooled to obtain video-level classscores. The action proposals are generated by thresholdingthe snippet-level class scores. However, this framework hasa major issue: it does not necessarily capture the full ex-tent of an action instance. As training is performed to mini-mize the video-level classiﬁcation loss, the network predictshigher CAS values for the discriminative parts of actions,ignoring the less discriminative parts. For example, an ac-tion might consist of several sub-actions (Hou, Sukthankar,and Shah 2017). In the MIL paradigm, only a particular sub-action might be detected, ignoring the other parts of the ac-tion.An illustrative example of a diving activity is presented inFigure 1. We observe that only the most discriminative loca-tion of the full diving activity is captured by the MIL frame-work. Capturing only the most distinctive part of an action issufﬁcient to produce a high video-level classiﬁcation accu-racy, but does not necessarily result in good temporal local-ization performance. Another issue with the existing frame-work is modeling the background activity effectively so thatbackground frames are not included in the temporal local-ization prediction. It has been shown previously that back-ground activity plays an important role in action localization1 a r X i v : . [ c s . C V ] J a n Lee, Uh, and Byun 2020). Without differentiating the back-ground frames from the foreground ones, the network mightinclude the background frames to minimize foreground clas-siﬁcation loss, resulting in many false positive localizationpredictions.In this paper, we propose a new WTAL framework namedHAM-Net with a hybrid attention mechanism to solve theabove-mentioned issues. Attention mechanism has been suc-cessfully used in deep learning (Islam et al. 2020; Vaswaniet al. 2017; Shi et al. 2020a). HAM-Net produces soft, semi-soft and hard attentions to detect the full temporal span ofaction instances and to model background activity, as illus-trated in Figure 2.Our framework consists of (1) a classiﬁcation branch thatpredicts class activation scores for all action instances in-cluding background activity, and (2) an attention branch thatpredicts the “action-ness” scores of a video snippet. Thesnippet-level class activation scores are also modulated bythree snippet-level attention scores, and temporally pooledto produce video-level class scores.To capture the full action instance, we drop the more dis-criminative parts of the video, and focus on the less dis-criminative parts. We achieve this by calculating semi-softattention scores and hard attention scores for all snippetsin the video. The semi-soft attention scores drop the morediscriminative portions of the video by assigning zero valueto the snippets that have soft-attention score greater than athreshold, and the scores for the other portions remain thesame as the soft-attention scores. The video-level classiﬁca-tion scores guided by the semi-soft attentions contain onlyforeground classes. On the other hand, the hard-attentionscore drops the more discriminative parts of the video, andassigns the attention scores of the less discriminative partsto one, which ensures that video-level class scores guided bythis hard attention contain both foreground and backgroundclasses. Both the semi-soft and hard attentions encourage themodel to learn the full temporal boundary of an action in thevideo.To summarize, our contributions are threefold: (1) wepropose a novel framework with a hybrid attention mech-anism to model an action in its entirety; (2) we present abackground modeling strategy by attention scores guidedusing an auxiliary background class; and (3) we achievestate-of-the-art performance on both the THUMOS14 (Jianget al. 2014) and ActivityNet (Caba Heilbron et al. 2015)datasets. Speciﬁcally, we outperform state-of-the-art meth-ods by 2.2% mAP at IoU threshold 0.5 on the THUMOS14dataset, and 1.3% mAP at IoU threshold 0.75 on the Activi-tyNet1.2 dataset.

Related Work

Action Analysis with Full Supervision

Due to the rep-resentation capability of deep learning based models, andthe availability of large scale datasets (Jiang et al. 2014;Caba Heilbron et al. 2015; Sigurdsson et al. 2016; Gu et al.2018; Kay et al. 2017), signiﬁcant progress has been madein the domain of video action recognition. To design mo-tion cues, the two-stream network (Simonyan and Zisserman2014) incorporated optical ﬂow (Horn and Schunck 1981) as a separate stream along with RGB frames. 3D convolutionalnetworks have demonstrated better representations for video(Carreira and Zisserman 2017; Tran et al. 2015, 2017). Forfully-supervised temporal action localization, several recentmethods adopt a two-stage strategy (Tran et al. 2017; Zhaoet al. 2017; Chao et al. 2018; Lin et al. 2018).

Weakly Supervised Temporal Action Localization

Interms of exisiting WTAL methods, UntrimmedNets (Wanget al. 2017) introduced a classiﬁcation module for predict-ing a classiﬁcation score for each snippet, and a selectionmodule to select relevant video segments. On top of that,STPN (Nguyen et al. 2018) added a sparsity loss and class-speciﬁc proposals. AutoLoc (Shou et al. 2018) introducedthe outer-inner contrastive loss to effectively predict the tem-poral boundaries. W-TALC (Paul, Roy, and Roy-Chowdhury2018) and Islam and Radke (Islam and Radke 2020) incor-porated distance metric learning strategies.MAAN (Yuan et al. 2019) proposed a novel marginalizedaverage aggregation module and latent discriminative prob-abilities to reduce the difference between the most salientregions and the others. TSM (Yu et al. 2019) modeled eachaction instance as a multi-phase process to effectively char-acterize action instances. WSGN (Fernando, Tan, and Bilen2020) assigns a weight to each frame prediction based onboth local and global statistics. DGAM (Shi et al. 2020b)used a conditional Variational Auto-Encoder (VAE) to sep-arate the attention, action, and non-action frames. Clean-Net (Liu et al. 2019) introduced an action proposal eval-uator that provides pseudo-supervision by leveraging thetemporal contrast in snippets. 3C-Net (Narayan et al. 2019)adopted three loss terms to ensure the separability, to en-hance discriminability, and to delineate adjacent action se-quences. Moreover, BaS-Net (Lee, Uh, and Byun 2020) andNguyen et al (Nguyen, Ramanan, and Fowlkes 2019) mod-eled background activity by introducing an auxiliary back-ground class. However, none of these approaches explicitlyresolve the issue of modeling an action instance in its en-tirety.To model action completeness, Hide-and-Seek (Singh andLee 2017) hid part of the video to discover other relevantparts, and Liu et al (Liu, Jiang, and Wang 2019) proposed amulti-branch network where each branch predicts distinctiveaction parts. Our approach has similar motivation, but differsin that we hide the most discriminative parts of the videoinstead of random parts.

Proposed Method

Problem Formulation

Assume a training video V containing activity instanceschosen from n c activity classes. A particular activity can oc-cur in the video multiple times. Only the video-level actioninstances are given. Denote the video-level activity instancesas y ∈ { , } n c , where y j = 1 only if there is at least oneinstance of the j -th action class in the video, and y j = 0 ifthere is no instance of the j -th activity. Note that neither thefrequency nor the order of the action instances in the videois provided. Our goal is to create a model that is trained only2 lassification Score RGB

Frames

Flow

Frames Feature

Extractor (RGB)

Feature

Extractor (Flow)

Semi-Soft Attention ScoreHard Attention Scoretime c l a ss Untrimmed Video t h r e s ho l d T e m po r a l L o ca li za ti on Training Testing T i m e li n e TTTT C O NV C O NV C O NVC O NV C O NV S i g m o i d Classification BranchAttention Branch

SSSST

Temporal PoolingSemi-softThresholdHardThreshold S Softmax Elementwisemultiplication

BCLSSALSALHAL

Hybrid Attention Scores

Soft Attention Score

Figure 2: Overview of our proposed framework HAM-Net. Snippet-level features of both RGB and ﬂow frames are extractedand separately fed into a classiﬁcation branch and an attention branch with a hybrid attention mechanism. Three attention scoresare calculated: soft attention, semi-soft attention, and hard attention, which are multiplied with snippet-level classiﬁcation scoresto obtain attention-guided class scores. The network is trained using four attention-guided losses: base classiﬁcation loss (BCL),soft attention loss (SAL), semi-soft attention loss (SSAL), and hard attention loss (HAL), as well as sparsity loss and guideloss.with video-level action classes, and predicts temporal loca-tion of activity instances during evaluation, i.e. , for a testingvideo it outputs a set of tuples ( t s , t e , ψ, c ) where t s and t e are the start and end frames of an action, c is the action label,and ψ is the activity score. Snippet-Level Classiﬁcation

In our proposed HAM-Net, as illustrated in Figure 2, foreach video, we ﬁrst divide it into non-overlapping snip-pets to extract snippet-level features. Using a snippet-levelrepresentation rather than a frame-level representation al-lows us to use existing 3D convolutional features extrac-tors that can effectively model temporal dependencies inthe video. Following the two-stream strategy (Carreira andZisserman 2017; Feichtenhofer, Pinz, and Zisserman 2016)for action recognition, we extract snippet-level features forboth the RGB and ﬂow streams, denoted as x RGB i ∈ R D and x Flow i ∈ R D respectively. We concatenate both streams toobtain full snippet features x i ∈ R D for the i -th snippet,resulting in a high-level representation of the snippet featurethat contains both appearance and motion cues.To determine the temporal locations of all activities inthe video, we calculate the snippet-level classiﬁcation scoresfrom classiﬁcation branch, which is a convolutional neuralnetwork that outputs the class logits commonly known asClass Activation Sequences (CAS) (Shou et al. 2018). Wedenote the snippet level CAS for all classes for the i -th snip-pet as s i ∈ R c +1 . Here, the c + 1 -th class is the backgroundclass. Since we only have the video-level class scores asground truth, we need to pool the snippet-level scores s i to obtain video-level class scores. There are several poolingstrategies in the literature to obtain video-level scores from snippet level scores. We adopt the top-k strategy (Islam andRadke 2020; Paul, Roy, and Roy-Chowdhury 2018) in oursetting. Speciﬁcally, the temporal pooling is performed byaggregating the top-k values from the temporal dimensionfor each class: v j = max l ⊂{ , ,...,T }| l | = k k X i ∈ l s i ( j ) (1)Next, we calculate the video-level class scores by apply-ing softmax operations along the class dimension: p j = exp( v j ) P c +1 j =1 exp( v j ) (2)where j = 1 , , . . . , c + 1 .The base classiﬁcation loss is calculated as the cross en-tropy loss between the ground-truth video-level class scores y and the predicted scores p : L BCL = − c +1 X j =1 y j log( p j ) (3)Note that every untrimmed video contains some back-ground portions where there are no actions occurring. Thesebackground portions are modeled as a separate class in theclassiﬁcation branch. Hence, the ground-truth backgroundclass y c +1 = 1 in Eqn. 3. One major issue of this approachis that there are no negative samples for the backgroundclass, and the model cannot learn background activity byonly optimizing with positive samples. To overcome this is-sue, we propose a hybrid attention mechanism in the atten-tion branch to further explore the “action-ness” score of eachsegment.3 Hybrid Attention Mechanism for WeakSupervision

To suppress background classes from the video, we in-corporate an attention module to differentiate foregroundand background actions following the background model-ing strategy in several weakly-supervised action detectionpapers (Nguyen, Ramanan, and Fowlkes 2019; Lee, Uh, andByun 2020; Liu et al. 2019). The goal is to predict an atten-tion score for each snippet that is lower in the frames wherethere is no activity instance (i.e., background activity) andhigher for other regions. Although the classiﬁcation branchpredicts the probability of background action in the snippets,a separate attention module is more effective to differentiatebetween the foreground and background classes for severalreasons. First, most of the actions in a video occur in regionswhere there are high motion cues; the attention branch caninitially detect the background region only from motion fea-tures. Second, it is easier for a network to learn two classes(foreground vs. background) rather than a large number ofclasses with weak supervision.

Soft Attention Score

The input to the attention moduleis the snippet-level feature x i , and it returns a single fore-ground attention score a i : a i = g ( x i ; Θ) , (4)where a i ∈ [0 , , and g ( · ; Θ) is a function with parame-ters Θ that is designed with two temporal convolution layersfollowed by a sigmoid activation layer.To create negative samples for the background class, wemultiply the snippet level class logit (i.e., CAS) s i ( j ) foreach class j with the snippet-level attention score a i for the i -th snippet, and obtain attention-guided snippet-level classscores s attn i ( j ) = s i ( j ) ⊗ a i , where ⊗ is the element-wiseproduct. s attn serves as a set of snippets without any back-ground activity, which can be considered as negative sam-ples for the background class. Following Eqns. 1 and 2, weobtain video level attention-guided class scores p attn j for classlabel j : v attn j = max l ⊂{ , ,...,T }| l | = k k X i ∈ l s attn i ( j ) (5) p attn j = exp( v attn j ) P c +1 j =1 exp( v attn j ) (6)where j = 1 , , . . . , c + 1 .Note that p attn j does not contain any background class,since the background class has been suppressed by the at-tention score a i . From p attn j , we calculate the soft attention-guided loss (SAL) function L SAL = − c +1 X j =1 y fj log( p j ) (7)Here, y fj contains only the foreground activities, i.e., thebackground class y fc +1 = 0 , since the attention score sup-presses the background activity. Semi-Soft Attention Score

Given the snippet-level classscore s i and soft-attention score a i for the i -th snippet, wecalculate the semi-soft attention scores by thresholding thesoft attention a i by a particular value γ ∈ [0 , , a semi-soft i = (cid:26) a i , if a i < γ , otherwiseNote that the semi-soft attention a semi-soft i both drops themost discriminative regions and attends to the foregroundsnippets only; hence, the semi-soft attention guided video-level class scores will only contain foreground activities.This design helps to better model the background, as dis-cussed in the ablation studies section.Denote the video-level class scores associated with semi-soft attention as p semi-soft j , where j = 1 , , . . . , c + 1 . Wecalculate the semi-soft attention loss: L SSAL = − c +1 X j =1 y fj log( p semi-soft j ) (8)where y f is the ground truth label without background activ-ity, i.e., y fc +1 = 0 , since the semi-soft attention suppressesthe background snippets along with removing the most dis-criminative regions. Hard Attention Score

In contrast to semi-soft attention,hard attention score is calculated by a hard i = (cid:26) , if a i < γ , otherwiseWith hard attention score, we obtain another set of video-level class scores by multiplying them with the originalsnippet-level logit s i ( j ) and temporally pooling the scoresfollowing Eqn. 1 and Eqn. 2. We obtain the hard-attentionloss: L HAL = − c +1 X j =1 y j log( p hard j ) (9)where y is the ground truth label with background activity,i.e., y c +1 = 1 , since the hard attention does not suppressthe background snippets, rather, it only removes the morediscriminative regions of a video. Loss Functions

Finally, we train our proposed HAM-Netusing the following joint loss function: L = λ L BCL + λ L SAL + λ L SSAL + λ L HAL + α L sparse + β L guide (10)where L sparse is sparse loss, L guide is guide loss, and λ , λ , λ , λ , α , and β are hyper-parameters.The sparsity loss L sparse is based on the assumption that anaction is recognizable from a sparse subset of the video seg-ments (Nguyen et al. 2018). The sparsity loss is calculatedas the L1-norm of the soft-attention scores: L sparse = T X i =1 | a i | (11)4egarding the guide loss L guide , we consider the soft-attention score a i as a form of binary classiﬁcation score foreach snippet, where there are only two classes, foregroundand background, the probabilities of which are captured by a i and − a i . Hence, − a i can be considered as the prob-ability of the i -th snippet containing background activity.On the other hand, the background class is also captured bythe class activation logits s i ( · ) ∈ R c +1 . To guide the back-ground class activation to follow the background attention,we ﬁrst calculate the probability of a particular segment be-ing background activity, ¯ s c +1 = exp( s c +1 ) P cj =1 exp( s j ) (12)and then add a guide loss so that the absolute difference be-tween the background class probability and the backgroundattention is minimized: L guide = T X i =1 | − a i − ¯ s c +1 | (13) Temporal Action Localization

For temporal localization, we ﬁrst discard classes whichhave video-level class score less than a particular threshold(set to . in our experiments). For the remaining classes,we ﬁrst discard the background snippets by thresholding thesoft attention scores a i for all snippets i , and obtain class-agnostic action proposals by selecting the one-dimensionalconnected components of the remaining snippets. Denotethe candidate action locations as { ( t s , t e , ψ, c ) } , where t s is the start time, t e is the end time, and ψ is the classiﬁcationscore for class c . We calculate the classiﬁcation score fol-lowing the outer-inner score of AutoLoc (Shou et al. 2018).Note that for calculating class-speciﬁc scores, we use theattention-guided class logits s attn c , ψ = ψ inner − ψ outer + ζp attn c (14) ψ inner = Avg ( s attn c ( t s : t e )) (15) ψ outer = Avg ( s attn c ( t s − l m : t s ) + s attn c ( t e : t e + l m )) (16)where ζ is a hyper-parameter, l m = ( t e − t s ) / , p attn c is thevideo-level score for class c , and s attn c ( · ) is the snippet-levelclass logit for class c . We apply different thresholds for ob-taining action proposals, and remove the overlapping seg-ments with non-maximum suppression. Experiments

Experimental Settings

Datasets

We evaluate our approach on two popular actionlocalization datasets: THUMOS14 (Jiang et al. 2014) andActivityNet1.2 (Caba Heilbron et al. 2015).

THUMOS14 contains 200 validation videos for training and 213 testingvideos for testing with 20 action categories. This is a chal-lenging dataset with around 15.5 activity segments and 71%background activity per video.

ActivityNet1.2 dataset con-tains 4,819 videos for training and 2,382 videos for testingwith 200 action classes. It contains around 1.5 activity in-stances (10 times sparser than THUMOS14) and 36% back-ground activity per video.

Evaluation Metrics

For evaluation, we use the standardprotocol and report mean Average Precision (mAP) at vari-ous intersection over union (IoU) thresholds. The evaluationcode provided by ActivityNet (Caba Heilbron et al. 2015) isused to calculate the evaluation metrics.

Implementation Details

For feature extraction, we sam-ple the video streams into non-overlapping 16 frame chunksfor both the RGB and the ﬂow stream. Flow streams are cre-ated using the TV-L1 algorithm (Wedel et al. 2008). Weuse the I3D network (Carreira and Zisserman 2017) pre-trained on the Kinetics dataset (Kay et al. 2017) to extractboth RGB and ﬂow features, and concatenate them to obtain2048-dimensional snippet-level features. During training werandomly sample 500 snippets for THUMOS14 and 80 snip-pets for ActivityNet, and during evaluation we take all thesnippets. The classiﬁcation branch is designed as two tem-poral convolution layers with kernel size 3, each followedby LeakyReLU activation, and a ﬁnal linear fully-connectedlayer for predicting class logits. The attention branch con-sists of two temporal convolution layers with kernel size 3followed by a sigmoid layer to predict attention scores be-tween 0 and 1.We use the Adam (Kingma and Ba 2015) optimizer withlearning rate 0.00001, and train for 100 epochs for THU-MOS14 and 20 epochs for ActivityNet. For THUMOS14,we set λ = λ = 0 . , λ = λ = 0 . , α = β = 0 . , γ = 0 . , and k = 50 for top-k temporal pooling. For Activi-tyNet, we set α = 0 . , β = 0 . , λ = λ = λ = λ = 0 . ,and k = 4 , and apply additional average pooling to post-process the ﬁnal CAS. All the hyperparameters are deter-mined from grid search. For action localization, we set thethresholds from 0.1 to 0.9 with a step of 0.05, and per-form non-maximum suppression to remove overlapping seg-ments. All experiments are performed on a single NVIDIARTX 2080Ti GPU. Ablation Studies

We conduct a set of ablation studies on the THUMOS14dataset to analyze the performance contribution of eachcomponent of our proposed HAM-Net. Table 1 shows theperformance of our method with respect to different lossterms. We use “AVG mAP” for the performance metric,which is the average of mAP values for different IoU thresh-olds (0.1:0.1:0.7). The ﬁrst ﬁve experiments are trainedwithout SSAL or HAL loss, i.e., without any temporal drop-ping mechanism, which we denote as “MIL-only mode”,and the remaining experiments trained with those losses aredenoted as “MIL and Drop mode”. Figure 3 shows the local-ization prediction of different experiments on a representa-tive video. Our analysis shows that all the loss componentsare required to achieve the maximum performance.

Importance of sparsity and guide loss

Table 1 shows thatboth sparsity and guide loss are important to acheive betterperformance. Speciﬁcally, in “MIL-only mode”, adding bothsparsity and guide loss provides 4% mAP gain, and in “MILand Drop mode”, the mAP gain is 9%, suggesting that theselosses are more important in “MIL and Drop mode”. Notethat for SSAL and HAL, the discriminativeness of a snippet5 xp L BCL L SAL L HAL L SSAL L sparse L guide AVGmAP1 - - - - - 24.62 - - - - 30.83 - - - - - - 30.95 - - - - 30.97 - - 37.99 - - Table 1: Ablation study on the effectiveness of differentcombination of loss functions in the localization perfor-mance on THUMOS14 in terms of mAP. Here, AVG mAPmeans the average of mAP values from IoU thresholds 0.1to 0.7. Adding L AL in the total loss function improves themAP from 34.8 to 39.8. (b)(c)(d)(e)(a) (a) GT (b) BCL (d) BCL+SAL+ sparse (e) BCL+SAL+ sparse+guide (f) BCL+SAL+ sparse+guide+SSAL+HAL (f) (c) BCL+SAL Figure 3: Visualization of the effects of different loss func-tions on the ﬁnal localization for a video containing the LongJump activity. (a) is the ground-truth action location. (b) rep-resents only MIL loss, which predicts many false positives.After adding sparsity and guide loss, in (d) we get rid ofthose false positives, but still cannot capture full temporalboundaries. (e) shows results from our approach which cap-tures full action boundaries.is measured by the soft-attention scores that are learned bysparsity and guide loss. Without sparsity loss, the majority ofthe soft-attention scores remain close to 1, making the snip-pet dropping strategy ineffective. Moreover, the guide lossitself does not increase the localization performance signif-icantly without the sparsity loss (experiment 3 and experi-ment 7 in Table 1); however, combined with sparsity loss itshows the best performance improvement (experiment 5 andexperiment 11 in Table 1).

Importance of attention losses

We observe that the at-tention losses can signiﬁcantly improve the performance.Table 1 shows that only incorporating L SAL achieves6.2% average mAP gain over the BCL-only model. Fromexperiment-9 and experiment-10 in Table 1, we see that bothHAL and SSAL individually improve the performance, andwe get the best performance when we combine them. Specif-ically, the combination of HAL and SSAL improves the per-formance by 5% over the best score in “MIL-only mode”.Figure 5 shows visualization examples of the effectivenessof the losses on a representative video. We can observe thatthe MIL-only model fails to capture several parts of a fullaction instance (i.e. , Long Jump). Incorporating attention losses helps to capture the action in its entirety.

Importance of dropping snippets by selective threshold-ing

For calculating the HAL loss and the SSAL loss, wedrop the more discriminative parts of the video and trainon the less discriminative parts, assuming that focusing onless discriminative parts will help the model to learn thecompleteness of actions. To conﬁrm our assumption here,we create two baselines: “ ours with random drop ” wherewe randomly drop video snippets, similar to Hide-and-Seek(Singh and Lee 2017), and “ ours with inverse drop ”, wherewe drop the less discriminative parts instead of dropping themost discriminative parts. We show the performance com-parison between these models in Figure 4a. Results showthat randomly dropping snippets is slightly more effectivethan the baseline, and dropping the less discriminative partsdecreases the localization performance. Our approach per-forms much better than randomly dropping snippets or drop-ping less discriminative snippets, which proves the efﬁ-cacy of selectively dropping more discriminative foregroundsnippets.

IoU threshold AV G m A P oursours with random dropours with inverse drop (a) weight for attention loss AV G m A P (b) Figure 4: (a) Ablation study on the importance of droppingsnippets by selective thresholding. Other approaches likerandom dropping or inverse selective thresholding do notwork well. (b) Ablation study on the importance of SSALand HAL. A lower weight causes the model to learn only themost distinctive parts, and a higher weight gives too muchfocus to the less distinctive parts.

Ablation on λ and λ For this analysis, we set λ = λ = λ . In Figure 4b, we analyze the effect of λ to the per-formance. Note that λ = 0 denotes “MIL only mode”, whichachieves an average mAP of 34.8%. Increasing the value of λ results in performance improvement until λ reaches 0.2,after which we observe performance degradation. The rea-son is that a lower weight does not incorporate L SSAL and L HAL effectively during training. On the contrary, a higherweight provides too much importance on the less discrimi-native parts, which might cause the model to ignore the morediscriminative regions in every iteration, resulting in poor lo-calization performance. The optimum value of 0.2 balancesout both of the issues.

Performance Comparison to State-of-the-Art

Table 2 summarizes performance comparisons between ourproposed HAM-Net and state-of-the-art fully-supervisedand weakly-supervised TAL methods on the THUMOS14dataset. We report mAP scores at different IoU thresholds.‘AVG’ is the average mAP for IoU 0.1 to 0.7 with stepsize of 0.1. With weak supervision, our proposed HAM-Net achieves state-of-the-art scores on all IoU thresholds.6able 2: Comparison of our algorithm with other state-of-the-art methods on the THUMOS14 dataset for temporal actionlocalization.

Supervision Method Feature IoU0.1 0.2 0.3 0.4 0.5 0.6 0.7 AVGFull

R-C3D (Xu, Das, and Saenko 2017) - 54.5 51.5 44.8 35.6 28.9 - - -SSN (Zhao et al. 2017) - 66.0 59.4 51.9 41.0 29.8 - - -BSN (Lin et al. 2018) - - - 53.5 45.0 36.9 28.4 20.0 -G-TAD (Xu et al. 2019) - - - 54.5 47.6 40.2 -P-GCN (Zeng et al. 2019) - - - -

Weak

Hide-and-Seek (Singh and Lee 2017) - 36.4 27.8 19.5 12.7 6.8 - - -UntrimmedNets (Wang et al. 2017) - 44.4 37.7 28.2 21.1 13.7 - - -STPN (Nguyen et al. 2018) I3D 52.0 44.7 35.5 25.8 16.9 9.9 4.3 26.4AutoLoc (Shou et al. 2018) UNT - - 35.8 29.0 21.2 13.4 5.8 -W-TALC (Paul, Roy, and Roy-Chowdhury 2018) I3D 55.2 49.6 40.1 31.1 22.8 - 7.6 -Liu et al (Liu, Jiang, and Wang 2019) I3D 57.4 50.8 41.2 32.1 23.1 15.0 7.0 32.4MAAN (Yuan et al. 2019) I3D 59.8 50.8 41.1 30.6 20.3 12.0 6.9 31.6TSM (Yu et al. 2019) I3D - - 39.5 - 24.5 - 7.1 -CleanNet (Liu et al. 2019) UNT - - 37.0 30.9 23.9 13.9 7.1 -3C-Net (Narayan et al. 2019) I3D 56.8 49.8 40.9 32.3 24.6 - 7.7 -Nguyen et al (Nguyen, Ramanan, and Fowlkes 2019) I3D 60.4 56.0 46.6 37.5 26.8 17.6 9.0 36.3WSGN (Fernando, Tan, and Bilen 2020) I3D 57.9 51.2 42.0 33.1 25.1 16.7 8.9 33.6Islam et al (Islam and Radke 2020) I3D 62.3 - 46.8 - 29.6 - 9.7 -BaS-Net (Lee, Uh, and Byun 2020) I3D 58.2 52.3 44.6 36.0 27.0 18.6 10.4 35.3DGAM (Shi et al. 2020b) I3D 60.0 54.2 46.8 38.2 28.8 19.8 11.4 37.0

HAM-Net (Ours)

I3D

Speciﬁcally, HAM-Net achieves 2.2% more mAP than thecurrent best scores at IoU threshold 0.5. Moreover, ourHAM-Net outperforms some fully-supervised TAL models,and even shows comparable results with some recent fully-supervised TAL methods.In Table 3, we evaluate HAM-Net on the ActivityNet1.2dataset. HAM-Net outperforms other WTAL approaches onActivityNet1.2 across all metrics, verifying the effectivenessof our proposed HAM-Net.Table 3: Comparison of our algorithm with other state-of-the-art methods on the ActivityNet1.2 validation set for tem-poral action localization. AVG means average mAP fromIoU 0.5 to 0.95 with 0.05 increment

Supervision Method IoU0.5 0.75 0.95 AVGFull

SSN 41.3 27.0 6.1 26.6

Weak

UntrimmedNets 7.4 3.2 0.7 3.6AutoLoc 27.3 15.1 3.3 16.0W-TALC 37.0 12.7 1.5 18.0Islam et al et al

HAM-Net (Ours) 41.0 24.8 5.3 25.1

Qualitative Performance

We show some representative examples in Fig. 5. For eachvideo, the top row shows example frames, the next row rep-resents ground-truth localization, “Ours” is our prediction,and “Ours w/o HAL & SSAL” is ours trained without L HAL and L SSAL . Fig. 5 shows that our model clearly captures thefull temporal extent of activities, while “ours w/o HAL &SSAL” focuses only on the more discriminative snippets.

Ground TruthOursOurs w/oHAL & SSAL (a) High Jump

Ground TruthOursOurs w/oHAL & SSAL (b) Diving

Figure 5: Qualitative results on THUMOS14. The horizontalaxis denotes time. On the vertical axis, we sequentially plotthe ground truth detection, our detection scores, and detec-tion scores of ours without HAL and SSAL. SSAL and HALhelps to learn the full context of an action.

Conclusion

We presented a novel framework called HAM-Net to learntemporal action localization from only video-level supervi-sion during training. We introduced a hybrid attention mech-anism including soft, semi-soft, and hard attentions to dif-ferentiate background frames from foreground ones and tocapture the full temporal boundaries of the actions in thevideo, respectively. We perform extensive analysis to showthe effectiveness of our approach. Our approach achievesstate-of-the-art performance on both the THUMOS14 andActivityNet1.2 dataset.7 cknowledgement

This material is based upon work supported by the U.S. De-partment of Homeland Security under Award Number 2013-ST-061-ED0001. The views and conclusions contained inthis document are those of the authors and should not be in-terpreted as necessarily representing the ofﬁcial policies, ei-ther expressed or implied, of the U.S. Department of Home-land Security.

References

Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; and Car-los Niebles, J. 2015. ActivityNet: A large-scale video bench-mark for human activity understanding. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , 961–970.Carreira, J.; and Zisserman, A. 2017. Quo Vadis, ActionRecognition? A New Model and the Kinetics Dataset.

TheIEEE Winter Conference on Applications of Computer Vi-sion , 537–546.Gu, C.; Sun, C.; Vijayanarasimhan, S.; Pantofaru, C.; Ross,D. A.; Toderici, G.; Li, Y.; Ricco, S.; Sukthankar, R.;Schmid, C.; and Malik, J. 2018. AVA: A Video Dataset ofSpatio-Temporally Localized Atomic Visual Actions.

Techniques and Applications of Image Understand-ing , volume 281, 319–331. International Society for Opticsand Photonics.Hou, R.; Sukthankar, R.; and Shah, M. 2017. Real-TimeTemporal Action Localization in Untrimmed Videos by Sub-Action Discovery. In

BMVC , volume 2, 7.Islam, A.; Long, C.; Basharat, A.; and Hoogs, A. 2020.DOA-GAN: Dual-Order Attentive Generative AdversarialNetwork for Image Copy-Move Forgery Detection and Lo-calization. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) .Islam, A.; and Radke, R. 2020. Weakly Supervised Tem-poral Action Localization Using Deep Metric Learning. In

The IEEE Winter Conference on Applications of ComputerVision , 547–556. Jiang, Y.-G.; Liu, J.; Roshan Zamir, A.; Toderici, G.; Laptev,I.; Shah, M.; and Sukthankar, R. 2014. THUMOS Chal-lenge: Action Recognition with a Large Number of Classes.http://crcv.ucf.edu/THUMOS14/.Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier,C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.;Natsev, A.; Suleyman, M.; and Zisserman, A. 2017. TheKinetics Human Action Video Dataset. arXiv preprintarXiv:1705.06950 .Kingma, D. P.; and Ba, J. 2015. Adam: A method forstochastic optimization.

International Conference on Learn-ing Representations (ICLR) .Lee, P.; Uh, Y.; and Byun, H. 2020. Background SuppressionNetwork for Weakly-supervised Temporal Action Localiza-tion. In

AAAI .Lin, T.; Zhao, X.; Su, H.; Wang, C.; and Yang, M. 2018.Bsn: Boundary sensitive network for temporal action pro-posal generation. In

Proceedings of the European Confer-ence on Computer Vision (ECCV) , 3–19.Liu, D.; Jiang, T.; and Wang, Y. 2019. Completeness Model-ing and Context Separation for Weakly Supervised TemporalAction Localization. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 1298–1307.Liu, Z.; Wang, L.; Zhang, Q.; Gao, Z.; Niu, Z.; Zheng, N.;and Hua, G. 2019. Weakly supervised temporal action lo-calization through contrast based evaluation networks. In

Proceedings of the IEEE International Conference on Com-puter Vision , 3899–3908.Narayan, S.; Cholakkal, H.; Khan, F. S.; and Shao, L.2019. 3c-net: Category count and center loss for weakly-supervised action localization. In

Proceedings of the IEEEInternational Conference on Computer Vision , 8679–8687.Nguyen, P.; Liu, T.; Prasad, G.; and Han, B. 2018. Weaklysupervised action localization by sparse temporal poolingnetwork. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 6752–6761.Nguyen, P. X.; Ramanan, D.; and Fowlkes, C. C. 2019.Weakly-supervised action localization with backgroundmodeling. In

Proceedings of the IEEE International Con-ference on Computer Vision , 5502–5511.Paul, S.; Roy, S.; and Roy-Chowdhury, A. K. 2018. W-TALC: Weakly-supervised Temporal Activity Localizationand Classiﬁcation. In

Proceedings of the European Confer-ence on Computer Vision (ECCV) , 563–579.Shi, B.; Dai, Q.; Mu, Y.; and Wang, J. 2020a. Weakly-Supervised Action Localization by Generative AttentionModeling. In

IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR) .Shi, B.; Dai, Q.; Mu, Y.; and Wang, J. 2020b. Weakly-Supervised Action Localization by Generative AttentionModeling.Shou, Z.; Gao, H.; Zhang, L.; Miyazawa, K.; and Chang,S.-F. 2018. AutoLoc: Weakly-supervised temporal actionlocalization in untrimmed videos. In

Proceedings of the Eu-ropean Conference on Computer Vision (ECCV) , 154–171.8igurdsson, G. A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev,I.; and Gupta, A. 2016. Hollywood in Homes: Crowdsourc-ing Data Collection for Activity Understanding. In

Euro-pean Conference on Computer Vision .Simonyan, K.; and Zisserman, A. 2014. Two-stream Con-volutional Networks for Action Recognition in Videos. In

Proceedings of the 27th International Conference on NeuralInformation Processing Systems - Volume 1 , NIPS’14, 568–576.Singh, K. K.; and Lee, Y. J. 2017. Hide-and-Seek: Forcinga Network to be Meticulous for Weakly-Supervised Objectand Action Localization.

Advances in neural informationprocessing systems , 5998–6008.Wang, L.; Xiong, Y.; Lin, D.; and Van Gool, L. 2017.UntrimmedNets for weakly supervised action recognitionand detection. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 4325–4334.Wedel, A.; Pock, T.; Zach, C.; Bischof, H.; and Cremers,D. 2008. An Improved Algorithm for TV-L1 Optical Flow.In

Statistical and Geometrical Approaches to Visual MotionAnalysis .Xu, H.; Das, A.; and Saenko, K. 2017. R-C3D: RegionConvolutional 3D Network for Temporal Activity Detection.

Proceedings of the IEEE International Confer-ence on Computer Vision , 5522–5531.Yuan, Y.; Lyu, Y.; Shen, X.; Tsang, I. W.; and Yeung, D.-Y. 2019. MARGINALIZED AVERAGE ATTENTIONALNETWORK FOR WEAKLY-SUPERVISED LEARNING.In

International Conference on Learning Representations .URL https://openreview.net/forum?id=HkljioCcFQ.Zeng, R.; Huang, W.; Tan, M.; Rong, Y.; Zhao, P.; Huang,J.; and Gan, C. 2019. Graph Convolutional Networks forTemporal Action Localization. In

The IEEE InternationalConference on Computer Vision (ICCV) .Zhao, Y. S.; Xiong, Y.; Wang, L.; Wu, Z.; Lin, D.; and Tang,X. 2017. Temporal Action Detection with Structured Seg- ment Networks.

Hybrid Attention Mechanism for Weakly-Supervised Temporal ActionLocalization (Supplementary Materials)

Ashraful Islam , Chengjiang Long , Richard J. Radke Rensselaer Polytechnic Institute JD Digits AI [email protected], [email protected], [email protected]

Abstract

The supplemental material contains additional experiments,visualization and ablation studies.

Experiments

Action Classiﬁcation

Table. 1 shows action classiﬁcation performance of our ap-proach in comaprison with other state-of-the-arts in THU-MOS14 and ActivityNet1.2 dataset. We use classiﬁcationmean average precision (mAP) for evaluation. We see thatthe classiﬁcation performance of our approach is very com-petitive with the SOTAs, specially in THUMOS14 weachieve 7.2% mAP improvement over 3C-Net ( ? ). Wealso achieve very competitive performance in ActivityNetdataset. Although our approach has not been designed forvideo action recognition task, it’s high performance in ac-tion classiﬁcation reveals the robustness of our method. Detailed Performance on ActivityNet1.2

Table. 2 shows detailed performance of our approach on Ac-tivityNet1.2 dataset in terms of localization mAP for differ-ent IoU thresholds.

More Ablation

Fir. 1 shows ablation studies on the hyper-parameters α , β ,and drop threshold γ on THUMOS14 dataset. AVG mAP isthe mean mAP value from IoU threshold 0.1 to 0.7 incre-mented by 0.1. Fig. 1a shows the performance for differentweights on sparsity loss. Without sparsity loss, the modelhardly learns any localization. As α increases, localizationperformance increases as well, and we get the best scorefor α = 0 . . Fig. 1b reveals the performance improvementfor different weights on guide loss. We empirically ﬁnd that β = 0 . gives the best performance. In Fig. 1c, we see themAP performance for different values of dropping threshold γ . Fig. 1d and Fig. 1e show the effect of video length dur-ing training for THUMOS14 and ActivityNet respectively.Note that THUMOS14 contains more denser videos with alarge number of activities per video. Hence we observe that the performance increase for larger video length for THU-MOS14, whereas, ActivityNet performs best for 80 lengthsegments. Also, note that the number of segments are chosenrandomly only during training. We use all segments duringevaluation. More Qualitative Examples

We show more qualitative examples in Fig. 2. In Fig. 2a,there are several occurrences of Pole Vault activity, and ourmethod can capture most of them. We show some failureexamples in Figure. 2b and Fig. 2c. In Fig. 2b, our modelerroneously captures some activities as high jump. In thoseerroneous segments, we observe that the person tends to doa high jump activity but restrain in the end without com-pleting the full action. The same goes for Fig. 2c. PreviousWTAL approaches ( ?? ) have also shown similar issues asan inherent limitation of WTAL methods. Because of theweakly-supervised nature, we infer that some errors relatedto incomplete activities are inevitable.1 a r X i v : . [ c s . C V ] J a n ethods THUMOS14 ActivityNet1.2iDT+FV ( ? ) 63.1 66.5C3D ( ? ) - 74.1TSN ( ? ) 67.7 88.8W-TALC ( ? ) 85.6 ? ) 86.9 92.4Ours alpha AV G m A P (a) beta AV G m A P (b) gamma AV G m A P (c)

200 400 600 number of segments (THUMOS) AV G m A P (d)

200 400 600 number of segments (ActivityNet) AV G m A P (e) Figure 1: (a) Ablation on the weight of sparsity loss. (b) Ablation on the weight of guide loss. (c) Ablation on the drop-thresholdfor dropping snippets in the HAD module. (d) and (e) Ablation on the number of segments for a video during training.Table 2: Comparison of our algorithm with other state-of-the-art methods on the ActivityNet1.2 validation set for temporalaction localization.

Supervision Method IoU0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 AVGFull

SSN ( ? ) 41.3 38.8 35.9 32.9 30.4 27.0 22.2 18.2 13.2 6.1 26.6 Weak

UntrimmedNets ( ? ) 7.4 6.1 5.2 4.5 3.9 3.2 2.5 1.8 1.2 0.7 3.6AutoLoc ( ? ) 27.3 24.9 22.5 19.9 17.5 15.1 13.0 10.0 6.8 3.3 16.0W-TALC ( ? ) 37.0 33.5 30.4 25.7 14.6 12.7 10.0 7.0 4.2 1.5 18.0TSM ( ? ) 28.3 26.0 23.6 21.2 18.9 17.0 14.0 11.1 7.5 3.5 17.13C-Net ( ? ) 35.4 - - - 22.9 - - - 8.5 - 21.1CleanNet ( ? ) 37.1 33.4 29.9 26.7 23.4 20.3 17.2 13.9 9.2 5.0 21.6Liu et al ( ? ) 36.8 - - - - 22.0 - - - 5.6 22.4Islam et al ( ? ) 35.2 - - - 16.3 - - - - - -BaS-Net ( ? ) 34.5 - - - - 22.5 - - - 4.9 22.2DGAM ( ? ) Ours 41.0 37.9 34.6 31.3 28.1 24.8 21.1 16.0 10.8 5.3 25.1 Ground TruthPredictionScore (a) Pole Vault

Ground TruthPredictionScore (b) High Jump

Ground TruthPredictionScore (c) Diving(c) Diving