[PDF] Multi-Label Activity Recognition using Activity-specific Features and Activity Correlations

Abstract

Multi-label activity recognition is designed for recognizing multiple activities that are performed simultaneously or sequentially in each video. Most recent activity recognition networks focus on single-activities, that assume only one activity in each video. These networks extract shared features for all the activities, which are not designed for multi-label activities. We introduce an approach to multi-label activity recognition that extracts independent feature descriptors for each activity and learns activity correlations. This structure can be trained end-to-end and plugged into any existing network structures for video classification. Our method outperformed state-of-the-art approaches on four multi-label activity recognition datasets. To better understand the activity-specific features that the system generated, we visualized these activity-specific features in the Charades dataset.

Full PDF

MMulti-Label Activity Recognition using Activity-speciﬁc Features

Yanyi Zhang, Xinyu Li, Ivan Marsic Rutgers UniversityNew Brunswick, Electrical and Computer Engineering Department Amazon Web [email protected], [email protected], [email protected]

Abstract

We introduce an approach to multi-label activity recognitionby extracting independent feature descriptors for each ac-tivity. Our approach ﬁrst extracts a set of independent fea-ture snippets, focused on different spatio-temporal regions ofa video, that we call “observations”. We then generate in-dependent feature descriptors for each activity, that we call“activity-speciﬁc features” by combining these observationswith attention, and further make action prediction based onthese activity-speciﬁc features. This structure can be trainedend-to-end and plugged into any existing network struc-tures for video classiﬁcation. Our method outperformed state-of-the-art approaches on three multi-label activity recogni-tion datasets. We also evaluated the method and achievedstate-of-the-art performance on two single-activity recogni-tion datasets to show the generalizability of our approach.Furthermore, to better understand the activity-speciﬁc fea-tures that the system generates, we visualized these activity-speciﬁc features in the Charades dataset.

Introduction

Activity recognition has been studied in recent years due toits great potential in real-world applications. Recent activityrecognition researches (Kay et al. 2017; Soomro, Zamir, andShah 2012; Goyal et al. 2017; Kuehne et al. 2011) focusedon single-activity recognition assuming that each video con-tains only one activity, without considering a multi-labelproblem where each video may contain multiple activities(simultaneous or sequential), which has more general real-world use cases (e.g., sports activity recognition (Sozykinet al. 2018; Carbonneau et al. 2015), or daily life activityrecognition (Sigurdsson et al. 2016)). Most of the recentmulti-label activity recognition methods are derived fromstructures for single-activity that generating a shared fea-ture vector using the 3D average pooling and applying sig-moid as the output activation function (Li et al. 2017; Wanget al. 2016; Carreira and Zisserman 2017; Wang et al. 2018;Feichtenhofer et al. 2019; Wu et al. 2019). Although theseapproaches enable the network to provide multi-label out-puts, the features are not representative of multi-label activi-ties. These methods work well on single-activity recognitionwith the assumption that the learned feature maps will onlyactivate on one region where the corresponding activity oc-curred. The remaining regions are considered as unrelated,

Figure 1: System overview using an example from Charades.The system ﬁrst generates k independent feature snippets (“ob-servations”) that focus on different key regions from the video(arms, blankets, and clothes). The activity-speciﬁc features are thengenerated by independently combining these observations. Theweights of the observations that contribute to activity-speciﬁc fea-tures are represented as lines with different colors (black, red, andblue). The thicker lines denote higher weights. For example, the Activity - specific features (holding a blanket) are obtained bycombining information from observation (focuses on arms) and observation (focuses on clothes). More examples and detailedexplanations are given in Figure 4. and have low values in the feature maps. As a result, averag-ing feature maps over spatio-temporal dimension may rep-resent single-activity features well. However, in multi-labelactivity videos, the feature maps may focus on multiple dis-connected regions corresponding to the performed activities,and the 3D average pooling will globally merge the featuremaps and make the rendered features unrepresentative.To better represent multi-label activities performed ina video, we introduce our novel mechanism that gener-ates independent feature descriptors for different activi-ties. We named these feature descriptors “activity-speciﬁcfeatures” . The introduced mechanism generates activity-speciﬁc features in two stages. The ﬁrst-stage (Figure 1,middle) network summarizes the feature maps extracted bythe backbone network (3D convolution layers) and generatesa set of independent feature snippets by applying indepen-dent spatio-temporal attention for each snippet. We namethese feature snippets “observations” . In the second stage a r X i v : . [ c s . C V ] S e p Figure 1, right), the network then learns activity-speciﬁcfeatures from different combinations of observations for dif-ferent activities. In this way, each activity is represented asan independent set of feature descriptors (activity-speciﬁcfeatures). The multi-label activity predictions can be thenmade based on the activity-speciﬁc features. Unlike most ofthe previous approaches (Li et al. 2017; Wang et al. 2016;Carreira and Zisserman 2017; Wang et al. 2018; Feichten-hofer et al. 2019; Wu et al. 2019) that generate a shared fea-ture vector to represent multiple activities by pooling fea-ture maps globally, our network produces speciﬁc featuredescriptors for each activity, which is more representative.In multi-label activity videos, different activities might havedifferent duration and need to be recognized using videoclips with different lengths. To address this issue, we furtherintroduced a speed-invariant tuning method for generat-ing activity-speciﬁc features and recognizing multi-label ac-tivities using inputs with different downsampling rates. Ourexperimental results show that the network using the speed-invariant tuning method boosts around 1% mAP (mean av-erage precision) score on Charades (Sigurdsson et al. 2016).We compared our system with current state-of-the-artmethods on three multi-label activity recognition datasets, alarge-scale dataset, Charades (Sigurdsson et al. 2016) for themain experiment, and two other small multi-label sport ac-tivity datasets, Volleyball (Sozykin et al. 2018) and Hockey(Carbonneau et al. 2015). We also evaluated our method onKinetics-400 (Kay et al. 2017) and UCF-101 (Soomro, Za-mir, and Shah 2012) for single-activity recognition to showgeneralizability of our method. We outperformed the cur-rent state-of-the-art methods on all three multi-label activ-ity recognition datasets by only using RGB videos as input(Feichtenhofer 2020; Sozykin et al. 2018; Azar et al. 2019).Our experimental results demonstrate that our approach forgenerating activity-speciﬁc features succeeds in multi-labelactivity recognition. We achieved similar performance withmost recent state-of-the-art approaches (Feichtenhofer et al.2019; Tran et al. 2019; Feichtenhofer 2020) on Kinetics-400and UCF-101, which shows that our system, although fo-cused on multi-label activity recognition, is able to achievestate-of-the-art performance on single-activity recognitiondatasets. We further visualized the activity-speciﬁc featuresby applying the learned attention maps on the backbone fea-tures (feature maps after the last 3D convolution layer) torepresent the activity-speciﬁc feature maps. Our contribu-tions can be summarized as:1. A novel network structure that generates activity-speciﬁcfeatures for multi-label activity recognition.2. The speed-invariant tuning method that produces multi-label activity predictions using different temporal-resolution inputs.3. We evaluated our method on ﬁve activity recognitiondatasets, including multi-label and single-activity datasetsto show the generalizability of our method as well as pro-duce an ablation study for selecting the parameters.

Related Work

Activity Recognition.

Video-based activity recognition hasbeen developing rapidly in recent years due to the success ofdeep learning methods for image recognition (Krizhevsky,Sutskever, and Hinton 2012; Szegedy et al. 2015; He et al.2016). Compared to image classiﬁcation, activity recog-nition depends on spatio-temporal features extracted fromconsecutive frames instead of spatio-only features fromstatic images. Two-stream networks apply two-branch con-volution layers to extract motion features from consecutiveframes as well as spatial features from static images andfuse them for activity recognition (Simonyan and Zisserman2014; Feichtenhofer, Pinz, and Zisserman 2016; Wang et al.2016). Others proposed 3D-convolution-based networks forextracting spatio-temporal features from the videos insteadof using manually designed optical ﬂow for extracting mo-tions between frames (Carreira and Zisserman 2017; Tranet al. 2015). The nonlocal neural network (Wang et al. 2018)and the long-term feature bank (LFB) (Wu et al. 2019) ex-tended the 3D ConvNet by extracting long-range features.The SlowFast network (Feichtenhofer et al. 2019) intro-duced a two-pathway network for learning motion and spa-tial features separately from the videos. The most recentX3D network introduced an efﬁcient video network thatexpands 2D networks from multiple axes (Feichtenhofer2020). These methods work well on single-activity recog-nition datasets that achieved around 80% accuracy score onKinetics-400 (Kay et al. 2017).

Multi-label Activity Recognition.

Multi-label activityrecognition is designed for recognizing multiple activitiesthat are performed simultaneously or sequentially in eachvideo. Most of the recent approaches focus on single-activityrecognition by assuming that only one activity was per-formed in a video and produce the multi-label output bymodifying the activation function from softmax to sigmoidwhen applied on multi-label activity datasets (Wang et al.2016; Carreira and Zisserman 2017; Wang et al. 2018; Fe-ichtenhofer et al. 2019; Wu et al. 2019). These methods willmerge features representing different activities together andfail to generate representative features for multi-label activ-ities, as mentioned earlier.

Attention Modules.

The attention mechanism was intro-duced for capturing long-range associations within sequen-tial inputs, which is commonly used in the natural lan-guage processing tasks (Bahdanau, Cho, and Bengio 2014;Vaswani et al. 2017). Existing activity recognition ap-proaches apply spatial attention to aggregate the backbonefeatures (output of the last convolution layer) over space andtemporal attention to aggregate the feature maps over time(Li et al. 2018; Meng et al. 2019; Du, Wang, and Qiao 2017;Girdhar and Ramanan 2017). The nonlocal block is an ex-tension of these methods for generating spatio-temporal at-tention instead of separately using cascade attention in spaceand time (Wang et al. 2018). However, these methods do notwork well on multi-label activities in that all the channels ofthe feature maps share the same attention, which causes thefeature maps to fail to highlight important regions.

Bag-of-Features.

Bag-of-words methods were initially de-veloped for document classiﬁcation and further extended to igure 2: Method overview, showing the detail dimension transformation when generating activity-speciﬁc features and providing predictions.Attention (red, green, and brown) focus on different spatio-temporal regions of the backbone feature ( F f ) for generating observations ( Obs ),and generating activity-speciﬁc features ( F A ) by combining observations using attnA . bag-of-visual-features for image recognition (Csurka et al.2004; Arandjelovic et al. 2016; Yan, Smith, and Zhang 2017;Sudhakaran and Lanz 2019; Tu et al. 2019; Girdhar et al.2017). These methods beneﬁt from requiring fewer param-eters by encoding large images into visual words. The stepthat generating activity-speciﬁc features in our network isrelated to the idea of bag-of-features. Methodology

Inspired by the idea of creating “action words” from theAction-VLAD and other bag-of-features methods (Girdharet al. 2017; Sudhakaran and Lanz 2019; Tu et al. 2019), weintroduce a method that generates independent feature de-scriptors for each activity (activity-speciﬁc features). Com-pared to the Action-VLAD and other bag-of-features meth-ods, our activity-speciﬁc features are end-to-end trainableunlike their visual words that are generated using unsu-pervised learning methods. In addition, our activity-speciﬁcfeatures focus on different spatio-temporal regions insteadof aggregating features over time as the Action-VLAD did.Given a video clip V ∈ R × × × with 32 consecutiveframes, our model provides activity predictions in two steps:1. Generating activity-speciﬁc features : we generate in-dependent feature representations for A different activi-ties. This step consists of two sub-steps: we ﬁrst generate K spatio-temporally independent feature snippets (obser-vations), Obs ∈ R K × C (cid:48) , that focus on different spatio-temporal regions of the video (Figure 2, left). We thenapply attention attn A on the observations to generate fea-ture descriptors F A (activity-speciﬁc features) that are in-dependent for each activity using independent weightedcombinations of observations (Figure 2, middle).2. Generating activity predictions : we ﬁnally provide pre-diction for each activity by using its correspondingactivity-speciﬁc features (Figure 2, right).

Generating Activity-speciﬁc Features

Given a feature set F f ∈ R C × T W H from the backbone net-work (e.g., i3D (Carreira and Zisserman 2017)), the activity-speciﬁc features can be generated as: F A = { attnA Obs, attnA Obs, ..., attnA A Obs } (1) where F A ∈ R A × C (cid:48) denotes A independent feature descrip-tors for their corresponding activities (activity-speciﬁc fea-tures), C (cid:48) is the channel number of F A , Obs ∈ R K × C (cid:48) denotes K independent feature snippets (observations) thatare extracted from the backbone features F f , and attnA i ( i ∈ , , ..., A ) are the attentions that independently com-bine the K observations to generate activity-speciﬁc featuresfor the i th activity. We create these observations instead ofdirectly generating attnA from the backbone features F f to reduce redundant information. Each observation is an in-dependent spatio-temporal feature snippet that focuses on aspeciﬁc key region in a video. The Obs are generated by ap-plying K independent spatio-temporal attentions on F f as: Obs k = attnO k [ g αk ( F f )] T (2) Obs = { Obs , Obs , ..., Obs k } (3)where Obs k ∈ R C (cid:48) is the k th observation that focuses on aspeciﬁc key region of the video, attnO k ∈ R T W H denotesthe spatio-temporal attention for generating the k th observa-tion. The g αk is the linear function to integrate channels from F f , which is represented as: g αk ( F f ) = W αk F f (4)where W αk ∈ R C (cid:48) × C are the weights for the linear function g αk . The activity-speciﬁc set F A can ﬁnally be written as: F A = { attnA { attnO [ g α ( F f )] T , ..., attnO k [ g αk ( F f )] T } attnA { attnO [ g α ( F f )] T , ..., attnO k [ g αk ( F f )] T } ...attnA A { attnO [ g α ( F f )] T , ..., attnO k [ g αk ( F f )] T }} (5) Generating Attentions

The attention mechanism was in-troduced for capturing long-term dependencies within se-quential inputs and is commonly used in nature languageprocessing systems. We applied attentions for generatingspatio-temporal independent observations from the back-bone features ( attnO ) and generating activity-speciﬁc fea-tures by combining observations ( attnA ). We implementedthe dot-product attention method (Vaswani et al. 2017) forgenerating attnO as: attnO k = sof tmax ([ g βk ( F f ) − ] T g γk ( F f )) (6) igure 3: A video example in Charades shows multiple activities(opening a fridge and walking through a doorway) that require in-puts with different sample rates. where attnO k ∈ R T W H denotes the attention for the k th observation, g βk , g γk are the linear functions same as the g αk in equation 4, and g βk ( F f ) − ∈ R C (cid:48) × denotes selecting thelast row of the g βk ( F f ) ∈ R C (cid:48) × T W H to produce an appropri-ate dimension for attnO k . The attnA was generated using asimilar approach. Other attention methods (e.g. additive at-tention (Bahdanau, Cho, and Bengio 2014)) could be usedfor generating attentions, but we selected the dot-product at-tention method because previous researches has shown thatit is more efﬁcient and works well for machine translation(Vaswani et al. 2017; Shen et al. 2017).Applying linear functions in equation (4) requires a largenumber of weights. To reduce the number of weights for g αk ,we used a group 1D convolution to simulate the linear func-tion in equation (4) as: g αk ( F f ) = conv d ( F f , group = n ) (7)Using a larger number of groups ( n ) results in fewer param-eters. We set n = 16 empirically to minimize the number ofweights without affecting the performance of the model. Generating Activity Predictions

The ﬁnal step of the module is to predict activities usingthose activity-speciﬁc features as: F out = sigmoid ( W ϕ F A + b ϕ ) (8)where F out ∈ R A × is the output of the model. W ϕ and b ϕ are the trainable weights and bias for decodingthe activity-speciﬁc features F A into the binary predictions.We compared the model performance by using independentfully-connected layers for each activity and a shared fully-connected layer for decoding all the activities from theircorresponding activity-speciﬁc features. The two methodsachieved similar performance. We choose to apply a sharedfully-connected layer to reduce the number of parameters. Comparison with Other Attention Methods

Attention-based module with spatial and temporal attentionshas already been used for activity recognition to extractfeatures by focusing on important spatio-temporal regions(Girdhar and Ramanan 2017; Du, Wang, and Qiao 2017; Liet al. 2018; Meng et al. 2019; Girdhar and Ramanan 2017).The nonlocal neural network extends from them that gen-erates spatio-temporal attention instead of separately using cascade attention in space and time (Wang et al. 2018). How-ever, both the nonlocal and other attention-based approachesmake all the channels in F f share the same attention. Thesemethods do not work well for multi-label activities becausethe feature maps from different channels of the backbonefeatures F f may focus on different regions that correspondto different activities. Using shared attention through chan-nels would cause feature maps to focus on irrelevant regionsthat might contain information important in the feature mapsof other channels. In our approach, “observations” indepen-dently focus on different key regions of the video becausethey are generated by integrating different subsets from thechannels of F f and applying independent spatio-temporalattention on each subset. Implementation

Implementation Details

We implemented our model with PyTorch (Paszke et al.2017). We used batch normalization (Ioffe and Szegedy2015) and ReLU activation (Hahnloser and Seung 2001) forall the convolution layers. We used binary cross-entropy lossand the SGD optimizer with an initial learning rate . e − and . e − as the weight decay. Dropout (rate=0.5) wasused after the dense layer to avoid overﬁtting (Srivastavaet al. 2014). We set the batch size to 9 and trained our modelwith 3 RTX 2080 Ti GPUs for 50k iterations.We applied spatio-temporal augmentation to avoid over-ﬁtting. We applied the scale-jittering method in the rangeof [256, 320] and horizontal ﬂipping to augment the framesin spatial ((Feichtenhofer et al. 2019)). Temporally, we ran-domly picked a starting point in the video and selected theconsecutive 32 frames. For the short videos having less than32 frames, we padded the videos at the end by duplicatingthe last frame from the video. Speed-invariant Tuning

Algorithm 1:

Speed-invariant Tuning Train the entire model with the video sampling rate to s ; Freeze the backbone network (3D-Conv layers) weights; for iteration=0,I do Randomly select a sample rate r ∈ ( s , s, s ) ; Fetch consecutive frames from the video downsampledby r and gain its corresponding ground truth y ; Perform a gradient step on y − F out according toequation (8); end Evaluate the model on testing set with video level predictionsgenerated by summing predictions using all the sample rate r ∈ ( s , s, s ) ; In multi-label activity videos, different activities mayhave different duration. Figure 3 shows an example in Cha-rades having two activities, opening a fridge and walkingthrough a doorway. Because our system requires the samenumber of frames (32) for each input clip, using the samesampling rate for video frames may cover long activitiesonly partially and short activities may appear in only a few able 1: Charades evaluation using mAP (mean-average-precision)in percentages, calculated using the ofﬁcially provided script. TheCharades ego means using Charade ego dataset as supplemental.method backbone mAP2D CNN (Sigurdsson et al. 2016) Alexnet 11.22-stream (Sigurdsson et al. 2016) VGG16 22.4Action-VLAD (Girdhar et al. 2017) VGG16 21.0CoViAR (Wu et al. 2018) Res2D-50 21.9MultiScale TRN (Zhou et al. 2018) Inception 25.2I3D (Carreira and Zisserman 2017) Inception 32.9STRG (Wang and Gupta 2018) Nonlocal-101 39.7LFB (Wu et al. 2019) Nonlocal-101 42.5SlowFast (Feichtenhofer et al. 2019) SlowFast-101 45.2Multi-Grid (Wu et al. 2020) SlowFast-50 38.2X3D (Feichtenhofer 2020) X3D 47.2CSN (Tran et al. 2019) (Baseline) CSN-152 45.4

Our’s

CSN-152

CSN-152 frames. To ensure that activities of different duration areproperly covered in 32-frame inputs we introduced a speed-invariant tuning method. We ﬁrst trained the complete modelusing the downsampling rate of 4 and froze the weights forall the 3D convolution layers (Algorithm , step 1-2). Wethen started ﬁnetuning the module after the 3D convolutionfor I iterations by using 32-frame inputs obtained by ran-domly selecting downsampling rate r among 2, 4, and 8 (Al-gorithm , step 3-7). During the testing stage, we summedthe predictions of the model that were generated based on allthree downsampling rates r ∈ { , , } for the ﬁnal video-level activity prediction (Algorithm , step 8). The full al-gorithm of our speed-invariant tuning method is presentedin Algorithm . We set the initial sampling rate s to 4 as in(Wang et al. 2018). The model can then recognize activitiesthat have different duration by aggregating the results frombranches that used different downsampling rate as the input. Experiments

We evaluated our method on three multi-label activitydatasets: one large-scale dataset, Charades (Sigurdsson et al.2016), and two small datasets, Volleyball (Sozykin et al.2018) and Hockey (Carbonneau et al. 2015). To show thatour proposed method generalizes to different activity recog-nition tasks, we also tested it on Kinetics-400 (Kay et al.2017) and UCF-101 (Soomro, Zamir, and Shah 2012), twocommonly used single-activity recognition datasets.

Experiments on Charades

Charades dataset (Sigurdsson et al. 2016) contains 9848videos with average length of 30 seconds. This dataset in-cludes 157 multi-label daily indoor activities. We used theofﬁcially provided train-validate split (7985/1863). We usedthe ofﬁcially-provided 24-fps RGB frames as input and theofﬁcially-provided evaluation script for evaluating the vali-dation set. During the evaluation, we used the 30-view testfollowed (Feichtenhofer et al. 2019).

Results Overview on Charades

We compared our sys-tem with the baseline network, CSN (Tran et al. 2019), pre-trained on IG-65M (Ghadiyaram, Tran, and Mahajan2019)), as well as other state-of-the-art methods that workon Charades. Compared to the baseline network, our methodachieved around higher mAP score on Charades. Weoutperformed all the other methods (Wu et al. 2019; Hus-sein, Gavves, and Smeulders 2019; Wang and Gupta 2018;Feichtenhofer et al. 2019) and the recent state-of-the-artapproach, X3D (Feichtenhofer 2020) that pre-trained onKinetics-600 (Carreira et al. 2018) on Charades. This showsthat our activity-speciﬁc features are representative of theircorresponding activities and works better for multi-label ac-tivity recognition tasks. Because the model performance onCharades highly depends on the pre-trained backbone net-work, our method could be further improved if we could usethe most recent X3D as backbone (Feichtenhofer 2020).Our method signiﬁcantly outperformed another bag-of-features method Action-VLAD on Charades (48.2% vs.21.0%) because our network captures spatio-temporal fea-tures independently for each activity, instead of aggregatingthe visual-words over time. In addition, Action-VLAD onlyworks on 2D backbone networks, which cannot beneﬁt fromthe recent 3D backbone networks that work better for activ-ity recognition.Charades is relatively small for recognizing more than100 multi-label activities. To further boost the performanceof our model on Charades, we included the videos from theCharades Ego dataset (Sigurdsson et al. 2018) into the train-ing set. Charades Ego dataset (Sigurdsson et al. 2018) con-tains 3930 videos having the same activity list as Charades.There are no overlapping videos between these two datasets.Our model after including the Charades Ego achieved 50.3%mAP on Charades. Ablation Experiment on Charades

We next ablated oursystem with various hyper-parameters (group size, observa-tion number, and sampling rate for speed-invariant inputs).

Group sizes.

Table 2a shows the system performance fordifferent values of the group size ( n ) in equation 7 when gen-erating observations (the number of observations is 64 andthe downsampling rate is 4). The performance on Charadesstayed at 47% when the group size increased from 1 to 16but dropped quickly for n = 32 . A larger group size resultsin using a smaller subset of channels from F f for generat-ing observations, which requires fewer parameters but maycause performance drop because of information loss. Observations number.

Table 2b compares the system per-formance for different number of observations (group size is16 and the downsampling rate is 4). The best-performingnumber of observations is 64, which also requires fewestweights. Using a larger number of observations helps covermore key parts from the videos but the performance satu-rates when for more than 64 observations.

Model structure ablation.

We then evaluated the system onCharades by removing each component from our network.The model without attn A is implemented by ﬂattening thedimensions of observations and feature channels, and rec-ognizing activities using a fully-connected layer. Table 2cshows that the model without attn A for generating activity-speciﬁc features achieved similar performance as the base- able 2: Ablation experiments on the Charades dataset. We show the mAP scores and parameter numbers by using different hyper-parameters,backbone networks, and removing different modalities from our tri-axial attention module.group size mAP

16 47.1 3.2M

32 46.3 1.6M(a) Group size: performance on Charadeswhen using different group sizes by settingobservation number to 64 and sample rateto 4. obs num mAP

64 47.1 3.2M

128 46.7 6.3M(b) Observation number: performance onCharades when using different number ofobservations by setting group size to 16and sample rate to 4. model structures mAPbaseline 45.4no attn O -no attn A complete 47.1 (c) Model structure ablation: performanceon Charades after removing each process-ing stage from the tri-axial attention.sample rate mAP × × × (d) Sample rate for speed-invariant tuning:performance on Charades when using dif-ferent sample rates by setting observationnumber to 64 and group size to 16. method Backbone mAPBaseline/ Ours

Nonlocal-50 38.3/

Baseline/

Ours

Nonlocal-101 40.3/

Baseline/

Ours

CSN-152 45.4/ (e) Backbone network: performance on Charadesby plugging into different baseline models. line model, because the fully-connected layer applied on aﬂattened feature vector requires a large number of weights,which causes overﬁtting. The model without attn O is im-plemented by setting the observation number K equal to thenumber of activities A . We failed to get the result for thismodel on Charades because it caused out-of-memory errordue to the huge number of parameters. These results showthat our network achieves a performance boost by the com-bination of key components of our network. Sampling rates for speed-invariant tuning.

We also eval-uated our speed-invariant tuning method by merging predic-tions using different downsampling rates at inputs. Table 2dshows that speed-invariant models achieved better perfor-mance compared to the model using a single downsamplingrate of 4 because the speed-invariant tuning method makesthe features better represent activities of different duration.The system achieved the best performance by merging pre-dictions based on 2, 4, and 8 downsampling rates and thismethod did not require extra parameters (Table 2d row 3).

Backbone network.

We ﬁnally evaluated our model byplugging our method in to different backbone networks. Ta-ble 2e shows that our method achieved around 2.5% mAPscore increase on Charades after being plugged into allthe three backbone networks (Nonlocal-50, Nonlocal-101(Wang et al. 2018), and CSN-152 (Tran et al. 2019)). Thatshows our method generalized well on multi-label activityrecognition by using different existing backbone networks.

Experiments on Hockey and Volleyball

We also run experiments on two small datasets, Hockey andVolleyball (Ibrahim et al. 2016; Sozykin et al. 2018). TheVolleyball Dataset contains 55 videos with 4830 annotatedvideo clips. This dataset includes two sets of labels for groupactivity recognition task (8-class multi-class classiﬁcation)and multi-label activity recognition task (9-label multi-labelclassiﬁcation). We evaluated our method on both of thesetasks.

The experimental results on Hockey is in the sup-plemental material because of the page limit.

Table 3 shows that our system substantially outperformedall the existing approaches on Volleyball for multi-label

Table 3: Experimental results on Volleyball. The “s” and “bb” inthe last two columns denote using the whole scene and boundingboxes of persons as supplemental for recognizing group activities.Volleyball Personal (multi-label) Volleyball Groupmethod Acc. Acc. (s) Acc. (bb)Hier LSTM (Ibrahim et al. 2016) 72.7 63.1 81.9SRNN (Biswas and Gall 2018) 76.6 - 83.4So-Sce (Bagautdinov et al. 2017) 82.4 75.5 89.9CRM (Azar et al. 2019) - 75.9 93.0Act-trans (Gavrilyuk et al. 2020) 85.9 - 94.4CSN-152 baseline 85.0 87.1 -Our’s 86.2 87.2 -

Our’s + speed-invariant 86.6 87.6 95.5 activities (Gavrilyuk et al. 2020). We also compared ourmethod with the baseline model using the latest backbonenetwork (CSN baseline in Table 3) that works on activityrecognition. Our system achieved roughly 2% higher accu-racy score compared with the baseline model, which showsthat the activity-speciﬁc features also improve multi-labelactivity recognition on small sports datasets.We further evaluated our method on Volleyball for groupactivity recognition. The group activity is essentially asingle-activity recognition problem: only one activity occursduring one video clip. Our method outperformed other state-of-the-art methods when using the whole scene (s) as input(Gavrilyuk et al. 2020) (RGB frames without using bound-ing boxes around people). This shows that our method gen-eralizes for the single-activity recognition problem as well.Previous methods (Azar et al. 2019; Gavrilyuk et al. 2020)used bounding boxes around people (bb) and their individ-ual activities as supplemental information for group activ-ity recognition. We tested our model by including this sup-plemental information, and our approach outperformed therecent state-of-the-art method (Azar et al. 2019) (95.5 forour system vs. 94.4 for the Act-trans in the last column ofTable 3). Compared to the baseline network, our methodslightly outperformed the baseline network (87.2 vs. 87.1in the second-to-the-last column of Table 3). The activity-speciﬁc features do not help signiﬁcantly in the single-activity problems, unlike the case of multi-label activities, igure 4: Visualizing the activity-speciﬁc features in two videos from the Charades dataset. The bounding boxes in the original framescorrespond to the activated regions in the activity-speciﬁc feature maps. The activity-speciﬁc feature maps will only focus on the regionswhere corresponding activities are being performed and have low values if there is no activity being performed at that time.Table 4: Method evaluation on Kinetics-400 and UCF-101. Thescores are top1 accuracy in percentage.method ﬂow K400 ucf-101I3D (Carreira and Zisserman 2017) × (cid:88) (cid:88) (cid:88) × × × × Our’s × because the feature maps will only focus on one regionwhere the single-activity occurred. Experiment on Kinetics-400 and UCF-101

To demonstrate that our method is generalizable for single-activity recognition tasks, we evaluated our method on twosingle-activity recognition datasets, Kinetics-400 (Kay et al.2017) and UCF-101 (Soomro, Zamir, and Shah 2012). Weﬁne-tuned our network with the backbone weights frozen.Table 4 shows that our approach, although specialized formulti-label activity recognition, generalized well for single-activity datasets and achieved similar results as the currentstate-of-the-art approaches (bottom row in Table 4). Ourmethod slightly outperformed the baseline network (CSNbaseline in Table 4) because both the datasets contain onlysingle-activities. As we described earlier, activity-speciﬁcfeatures do not help signiﬁcantly for single-activities.

Feature Visualization

To better understand what activity-speciﬁc features arelearned, we visualized these features for the activitiespresent in the video clips. Figure 4 shows two examplesin the Charades, including the activity-speciﬁc feature maps(last two rows of each example in Figure 4) and their cor-responding input frames. The activity-speciﬁc feature mapswere generated by applying the learned attn O and attn A onthe backbone features F f . We normalized the feature mapsbetween 0 and 1 and plotted these maps for the activitiespresent in the video (last two rows of each example in Figure4). To make the visualized maps more understandable, we applied the 0.5 threshold to the activity-speciﬁc feature mapsand drew the bounding boxes using different colors (blue,red) for different activities in the original frames around theregions activated in the feature maps.Based on the visualizations in Figure 4, we can make threepoints. First, the activity-speciﬁc features will only focus onthe spatial regions for the corresponding activity when mul-tiple activities are performed simultaneously. The visualiza-tion of video “2NXFV” (Figure 4, left) shows the activity-speciﬁc features Conclusion and Future Work

We introduced a novel network that focuses on multi-labelactivity recognition. The system generates spatio-temporallyindependent activity-speciﬁc features for each activity andoutperformed previous state-of-the-art methods on threemulti-label activity recognition datasets. The visualizationsshowed that the activity-speciﬁc features are representa-tive of their corresponding activities. We also evaluated ourmethod on two single-activity recognition datasets to showthe generalizability of our method. One issue remains in thespeed-invariant tuning method, where we simply summedthe predictions by using different downsampling rates forthe inputs. Extending the speed-invariant method to enablethe model to learn to select features from appropriate scalesfor different activities will be our future work. eferences

Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; and Sivic,J. 2016. NetVLAD: CNN architecture for weakly supervisedplace recognition. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 5297–5307.Azar, S. M.; Atigh, M. G.; Nickabadi, A.; and Alahi, A.2019. Convolutional Relational Machine for Group Activ-ity Recognition. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 7892–7901.Bagautdinov, T.; Alahi, A.; Fleuret, F.; Fua, P.; and Savarese,S. 2017. Social scene understanding: End-to-end multi-person action localization and collective activity recogni-tion. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 4315–4324.Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .Biswas, S.; and Gall, J. 2018. Structural recurrent neural net-work (srnn) for group activity analysis. In ,1625–1632. IEEE.Carbonneau, M.-A.; Raymond, A. J.; Granger, E.; andGagnon, G. 2015. Real-time visual play-break detectionin sport events using a context descriptor. In ,2808–2811. IEEE.Carreira, J.; Noland, E.; Banki-Horvath, A.; Hillier, C.; andZisserman, A. 2018. A short note about kinetics-600. arXivpreprint arXiv:1808.01340 .Carreira, J.; and Zisserman, A. 2017. Quo vadis, actionrecognition? a new model and the kinetics dataset. In pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , 6299–6308.Csurka, G.; Dance, C.; Fan, L.; Willamowski, J.; and Bray,C. 2004. Visual categorization with bags of keypoints. In

Workshop on statistical learning in computer vision, ECCV ,volume 1, 1–2. Prague.Du, W.; Wang, Y.; and Qiao, Y. 2017. Recurrent spatial-temporal attention network for action recognition in videos.

IEEE Transactions on Image Processing

Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , 203–213.Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019. Slow-fast networks for video recognition. In

Proceedings of theIEEE international conference on computer vision , 6202–6211.Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convo-lutional two-stream network fusion for video action recog-nition. In

Proceedings of the IEEE conference on computervision and pattern recognition , 1933–1941.Gavrilyuk, K.; Sanford, R.; Javan, M.; and Snoek, C. G.2020. Actor-transformers for group activity recognition. arXiv preprint arXiv:2003.12737 . Ghadiyaram, D.; Tran, D.; and Mahajan, D. 2019. Large-scale weakly-supervised pre-training for video action recog-nition. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 12046–12055.Girdhar, R.; and Ramanan, D. 2017. Attentional poolingfor action recognition. In

Advances in Neural InformationProcessing Systems , 34–45.Girdhar, R.; Ramanan, D.; Gupta, A.; Sivic, J.; and Russell,B. 2017. Actionvlad: Learning spatio-temporal aggregationfor action classiﬁcation. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , 971–980.Goyal, R.; Kahou, S. E.; Michalski, V.; Materzynska, J.;Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.;Mueller-Freitag, M.; et al. 2017. The” Something Some-thing” Video Database for Learning and Evaluating VisualCommon Sense. In

ICCV , volume 1, 3.Hahnloser, R. H.; and Seung, H. S. 2001. Permitted and for-bidden sets in symmetric threshold-linear networks. In

Ad-vances in neural information processing systems , 217–223.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.Hussein, N.; Gavves, E.; and Smeulders, A. W. 2019. Time-ception for Complex Action Recognition. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , 254–263.Ibrahim, M. S.; Muralidharan, S.; Deng, Z.; Vahdat, A.; andMori, G. 2016. A hierarchical deep temporal model forgroup activity recognition. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , 1971–1980.Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accel-erating deep network training by reducing internal covariateshift. arXiv preprint arXiv:1502.03167 .Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.;Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev,P.; et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 .Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-agenet classiﬁcation with deep convolutional neural net-works. In

Advances in neural information processing sys-tems , 1097–1105.Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre,T. 2011. HMDB: a large video database for human motionrecognition. In , 2556–2563. IEEE.Li, D.; Yao, T.; Duan, L.-Y.; Mei, T.; and Rui, Y. 2018. Uni-ﬁed spatio-temporal attention networks for action recogni-tion in videos.

IEEE Transactions on Multimedia arXiv preprintarXiv:1702.01638 .in, J.; Gan, C.; and Han, S. 2018. Temporal shiftmodule for efﬁcient video understanding. arXiv preprintarXiv:1811.08383 .Meng, L.; Zhao, B.; Chang, B.; Huang, G.; Sun, W.; Tung,F.; and Sigal, L. 2019. Interpretable spatio-temporal atten-tion for video action recognition. In

Proceedings of the IEEEInternational Conference on Computer Vision Workshops ,0–0.Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.;DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer,A. 2017. Automatic differentiation in PyTorch .Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Pan, S.; andZhang, C. 2017. Disan: Directional self-attention networkfor rnn/cnn-free language understanding. arXiv preprintarXiv:1709.04696 .Sigurdsson, G. A.; Gupta, A.; Schmid, C.; Farhadi, A.; andAlahari, K. 2018. Actor and observer: Joint modeling ofﬁrst and third-person videos. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,7396–7404.Sigurdsson, G. A.; Varol, G.; Wang, X.; Farhadi, A.; Laptev,I.; and Gupta, A. 2016. Hollywood in homes: Crowdsourc-ing data collection for activity understanding. In

EuropeanConference on Computer Vision , 510–526. Springer.Simonyan, K.; and Zisserman, A. 2014. Two-stream con-volutional networks for action recognition in videos. In

Ad-vances in neural information processing systems , 568–576.Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101:A dataset of 101 human actions classes from videos in thewild. arXiv preprint arXiv:1212.0402 .Sozykin, K.; Protasov, S.; Khan, A.; Hussain, R.; and Lee,J. 2018. Multi-label class-imbalanced action recognitionin hockey videos via 3d convolutional neural networks. In , 146–151. IEEE.Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; andSalakhutdinov, R. 2014. Dropout: a simple way to preventneural networks from overﬁtting.

The journal of machinelearning research

In-telligenza Artiﬁciale

Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , 1–9.Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri,M. 2015. Learning spatiotemporal features with 3d convo-lutional networks. In

Proceedings of the IEEE internationalconference on computer vision , 4489–4497.Tran, D.; Wang, H.; Torresani, L.; and Feiszli, M. 2019.Video classiﬁcation with channel-separated convolutional networks. In

Proceedings of the IEEE International Con-ference on Computer Vision , 5552–5561.Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; andPaluri, M. 2018. A closer look at spatiotemporal convolu-tions for action recognition. In

Proceedings of the IEEE con-ference on Computer Vision and Pattern Recognition , 6450–6459.Tu, Z.; Li, H.; Zhang, D.; Dauwels, J.; Li, B.; and Yuan,J. 2019. Action-stage emphasized spatiotemporal vlad forvideo action recognition.

IEEE Transactions on Image Pro-cessing

Advances in neural informationprocessing systems , 5998–6008.Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.;and Van Gool, L. 2016. Temporal segment networks: To-wards good practices for deep action recognition. In

Euro-pean conference on computer vision , 20–36. Springer.Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-local neural networks. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , 7794–7803.Wang, X.; and Gupta, A. 2018. Videos as space-time re-gion graphs. In

Proceedings of the European Conference onComputer Vision (ECCV) , 399–417.Wu, C.-Y.; Feichtenhofer, C.; Fan, H.; He, K.; Krahenbuhl,P.; and Girshick, R. 2019. Long-term feature banks for de-tailed video understanding. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , 284–293.Wu, C.-Y.; Girshick, R.; He, K.; Feichtenhofer, C.; and Kra-henbuhl, P. 2020. A Multigrid Method for Efﬁciently Train-ing Video Models. In

Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition , 153–162.Wu, C.-Y.; Zaheer, M.; Hu, H.; Manmatha, R.; Smola, A. J.;and Kr¨ahenb¨uhl, P. 2018. Compressed video action recogni-tion. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 6026–6035.Xie, S.; Sun, C.; Huang, J.; Tu, Z.; and Murphy, K. 2018.Rethinking spatiotemporal feature learning: Speed-accuracytrade-offs in video classiﬁcation. In

Proceedings of the Eu-ropean Conference on Computer Vision (ECCV) , 305–321.Yan, S.; Smith, J. S.; and Zhang, B. 2017. Action recogni-tion from still images based on deep vlad spatial pyramids.

Signal Processing: Image Communication

54: 118–129.Yang, C.; Xu, Y.; Shi, J.; Dai, B.; and Zhou, B. 2020. Tem-poral pyramid network for action recognition. In

Proceed-ings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , 591–600.Zhou, B.; Andonian, A.; Oliva, A.; and Torralba, A. 2018.Temporal relational reasoning in videos. In