[PDF] ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation

Abstract

Interpreting human actions requires understanding the spatial and temporal context of the scenes. State-of-the-art action detectors based on Convolutional Neural Network (CNN) have demonstrated remarkable results by adopting two-stream or 3D CNN architectures. However, these methods typically operate in a non-real-time, ofline fashion due to system complexity to reason spatio-temporal information. Consequently, their high computational cost is not compliant with emerging real-world scenarios such as service robots or public surveillance where detection needs to take place at resource-limited edge devices. In this paper, we propose ACDnet, a compact action detection network targeting real-time edge computing which addresses both efficiency and accuracy. It intelligently exploits the temporal coherence between successive video frames to approximate their CNN features rather than naively extracting them. It also integrates memory feature aggregation from past video frames to enhance current detection stability, implicitly modeling long temporal cues over time. Experiments conducted on the public benchmark datasets UCF-24 and JHMDB-21 demonstrate that ACDnet, when integrated with the SSD detector, can robustly achieve detection well above real-time (75 FPS). At the same time, it retains reasonable accuracy (70.92 and 49.53 frame mAP) compared to other top-performing methods using far heavier configurations. Codes will be available at this https URL

Full PDF

11 ACDnet: An Action Detection network for real-time edge computing based onﬂow-guided feature approximation and memory aggregation

Yu Liu ∗∗ , Fan Yang, Dominique Ginhac ImViA EA7535, Univ. Bourgogne Franche-Comt´e, Dijon 21078, FranceABSTRACT

Interpreting human actions requires understanding the spatial and temporal context of the scenes.State-of-the-art action detectors based on Convolutional Neural Network (CNN) have demonstratedremarkable results by adopting two-stream or 3D CNN architectures. However, these methods typi-cally operate in a non-real-time, o ﬄ ine fashion due to system complexity to reason spatio-temporalinformation. Consequently, their high computational cost is not compliant with emerging real-worldscenarios such as service robots or public surveillance where detection needs to take place at re-source-limited edge devices. In this paper, we propose ACDnet, a compact action detection networktargeting real-time edge computing which addresses both e ﬃ ciency and accuracy. It intelligentlyexploits the temporal coherence between successive video frames to approximate their CNN featuresrather than naively extracting them. It also integrates memory feature aggregation from past videoframes to enhance current detection stability, implicitly modeling long temporal cues over time.Experiments conducted on the public benchmark datasets UCF-24 and JHMDB-21 demonstratethat ACDnet, when integrated with the SSD detector, can robustly achieve detection well abovereal-time (75 FPS). At the same time, it retains reasonable accuracy (70.92 and 49.53 frame mAP)compared to other top-performing methods using far heavier conﬁgurations. Codes will be availableat https: // github.com / dginhac / ACDnet.

1. Introduction

In past years, human action detection has been an active areaof research driven by numerous applications: autonomous ve-hicles, video search engines, and human-computer interaction,etc. As it aims not only to recognize actions of interest in avideo, but also to localize each of them, action detection posesmore challenges when compared to video classiﬁcation. Thetask becomes even more di ﬃ cult in practical applications whendetection is to be performed in an online setting and at real-timespeed. For instance, time-critical scenarios such as autonomousdriving demand instant detection in order for machines to reactimmediately. Other use cases which seek for mobile or large-scale deployment, such as service robots and distributed un-manned surveillance, require detection or scene meta-data ex-traction at low-end edge devices. In general, edge devices (e.g.,embedded systems) have limited computational power and areonly compliant with resource-e ﬃ cient detection algorithms. ∗∗ Corresponding author: e-mail: [email protected] (Yu Liu), [email protected] (Fan Yang), [email protected] (Dominique Ginhac)

Following the success of Convolutional Neural Network(CNN) in diverse computer vision tasks, modern action detec-tors are mainly based on CNN. In particular, fast object de-tectors have been widely adopted to spatially localize actioninstances at each frame (Singh et al. (2017), Zhao and Snoek(2019)). Naturally, e ﬀ ective temporal modeling plays an im-perative role for identifying an action. To reason both spatialand temporal context, Simonyan and Zisserman (2014) pio-neered the two-stream CNN framework which aggregates spa-tial and temporal cues from separate networks and input modal-ities (RGB and optical ﬂow). Such an approach has motivatedmany state-of-the-art methods in the ﬁeld of action recognitionand detection. Alternatively, 3D CNN (Carreira and Zisser-man (2017)) which performs spatio-temporal feature learningon stacked frames has also been increasingly explored to tacklevideo analysis tasks.Despite recent advances in action detection, existing meth-ods are inherently sub-optimal in two aspects. First, consecu-tive video frames exhibit high appearance similarity. Extractingframe features without taking into account this inter-frame sim-ilarity introduces redundancy. Moreover, the increased systemcomplexity associated with employing two-stream or 3D CNN a r X i v : . [ c s . C V ] F e b models is not proportionally reﬂected in the detection accuracy.In contrast, the above inevitably raises computational require-ments associated with motion extraction and 3D convolutionoperation, prohibiting practical deployment on edge devices.This work focuses on action detection solutions more per-tinent to the criteria of realistic applications. To address theaforementioned limitations, we ﬁrst exploit the temporal co-herence among nearby video frames to enhance detection e ﬃ -ciency. This is embodied by performing feature approximationat the majority of frames in a video, mitigating re-extractionof similar features from neighboring frames. Furthermore, wehypothesize that a less expensive framework can e ﬀ ectively ex-tract meaningful temporal contexts. Here, we adopt a multi-frame feature aggregation module, which recursively accumu-lates 2D spatial features over time to encapsulate long tempo-ral cues. Such feature aggregation implicitly models temporalvariations of actions and facilitates understanding degeneratedframes with limited visual cues.To the best of our knowledge, this is the ﬁrst attempt applyingfeature approximation and aggregation techniques to achievee ﬃ cient action detection which can beneﬁt resource-limited de-vices. To summarize, our contribution is three-fold: • We propose an integrated detection framework, ACD-net, to address both detection e ﬃ ciency and accuracy. Itcombines feature approximation and memory aggregationmodules, leading to improvement in both aspects. • Our generalized framework allows for smooth integrationwith state-of-the-art detectors. When incorporated withSSD (single shot detector), ACDnet could reason spatio-temporal context well over real-time, more appealing toresource-constrained devices. • We conduct detailed studies in terms of accuracy, e ﬃ -ciency, robustness and qualitative analysis on public actiondatasets UCF-24 and JHMDB-21.

2. Related work

Recent advancements in action detection are largely led bybuilding upon successful cases in object detection and actionrecognition. Here, we brieﬂy review these relevant topics.

Object detection based on CNN methods can be groupedinto two families. The two-stage approach such as Faster R-CNN by Ren et al. (2015) and R-FCN by Dai et al. (2016)ﬁrst extracts potential object regions from images, on whichit performs object classiﬁcation and bounding box regressionon features corresponding to each proposed location. Such asequential pipeline imposes a bottleneck to real-time inference.Alternatively, single-stage detectors such as YOLO proposedby Redmon and Farhadi (2017), or SSD in Liu et al. (2016),remove the intermediate region proposal, directly achievingbounding box regression and classiﬁcation in a single forward-pass. Bypassing the intermediate bottleneck enables real-timedetection at the cost of minor accuracy drop.A number of research focuses on video object detection in-stead of the image domain. Popular approaches such as Han et al. (2016) and Kang et al. (2017) exploit videos’ tempo-ral consistency by associating detection boxes and scores frommultiple frames. Similarly but on the feature level, Hetang et al.(2017) and Zhu et al. (2017a) aggregate multiple frame featuresto enhance detection accuracy. On the other hand, Zhu et al.(2017b) leverage the temporal redundancy among video framesto improve detection e ﬃ ciency. Their framework propagatesfeatures from a sparse set of key frames to successive ones bymotion to avoid re-extracting similar object features. In a sim-ilar spirit, Liu and Zhu (2018) propagate frame-level informa-tion across frames using a recurrent-convolutional architecture. Action recognition is typically treated as a classiﬁcation taskon trimmed videos (Yao et al. (2019)). In addition to spatial fea-tures, reasoning temporal information across multiple frames isalso crucial. Among di ﬀ erent temporal modeling techniques,the two-stream architecture in Simonyan and Zisserman (2014)demonstrates state-of-art performance. Its framework consistsof two feed-forward pathways, with one CNN learning spa-tial features from RGB stream and the other one learning mo-tion features from optical ﬂow stream. The two streams aretrained and run inference independently to aggregate comple-mented features (Feichtenhofer et al. (2016)). Even thoughsuch a framework can exploit existing 2D CNN backbones,ﬁne-grained optical ﬂow is expensive to extract. Thus, ﬂowimages are typically pre-computed, which do not conform tothe online workﬂow demanded in real-world scenarios.Recently, 3D CNNs have been increasingly explored (Car-reira and Zisserman (2017), Li et al. (2019)) along with therelease of large-scale action dataset Kinetics. They utilize 3Dkernels to jointly perform spatio-temporal feature learning fromstacked RGB frames, achieving comparable and even supe-rior modeling capability than two-stream CNN. However, thesemodels inherently su ﬀ er from higher number of parameters andcomputational cost than their 2D counterparts, making their de-ployment on resource-constrained devices impractical. E ﬃ cient spatio-temporal modeling. To alleviate the highcomputational cost associated with ﬂow extraction, severalstudies seek alternative motion representations that are easierto compute. These include feature-level displacement (Jianget al. (2019), Sun et al. (2018b)), or simply taking the RGB dif-ference between adjacent frames (Wang et al. (2016)). On theother hand, to reduce the complexity of 3D CNN, decoupled ar-chitectures such as P3D (Qiu et al. (2017)) and R(2 + ﬀ erent timesteps. Their approach has demonstrated e ﬀ ectiveness on edgedevices such as Jetson Nano and Galaxy Note8. Spatio-temporal action detection simultaneously addressesaction localization and classiﬁcation in time and space. Lead-ing approaches often leverage CNN object detectors as the corebuilding block. The extension mainly consists of adopting thetwo-stream framework, fusing complementary detection resultsfrom both spatial and temporal stream to acquire frame-leveldetection, as demonstrated in Singh et al. (2017). For temporallocalization, detection at each frame is then linked over timeto construct action tubes (Peng and Schmid (2016)). Beyonddetection at the frame-level, Kalogeiton et al. (2017) and Liet al. (2020) adopt a clip-based approach, which exploits stack-ing multiple frame features to capture temporal cues on top ofthe two-stream architecture. In this case, actions are regressedand inferred directly on action cuboids.Inspired by the latest adoption of 3D CNN in action recogni-tion, more recent studies incorporate 3D CNN as the backbone(Yang et al. (2019), Sun et al. (2018a), Girdhar et al. (2019),Gu et al. (2018), Wei et al. (2019)). In addition, various waysof fusing spatial and temporal context have also been investi-gated. Besides aggregating at the detection level (e.g., unionof detection results), others perform feature-level fusion. Theseinclude the use of 1 ×

3. ACDnet

Our objective is to perform detection in an online manner forevery incoming frame of a video. The proposed ACDnet whichconsists of the feature approximation and aggregation module,is summarized in Figure 1.

Video content varies slowly over consecutive frames. Thisphenomenon is more so reﬂected in the corresponding CNNfeature maps which capture high-level semantics. Intuitively,the shared appearances among neighboring frames can help topropagate essential information for a given task. The practiceof propagation in Zhu et al. (2017b) has established success toenhance object detection e ﬃ ciency in videos, which motivatesour feature approximation module.Within the approximation scheme, the heavier feature extrac-tion sub-network, N f eat , only operates on a sparse set of keyframes during inference. The features of successive non-keyframes are obtained by spatially transforming those from theirpreceding key frames via two-channel ﬂow ﬁelds. The work-ﬂow can be summarized by the following equations. Let M i → k be the two-channel ﬂow ﬁeld capturing relative motion from thecurrent frame I i to its previous key frame I k (horizontal and ver-tical direction). Then, feature approximation (also referred asfeature propagation) is realized according to inverse warping: F i = W ( F k , M i → k ) (1)where F k is the key frame feature, and F i is the newly warpedfeature corresponding to I i . W denotes the inverse warping op-eration to sample the correct key frame features and assign themto the warped ones. Inverse warping is necessary to ensure ev-ery location p at the warped feature can be projected back to apoint p + ∆ p at the key frame feature, where ∆ p = M i → k ( p ).Concretely, the warping operation W is performed as: f ci ( p i ) = (cid:88) p k G ( p k , p i + ∆ p ) f ck ( p k ) (2) In Equation 2, f ci and f ck denote the c th channel of feature F i and F k , respectively; G denotes the bilinear interpolation ker-nel. Every location p i in the warped feature map undergoesthis warping scheme to sample features from key frames, inde-pendently for each feature channel c . The warping operation ismuch lighter compared to layers of convolution for feature ex-traction. Consequently, by applying feature approximation on adense set of non-key frames, computation is greatly reduced.Previous methods on action-based tasks typically acquiremotion features from accurate optical ﬂow using non-learning-based algorithms. However, computing ﬂows in such a wayimposes a bottleneck to real-time and online detection due tohigh consumption of time or requiring to pre-compute ﬂow re-sults. In contrast, ACDnet integrates a fast ﬂow estimation sub-network, N f low , to predict ﬂow ﬁelds. In our case, optical ﬂowserves to spatially transform CNN features; it does not needto capture ﬁne-grained motion details and has the same heightand width as the corresponding feature. Using such a learning-based ﬂow estimator also allows it to be jointly trained with allother sub-networks speciﬁc to the task of action detection.In detail, the ﬂow sub-network take ( I k , I i ) as input, and gen-erates a pair of motion ﬁeld and position-wise scale map. Giventhat H , W , and C denote height, width and channel of F k , thenthe ﬂow ﬁeld M i → k is of size H × W × H × W × C , whose dimension matches that of F k to be warped.After the inverse warping described by Equation 1, the warpedfeature F i is reﬁned by multiplying the scale map in an element-wise way. Any F k and F i would be fed to the shared detectionsub-network, N det , to obtain ﬁnal detection. This workﬂow isillustrated in Figure 1 (a) and (b). Propagating features across frames reduces the computationcost associated with feature extraction. However, since mostfeatures are now approximated and heavily dependent of thequality of the precedent key frame features, we adopt a mem-ory aggregation module as inspired by Hetang et al. (2017) toenhance the feature representation at key frames. Given in-coming video frames, the core of memory aggregation is toreinforce features of a target frame by recursively incorporat-ing supportive and discriminating context from the past. Thisallows implicit spatio-temporal modeling without explicitly ex-tracting motion features. In addition, in cases when the currentframe is deteriorated, an action can still be inferred with thesupportive visual cues from memory. Figure 1 (c) gives an ex-ample when such memory aggregation could be useful.Memory aggregation shares the same warping operation usedfor feature approximation. ACDnet takes a sparse and recursiveapproach to aggregate memory features only at key frames, dueto similar appearances shared among nearby frames. Given twosucceeding key frames I k and I k , where I k is the more recentone in time, memory aggregation follows Equation 3: F k aggregated = w k ⊗ F (cid:48) k + w k ⊗ F k (3)where F (cid:48) k = W ( F k , M k → k ), is the warped feature of I k tospatially align its position with that of I k . The position-wiseweights w k and w k both have the same height and width as Fig. 1. Illustration of ACDnet inference pipeline. (a) At the initial frame, features are obtained from the feature extraction sub-network ( N feat ). (b) Fornon-key frames (dense), the ﬂow sub-network ( N flow ) estimates a pair of ﬂow ﬁeld and position-wise scale map between the non-key frame and its precedingkey frame. The resulted ﬂow ﬁeld is used to propagate appearance feature, which is then reﬁned by the scale map via element-wise multiplication. (c)At key frames (sparse), new features are extracted. They are then aggregated with those from the past key frames (memory features) via N flow and theaggregation sub-network ( N aggr ). The fused features will be used for detection ( N det ) and also passed along as the updated memory. F (cid:48) k and F k . These weights are normalized and determine theimportance of memory feature at each location p with respect tothe target frame feature map ( w k ( p ) + w k ( p ) = w k and w k are adaptively calculated based onthe similarity of memory and target features. We estimate fea-ture similarity by ﬁrst projecting them into an embedding spacevia a few convolution layers, and then computing the cosinesimilarity between the embedded features. Finally at the cur-rent key frame, the weighted sum of the memory and currentfeatures will be fed to the detection sub-network and passedalong as the new memory. ACDnet follows a three-frame training scheme, as depictedin Figure 2. From each training mini-batch, frame I i and twoprecedent video frames ( I k and I mem ) are selected, whose fea-tures simulate key frame and memory features respectively. Theo ﬀ set between I i and I k is a random number from 0 to T k → i , andthe o ﬀ set between I k and I mem is ﬁxed at T mem → k .The feature maps F mem and F k are ﬁrst extracted from I mem and I k respectively. Two sets of ﬂow ﬁelds, namely, the rela-tive motion between I k - I mem , and I i - I k are also estimated. Theformer ﬂow is used to propagate F mem to F k to simulate the oc-currence of memory feature aggregation following Equation 3.Finally, the fused feature is warped with the second ﬂow (simu-lating feature approximation) following Equation 1, which willbe the ﬁnal feature map for N det . Under this training mode, onlythe groundtruth of I i is needed to determine losses, which areback-propagated to update all sub-networks. Workﬂows of feature approximation and aggregation aregeneric for video-based tasks. ACDnet further employs SSD,an one-stage detector to fulﬁll the objective of high-speed ac-tion detection potentially for embedded vision systems. In par-ticular, the SSD300 model is chosen due to its superior speed.

Fig. 2. Training procedure. Each mini-batch consists of three frames ( I mem , I k , and I i ) and the groudtruth of I i . In SSD, a set of auxiliary convolutional layers are progres-sively added after the base network (e.g., VGG16 in a stan-dard SSD) to extract features at multiple scales. This createsmultiple feature maps where the detector makes prediction forobjects of various sizes. Consequently, adopting the describedframework in SSD requires feature approximation and memoryaggregation to be handled for features at all scales.To enable multi-level feature approximation, we duplicate N f low ’s ﬂow prediction layer into several branches. The num-ber of branches matches that of the feature maps; the branches’outputs are also progressively resized via average pooling ac-cording to sizes of SSD’s feature maps. Then, each branch re-constructs a pair of ﬂow ﬁeld and scale map in accordance withthe dimension of SSD feature (refer to Figure 3). To cope withmulti-level feature approximation and aggregation, Equation 1and 3 are also generalized to take place at each feature level in-dependently. Note that the standard SSD300 applies detectionat six feature scales. Nevertheless, we only use the ﬁrst ﬁve ofthem, as the dimension of the last feature map becomes a 1Dvector resulting from progressive resizing, which is not feasiblefor feature approximation governed by 2D spatial warping.

4. Experimental results

Dataset . Our proposed methods are evaluated on two popu-lar action datasets:

UCF-24 and

JHMDB-21 . The former one

Fig. 3. Flow estimation sub-network adapted for multi-scale feature ap-proximation and aggregation. The depicted design corresponds to the ar-chitecture of SSD300 and FlowNet. released by Soomro et al. (2012) is composed of 3207 sportsvideos of 24 action classes. Following previous work, we use2290 of these video clips for training. The latter collected byJhuang et al. (2013) consists of 928 short videos divided intothree splits, with 21 action categories in daily life. Each videois trimmed and has a single action instance. We report our ex-perimental results on the average of three splits for this dataset.

Network architectures . ACDnet incorporates the follow-ing sub-networks: SSD300 (with VGG16 backbone), FlowNet(Dosovitskiy et al. (2015)) and feature embedding. Feature em-bedding contains ﬁve branches for measuring feature similarityat ﬁve di ﬀ erent scales. Each embedding branch has a bottle-neck design of three 1 × l : f eat lchannel / f eat lchannel / f eat lchannel ×

2, respectively.FlowNet is modiﬁed to also generate ﬁve sets of ﬂow ﬁeldsand position-wise scale maps, each pair being used for warpingand reﬁning designated features. We initialize the weights ofthe ﬁrst two branches of ﬂow generation layers using FlowNet’spre-trained weights. Considering the later three ﬂow outputsare spatially much smaller than that of the original FlowNet,we randomly initialize the weights of those branches.

Training . Images are resized to 300 ×

300 for training andinference. Training is conducted by stochastic gradient descent.To address data imbalance among di ﬀ erent actions, from eachtraining video clip of UCF-24, 15 frames spanning the wholevideo are evenly sampled as the training set. Since video clipsof JHMDB-21 are generally short ( ≤ T mem → k and T k → i are set to 10 during training. These chosen values cor-respond to the key frame interval used during inference, whichis also ﬁxed at 10 in our experiments unless speciﬁed.We apply di ﬀ erent hyperparameters on the two datasets.UCF-24 is trained for 100 K iterations; the learning rate is ini-tialized as 0.0005 and reduced by a factor of 0.1 after the 80 K th and 90 K th iterations. Weights of VGG16’s ﬁrst two convolutionblocks are frozen. For JHMDB-21, due to its smaller trainingand testing size, we observe that detection accuracy tends toﬂuctuate signiﬁcantly between successive epochs. Hence, weempirically train this dataset for 20 K iterations with learningrate initialized as 0.0004 and reduced by a factor of 0.5 afterthe 8 K th and 16 K th iteration. During its training, the ﬁrst three Table 1. F-mAP results for di ﬀ erent conﬁgurations. ACDnetSSD (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) FA (cid:51) (cid:51) (cid:51) (cid:51) Scale map (cid:51) (cid:51) MA (cid:51) (cid:51) F-mAPUCF-24 67.32 65.84 67.23 68.06

JHMDB-21 47.90 46.65 46.69 49.37 convolution blocks of VGG16 are frozen. In addition, all layersof FlowNet until the ﬂow generation layers (the ﬁve branches atthe end of our modiﬁed model) are also frozen to further reducethe risk of overﬁtting.All sub-networks are trained jointly (also evaluated) on anNVIDIA Quadro P6000 GPU using a training batch size of 8.For the rest of hyperparameters and data augmentation meth-ods, we follow the same setup as the original SSD by Liu et al.(2016). The weights of VGG16 and FlowNet are pre-trainedusing ImageNet and the Flying Chair dataset respectively.

Our proposed architecture has been evaluated in terms ofaccuracy, e ﬃ ciency and robustness over several network con-ﬁgurations. The standard frame-level mean average precision(F-mAP) and frame-per-second (FPS) have been used as theevaluation metrics. Speciﬁcally, FPS is measured based on thecomplete detection pipeline, including data loading and modelinference using batch size of 1.The Intersection-over-Unionthreshold is set to 0.5 throughout all experiments. For brevity,we refer to feature approximation and memory feature aggrega-tion as FA and MA respectively when presenting their results. Accuracy.

F-mAP results of di ﬀ erent conﬁgurations are re-ported in Table 1. From both datasets, we observe a decrease ofaccuracy when only feature approximation is included. How-ever, the accuracy drop can be compensated by the addition ofmemory aggregation, which exceeds the accuracy of the stand-alone SSD. Figure 4 shows some examples of how the mem-ory aggregation module beneﬁts detection. Overall, we remarkthat aggregating multipleframe features over time, even in asparse manner, improves models’ abilities to more conﬁdentlydiscriminate among di ﬀ erent actions.We also examine the e ﬀ ect of separate branches of position-wise scale maps designed for reﬁning visual features. Our re-sults indicate that such reﬁnement mildly improves detectionaccuracy. The scale maps serve as implicit attention mapswhich reinforce feature responses associated with moving ac-tors (elaborated in Figure 5).Even though similar result patterns can be seen from bothdatasets, the beneﬁt of memory aggregation appears less promi-nent in JHMDB-21. This could result from the fact that eachvideo clip in JHMDB-21 is much shorter (40 frames or less).As MA is performed sparsely at every 10 th frame, its impact islimited to 2-3 aggregation per clip. Furthermore, we observethat motions in several JHMDB-21 clips are relatively small. Fig. 4. Examples where ACDnet (FA, MA) improves the baseline SSD.Green / Red boxes correspond to correct / incorrect detection respectively.Fig. 5. Position-wise scale maps produced by our modiﬁed FlowNet. Thescale map (bottom row) only reinforces activation (top row) associated withthe actor by up-scaling, without altering activation in other feature regions. In these clips, key frames far apart still appear fairly identical,limiting additional visual cues to be propagated. E ﬃ ciency is evaluated on UCF-24 by simultaneously in-specting accuracy, run time and number of parameters of vari-ous conﬁgurations. Here, we assume the use of scale map re-ﬁnement if applicable. As shown in Table 2, ACDnet (SSD,FA, MA) outperforms the stand-alone SSD in both speed andaccuracy. This suggests it is relevant to handle inter-frame re-dundancy, and that long-range memory fusion is e ﬀ ective forcollecting more discriminating features. Regarding the numberof required parameters, the increase in ACDnet (SSD, FA) fromstand-alone SSD is associated with the addition of FlowNet,which can be replaced by much lighter architectures in the fu-ture. Likewise, the increase with the addition of MA modulecorresponds to the extra embedding layers for measuring fea-ture similarity at various scales. In terms of run time, the speeddrop with MA is incurred by the additional operations at keyframes (except for the ﬁrst one), where ﬂow estimation, featureextraction, similarity measure and aggregation all take place.To examine how our generic architecture performs on a dif-ferent detection framework, we conduct the same experimentswhile incorporating ACDnet with R-FCN, a state-of-the-arttwo-stage detector. The run time improvement brought by fea-ture approximation is more signiﬁcant with R-FCN, due to itusing a much deeper backbone for feature extraction. The num-ber of additional parameter needed to carry out memory aggre-gation is less too for R-FCN, as it is designed to perform predic-tion on a single-scale feature (needing only one branch for the Table 2. Performance of di ﬀ erent conﬁgurations on UCF-24. F-mAP FPS F - m AP ( UC F - ) Key frame interval (k) F - m AP ( J H M D B - ) ACDnet (FA, MA)ACDnet (FA)SSD

Fig. 6. F-mAP under varied key frame intervals. embedding and ﬂow sub-networks). Overall, when taking intoaccount run time, memory consumption and obtained accuracytogether, our results still strongly favor the SSD-based ACDnet.

Robustness.

Concerning the robustness of our modelstrained with a ﬁxed duration T mem → k and T k → i (at 10), we eval-uate their performances under various key frame intervals ( k )during inference. Figure 6 displays F-mAP results on bothdatasets, with k ranging from 2 to 20. Both models (with andwithout MA) express an overall steady drop in accuracy on thetwo datasets as k increases. This is reasonable, as the ability ofﬂow ﬁelds to correctly encode pixel correspondence diminishesunder large motions. However, even when k is large, ACDnetwith MA still retains decent accuracy which outperforms or iscomparable with the best cases of the other conﬁgurations.Run time is also inspected under the same setting, as shownin Figure 7. It can be observed that ACDnet (FA, MA) ex-ceeds the speed of SSD starting around k =

8, while the FA-only model is consistently faster. Larger key frame intervalsintuitively should lead to further speed boost, as higher ratioof features are approximated. Interestingly, we observe thatthis pattern is neatly presented when k ≤

10. After that, therun time of the examined models begin to saturate. This phe-nomenon is associated with two factors. On the one hand, askey frame interval increases, the ratio between approximatedand real features become less signiﬁcant. On the other hand,larger key frame intervals introduce more motion which couldcompromise the quality of approximated features. This resultsin an increase of low-conﬁdent predictions, which take longerfor SSD’s non-maximum suppression to ﬁlter. F PS ( UC F - ) Key frame interval (k) F PS ( J H M D B - ) ACDnet (FA, MA)ACDnet (FA)SSD

Fig. 7. FPS under varied key frame intervals.

We compare the complete ACDnet (with FA and MA) againststate-of-the-art methods in Table 3. Since our proposed frame-work targets lightweight action inference for realistic deploy-ment rather than solely obtaining superior accuracy, only top-performing works which take into account both accuracy andrun time are considered for fair comparison. With this in mind,recent research such as the works of Wei et al. (2019) and Guet al. (2018) demonstrate impressive accuracy but are excludedfrom our comparison, as they utilize heavier conﬁgurations andomit speed analysis. Alongside performances, comprehensivesummary of each method’s backbone is also reported for clearercomparison. It should be noted that methods such as ACT,STEP and MOC perform clip-based detection. In other words,they take clips of multiple RGB frames with the support ofstacked ﬂow images at once (e.g., ﬁve ﬂow for each RGB), pre-dicting action tubelets spanning these RGB frames. In contrast,methods such as YOWO gather supportive contextual cues frommultiple frames to augment the target one. These particular at-tributes are summarized Table 3 column 4.As shown in Table 3, ACDnet outperforms the others in termsof run time. This is ascribed to the feature approximation mod-ule and our less complex architecture overall. The other meth-ods either adopt two-stream or 3D CNN architectures to capturecomplemented spatial and temporal features, which raise com-putation. In addition, preparation of accurate ﬂow using Brox(Brox et al. (2004)) or FlowNet2 (Ilg et al. (2017)) is partic-ularly expensive; as a result, all methods employing a secondﬂow stream do not take into account optical ﬂow acquisitionwhen measuring run time (except ROAD using a fast ﬂow esti-mator by Kroeger et al. (2016)). In contrast, ﬂow generation inACDnet is fast and can be carried out in an online setting as itdoes not aim to encode ﬁne-grained motion features.In terms of accuracy, ACDnet retains competitive perfor-mance on UCF-24. On the other hand, its performance onJHMDB-21 is less impressive compared to the other methods.As opposed to UCF-24, whose classes of sports activities arevisually more distinctive, we observe that JHMDB-21 containsmore classes that share similar visual context (for example,

Sit v.s.

Stand , and

Run v.s.

Walk , etc.). Figure 8 demonstrates

Fig. 8. Examples of false detection in JHMDB-21. (a) Correct action:

Jump .(b) Correct action:

Sit . (c). Correct action:

Stand ; ACDnet incorrectlypredicts two actions (

Stand and

Run ). a few falsely detected examples by our model which result inlower F-mAP in JHMDB-21. Such ambiguous visual contextis challenging even for human to conﬁdently infer the correctaction unless viewing consecutive frames at once. As shown incolumn 4 of Table 3, ACDnet applies detection on frames farfewer than other methods, which limits its ability to model de-tailed variations of visual cues over time. In addition, JHMDB-21 consists of short clips for which sparse memory aggrega-tion can only take place minimally. The above factors result inACDnet’s less satisfactory accuracy on JHMDB-21. This visualambiguity could generally be mitigated when examining moreframes at once, as demonstrated by all clip-based methods.Similarly, 3D-CNN-based method such as YOWO alsoproves e ﬀ ective to learn spatio-temporal features when taking16 consecutive frames. However, such an approach inevitablyraises the computation time; not only from model inference,but also data loading, which is excluded in their reported speedperformance. Furthermore, ACDnet achieves comparable ac-curacy when YOWO employs lighter 3D CNN variants, imply-ing the necessity of deeper models to e ﬀ ectively reason tem-poral context. In conclusion, our experimental results ver-ify ACDnet’s competitive capability to e ﬃ ciently infer actionswith strong visual cues, but its sparse spatio-temporal model-ing scheme does not capture temporal cues as e ﬀ ectively asthe more expensive two-stream /

3D CNN. On the other hand,ACDnet is compact and can achieve inference speed far beyondreal-time requirement. This not only permits more seamless de-ployment potentially on resource-constrained devices, but alsocan a ﬀ ord to further adopt a clip-based framework or light-weight 3D CNN to improve its accuracy.

5. Conclusions and Future works

In this paper, we present ACDnet, a compact action detec-tion network with real-time capability. By exploiting temporalcoherence among video frames, it utilizes feature approxima-tion on frames with similar visual appearances, which signif-icantly improves detection e ﬃ ciency. Additionally, a memoryaggregation module is introduced to fuse multi-frame features,enhancing detection stability and accuracy. The combinationof the two modules and SSD detector implicitly reasons tem-poral context in an inexpensive manner. ACDnet demonstratesreal-time detection (up to 75 FPS) on public benchmarks whileretaining decent accuracy against other best performers at farless complex settings, making it more appealing to edge de-vice deployment in practical applications. Our future worksinclude further investigation in cost-e ﬀ ective architectures for Table 3. State-of-the-art comparison. *For any key frame, ACDnet (FA, MA) fuses the accumulated key frame feature from the past with the current one(considered 2RGB implicitly). For any non-key frame, its feature is approximated based on the preceding key frame feature (considered 1RGB).

Method + +

3D CNN frames: det. UCF24 JHMDB21 FPSROAD, Singh et al. (2017) Kroeger (cid:55) (1RGB + (cid:55) (6RGB + (cid:55) (1RGB + ×

3D conv (6RGB + (cid:55) (cid:55) ﬄ eNetV2 16RGB:1 71.4 55.3 –YOWO(c) (cid:55) (cid:55) (7RGB + (cid:55) (cid:55) ∗ RGB:1 70.92 49.53 75spatio-temporal modeling and performing temporal localiza-tion. For a fully integrated and resource-e ﬃ cient vision system,lightweight alternatives of the current sub-networks will be ex-plored, and we will precisely customize candidate solutions forembedding them on edge devices such as NVIDIA Xavier GPU. Acknowledgments

This work was supported by the H2020 ITN projectACHIEVE (H2020-MSCA-ITN-2017: agreement no. 765866).

References

Ali, A., Taylor, G.W., 2018. Real-time end-to-end action detection with two-stream networks, in: IEEE CRV, pp. 31–38.Brox, T., Bruhn, A., Papenberg, N., Weickert, J., 2004. High accuracy opticalﬂow estimation based on a theory for warping, in: ECCV, Springer. pp. 25–36.Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new modeland the kinetics dataset, in: IEEE CVPR, pp. 6299–6308.Dai, J., Li, Y., He, K., Sun, J., 2016. R-fcn: Object detection via region-basedfully convolutional networks, in: NIPS, pp. 379–387.Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., VanDer Smagt, P., Cremers, D., Brox, T., 2015. Flownet: Learning optical ﬂowwith convolutional networks, in: IEEE ICCV, pp. 2758–2766.Feichtenhofer, C., Pinz, A., Zisserman, A., 2016. Convolutional two-streamnetwork fusion for video action recognition, in: IEEE CVPR, pp. 1933–1941.Girdhar, R., Carreira, J., Doersch, C., Zisserman, A., 2019. Video action trans-former network, in: IEEE CVPR, pp. 244–253.Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijaya-narasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al., 2018. Ava: Avideo dataset of spatio-temporally localized atomic visual actions, in: IEEECVPR, pp. 6047–6056.Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi,H., Li, J., Yan, S., Huang, T.S., 2016. Seq-nms for video object detection.arXiv preprint arXiv:1602.08465 .Hetang, C., Qin, H., Liu, S., Yan, J., 2017. Impression network for video objectdetection. arXiv preprint arXiv:1712.05896 .Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T., 2017.Flownet 2.0: Evolution of optical ﬂow estimation with deep networks, in:IEEE CVPR, pp. 2462–2470.Jhuang, H., Gall, J., Zu ﬃ , S., Schmid, C., Black, M.J., 2013. Towards under-standing action recognition, in: IEEE ICCV, pp. 3192–3199.Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J., 2019. Stm: Spatiotemporal andmotion encoding for action recognition, in: IEEE ICCV, pp. 2000–2009.Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C., 2017. Action tubeletdetector for spatio-temporal action localization, in: IEEE ICCV, pp. 4405–4413.Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z.,Wang, R., Wang, X., et al., 2017. T-cnn: Tubelets with convolutional neuralnetworks for object detection from videos. IEEE TCSVT 28, 2896–2907. K¨op¨ukl¨u, O., Wei, X., Rigoll, G., 2019. You only watch once: A uniﬁed cnnarchitecture for real-time spatiotemporal action localization. arXiv preprintarXiv:1911.06644 .Kroeger, T., Timofte, R., Dai, D., Van Gool, L., 2016. Fast optical ﬂow usingdense inverse search, in: ECCV, Springer. pp. 471–488.Li, Y., Miao, Q., Tian, K., Fan, Y., Xu, X., Ma, Z., Song, J., 2019. Large-scalegesture recognition with a fusion of rgb-d data based on optical ﬂow and thec3d model. PRL 119, 187–194.Li, Y., Wang, Z., Wang, L., Wu, G., 2020. Actions as moving points. arXivpreprint arXiv:2001.04608 .Lin, J., Gan, C., Han, S., 2019. Tsm: Temporal shift module for e ﬃ cient videounderstanding, in: IEEE ICCV, pp. 7083–7093.Liu, M., Zhu, M., 2018. Mobile video object detection with temporally-awarefeature maps, in: IEEE CVPR, pp. 5686–5695.Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.,2016. Ssd: Single shot multibox detector, in: ECCV, Springer. pp. 21–37.Peng, X., Schmid, C., 2016. Multi-region two-stream r-cnn for action detection,in: ECCV, Springer. pp. 744–759.Qiu, Z., Yao, T., Mei, T., 2017. Learning spatio-temporal representation withpseudo-3d residual networks, in: IEEE ICCV, pp. 5533–5541.Redmon, J., Farhadi, A., 2017. Yolo9000: better, faster, stronger, in: IEEECVPR, pp. 7263–7271.Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-timeobject detection with region proposal networks, in: NIPS, pp. 91–99.Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks foraction recognition in videos, in: NIPS, pp. 568–576.Singh, G., Saha, S., Sapienza, M., Torr, P.H., Cuzzolin, F., 2017. Online real-time multiple spatiotemporal action localisation and prediction, in: IEEEICCV, pp. 3637–3646.Soomro, K., Zamir, A.R., Shah, M., 2012. Ucf101: A dataset of 101 humanactions classes from videos in the wild. arXiv preprint arXiv:1212.0402 .Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid,C., 2018a. Actor-centric relation network, in: ECCV, pp. 318–334.Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W., 2018b. Optical ﬂowguided feature: A fast and robust motion representation for video actionrecognition, in: IEEE CVPR, pp. 1390–1399.Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M., 2018. A closerlook at spatiotemporal convolutions for action recognition, in: IEEE CVPR,pp. 6450–6459.Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.,2016. Temporal segment networks: Towards good practices for deep actionrecognition, in: ECCV, Springer. pp. 20–36.Wei, J., Wang, H., Yi, Y., Li, Q., Huang, D., 2019. P3d-ctn: Pseudo-3d con-volutional tube network for spatio-temporal action detection in videos, in:IEEE ICIP, pp. 300–304.Yang, X., Yang, X., Liu, M.Y., Xiao, F., Davis, L.S., Kautz, J., 2019. Step:Spatio-temporal progressive learning for video action detection, in: IEEECVPR, pp. 264–272.Yao, G., Lei, T., Zhong, J., 2019. A review of convolutional-neural-network-based action recognition. PRL 118, 14–22.Zhao, J., Snoek, C.G., 2019. Dance with ﬂow: Two-in-one stream action de-tection, in: IEEE CVPR, pp. 9935–9944.Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y., 2017a. Flow-guided featureaggregation for video object detection, in: IEEE ICCV, pp. 408–417.Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y., 2017b. Deep feature ﬂow forcient videounderstanding, in: IEEE ICCV, pp. 7083–7093.Liu, M., Zhu, M., 2018. Mobile video object detection with temporally-awarefeature maps, in: IEEE CVPR, pp. 5686–5695.Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.,2016. Ssd: Single shot multibox detector, in: ECCV, Springer. pp. 21–37.Peng, X., Schmid, C., 2016. Multi-region two-stream r-cnn for action detection,in: ECCV, Springer. pp. 744–759.Qiu, Z., Yao, T., Mei, T., 2017. Learning spatio-temporal representation withpseudo-3d residual networks, in: IEEE ICCV, pp. 5533–5541.Redmon, J., Farhadi, A., 2017. Yolo9000: better, faster, stronger, in: IEEECVPR, pp. 7263–7271.Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-timeobject detection with region proposal networks, in: NIPS, pp. 91–99.Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks foraction recognition in videos, in: NIPS, pp. 568–576.Singh, G., Saha, S., Sapienza, M., Torr, P.H., Cuzzolin, F., 2017. Online real-time multiple spatiotemporal action localisation and prediction, in: IEEEICCV, pp. 3637–3646.Soomro, K., Zamir, A.R., Shah, M., 2012. Ucf101: A dataset of 101 humanactions classes from videos in the wild. arXiv preprint arXiv:1212.0402 .Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid,C., 2018a. Actor-centric relation network, in: ECCV, pp. 318–334.Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W., 2018b. Optical ﬂowguided feature: A fast and robust motion representation for video actionrecognition, in: IEEE CVPR, pp. 1390–1399.Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M., 2018. A closerlook at spatiotemporal convolutions for action recognition, in: IEEE CVPR,pp. 6450–6459.Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.,2016. Temporal segment networks: Towards good practices for deep actionrecognition, in: ECCV, Springer. pp. 20–36.Wei, J., Wang, H., Yi, Y., Li, Q., Huang, D., 2019. P3d-ctn: Pseudo-3d con-volutional tube network for spatio-temporal action detection in videos, in:IEEE ICIP, pp. 300–304.Yang, X., Yang, X., Liu, M.Y., Xiao, F., Davis, L.S., Kautz, J., 2019. Step:Spatio-temporal progressive learning for video action detection, in: IEEECVPR, pp. 264–272.Yao, G., Lei, T., Zhong, J., 2019. A review of convolutional-neural-network-based action recognition. PRL 118, 14–22.Zhao, J., Snoek, C.G., 2019. Dance with ﬂow: Two-in-one stream action de-tection, in: IEEE CVPR, pp. 9935–9944.Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y., 2017a. Flow-guided featureaggregation for video object detection, in: IEEE ICCV, pp. 408–417.Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y., 2017b. Deep feature ﬂow for