Single Run Action Detector over Video Stream -- A Privacy Preserving Approach
Anbumalar Saravanan, Justin Sanchez, Hassan Ghasemzadeh, Aurelia Macabasco-O'Connell, Hamed Tabkhi
SSingle Run Action Detector over Video Stream - A Privacy Preserving Approach
Anbumalar Saravanan , Justin Sanchez , Hassan Ghasemzadeh , AureliaMacabasco-O’Connell and Hamed Tabkhi University of North Carolina Charlotte University of North Carolina Charlotte University of North Carolina Charlotte School of EECS Washington State University School of Nursing Azusa Pacific University School of Nursing Azusa Pacific University { asaravan, jsanch19,htabkhiv } @uncc.edu, [email protected] { amacabascooconnell } @apu.edu Abstract
This paper takes initial strides at designingand evaluating a vision-based system for pri-vacy ensured activity monitoring. The proposedtechnology utilizing Artificial Intelligence (AI)-empowered proactive systems offering continuousmonitoring, behavioral analysis, and modeling ofhuman activities. To this end, this paper presentsSingle Run Action Detector (S-RAD) which is areal-time privacy-preserving action detector thatperforms end-to-end action localization and clas-sification. It is based on Faster-RCNN combinedwith temporal shift modeling and segment basedsampling to capture the human actions. Resultson UCF-Sports and UR Fall dataset present compa-rable accuracy to State-of-the-Art approaches withsignificantly lower model size and computation de-mand and the ability for real-time execution onedge embedded device (e.g. Nvidia Jetson Xavier).
In recent years, deep learning has achieved success in fieldssuch as computer vision and natural language processing.Compared to traditional machine learning methods such assupport vector and random forest, deep learning has a stronglearning ability from the data and can make better use ofdatasets for feature extraction. Because of this practicabil-ity, deep learning had become more and more popular to doresearch works.Deep learning models usually adopt hierarchical structuresto connect their layers. The output of a lower layer can beregarded as the input of a higher layer using linear or non-linear functions. These models can transform low-level fea-tures to high-level abstract features from the input data. Be-cause of this characteristic, deep learning models are strongerthan shallow machine learning models in feature representa-tion. The performance of traditional machine-learning meth-ods usually rely on user experiences and handcrafted meth-ods, while deep learning approaches rely on the data. The recent approaches in video analytic and deep learn-ing algorithms like Convolutional Neural network providesthe opportunity for real-time detection and analysis of hu-man behaviors like walking,running or sitting down, whichare part of daily living Activities (ADL) [Neff et al. , 2020].Cameras provide very rich information about persons and en-vironments and their presence is becoming more importantin everyday environments like airports, train and bus sta-tions,malls,elderly care and even streets. Therefore, reliablevision-based action detection systems is required for variousapplication like healthcare assistance system, crime detectionand sports monitoring system. In our paper we explored twodifferent domains(Sport and Healthcare), to prove the com-prehensive nature of our proposed action detector algorithm.Approaches like [Cameiro et al. , 2019; Alaoui et al. , 2019;Duarte et al. , 2018; Hou et al. , 2017] use larger CNN mod-els that impose huge computation demand and thus limit theirapplication in real-time constrained systems, in particular onembedded edge devices. Additionally,these methods have notbeen designed to fulfill requirements of pervasive video sys-tems including privacy-preserving and real-time responsive-ness. Other works done in this area are based on the use ofwearable sensors. These works used the tri-axial accelerom-eter, ambient/fusion, vibrations or audio and video to cap-ture the human posture,body shape change. However, wear-able sensors require relative strict positioning and thus bringalong inconvenience especially in the scenario of healthcareunit where elderly seniors may even forget to wear them.Motivated by the need and importance of image based ac-tion detection system we introduce a novel Single Run Actiondetector (S-RAD) for activity monitoring. S-RAD providesend-to-end action detection without the use of computation-ally heavy methods in a single shot manner with the abilityto run real-time on embedded edge device. S-RAD detectsand localizes complex human actions with a Faster-RCNNlike architecture [Ren et al. , 2015] combined with tempo-ral shift blocks (based on [Lin et al. , 2018]) to capture thelow-level and high-level video temporal context. S-RAD is aprivacy-preserving approach and inherently protects Person-ally Identifiable Information (PII). The real-time execution on a r X i v : . [ c s . C V ] F e b dge avoids unnecessary video transfer and PII to a cloud orremote computing server.Overall, our contributions are as follows: (1) We introduceS-RAD, a single shot action detector localising humans andclassifying actions. (2) We demonstrate that we can achievecomparable accuracy to the State-of-the-Art approaches (onthe UCF-Sports and UR Fall datasets) at much lower com-putation cost. We demonstrate our approach on two differentdataset from healthcare and sport domain to prove it’s robust-ness and applicability to multiple action detection domains.(3) We additionally provide possibility’s of extending our net-work to real-time scenarios on an edge device. Code will bemade publicly available on GitHub after reviews. Most prior research focuses on using wearable and mobiledevices (e.g., smartphones, smartwatches) for activity recog-nition . In designing efficient activity recognition systems, re-searchers have extensively studied various wearable comput-ing research questions. These research efforts have revolvedaround optimal placement of the wearable sensors [Atallah et al. , 2011], automatic detection of the on-body location ofthe sensor [Saeedi et al. , 2014], minimization of the sensingenergy consumption [Pagan et al. , 2018], and optimization ofthe power consumption [Mirzadeh and Ghasemzadeh, 2020].A limitation of activity monitoring using wearable sensorsand mobile devices is that these technologies are battery-powered and therefore need to be regularly charged. Failureto charge the battery results in discontinuity of the activityrecognition, which in turn may lead to important behavioralevents remaining undetected.
Action recognition is a long-term research problem and hasbeen studied for decades. Existing State-of-the-Art meth-ods mostly focus on modelling the temporal dependencies inthe successive video frames [Simonyan and Zisserman, 2014;Wang et al. , 2016; Tran et al. , 2014]. For instance, [Wang et al. , 2016] directly averaged the motion cues depicted indifferent temporal segments in order to capture the irregularnature of temporal information. [Simonyan and Zisserman,2014] proposed a two-stream network, which takes RGBframes and optical flows as input respectively and fused thedetection’s from the two streams as the final output. This wasdone at several granularities of abstraction and achieved greatperformance. Beyond multi-stream based methods, methodslike [Tran et al. , 2014; Lu et al. , 2019] explored 3D ConvNetson video streams for joint spatio-temporal feature learning onvideos. In this way, they avoid calculating the optical flow,keypoints or saliency maps explicitly. However all the aboveapproaches are too large to fit in a real-time edge device. On This is a pre-print of an article to be published in the 2nd Inter-national Workshop On Deep Learning For Human Activity Recog-nition, Springer Communications in Computer and Information Sci-ence (CCIS) proceedings. the other hand [Alaoui et al. , 2019] uses features calculatedfrom variations in the human keypoints to classify fallingand not falling actions, [Cameiro et al. , 2019] uses VGG16based on Multi-stream (optical flow, RGB, pose estimation)for human action classification. The above approaches onlyconcentrate on the classification of single human action atscene level and will not perform well if multiple human’sare present in an image, which is essential for the healthcareand other public place monitoring systems. Our proposed ap-proach performs human detection and action classification to-gether in a single shot manner where algorithm first localisesthe human’s in an image and classifies his/her action.
Spatio-temporal human action detection is a challengingcomputer vision problem, which involves detecting humanactions in a video as well as localizing these actions both spa-tially and temporally. Few papers on spatio-temporal actiondetection like [Kalogeiton et al. , 2017] uses object detectorslike SSD [Liu et al. , 2015] to generate spatio-temporal tubesby deploying high level linking algorithm on frame level de-tection’s. Inspired by RCNN approaches, [Peng and Schmid,2016] used Faster-RCNN [Ren et al. , 2015] to detect the hu-man in an image by capturing the action motion cues withthe help of optical flow and classify the final human actionsbased on the actionness score. [Gkioxari and Malik, 2015]extracted proposals by using the selective search method onRGB frames and then applied the original R-CNN on perframe RGB and optical flow data for frame-level action detec-tion’s and finally link those detection’s using the Viterbi algo-rithm to generate action tubes. On the other hand [Hou et al. ,2017] uses 3D CNN to generate spatio-temporal tubes withTube of interest pooling and had showed good performance inthe action related datasets. However all these methods poseshigh processing time and computation cost due to optical flowgeneration in the two stream networks, 3D kernels in the 3DCNN related works and generation of keypoint’s in the humanpose based methods. As such, the aforementioned methodsare unable to be applied in real-time monitoring systems.
Approach
We introduce S-RAD, an agile and real-time activ-ity monitoring system. Our approach unifies spatio-temporalfeature extraction and localization into a single network, al-lowing the opportunity to be deployed on edge device. This”on-the-edge” deployment eliminates the need for sendingsensitive human data to privacy invalidating cloud servers,similar to [Neff et al. , 2020]. Instead our approach can deleteall video data after it is processed and can store only the highlevel activity analytics. Without stored images, S-RAD canbe used to solely focus on differentiating between the humanactions rather than identifying or describing the human.In order to achieve this privacy preserving edge execu-tion, it is important to have an algorithm able to perform ina resource constrained edge environment. Traditionally suchconstraints resulted in either accuracy reduction, or increasedlatency. The overview of S-RAD is shown in Figure 1. S-RAD takes an input sequence of N frames f , f , f , ..., f N igure 1: Overview of the activity detector. Given a sequence of frames we extract channel shifted convolutional features from the basefeature extractor to derive the activity proposals in the action proposal network. We then ROI align the activity proposals to predict theirscores and regress their co-ordinates. and outputs the detected bounding box and confidence scoreper each class of the proposals. The model consists of a basefeature extractor integrated with temporal shift blocks to cap-ture low level spatio-temporal features. The base feature ex-tractor is made up of the first 40 layers of the original ResNet-50 [He et al. , 2015] backbone. The base feature maps are pro-cessed by the Region Proposal Network (RPN) using a slid-ing window approach with handpicked anchors and generatesaction proposals for each frame. An RPN is a fully convo-lutional network that simultaneously predicts action boundsand actionness scores at each position. The RPN is trainedend-to-end to localize and detect valid region action propos-als (the foreground) from background. This sliding windowapproach to generate the proposals is the source of its accu-racy as opposed to SSD’s [Liu et al. , 2015] rigid grid baseproposal generation. Following the first stage, the originalspatio-temporal base features, in conjecture with the propos-als are passed into the Region of interest Align (ROI-Align)layer which aligns the varying sized action proposals in to afixed 7x7 spatial sized action proposals. The second stage ofthe action detector further classifies each valid action propos-als to the action classes in that particular frame. The finalclassification layer outputs C +1 scores for each action pro-posal, one per each action class plus one for the background.The regression layer outputs 4 x K where K is the number ofaction proposals generated in each frame. Temporal shift block
TSM [Lin et al. , 2018] are highlyhardware efficient. Temporal shift blocks are inserted intothe bottleneck layer of Resnet-50 [He et al. , 2015] based fea-ture extractor to sustain the spatial information using the iden-tity mapping along with the temporal information using the
Figure 2: Temporal shift block shifted features. As shown in Figure 2, each shift receivesthe C channels from the previous layer. We shift 1/8th of thechannels from the past frame to the current frame and shift1/8th of the channels from current frame to the future frame,while the other part of the channels remain unshifted. Thenew features (channels are referred to as features) ˆ x , havethe information of both the past x and future x frames afterthe ”shift” operation. The features are convoluted and mixedinto new spatio-temporal features. The shift block coupled tothe next layer will do the same operation. Each shift blockincreases the temporal receptive field by a magnitude of 2neighbor frames until N frames. For our work we choose N = 8 since features are in the magnitude of 8 in Resnet-50rchitecture [He et al. , 2015].S-RAD goes beyond action classification to action detec-tion. This is valuable for communal areas such as mesh halls,and for interactions with other human’s and with objects. Wechose Faster-RCNN [Ren et al. , 2015] as our detection base-line due to its fine-grained detection capabilities when com-pared to SSD [Liu et al. , 2015]. This fine grained detectionis especially applicable to the healthcare domain when deal-ing with wandering patients and fine-grain abnormal behav-iors. Despite the complexity of such tasks our utilization ofTSM [Lin et al. , 2018] enables the extraction of the neces-sary spatio-temporal features for human action localizationand individual action classification, in a streaming real-timemanner while maintaining privacy. RPN Loss:
For training RPNs, we assign a binary actionclass label (of being an action or not i.e foreground vs back-ground) to each anchor. We assign a positive action classlabel to two kinds of anchors:(i) the anchors with the high-est Intersection-over Union (IoU) overlap with a ground-truthbox, or (ii) an anchor that has an IoU > < L rpn ( { p i } , { bb i } ) = K . K (cid:80) i =1 L cls ( p i , p ∗ i ) + K . K (cid:80) i =1 p ∗ i L reg ( bb i , bb ∗ i ) (1) Here, i is the index of an anchor in a mini-batch and p i is thepredicted probability of anchor i belonging to an action class.The ground-truth label p ∗ i is 1 if the anchor is positive, and0 if the anchor is negative. The vector representing the 4 co-ordinates of the predicted bounding box is bb i , and bb ∗ i is theground-truth box associated with a positive anchor. The term p ∗ i L reg dictates the smooth L1 regression loss is activatedonly for positive anchors ( p ∗ i = 1 ) and is disabled otherwise( p ∗ i = 0 ). L cls is log loss(cross-entropy) over two classes (ac-tion vs. no action) and is averaged over K frames. RCNN loss:
The seconds stage of the detector assigns theaction class label to the region of interest or foreground pro-posals from the RPN training. It involves classification lossand regression loss. The classification layer here includes de-tecting the correct action class label for the proposals fromROI align layer and regression layer is to regress the detectedbox with ground truth. The RCNN loss is defined as : L rcnn ( { p i } , { bb i } ) = K . K (cid:80) i =1 L cls ( p i , p ∗ i ) + K . K (cid:80) i =1 L reg ( bb i , bb ∗ i ) (2) where i is the index of proposals or region of interests withspatial dimension x and p i is the predicted probability ofthe action class label, with p ∗ i being the ground truth class la-bel. The vector representing the 4 coordinates of the predictedbounding box is bb i , and bb ∗ i is that of the ground-truth box. L cls is log loss (cross-entropy) over multi-classes, L reg is thesmooth L1 regression loss and is averaged over K frames. Intraining mode we set the network to output 256 proposals andin inference mode network outputs 300 proposals. Total training loss:
Total loss is defined as sum of RCNNand RPN loss:
T otal loss = L rpn ( { p i } , { bb i } ) + L rcnn ( { p i } , { bb i } ) (3) Setup
We use Resnet-50 [He et al. , 2015] as the backboneof our architecture because of the network depth and residualconnections that enable feature reuse and propagation. TheUCF-Sports [Soomro and Zamir, 2014] and UR Fall [Kwolekand Kepski, 2014] datasets are too small and are prone toover fitting, so we fine-tuned our network from Kinetics [Kay et al. , 2017] pre-trained weights and froze the batch normal-ization layers. The training parameters for the UCF-Sports[Soomro and Zamir, 2014] dataset are 300 training epochs,with an inital learning rate of 0.03 and a weight decay 0.1every 60 epochs. We utilized gradient accumulation with abatch size of 4 and an accumulation step of 3 to fit a totalbatch of 12 on one V100GPU. The training parameters forthe UR Fall dataset [Kwolek and Kepski, 2014] are 80 train-ing epochs, with initial learning rate of 0.02 and a weightdecay 0.1 every 20 epochs. We use the uniform temporalsampling strategy done in [Wang et al. , 2016] to sample 8frames from the video and resize the input resolution of theimage to 300x400 for State-of-the-Art comparison. We useddatasets from two different domain (Sport and Healthcare) toshow the generic capability of our algorithm.
The UCF-Sports dataset [Soomro and Zamir, 2014] consistsof 150 videos from 10 action classes. All videos have spatio-temporal annotations in the form of frame-level boundingboxes and we follow the same training/testing split used by[Gkioxari and Malik, 2015]. On average there are 103 videosin the training dataset and 47 videos in the testing dataset.Videos are truncated to the action and bounding boxes an-notations are provided for all frames. To quantify our re-sults, we report the mean Average Precision (mAP) at theframe level (frame mAP). Frame-level metrics allow us tocompare the quality of the detection’s independently. We usethe Precision-recall AUC (Area under curve) to calculate theaverage precision per class. We compute the mean of the av-erage precision per class to see how much our algorithm isable to differentiate the features between action classes. Wefollowed the same procedure as in the PASCAL VOC detec-tion challenge [Everingham et al. , 2010] to have an apple toapple comparison with the State-of-the-Art approaches in thedetection task. We first evaluate S-RAD on the widely usedUCF-Sports dataset. Table 1 indicates frame level AveragePrecision per class for an intersection-over-union threshold of0.5. Our approach achieves a mean AP of 85.04% . While ob-taining excellent performance on most of the classes, walkingis the only action for which the framework fails to detect thehumans (40.71% frame-AP). This is possibly due to severalfactors, the first being that the test videos for ”walking” con-tain multiple actors in close proximity, which results in falsedetections due to occlusions. Additionally, walking is a veryslow action with fine grained features and potentially lacksenough temporal displacement in 8 frames to be picked up byur detector due to sparse temporal sampling strategy. Ulti-mately, our approach is off by only 2% when compared to theState-of-the-Art approaches that utilize either multi-modal, 3-dimensional, or complex proposal architecture solutions. TheState-of-the-Art comparison in terms of mean Average preci-sion (mAP) is summarised in Table 2.
Table 1: State-of-the-Art per class frame mAP comparison in UCF-Sports
Action Class [Gkioxari and Malik] [Weinzaepfel et al. ] [Peng and Schmid] [Hou et al. ] S-RAD
Diving 75.79 60.71 96.12 84.37
Golf 69.29 77.54 80.46 90.79 87.20Kicking 54.60 65.26 73.48 86.48 76.00Lifting 99.09 100.00 99.17 99.76
Riding 89.59 99.53 97.56 100.0
Run 54.89 52.60 82.37 83.65
Skate Boarding 29.80 47.14 57.43 68.71 67.93Swing1 88.70 88.87 83.64 65.75
Swing2 74.50 62.85 98.50 99.71
Walk 44.70 64.43 75.98 87.79 40.71
Table 2: Overall frame mAP at IOU 0.5 threshold comparison inUCF-Sports Action dataset [2015] [2015] [2016] [2017] [2017] [2018] S-RADmAP
The Precision Recall AUC is ploted in Figure 3 shows thecapability of our algorithm to separate different classes.
Figure 3: Precision-Recall curve per Action class in UCF-Sports
We also provided the confusion matrix to better understandthe detections with the original ground truth in Figure 4. Theconfusion matrix is calculated considering both the detectionand classification tasks. Here the grids in the diagonal are thetrue positive’s whose IOU > > < Figure 4: Confusion matrix of S-RAD on UCF-Sports
We have also evaluated our framework on the healthcare ex-tensive dataset [Kwolek and Kepski, 2014]. The UR Falldataset is composed of 70 videos: (i) 30 videos of falls; and(ii) 40 videos displaying diverse activities. We used [Chen et al. , 2019] pre-trained only on the person class in the cocodataset to obtain the bounding box annotations for the groundtruth. On average there are 56 videos in the training and 14videos are in the testing dataset.For the UR Fall dataset we calculate specificity, sensitivityand accuracy along with mAP for comparison.(1)Sensitivity: A metric to evaluate detecting falls. Andcompute the ratio of trues positives to the number of falls.
Sensitivity = T PT P + F N ∗ (4) (2)Specificity: A metric to evaluate how much our algorithmdetects just ”fall” and avoids misclassification with the ”notfall” class. Specificity = T NT N + F P ∗ (5) (3)Accuracy: Metric to compute how much our algorithm candiffer between falls and non-fall videos. Accuracy = T P + T NT N + F P + T P + F N ∗ (6) True positive (TP) means that the frame has a fall andour algorithm has detected fall in those frames.True negative(TN) refers to the frames that don’t contain fall and our algo-rithm does not detect fall in those frames. False negative (FN)designates the frames containing falls, however our algorithmfails to detect the fall in those frames. Finally, false positive(FP) indicates the frames don’t contain a fall, yet our algo-rithm claims to detect a fall. For the sake of comparison withthe other classification based State-of-the-Art papers we takehe detection with the highest confidence score from the out-put of S-RAD and compare it’s class label with the groundtruth class label to calculate the above mentioned parame-ters. Since our approach is based on frame level detection,the classification task on UR fall dataset is also done in framelevel. We achieved a competitive score of 96.54 % in mAP(detection task at frame level). It is important to note, otherState-of-the-Art approaches on this dataset relied solely onclassification, hence our comparison being concentrated onthe classification metrics. The Results are shown on Table 3,showing S-RAD’s true capabilities in the field of healthcare.
Figure 5: Confusion matrix of S-RAD on UR Fall dataset
The confusion matrix on Figure 5 shows the ability of theS-RAD to distinguish Fall and Not Fall with only 4 instancesbeing misclassified as Fall.
Table 3: State-of-the-Art per frame comparison in UR Fall dataset [Alaoui et al. ] [Lu et al. ] [Cameiro et al. ] [Leite et al. ] S-RAD
Sensitivity 100 - 100 100
Specificity 95 - 98.61 98.77
Accuracy 97.5 99.27 98.77 98.84
The S-RAD framework has the advantage of reduced infer-ence time and less number of parameters, enabling us to per-form real-time on the edge activity monitoring in a privacy-aware manner. We compare our framework with others interms of FPS (Frame-Per-Second) and mAP in Table 4 on theUCF-Sports Action dataset. We tested our models on one Ti-tan V GPU (except the work of TubeCNN [Hou et al. , 2017],which was reported on a titan X). The trade-off is between ac-curacy and inference FPS, as well as parameters. Among thestate of the art approaches, our method has the second fastestrun time and can process 41 frames per second which is threetimes faster than [Hou et al. , 2017] and [Peng and Schmid,2016]. Moreover, the number of parameters of our frameworkis the smallest, about 28.36 M in Table 4, although works like[Duarte et al. , 2018] have better FPS with their models, theirfeatures are too heavy to fit into a real-time edge device, addi-tionally our work maintains a higher mAP at a high resolution when compared to their work. We were unable to provide per-formance comparisons with the State-of-the-Art approacheson the UR Fall dataset as most of the approaches are not pub-licly available to run on the edge device, and do not provideperformance metrics of their own.
Table 4: Comparison on Server Class Execution on Nvidia Titanplatform
Approach Input Resolution Param
Multi-stream [Peng and Schmid, 2016] RGB+Flow 600x1067 274 11.82 84.51CapsuleNet[Duarte et al. , 2018] RGB 112x112 103.137 78.41 83.9TubeCNN[Hou et al. , 2017] RGB 300x400 245.87 17.391 86.7ACT[Kalogeiton et al. , 2017] RGB+Flow 300x300 50 12 87.7
S-RAD RGB 300x400 28.35 41.64 85.04
We additionally evaluated our work on an edge platform,the Nvidia Xavier to test its performance on an resource con-strained edge platform. We compare the work of VideoCap-suleNet [Duarte et al. , 2018] with our approach, and despitetheir initial performance advantage on the Titan V, our workis the only model capable of running on the memory con-strained edge device. S-RAD, as opposed to VideoCapsu-leNet folds temporal data into the channel dimension, and asa result avoids introducing another dimension to the tensorsizes. VideoCapsuleNet not only process 3D spatial-temporalfeature maps, but they also introduce another dimension ofcomplexity in the form of capsules. We also observed 6.0FPS with only 5.21W of total SoC (on chip) power consump-tion.
This paper introduced a novel Single Run Action detector (S-RAD) for activity monitoring. S-RAD provides end-to-endaction detection without the use of computationally heavymethods with the ability for real-time execution of embeddededge devices. S-RAD is a privacy-preserving approach andinherently protects Personally Identifiable Information (PII).Results on UCF-Sports and UR Fall dataset presented com-parable accuracy to State-of-the-Art approaches with signif-icantly lower model size and computation demand and theability for real-time execution on edge embedded device.
References [Alaoui et al. , 2019] A. Y. Alaoui, S. El Fkihi, and R. O. H.Thami. Fall detection for elderly people using the variationof key points of human skeleton.
IEEE Access , 7:154786–154795, 2019.[Atallah et al. , 2011] Louis Atallah, Benny Lo, Rachel King,and Guang-Zhong Yang. Sensor positioning for activityrecognition using wearable accelerometers.
IEEE trans-actions on biomedical circuits and systems , 5(4):320–329,2011.[Cameiro et al. , 2019] S. A. Cameiro, G. P. da Silva, G. V.Leite, R. Moreno, S. J. F. Guimar˜aes, and H. Pedrini.Multi-stream deep convolutional network using high-leveleatures applied to fall detection in video sequences. In , pages 293–298, 2019.[Chen et al. , 2019] Kai Chen, Jiaqi Wang, Jiangmiao Pang,Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun,Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, DazhiCheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, BuyuLi, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, JingdongWang, Jianping Shi, Wanli Ouyang, Chen Change Loy,and Dahua Lin. MMDetection: Open mmlab detectiontoolbox and benchmark. arXiv preprint arXiv:1906.07155 ,2019.[Duarte et al. , 2018] Kevin Duarte, Yogesh S Rawat, andMubarak Shah. Videocapsulenet: A simplified networkfor action detection. In
Proceedings of the 32nd Interna-tional Conference on Neural Information Processing Sys-tems , NIPS’18, page 7621–7630, Red Hook, NY, USA,2018. Curran Associates Inc.[Everingham et al. , 2010] Mark Everingham, Luc Gool,Christopher K. Williams, John Winn, and Andrew Zisser-man. The pascal visual object classes (voc) challenge.
Int.J. Comput. Vision , 88(2):303–338, June 2010.[Gkioxari and Malik, 2015] Georgia Gkioxari and JitendraMalik. Finding action tubes. In
Proceedings of the IEEEconference on computer vision and pattern recognition ,pages 759–768, 2015.[He et al. , 2015] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Deep residual learning for image recog-nition.
CoRR , abs/1512.03385, 2015.[Hou et al. , 2017] Rui Hou, Chen Chen, and Mubarak Shah.Tube convolutional neural network (t-cnn) for action de-tection in videos. In
The IEEE International Conferenceon Computer Vision (ICCV) , Oct 2017.[Kalogeiton et al. , 2017] Vicky Kalogeiton, Philippe Wein-zaepfel, Vittorio Ferrari, and Cordelia Schmid. Actiontubelet detector for spatio-temporal action localization.
CoRR , abs/1705.01861, 2017.[Kay et al. , 2017] Will Kay, Joao Carreira, Karen Si-monyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-narasimhan, Fabio Viola, Tim Green, Trevor Back, PaulNatsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 , 2017.[Kwolek and Kepski, 2014] B. Kwolek and Michal Kepski.Human fall detection on embedded platform using depthmaps and wireless accelerometer.
Computer methods andprograms in biomedicine , 117 3:489–501, 2014.[Leite et al. , 2019] G. Leite, G. Silva, and H. Pedrini. Falldetection in video sequences based on a three-stream con-volutional neural network. In , pages 191–195, 2019.[Lin et al. , 2018] Ji Lin, Chuang Gan, and Song Han. Tem-poral shift module for efficient video understanding.
CoRR , abs/1811.08383, 2018. [Liu et al. , 2015] Wei Liu, Dragomir Anguelov, Dumitru Er-han, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu,and Alexander C. Berg. SSD: single shot multibox detec-tor.
CoRR , abs/1512.02325, 2015.[Lu et al. , 2019] N. Lu, Y. Wu, L. Feng, and J. Song. Deeplearning for fall detection: Three-dimensional cnn com-bined with lstm on video kinematic data.
IEEE Journal ofBiomedical and Health Informatics , 23(1):314–323, 2019.[Mirzadeh and Ghasemzadeh, 2020] Seyed Iman Mirzadehand Hassan Ghasemzadeh. Optimal policy for deploy-ment of machine learning modelson energy-bounded sys-tems. In
Proceedings of the Twenty-Ninth InternationalJoint Conference on Artificial Intelligence (IJCAI) , 2020.[Neff et al. , 2020] C. Neff, M. Mendieta, S. Mohan, M. Ba-harani, S. Rogers, and H. Tabkhi. Revamp2t: Real-time edge video analytics for multicamera privacy-awarepedestrian tracking.
IEEE Internet of Things Journal ,7(4):2591–2602, 2020.[Pagan et al. , 2018] Josue Pagan, Ramin Fallahzadeh, MahdiPedram, Jose L Risco-Martin, Jose M Moya, Jose L Ay-ala, and Hassan Ghasemzadeh. Toward ultra-low-powerremote health monitoring: An optimal and adaptive com-pressed sensing framework for activity recognition.
IEEETransactions on Mobile Computing (TMC) , 18(3):658–673, 2018.[Peng and Schmid, 2016] Xiaojiang Peng and CordeliaSchmid. Multi-region two-stream r-cnn for action detec-tion. In
European conference on computer vision , pages744–759. Springer, 2016.[Ren et al. , 2015] Shaoqing Ren, Kaiming He, Ross B. Gir-shick, and Jian Sun. Faster R-CNN: towards real-timeobject detection with region proposal networks.
CoRR ,abs/1506.01497, 2015.[Saeedi et al. , 2014] Ramyar Saeedi, Janet Purath, KrishnaVenkatasubramanian, and Hassan Ghasemzadeh. Towardseamless wearable sensing: Automatic on-body sensor lo-calization for physical activity monitoring. In , pages 5385–5388. IEEE,2014.[Simonyan and Zisserman, 2014] Karen Simonyan and An-drew Zisserman. Two-stream convolutional networks foraction recognition in videos.
CoRR , abs/1406.2199, 2014.[Soomro and Zamir, 2014] Khurram Soomro andAmir Roshan Zamir. Action recognition in realisticsports videos. 2014.[Tran et al. , 2014] Du Tran, Lubomir D. Bourdev, Rob Fer-gus, Lorenzo Torresani, and Manohar Paluri. C3D: genericfeatures for video analysis.
CoRR , abs/1412.0767, 2014.[Wang et al. , 2016] Limin Wang, Yuanjun Xiong, ZheWang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc VanGool. Temporal segment networks: Towards good prac-tices for deep action recognition.
CoRR , abs/1608.00859,2016.Weinzaepfel et al. , 2015] Philippe Weinzaepfel, Za¨ıd Har-chaoui, and Cordelia Schmid. Learning to track for spatio-temporal action localization.