2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS) | 2019

Temporal Action Detection with Fused Two-Stream 3D Residual Neural Networks and Bi-Directional LSTM

 
 
 

Abstract


This work presents an architecture for localizing interesting target events within long sequences of untrimmed videos. Mainly, we focus on finding temporal boundaries of target visual actions and bypassing irrelevant events of other actions. Both the appearance and motion information are crucial for discriminating between different actions. Based on this, we propose a trainable fused two-stream 3D Convolution neural network framework, integrated with a bi-directional Long Short-Term Memory sequence model (2-stream 3DCNN+ LSTM) for learning. The two stream CNN enables us to model features from both RGB and optical flow short video-clips of resolution $\\delta=16$ frames, extracted from the long input video sequence. This framework produces a sequence of class probability scores at each video-clip. Simple low-cost mean, average and max filters are used to localize and classify each relevant action instance and to label the whole video. Such architecture utilized the power of (1) two streams CNN architecture, (2) the spatiotemporal processing of 3D convolution network for capturing spatial and motion patterns, (3) temporal orderings and long-range dependencies of the sequence model for obtaining robust classifications at each time step. We evaluate our framework using THUMOS 15 dataset, attaining 98.9% accuracy and 35.8 % mAP in the video level classification and relevant action detection tasks, respectively.

Volume None
Pages 130-140
DOI 10.1109/ICICIS46948.2019.9014707
Language English
Journal 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS)

Full Text