IEEE Transactions on Multimedia | 2019

First-Person Action Recognition With Temporal Pooling and Hilbert–Huang Transform

 
 
 

Abstract


This paper presents a convolutional neural network (CNN)-based approach for first-person action recognition with a combination of temporal pooling and the Hilbert–Huang transform (HHT). The new approach first adaptively performs temporal sub-action localization, treats each channel of the extracted trajectory pooled CNN features as a time series, and summarizes the temporal dynamic information in each sub-action by temporal pooling. The temporal evolution across sub-actions is then modeled by rank pooling. Thereafter, to account for the highly dynamic scene changes in first-person videos, the HHT is employed to decompose the ranked pooling features into finite and often few data-dependent functions, called intrinsic mode functions (IMFs), through empirical mode decomposition. Hilbert spectral analysis is then applied to each IMF component, and four salient descriptors are scrutinized and aggregated into the final video descriptor. Such a framework cannot only precisely acquire both long- and short-term tendencies, but also address the cumbersome significant camera motion in first-person videos to render better accuracy. Furthermore, it works well for complex actions for limited training samples. Simulations show that the proposed approach outperforms the main state-of-the-art methods when applied to four publicly available first-person video datasets.

Volume 21
Pages 3122-3135
DOI 10.1109/TMM.2019.2919434
Language English
Journal IEEE Transactions on Multimedia

Full Text