Proceedings of the 29th ACM International Conference on Multimedia | 2021
Video Visual Relation Detection via Iterative Inference
Abstract
The core problem of video visual relation detection (VidVRD) lies in accurately classifying the relation triplets, which comprise of the classes of subject and object entities, and the predicate classes of various relationships between them. Existing VidVRD approaches classify these three relation components in either independent or cascaded manner, thus fail to fully exploit the inter-dependency among them. In order to utilize this inter-dependency in tackling the challenges of visual relation recognition in videos, we propose a novel iterative relation inference approach for VidVRD. We derive our model from the viewpoint of joint relation classification which is light-weight yet effective, and propose a training approach to better learn the dependency knowledge from the likely correct triplet combinations. As such, the proposed inference approach is able to gradually refine each component based on its learnt dependency and the other two s predictions. Our ablation studies show that this iterative relation inference can empirically converge in a few steps and consistently boost the performance over baselines. Further, we incorporate it into a newly designed VidVRD architecture, named VidVRD-II (Iterative Inference), which generalizes well across different datasets. Experiments show that VidVRD-II achieves the start-of-the-art performance on both of ImageNet-VidVRD and VidOR benchmark datasets.