ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) | 2021

Socializing the Videos: A Multimodal Approach for Social Relation Recognition

Abstract

As a crucial task for video analysis, social relation recognition for characters not only provides semantically rich description of video content but also supports intelligent applications, e.g., video retrieval and visual question answering. Unfortunately, due to the semantic gap between visual and semantic features, traditional solutions may fail to reveal the accurate relations among characters. At the same time, the development of social media platforms has now promoted the emergence of crowdsourced comments, which may enhance the recognition task with semantic and descriptive cues. To that end, in this article, we propose a novel multimodal-based solution to deal with the character relation recognition task. Specifically, we capture the target character pairs via a search module and then design a multistream architecture for jointly embedding the visual and textual information, in which feature fusion and attention mechanism are adapted for better integrating the multimodal inputs. Finally, supervised learning is applied to classify character relations. Experiments on real-world data sets validate that our solution outperforms several competitive baselines.

Volume 17

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) | 2021

Socializing the Videos: A Multimodal Approach for Social Relation Recognition

Abstract

Volume 17

Pages 1 - 23

DOI 10.1145/3416493

Language English

Journal ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

Full Text