IEEE Signal Processing Letters | 2021
A Unimodal Reinforced Transformer With Time Squeeze Fusion for Multimodal Sentiment Analysis
Abstract
Multimodal sentiment analysis refers to inferring sentiment from language, acoustic, and visual sequences. Previous studies focus on analyzing aligned sequences, while the unaligned sequential analysis is more practical in real-world scenarios. Due to the long-time dependency hidden in the multimodal unaligned sequence and time alignment information is not provided, exploring the time-dependent interactions within unaligned sequences is more challenging. To this end, we introduce the time squeeze fusion to automatically explore the time-dependent interactions by modeling the unimodal and multimodal sequences from the perspective of compressing the time dimension. Moreover, prior methods tend to fuse unimodal features into a multimodal embedding, based on which sentiment is inferred. However, we argue that the unimodal information may be lost or the generated multimodal embedding may be redundant. Addressing this issue, we propose a unimodal reinforced Transformer to progressively attend and distill unimodal information from the multimodal embedding, which enables the multimodal embedding to highlight the discriminative unimodal information. Extensive experiments suggest that our model reaches state-of-the-art performance in terms of accuracy and F1 score on MOSEI dataset.