Archive | 2019

Automatic Alignment Methods for Visual and Textual Data with Narrative Content

 

Abstract


Visual and linguistic tools are the most common medium for exchanging information, self-expression, and storytelling since the ancient times. The advancing technology and humanity enriched these media by introducing novel forms such as digital images, digital videos, online books, blogs, etc. which are tremendously increasing in quantity. Today, we are at a point where we see various manifestations of the same story for which joint analysis for the purposes of summarization, archiving, and automatic meta-data annotation becomes crucial. This leads to challenges in aligning multiple facet stories which would alleviate the difficulties in comprehensive understanding for a joint analysis. Traditional approaches usually require complicated pre-processing steps (e.g., shot segmentation, speech/scene/face recognition and tracking), define a similarity metric between the sequence elements, and perform the alignment with standard techniques based on dynamic programming. Thus, they suffer from the limitations caused by the pre-processing steps, and the inherent drawbacks of Markov assumptions. In this thesis, we focus on aligning multi-modal data, specifically in visual and textual form, which is a fundamental step to learn and analyze correspondences between different manifestations of the same story. To achieve this, we build upon recent advances in deep and recurrent neural networks which provide efficient vectorial and contextual representations of the modalities to be aligned. Our label-based method for automatic alignment of video with narrative sentences proposes a highly efficient alignment technique that does not require heavy pre-processing steps, while enabling a fine level of granularity in the alignment result. Then, we develop an end-to-end differentiable neural architecture that addresses the limitations of the two-stage solutions by optimizing the similarity metric specifically for the alignment task while supporting one-to-one, oneto-many, skipping unmatched elements, and non-monotonic alignment. Expanding on this neural architecture, we develop a sequential spatial phrase grounding network that formulates grounding of multiple phrases as a sequential and contextual process allowing many-to-many matching. In a large variety of experiments, we show that using neural methods for multi-modal data alignment bear potential for more interesting research and applications by alleviating the large manpower that would be needed otherwise.

Volume None
Pages None
DOI 10.3929/ethz-b-000359170
Language English
Journal None

Full Text