2021 IEEE 3rd Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability (ECBIOS) | 2021

Policy and Value Deep RL for Temporal Language-Agnostic Street Image Captioning

 
 
 
 

Abstract


Dashcams and street cameras capture a huge amount of street and traffic video data. Temporal image captioning for such video data has been approached with an encoder-decoder framework and achieved substantial success in captioning accuracy. Most recently, policy and value deep reinforcement learning (PVRL) emerged as an outperformer over other decision-making frameworks for image captioning. In this paper, we design a framework that utilizes PVRL on an inhouse dataset containing temporal images of East Asia streets as a step towards designing a language-agnostic street image captioning framework that is capable of captioning temporal images of any street regardless of location. For language- invariance, the framework includes cross-modal retrieval at the character level so that similar words in different languages but in the same word-embedding space are grouped together. Our results show that PVRL can be applied successfully to temporal video captured the streets and achieve natural semantic captions; preliminary studies on our dataset suggest that the framework is capable of use in multi-language scenarios.

Volume None
Pages 211-214
DOI 10.1109/ECBIOS51820.2021.9510291
Language English
Journal 2021 IEEE 3rd Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability (ECBIOS)

Full Text