Jianfeng Dong
Zhejiang University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jianfeng Dong.
acm multimedia | 2016
Jianfeng Dong; Xirong Li; Weiyu Lan; Yujia Huo; Cees G. M. Snoek
This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.
international conference on multimedia retrieval | 2016
Xirong Li; Weiyu Lan; Jianfeng Dong; Hailong Liu
This paper extends research on automated image captioning in the dimension of language, studying how to generate Chinese sentence descriptions for unlabeled images. To evaluate image captioning in this novel context, we present Flickr8k-CN, a bilingual extension of the popular Flickr8k set. The new multimedia dataset can be used to quantitatively assess the performance of Chinese captioning and English-Chinese machine translation. The possibility of re-using existing English data and models via machine translation is investigated. Our study reveals to some extent that a computer can master two distinct languages, English and Chinese, at a similar level for describing the visual world. Data is publicly available at http://tinyurl.com/flickr8kcn
acm multimedia | 2015
Jianfeng Dong; Xirong Li; Shuai Liao; Jieping Xu; Duanqing Xu; Xiaoyong Du
How to estimate cross-media relevance between a given query and an unlabeled image is a key question in the MSR-Bing Image Retrieval Challenge. We answer the question by proposing cross-media relevance fusion, a conceptually simple framework that exploits the power of individual methods for cross-media relevance estimation. Four base cross-media relevance functions are investigated, and later combined by weights optimized on the development set. With DCG25 of 0.5200 on the test dataset, the proposed image retrieval system secures the first place in the evaluation.
acm multimedia | 2017
Weiyu Lan; Xirong Li; Jianfeng Dong
Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language. Only few studies have been conducted for image captioning in a cross-lingual setting. Different from these works that manually build a dataset for a target language, we aim to learn a cross-lingual captioning model fully from machine-translated sentences. To conquer the lack of fluency in the translated sentences, we propose in this paper a fluency-guided learning framework. The framework comprises a module to automatically estimate the fluency of the sentences and another module to utilize the estimated fluency scores to effectively train an image captioning model for the target language. As experiments on two bilingual (English-Chinese) datasets show, our approach improves both fluency and relevance of the generated captions in Chinese, but without using any manually written sentences from the target language.
acm multimedia | 2017
Jianfeng Dong
In this paper, we summarize our works for cross-media retrieval where the queries and retrieval content are of different media types. We study cross-media retrieval in the context of two applications, i.e., ~image retrieval by textual queries, and sentence retrieval by visual queries, two popular applications in multimedia retrieval. For image retrieval by textual queries, we proposetext2image which converts computing cross-media relevance between images and textual queries to comparing the visual similarity among images.We also proposecross-media relevance fusion, a conceptual framework that combines multiple cross-media relevance estimators.These two techniques have resulted in a winning entry in the Microsoft Image Retrieval Challenge at ACM MM 2015. For sentence retrieval by visual queries, we propose to compute cross-media relevance in a visual space exclusively. We contributeWord2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input. With proposedWord2VisualVec model, we won the Video to Text Description task at TRECVID 2016.
arXiv: Computer Vision and Pattern Recognition | 2016
Jianfeng Dong; Xirong Li; Cees G. M. Snoek
arXiv: Computer Vision and Pattern Recognition | 2016
Jianfeng Dong; Xirong Li; Cees G. M. Snoek
IEEE Transactions on Multimedia | 2018
Jianfeng Dong; Xirong Li; Cees G. M. Snoek
IEEE Transactions on Multimedia | 2018
Jianfeng Dong; Xirong Li; Duanqing Xu
TRECVID Workshop | 2016
Cees G. M. Snoek; Jianfeng Dong; Xirong Li; Q. Wei; X. Wang; Weiyu Lan; Efstratios Gavves; N. Hussein; Dennis Koelma; Arnold W. M. Smeulders