Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Junwei Liang is active.

Publication


Featured researches published by Junwei Liang.


international conference on multimedia retrieval | 2016

Video Description Generation using Audio and Visual Cues

Qin Jin; Junwei Liang

The recent advances in image captioning stimulate the research in generating natural language description for visual content, which can be widely applied in many applications such as assisting blind people. Video description generation is a more complex task than image caption. Most works of video description generation focus on visual information in the video. However, audio provides rich information for describing video contents as well. In this paper, we propose to generate video descriptions in natural sentences using both audio and visual cues. We use unified deep neural networks with both convolutional and recurrent structure. Experimental results on the Microsoft Research Video Description (MSVD) corpus prove that fusing audio information greatly improves the video description performance.


conference of the international speech communication association | 2016

Generating Natural Video Descriptions via Multimodal Processing.

Qin Jin; Junwei Liang; Xiaozhu Lin

Generating natural language descriptions of visual content is an intriguing task which has wide applications such as assisting blind people. The recent advances in image captioning stimulate further study of this task in more depth including generating natural descriptions for videos. Most works of video description generation focus on visual information in the video. However, audio provides rich information for describing video contents as well. In this paper, we propose to generate video descriptions in natural sentences via multimodal processing, which refers to using both audio and visual cues via unified deep neural networks with both convolutional and recurrent structure. Experimental results on the Microsoft Research Video Description (MSVD) corpus prove that fusing audio information greatly improves the video description performance. We also investigate the impact of image amount vs caption amount on the image caption performance and see the trend that when limited amount of training is available, number of various captions is more important than number of various images. This will guide us to investigate in the future how to improve the video description system via increasing amount of training data.


international conference on multimedia retrieval | 2015

Semantic Concept Annotation For User Generated Videos Using Soundtracks

Qin Jin; Junwei Liang; Xixi He; Gang Yang; Jieping Xu; Xirong Li

With the increasing use of audio sensors in user generated content (UGC) collections, semantic concept annotation from video soundtracks has become an important research problem. In this paper, we investigate reducing the semantic gap of the traditional data-driven bag-of-audio-words based audio annotation approach by utilizing the large-amount of wild audio data and their rich user tags, from which we propose a new feature representation based on semantic class model distance. We conduct experiments on the data collection from HUAWEI Accurate and Fast Mobile Video Annotation Grand Challenge 2014. We also fuse the audio-only annotation system with a visual-only system. The experimental results show that our audio-only concept annotation system can detect semantic concepts significantly better than does random guessing. The new feature representation achieves comparable annotation performance with the bag-of-audio-words feature. In addition, it can provide more semantic interpretation in the output. The experimental results also prove that the audio-only system can provide significant complementary information to the visual-only concept annotation system for performance boost and for better interpretation of semantic concepts both visually and acoustically.


advances in multimedia | 2014

Semantic Concept Annotation of Consumer Videos at Frame-Level Using Audio

Junwei Liang; Qin Jin; Xixi He; Gang Yang; Jieping Xu; Xirong Li

With the increasing use of audio sensors in user generated content UGC collection, semantic concept annotation using audio streams has become an important research problem. Huawei initiates a grand challenge in the International Conference on Multimedia & Expo ICME 2014: Huawei Accurate and Fast Mobile Video Annotation Challenge. In this paper, we present our semantic concept annotation system using audio stream only for the Huawei challenge. The system extracts audio stream from the video data and low-level acoustic features from the audio stream. Bag-of-feature representation is generated based on the low-level features and is used as input feature to train the support vector machine SVM concept classifier. The experimental results show that our audio-only concept annotation system can detect semantic concepts significantly better than random guess. It can also provide important complementary information to the visual-based concept annotation system for performance boost.


international conference on acoustics, speech, and signal processing | 2015

Detecting semantic concepts in consumer videos using audio

Junwei Liang; Qin Jin; Xixi He; Gang Yang; Jieping Xu; Xirong Li

With the increasing use of audio sensors in user generated content collection, how to detect semantic concepts using audio streams has become an important research problem. In this paper, we present a semantic concept annotation system using soundtracks/ audio of the video. We investigate three different acoustic feature representations for audio semantic concept annotation and explore fusion of audio annotation with visual annotation systems. We test our system on the data collection from HUAWEI Accurate and Fast Mobile Video Annotation Grand Challenge 2014. The experimental results show that our audio-only concept annotation system can detect semantic concepts significantly better than random guess. It can also provide significant complementary information to the visual-based concept annotation system for performance boost. Further detailed analysis shows that for interpreting a semantic concept both visually and acoustically, it is better to train concept models for the visual system and audio system using visual-driven and audio-driven ground truth separately.


international conference on multimedia retrieval | 2018

Multimodal Filtering of Social Media for Temporal Monitoring and Event Analysis

Po-Yao Huang; Junwei Liang; Jean-Baptiste Lamare; Alexander G. Hauptmann

Developing an efficient and effective social media monitoring system has become one of the important steps towards improved public safety. With the explosive availability of user-generated content documenting most conflicts and human rights abuses around the world, analysts and first-responders increasingly find themselves overwhelmed with massive amounts of noisy data from social media. In this paper, we construct a large-scale public safety event dataset with retrospective automatic labeling for 4.2 million multimodal tweets from 7 public safety events occurred in 2013~2017. We propose a new multimodal social media filtering system composed of encoding, classification, and correlation networks to jointly learn shared and complementary visual and textual information to filter out the most relevant and useful items among the noisy social media influx. The proposed model is verified and achieves significant improvement over competitive baselines under the retrospective and real-time experimental protocols.


international conference on multimedia retrieval | 2017

Leveraging Multi-modal Prior Knowledge for Large-scale Concept Learning in Noisy Web Data

Junwei Liang; Lu Jiang; Deyu Meng; Alexander G. Hauptmann

Learning video concept detectors automatically from the big but noisy web data with no additional manual annotations is a novel but challenging area in the multimedia and the machine learning community. A considerable amount of videos on the web is associated with rich but noisy contextual information, such as the title and other multi-modal information, which provides weak annotations or labels about the video content. To tackle the problem of large-scale noisy learning, We propose a novel method called Multi-modal WEbly-Labeled Learning (WELL-MM), which is established on the state-of-the-art machine learning algorithm inspired by the learning process of human. WELL-MM introduces a novel multi-modal approach to incorporate meaningful prior knowledge called curriculum from the noisy web videos. We empirically study the curriculum constructed from the multi-modal features of the Internet videos and images. The comprehensive experimental results on FCVID and YFCC100M demonstrate that WELL-MM outperforms state-of-the-art studies by a statically significant margin on learning concepts from noisy web video data. In addition, the results also verify that WELL-MM is robust to the level of noisiness in the video data. Notably, WELL-MM trained on sufficient noisy web labels is able to achieve a better accuracy to supervised learning methods trained on the clean manually labeled data.


international conference on acoustics, speech, and signal processing | 2017

Temporal localization of audio events for conflict monitoring in social media

Junwei Liang; Lu Jiang; Alexander G. Hauptmann

With the explosion in the availability of user-generated videos documenting any conflicts and human rights abuses around the world, analysts and researchers increasingly find themselves overwhelmed with massive amounts of video data to acquire and analyze useful information. In this paper, we develop a temporal localization framework for intense audio events in videos which addresses the problem. The proposed method utilizes Localized Self-Paced Reranking (LSPaR) to refine the localization results. LSPaR utilizes samples from easy to noisier ones so that it can overcome the noisiness of the initial retrieval results from user-generated videos. We show our frameworks efficacy on localizing intense audio event like gunshot, and further experiments also indicate that our methods can be generalized to localizing other audio events in noisy videos.


international conference on acoustics, speech, and signal processing | 2017

Synchronization for multi-perspective videos in the wild

Junwei Liang; Po-Yao Huang; Jia Chen; Alexander G. Hauptmann

In the era of social media, a large number of user-generated videos are uploaded to the Internet every day, capturing events all over the world. Reconstructing the event truth based on information mined from these videos has been an emerging challenging task. Temporal alignment of videos “in the wild” which capture different moments at different positions with different perspectives is the critical step. In this paper, we propose a hierarchical approach to synchronize videos. Our system utilizes clustered audio-signatures to align video pairs. Global alignment for all videos is then achieved via forming alignable video groups with self-paced learning. Experiments on the Boston Marathon dataset show that the proposed method achieves excellent precision and robustness.


international joint conference on artificial intelligence | 2016

Learning to detect concepts from webly-labeled video data

Junwei Liang; Lu Jiang; Deyu Meng; Alexander G. Hauptmann

Collaboration


Dive into the Junwei Liang's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Lu Jiang

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Qin Jin

Renmin University of China

View shared research outputs
Top Co-Authors

Avatar

Jieping Xu

Renmin University of China

View shared research outputs
Top Co-Authors

Avatar

Xirong Li

Renmin University of China

View shared research outputs
Top Co-Authors

Avatar

Xixi He

Renmin University of China

View shared research outputs
Top Co-Authors

Avatar

Jia Chen

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Po-Yao Huang

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Deyu Meng

Xi'an Jiaotong University

View shared research outputs
Top Co-Authors

Avatar

Gang Yang

Renmin University of China

View shared research outputs
Researchain Logo
Decentralizing Knowledge