Jiyang Gao
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jiyang Gao.
international conference on multimedia retrieval | 2016
Jiyang Gao; Chen Sun; Ram Nevatia
Action classification in still images is an important task in computer vision. It is challenging as the appearances of actions may vary depending on their context (e.g. associated objects). Manually labeling of context information would be time consuming and difficult to scale up. To address this challenge, we propose a method to automatically discover and cluster action concepts, and learn their classifiers from weakly supervised image-sentence corpora. It obtains candidate action concepts by extracting verb-object pairs from sentences and verifies their visualness with the associated images. Candidate action concepts are then clustered by using a multi-modal representation with image embeddings from deep convolutional networks and text embeddings from word2vec. More than one hundred human action concept classifiers are learned from the Flickr 30k dataset with no additional human effort and promising classification results are obtained. We further apply the AdaBoost algorithm to automatically select and combine relevant action concepts given an action query. Promising results have been shown on the PASCAL VOC 2012 action classification benchmark, which has zero overlap with Flickr30k.
international conference on multimedia retrieval | 2017
Kan Chen; Rama Kovvuri; Jiyang Gao; Ram Nevatia
Given an image and a natural language query phrase, a grounding system localizes the mentioned objects in the image according to the querys specifications. State-of-the-art methods address the problem by ranking a set of proposal bounding boxes according to the querys semantics, which makes them dependent on the performance of proposal generation systems. Besides, query phrases in one sentence may be semantically related in one sentence and can provide useful cues to ground objects. We propose a novel Multimodal Spatial Regression with semantic Context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences. The advantages of MSRC are twofold: first, it removes the limitation of performance from proposal generation algorithms by using a spatial regression network. Second, MSRC not only encodes the semantics of a query phrase, but also deals with its relation with other queries in the same sentence (i.e., context) by a context refinement network. Experiments show MSRC system provides a significant improvement in accuracy on two popular datasets: Flickr30K Entities and Refer-it Game, with 6.64% and 5.28% increase over the state-of-the-arts respectively.
european conference on computer vision | 2018
Jiyang Gao; Kan Chen; Ram Nevatia
Temporal action proposal generation is an important task, akin to object proposals, temporal action proposals are intended to capture “clips” or temporal intervals in videos that are likely to contain an action. Previous methods can be divided to two groups: sliding window ranking and actionness score grouping. Sliding windows uniformly cover all segments in videos, but the temporal boundaries are imprecise; grouping based method may have more precise boundaries but it may omit some proposals when the quality of actionness score is low. Based on the complementary characteristics of these two methods, we propose a novel Complementary Temporal Action Proposal (CTAP) generator. Specifically, we apply a Proposal-level Actionness Trustworthiness Estimator (PATE) on the sliding windows proposals to generate the probabilities indicating whether the actions can be correctly detected by actionness scores, the windows with high scores are collected. The collected sliding windows and actionness proposals are then processed by a temporal convolutional neural network for proposal ranking and boundary adjustment. CTAP outperforms state-of-the-art methods on average recall (AR) by a large margin on THUMOS-14 and ActivityNet 1.3 datasets. We further apply CTAP as a proposal generation method in an existing action detector, and show consistent significant improvements.
asian conference on computer vision | 2016
Jiyang Gao; Ram Nevatia
Action classification in still images has been a popular research topic in computer vision. Labelling large scale datasets for action classification requires tremendous manual work, which is hard to scale up. Besides, the action categories in such datasets are pre-defined and vocabularies are fixed. However humans may describe the same action with different phrases, which leads to the difficulty of vocabulary expansion for traditional fully-supervised methods. We observe that large amounts of images with sentence descriptions are readily available on the Internet. The sentence descriptions can be regarded as weak labels for the images, which contain rich information and could be used to learn flexible expressions of action categories. We propose a method to learn an Action Concept Tree (ACT) and an Action Semantic Alignment (ASA) model for classification from image-description data via a two-stage learning process. A new dataset for the task of learning actions from descriptions is built. Experimental results show that our method outperforms several baseline methods significantly.
International Journal of Multimedia Information Retrieval | 2018
Kan Chen; Rama Kovvuri; Jiyang Gao; Ram Nevatia
Given a textual description of an image, phrase grounding localizes objects in the image referred by query phrases in the description. State-of-the-art methods treat phrase grounding as a ranking problem and address it by retrieving a set of proposals according to the query’s semantics, which are limited by the performance of independent proposal generation systems and ignore useful cues from context in the description. In this paper, we propose a novel multimodal spatial regression with semantic context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences. There are two advantages of MSRC: First, it sidesteps the performance upper bound from independent proposal generation systems by adopting regression mechanism. Second, MSRC not only encodes the semantics of a query phrase, but also considers its relation with context (i.e., other queries from the same sentence) via a context refinement network. Experiments show MSRC system achieves a significant improvement in accuracy on two popular datasets: Flickr30K Entities and Refer-it Game, with 6.64 and 5.28% increase over the state of the arts, respectively.
international conference on computer vision | 2017
Jiyang Gao; Zhenheng Yang; Chen Sun; Kan Chen; Ram Nevatia
international conference on computer vision | 2017
Jiyang Gao; Chen Sun; Zhenheng Yang; Ram Nevatia
british machine vision conference | 2017
Jiyang Gao; Zhenheng Yang; Ram Nevatia
british machine vision conference | 2017
Jiyang Gao; Zhenheng Yang; Ram Nevatia
computer vision and pattern recognition | 2018
Jiyang Gao; Runzhou Ge; Kan Chen; Ram Nevatia