Guangnan Ye
Columbia University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Guangnan Ye.
computer vision and pattern recognition | 2012
Guangnan Ye; Dong Liu; I-Hong Jhuo; Shih-Fu Chang
In this paper, we propose a rank minimization method to fuse the predicted confidence scores of multiple models, each of which is obtained based on a certain kind of feature. Specifically, we convert each confidence score vector obtained from one model into a pairwise relationship matrix, in which each entry characterizes the comparative relationship of scores of two test samples. Our hypothesis is that the relative score relations are consistent among component models up to certain sparse deviations, despite the large variations that may exist in the absolute values of the raw scores. Then we formulate the score fusion problem as seeking a shared rank-2 pairwise relationship matrix based on which each original score matrix from individual model can be decomposed into the common rank-2 matrix and sparse deviation errors. A robust score vector is then extracted to fit the recovered low rank score relation matrix. We formulate the problem as a nuclear norm and ℓ1 norm optimization objective function and employ the Augmented Lagrange Multiplier (ALM) method for the optimization. Our method is isotonic (i.e., scale invariant) to the numeric scales of the scores originated from different models. We experimentally show that the proposed method achieves significant performance gains on various tasks including object categorization and video event detection.
computer vision and pattern recognition | 2012
Felix X. Yu; Rongrong Ji; Ming-Hen Tsai; Guangnan Ye; Shih-Fu Chang
Attribute-based query offers an intuitive way of image retrieval, in which users can describe the intended search targets with understandable attributes. In this paper, we develop a general and powerful framework to solve this problem by leveraging a large pool of weak attributes comprised of automatic classifier scores or other mid-level representations that can be easily acquired with little or no human labor. We extend the existing retrieval model of modeling dependency within query attributes to modeling dependency of query attributes on a large pool of weak attributes, which is more expressive and scalable. To efficiently learn such a large dependency model without overfitting, we further propose a semi-supervised graphical model to map each multiattribute query to a subset of weak attributes. Through extensive experiments over several attribute benchmarks, we demonstrate consistent and significant performance improvements over the state-of-the-art techniques. In addition, we compile the largest multi-attribute image retrieval dateset to date, including 126 fully labeled query attributes and 6,000 weak attributes of 0.26 million images.
acm multimedia | 2015
Guangnan Ye; Yitong Li; Hongliang Xu; Dong Liu; Shih-Fu Chang
Event-specific concepts are the semantic concepts specifically designed for the events of interest, which can be used as a mid-level representation of complex events in videos. Existing methods only focus on defining event-specific concepts for a small number of pre-defined events, but cannot handle novel unseen events. This motivates us to build a large scale event-specific concept library that covers as many real-world events and their concepts as possible. Specifically, we choose WikiHow, an online forum containing a large number of how-to articles on human daily life events. We perform a coarse-to-fine event discovery process and discover 500 events from WikiHow articles. Then we use each event name as query to search YouTube and discover event-specific concepts from the tags of returned videos. After an automatic filter process, we end up with 95,321 videos and 4,490 concepts. We train a Convolutional Neural Network (CNN) model on the 95,321 videos over the 500 events, and use the model to extract deep learning feature from video content. With the learned deep learning feature, we train 4,490 binary SVM classifiers as the event-specific concept library. The concepts and events are further organized in a hierarchical structure defined by WikiHow, and the resultant concept library is called EventNet. Finally, the EventNet concept library is used to generate concept based representation of event videos. To the best of our knowledge, EventNet represents the first video event ontology that organizes events and their concepts into a semantic structure. It offers great potential for event retrieval and browsing. Extensive experiments over the zero-shot event retrieval task when no training samples are available show that the proposed EventNet concept library consistently and significantly outperforms the state-of-the-art (such as the 20K ImageNet concepts trained with CNN) by a large margin up to 207%. We will also show that EventNet structure can help users find relevant concepts for novel event queries that cannot be well handled by conventional text based semantic analysis alone. The unique two-step approach of first applying event detection models followed by detection of event-specific concepts also provides great potential to improve the efficiency and accuracy of Event Recounting since only a very small number of event-specific concept classifiers need to be fired after event detection.
international conference on multimedia retrieval | 2014
Jiawei Chen; Yin Cui; Guangnan Ye; Dong Liu; Shih-Fu Chang
Analysis and detection of complex events in videos require a semantic representation of the video content. Existing video semantic representation methods typically require users to pre-define an exhaustive concept lexicon and manually annotate the presence of the concepts in each video, which is infeasible for real-world video event detection problems. In this paper, we propose an automatic semantic concept discovery scheme by exploiting Internet images and their associated tags. Given a target event and its textual descriptions, we crawl a collection of images and their associated tags by performing text based image search using the noun and verb pairs extracted from the event textual descriptions. The system first identifies the candidate concepts for an event by measuring whether a tag is a meaningful word and visually detectable. Then a concept visual model is built for each candidate concept using a SVM classifier with probabilistic output. Finally, the concept models are applied to generate concept based video representations. We use the TRECVID Multimedia Event Detection (MED) 2013 as our video test set and crawl 400K Flickr images to automatically discover 2, 000 visual concepts. We show significant performance gains of the proposed concept discovery method over different video event detection tasks including supervised event modeling over concept space and semantic based zero-shot retrieval without training examples. Importantly, we show the proposed method of automatic concept discovery outperforms other well-known concept library construction approaches such as Classemes and ImageNet by a large margin (228%) in zero-shot event retrieval. Finally, subjective evaluation by humans also confirms clear superiority of the proposed method in discovering concepts for event representation.
computer vision and pattern recognition | 2013
Dong Liu; Kuan-Ting Lai; Guangnan Ye; Ming-Syan Chen; Shih-Fu Chang
Late fusion addresses the problem of combining the prediction scores of multiple classifiers, in which each score is predicted by a classifier trained with a specific feature. However, the existing methods generally use a fixed fusion weight for all the scores of a classifier, and thus fail to optimally determine the fusion weight for the individual samples. In this paper, we propose a sample-specific late fusion method to address this issue. Specifically, we cast the problem into an information propagation process which propagates the fusion weights learned on the labeled samples to individual unlabeled samples, while enforcing that positive samples have higher fusion scores than negative samples. In this process, we identify the optimal fusion weights for each sample and push positive samples to top positions in the fusion score rank list. We formulate our problem as a L∞ norm constrained optimization problem and apply the Alternating Direction Method of Multipliers for the optimization. Extensive experiment results on various visual categorization tasks show that the proposed method consistently and significantly beats the state-of-the-art late fusion methods. To the best knowledge, this is the first method supporting sample-specific fusion weight learning.
international conference on computer vision | 2013
Guangnan Ye; Dong Liu; Jun Wang; Shih-Fu Chang
Recently, learning based hashing methods have become popular for indexing large-scale media data. Hashing methods map high-dimensional features to compact binary codes that are efficient to match and robust in preserving original similarity. However, most of the existing hashing methods treat videos as a simple aggregation of independent frames and index each video through combining the indexes of frames. The structure information of videos, e.g., discriminative local visual commonality and temporal consistency, is often neglected in the design of hash functions. In this paper, we propose a supervised method that explores the structure learning techniques to design efficient hash functions. The proposed video hashing method formulates a minimization problem over a structure-regularized empirical loss. In particular, the structure regularization exploits the common local visual patterns occurring in video frames that are associated with the same semantic class, and simultaneously preserves the temporal consistency over successive frames from the same video. We show that the minimization objective can be efficiently solved by an Accelerated Proximal Gradient (APG) method. Extensive experiments on two large video benchmark datasets (up to around 150K video clips with over 12 million frames) show that the proposed method significantly outperforms the state-of-the-art hashing methods.
international conference on multimedia retrieval | 2012
Guangnan Ye; I-Hong Jhuo; Dong Liu; Yu-Gang Jiang; D. T. Lee; Shih-Fu Chang
Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to construct the bi-modal words that reveal the joint patterns across modalities. Finally, different pooling strategies are employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations that are fed to subsequent multimedia event classifiers. We experimentally show that the proposed multi-modal feature achieves statistically significant performance gains over methods using individual visual and audio features alone and alternative multi-modal fusion methods. Moreover, we found that average pooling is the most suitable strategy for bi-modal feature generation.
machine vision applications | 2014
I-Hong Jhuo; Guangnan Ye; Shenghua Gao; Dong Liu; Yu-Gang Jiang; D. T. Lee; Shih-Fu Chang
Detecting complex events in videos is intrinsically a multimodal problem since both audio and visual channels provide important clues. While conventional methods fuse both modalities at a superficial level, in this paper we propose a new representation—called bi-modal words—to explore representative joint audio–visual patterns. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to produce the bi-modal words that reveal the joint patterns across modalities. Different pooling strategies are then employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations. Since it is difficult to predict the suitable number of bi-modal words, we generate bi-modal words at different levels (i.e., codebooks with different sizes), and use multiple kernel learning to combine the resulting multiple representations during event classifier learning. Experimental results on three popular datasets show that the proposed method achieves statistically significant performance gains over methods using individual visual and audio feature alone and existing popular multi-modal fusion methods. We also find that average pooling is particularly suitable for bi-modal representation, and using multiple kernel learning to combine multi-modal representations at various granularities is helpful.
acm multimedia | 2012
Dong Liu; Guangnan Ye; Ching-Ting Chen; Shuicheng Yan; Shih-Fu Chang
Analysis and recommendation of multimedia information can be greatly improved if we know the interactions between the content, user, and concept, which can be easily observed from the social media networks. However, there are many heterogeneous entities and relations in such networks, making it difficult to fully represent and exploit the diverse array of information. In this paper, we develop a hybrid social media network, through which the heterogeneous entities and relations are seamlessly integrated and a joint inference procedure across the heterogeneous entities and relations can be developed. The network can be used to generate personalized information recommendation in response to specific targets of interests, e.g., personalized multimedia albums, target advertisement and friend/topic recommendation. In the proposed network, each node denotes an entity and the multiple edges between nodes characterize the diverse relations between the entities (e.g., friends, similar contents, related concepts, favorites, tags, etc). Given a query from a user indicating his/her information needs, a propagation over the hybrid social media network is employed to infer the utility scores of all the entities in the network while learning the edge selection function to activate only a sparse subset of relevant edges, such that the query information can be best propagated along the activated paths. Driven by the intuition that much redundancy exists among the diverse relations, we have developed a robust optimization framework based on several sparsity principles. We show significant performance gains of the proposed method over the state of the art in multimedia retrieval and recommendation using data crawled from social media sites. To the best of our knowledge, this is the first model supporting not only aggregation but also judicious selection of heterogeneous relations in the social media networks.
Archive | 2012
Xinnan Yu; Rongrong Ji; Ming-Hen Tsai; Guangnan Ye; Shih-Fu Chang
Searching images based on descriptions of image attributes is an intuitive process that can be easily understood by humans and recently made feasible by a few promising works in both the computer vision and multimedia communities [4,7,9,11]. In this report, we describe some experiments of image retrieval methods that utilize weak attributes [11].