Zhongyang Huang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zhongyang Huang is active.

Explore More

Publication

Featured researches published by Zhongyang Huang.

computer vision and pattern recognition | 2011

Contextualizing object detection and classification

Zheng Song; Qiang Chen; Zhongyang Huang; Yang Hua; Shuicheng Yan

We investigate how to iteratively and mutually boost object classification and detection performance by taking the outputs from one task as the context of the other one. While context models have been quite popular, previous works mainly concentrate on co-occurrence relationship within classes and few of them focus on contextualization from a top-down perspective, i.e. high-level task context. In this paper, our system adopts a new method for adaptive context modeling and iterative boosting. First, the contextualized support vector machine (Context-SVM) is proposed, where the context takes the role of dynamically adjusting the classification score based on the sample ambiguity, and thus the context-adaptive classifier is achieved. Then, an iterative training procedure is presented. In each step, Context-SVM, associated with the output context from one task (object classification or detection), is instantiated to boost the performance for the other task, whose augmented outputs are then further used to improve the former task by Context-SVM. The proposed solution is evaluated on the object classification and detection tasks of PASCAL Visual Object Classes Challenge (VOC) 2007, 2010 and SUN09 data sets, and achieves the state-of-the-art performance.

international conference on computer vision | 2011

Multi-task low-rank affinity pursuit for image segmentation

Bin Cheng; Guangcan Liu; Jingdong Wang; Zhongyang Huang; Shuicheng Yan

This paper investigates how to boost region-based image segmentation by pursuing a new solution to fuse multiple types of image features. A collaborative image segmentation framework, called multi-task low-rank affinity pursuit, is presented for such a purpose. Given an image described with multiple types of features, we aim at inferring a unified affinity matrix that implicitly encodes the segmentation of the image. This is achieved by seeking the sparsity-consistent low-rank affinities from the joint decompositions of multiple feature matrices into pairs of sparse and low-rank matrices, the latter of which is expressed as the production of the image feature matrix and its corresponding image affinity matrix. The inference process is formulated as a constrained nuclear norm and ℓ2;1-norm minimization problem, which is convex and can be solved efficiently with the Augmented Lagrange Multiplier method. Compared to previous methods, which are usually based on a single type of features, the proposed method seamlessly integrates multiple types of features to jointly produce the affinity matrix within a single inference step, and produces more accurate and reliable segmentation results. Experiments on the MSRC dataset and Berkeley segmentation dataset well validate the superiority of using multiple features over single feature and also the superiority of our method over conventional methods for feature fusion. Moreover, our method is shown to be very competitive while comparing to other state-of-the-art methods.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2015

Contextualizing Object Detection and Classification

Qiang Chen; Zheng Song; Jian Dong; Zhongyang Huang; Yang Hua; Shuicheng Yan

computer vision and pattern recognition | 2013

Subcategory-Aware Object Classification

Jian Dong; Wei Xia; Qiang Chen; Jianshi Feng; Zhongyang Huang; Shuicheng Yan

In this paper, we introduce a subcategory-aware object classification framework to boost category level object classification performance. Motivated by the observation of considerable intra-class diversities and inter-class ambiguities in many current object classification datasets, we explicitly split data into subcategories by ambiguity guided subcategory mining. We then train an individual model for each subcategory rather than attempt to represent an object category with a monolithic model. More specifically, we build the instance affinity graph by combining both intra-class similarity and inter-class ambiguity. Visual subcategories, which correspond to the dense sub graphs, are detected by the graph shift algorithm and seamlessly integrated into the state-of-the-art detection assisted classification framework. Finally the responses from subcategory models are aggregated by subcategory-aware kernel regression. The extensive experiments over the PASCAL VOC 2007 and PASCAL VOC 2010 databases show the state-of-the-art performance from our framework.

computer vision and pattern recognition | 2012

Hierarchical matching with side information for image classification

Qiang Chen; Zheng Song; Yang Hua; Zhongyang Huang; Shuicheng Yan

In this work, we introduce a hierarchical matching framework with so-called side information for image classification based on bag-of-words representation. Each image is expressed as a bag of orderless pairs, each of which includes a local feature vector encoded over a visual dictionary, and its corresponding side information from priors or contexts. The side information is used for hierarchical clustering of the encoded local features. Then a hierarchical matching kernel is derived as the weighted sum of the similarities over the encoded features pooled within clusters at different levels. Finally the new kernel is integrated with popular machine learning algorithms for classification purpose. This framework is quite general and flexible, other practical and powerful algorithms can be easily designed by using this framework as a template and utilizing particular side information for hierarchical clustering of the encoded local features. To tackle the latent spatial mismatch issues in SPM, we design in this work two exemplar algorithms based on two types of side information: object confidence map and visual saliency map, from object detection priors and within-image contexts respectively. The extensive experiments over the Caltech-UCSD Birds 200, Oxford Flowers 17 and 102, PASCAL VOC 2007, and PASCAL VOC 2010 databases show the state-of-the-art performances from these two exemplar algorithms.

international conference on computer vision | 2013

A Deformable Mixture Parsing Model with Parselets

Jian Dong; Qiang Chen; Wei Xia; Zhongyang Huang; Shuicheng Yan

In this work, we address the problem of human parsing, namely partitioning the human body into semantic regions, by using the novel Parselet representation. Previous works often consider solving the problem of human pose estimation as the prerequisite of human parsing. We argue that these approaches cannot obtain optimal pixel level parsing due to the inconsistent targets between these tasks. In this paper, we propose to use Parselets as the building blocks of our parsing model. Parselets are a group of parsable segments which can generally be obtained by low-level over-segmentation algorithms and bear strong semantic meaning. We then build a Deformable Mixture Parsing Model (DMPM) for human parsing to simultaneously handle the deformation and multi-modalities of Parselets. The proposed model has two unique characteristics: (1) the possible numerous modalities of Parse let ensembles are exhibited as the ``And-Or structure of sub-trees, (2) to further solve the practical problem of Parselet occlusion or absence, we directly model the visibility property at some leaf nodes. The DMPM thus directly solves the problem of human parsing by searching for the best graph configuration from a pool of Parse let hypotheses without intermediate tasks. Comprehensive evaluations demonstrate the encouraging performance of the proposed approach.

computer vision and pattern recognition | 2013

Efficient Maximum Appearance Search for Large-Scale Object Detection

Qiang Chen; Zheng Song; Rogério Schmidt Feris; Ankur Datta; Liangliang Cao; Zhongyang Huang; Shuicheng Yan

In recent years, efficiency of large-scale object detection has arisen as an important topic due to the exponential growth in the size of benchmark object detection datasets. Most current object detection methods focus on improving accuracy of large-scale object detection with efficiency being an afterthought. In this paper, we present the Efficient Maximum Appearance Search (EMAS) model which is an order of magnitude faster than the existing state-of-the-art large-scale object detection approaches, while maintaining comparable accuracy. Our EMAS model consists of representing an image as an ensemble of densely sampled feature points with the proposed Point wise Fisher Vector encoding method, so that the learnt discriminative scoring function can be applied locally. Consequently, the object detection problem is transformed into searching an image sub-area for maximum local appearance probability, thereby making EMAS an order of magnitude faster than the traditional detection methods. In addition, the proposed model is also suitable for incorporating global context at a negligible extra computational cost. EMAS can also incorporate fusion of multiple features, which greatly improves its performance in detecting multiple object categories. Our experiments show that the proposed algorithm can perform detection of 1000 object classes in less than one minute per image on the Image Net ILSVRC2012 dataset and for 107 object classes in less than 5 seconds per image for the SUN09 dataset using a single CPU.

IEEE Transactions on Multimedia | 2013

VideoPuzzle: Descriptive One-Shot Video Composition

Qiang Chen; Meng Wang; Zhongyang Huang; Yang Hua; Zheng Song; Shuicheng Yan

A large amount of short, single-shot videos are created by personal camcorder every day, such as the small video clips in family albums, and thus a solution for presenting and managing these video clips is highly desired. From the perspective of professionalism and artistry, long-take/shot video, also termed one-shot video, is able to present events, persons or scenic spots in an informative manner. This paper presents a novel video composition system “Video Puzzle” which generates aesthetically enhanced long-shot videos from short video clips. Our task here is to automatically composite several related single shots into a virtual long-take video with spatial and temporal consistency. We propose a novel framework to compose descriptive long-take video with content-consistent shots retrieved from a video pool. For each video, frame-by-frame search is performed over the entire pool to find start-end content correspondences through a coarse-to-fine partial matching process. The content correspondence here is general and can refer to the matched regions or objects, such as human body and face. The content consistency of these correspondences enables us to design several shot transition schemes to seamlessly stitch one shot to another in a spatially and temporally consistent manner. The entire long-take video thus comprises several single shots with consistent contents and ίuent transitions. Meanwhile, with the generated matching graph of videos, the proposed system can also provide an efficient video browsing mode. Experiments are conducted on multiple video albums and the results demonstrate the effectiveness and the usefulness of the proposed scheme.

acm multimedia | 2012

Touch saliency

Mengdi Xu; Bingbing Ni; Jian Dong; Zhongyang Huang; Meng Wang; Shuicheng Yan

In this work, we propose a new concept of touch saliency, and attempt to answer the question of whether the underlying image saliency map may be implicitly derived from the accumulative touch behaviors (or more specifically speaking, zoom-in and panning manipulations) when many users browse the image on smart mobile devices with multi-touch display of small size. The touch saliency maps are collected for the images of the recently released NUSEF dataset, and the preliminary comparison study demonstrates: 1) the touch saliency map is highly correlated with human eye fixation map for the same stimuli, yet compared to the latter, the touch data collection is much more flexible and requires no cooperation from users; and 2) the touch saliency is also well predictable by popular saliency detection algorithms. This study opens a new research direction of multimedia analysis by harnessing human touch information on increasingly popular multi-touch smart mobile devices.

IEEE Transactions on Multimedia | 2014

Touch Saliency: Characteristics and Prediction

Bingbing Ni; Mengdi Xu; Tam V. Nguyen; Meng Wang; Congyan Lang; Zhongyang Huang; Shuicheng Yan

In this work, we propose an alternative ground truth to the eye fixation map in visual attention study, called touch saliency. As it can be directly collected from the recorded data of users daily browsing behavior on widely used smart phone devices with touch screens, the touch saliency data is easy to obtain. Due to the limited screen size, smart phone users usually move and zoom in the images, and fix the region of interest on the screen when browsing images. Our studies are two-fold. First, we collect and study the characteristics of these touch screen fixation maps (named touch saliency) by comprehensive comparisons with their counterpart, the eye-fixation maps (namely, visual saliency). The comparisons show that the touch saliency is highly correlated with the eye fixations for the same stimuli, which indicates its utility in data collection for visual attention study. Based on the consistency between both touch saliency and visual saliency, our second task is to propose a unified saliency prediction model for both visual and touch saliency detection. This model utilizes middle-level object category features extracted from pre-segmented image superpixels as input to the recently proposed multitask sparsity pursuit (MTSP) framework for saliency prediction. Extensive evaluations show that the proposed middle-level category features can considerably improve the saliency prediction performance when taking both touch saliency and visual saliency as ground truth.

Explore More