Jinzhuo Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jinzhuo Wang is active.

Explore More

Publication

Featured researches published by Jinzhuo Wang.

acm multimedia | 2017

Video Imagination from a Single Image with Transformation Generation

Baoyang Chen; Wenmin Wang; Jinzhuo Wang

In this work, we focus on a challenging task: synthesizing multiple imaginary videos given a single image. Major problems come from high dimensionality of pixel space and the ambiguity of potential motions. To overcome those problems, we propose a new framework that produce imaginary videos by transformation generation. The generated transformations are applied to the original image in a novel volumetric merge network to reconstruct frames in imaginary video. Through sampling different latent variables, our method can output different imaginary video samples. The framework is trained in an adversarial way with unsupervised learning. For evaluation, we propose a new assessment metric RIQA. In experiments, we test on 3 datasets varying from synthetic data to natural scene. Our framework achieves promising performance in image quality assessment. The visual inspection indicates that it can successfully generate diverse five-frame videos in acceptable perceptual quality.

international conference on image processing | 2015

Image classification using RBM to encode local descriptors with group sparse learning.

Jinzhuo Wang; Wenmin Wang; Ronggang Wang; Wen Gao

This paper proposes to employ deep learning model to encode local descriptors for image classification. Previous works using deep architectures to obtain higher representations are often operated from pixel level, which lack the power to be generalized to large-size and complex images due to computational burdens and internal essence capture. Our method slips the leash of this limitation by starting from local descriptors to leverage more semantical inputs. We investigate to use two layers of Restricted Boltzmann Machines (RBMs) to encode different local descriptors with a novel group sparse learning (GSL) inspired by the recent success of sparse coding. Besides, unlike the most existing pure unsupervised feature coding strategies, we use another RBM corresponding to semantic labels to perform supervised fine-tuning which makes our model more suitable for classification task. Experimental results on Caltech-256 and Indoor-67 datasets demonstrate the effectiveness of our method.

ieee international conference on multimedia big data | 2017

Collaborative Deep Networks for Pedestrian Detection

Hongmeng Song; Wenmin Wang; Jinzhuo Wang; Ronggang Wang

Conventional pedestrian detection methods construct models based on hand-crafted features or deep learning. They are powerful but limited due to finite capabilities of single classifiers. Ensemble models escape these problems by assembling multiple classifiers using some man-made criteria which synthetically utilize information from all combined models. However, these criteria lack theoretical support. Therefore, we propose a novel ensemble deep model called collaborative deep networks where multiple deep networks are meaningfully combined in a fully-connected network. For maximizing the abilities of these deep models, we incorporate a resampling process to prepare diverse datasets and pre-train them using these resampled data. Finally, a collaborative learning method is presented to train the entire model. Experimental results prove that our approach can improve the performance of single classifiers and outperform state-of-the-art methods both on Caltech dataset and ETH dataset.

acm multimedia | 2017

Learning Object-Centric Transformation for Video Prediction

Xiongtao Chen; Wenmin Wang; Jinzhuo Wang; Weimian Li

Future frame prediction for video sequences is a challenging task and worth exploring problem in computer vision. Existing methods often learn motion information for the entire image to predict next frames. However, different objects in the same scene often move and deform in different ways intuitively. Considering the human visual system, one often pays attention to the key objects that contain crucial motion signals, rather than compress an entire image into a static representation. Motivated by this property of human perception, in this work, we develop a novel object-centric video prediction model that learns local motion transformation dynamically for key object regions with visual attention. By transforming objects iteratively to the original input frames, next frame can be produced. Specifically, we design an attention module with replaceable strategies to attend to objects in video frames automatically. Our method does not require any annotated data during training procedure. To produce sharp predictions, adversarial training is adopted in our work. We evaluate our model on the Moving MNIST and UCF101 datasets and report competitive results, compared to prior methods. The generated frames demonstrate that our model can characterize motion for different objects and produce plausible future frames.

international conference on image processing | 2016

Tube ConvNets: Better exploiting motion for action recognition

Zhihao Li; Wenmin Wang; Nannan Li; Jinzhuo Wang

Motion information is a key factor for action recognition and has been eagerly pursued for decades. How to effectively learn motion features in Convolutional Networks (ConvNets) remains an open issue. Prevalent ConvNets often take several full frames of video as input at a time, which can be a heavy burden for network training. In this paper, we introduce a novel framework called Tube ConvNets, by substituting action tubes for full frames to reduce this burden. Tube ConvNets focus on the regions of interest (ROI) where key motions occur, and thus eliminate the distraction of irrelevant objects. Each action tube is a fraction of spatiotemporal volumes, generated by the techniques of object detection and clustering algorithm. We demonstrate the effectiveness of Tube ConvNets for action classification on UCF-101 dataset, and illustrate its potential to support fine-grained localization on UCF-Sports dataset. Source code is available at https://github.com/wangjinzhuo/tubecnn.

IEEE Transactions on Multimedia | 2016

CSPS: An Adaptive Pooling Method for Image Classification

Jinzhuo Wang; Wenmin Wang; Ronggang Wang; Wen Gao

This paper proposes an adaptive approach to learn class-specific pooling shapes (CSPS) for image classification. Prevalent methods for spatial pooling are often conducted on predefined grids of images, which is an ad-hoc method and, thus, lacks generalization power across different categories. In contrast, our CSPS is designed in a data-driven fashion by generating plenty of candidates and selecting the optimal subset for each class. Specifically, we establish an overcomplete spatial shape set that preserves as many geometric patterns as possible. Then, the class-specific subset is selected by training a linear classifier with structured sparsity constraints and color distribution cues. To address the high computational cost and the risk of overfitting due to the overcomplete scheme, the image representations for CSPS are first compressed according to dictionary sensitivity and shape importance. These representations are finally fed to SVMs for the classification task. We demonstrate that CSPS can learn compact yet discriminative geometric information for different classes that carries more semantic meaning than other methods. Experimental results on four datasets demonstrate the benefits of the proposed method compared with other pooling schemes and illustrate its effectiveness on both object and scene images.

international conference on multimedia and expo | 2017

A joint model for action localization and classification in untrimmed video with visual attention

Weimian Li; Wenmin Wang; Xiongtao Chen; Jinzhuo Wang; Ge Li

In this paper, we introduce a joint model that learns to directly localize the temporal bounds of actions in untrimmed videos as well as precisely classify what actions occur. Most existing approaches tend to scan the whole video to generate action instances, which are really inefficient. Instead, inspired by human perception, our model is formulated based on a recurrent neural network to observe different locations within a video over time. And, it is capable of producing temporal localizations by only observing a fixed number of fragments, and the amount of computation it performs is independent of input video size. The decision policy for determining where to look next is learned by REINFORCE which is powerful in non-differentiable settings. In addition, different from relevant ways, our model runs localization and classification serially, and possesses a strategy for extracting appropriate features to classify. We evaluate our model on ActivityNet dataset, and it greatly outperforms the baseline. Moreover, compared with a recent approach, we show that our serial design can bring about 9% increase in detection performance.

international conference on multimedia and expo | 2015

Learning class-specific pooling shapes for image classification

Jinzhuo Wang; Wenmin Wang; Ronggang Wang; Wen Gao

Spatial pyramid (SP) representation is an extension of bag-of-feature model which embeds spatial layout information of local features by pooling feature codes over pre-defined spatial shapes. However, the uniform style of spatial pooling shapes used in standard SP is an ad-hoc manner without theoretical motivation, thus lacking the generalization power to adapt to different distribution of geometric properties across image classes. In this paper, we propose a data-driven approach to adaptively learn class-specific pooling shapes (CSPS). Specifically, we first establish an over-complete set of spatial shapes providing candidates with more flexible geometric patterns. Then the optimal subset for each class is selected by training a linear classifier with structured sparsity constraint and color distribution cues. To further enhance the robust of our model, the representations over CSPS are compressed according to the shape importance and finally fed to SVM with a multi-shape matching kernel for classification task. Experimental results on three challenging datasets (Caltech-256, Scene-15 and Indoor-67) demonstrate the effectiveness of the proposed method on both object and scene images.

international conference on image processing | 2015

A compact shot representation for video semantic indexing

Jinzhuo Wang; Wenmin Wang; Ronggang Wang; Wen Gao

This paper presents a compact shot representation for video semantic indexing (SIN). The proposed representation consists of visual cues from only two frames, i.e., key frame (KF) and difference frame (DF), which are both constructed with spatial pyramid. The KF describes static information while the generated DF captures non-static information. Each region of DF is derived from the same location in a selected frame, which has the most salient difference compared with the key frame in that region. We introduce a variation of DF to further enhance our model. Experimental results on TRECVID SIN demonstrate that our method obtains better accuracy than the state-of-the-art, while requiring less storage space and consuming time.

neural information processing systems | 2016