Jungong Han
Lancaster University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jungong Han.
IEEE Transactions on Systems, Man, and Cybernetics | 2013
Jungong Han; Ling Shao; Dong Xu; Jamie Shotton
With the invention of the low-cost Microsoft Kinect sensor, high-resolution depth and visual (RGB) sensing has become available for widespread use. The complementary nature of the depth and visual information provided by the Kinect sensor opens up new opportunities to solve fundamental problems in computer vision. This paper presents a comprehensive review of recent Kinect-based computer vision algorithms and applications. The reviewed approaches are classified according to the type of vision problems that can be addressed or enhanced by means of the Kinect sensor. The covered topics include preprocessing, object tracking and recognition, human activity analysis, hand gesture analysis, and indoor 3-D mapping. For each category of methods, we outline their main algorithmic contributions and summarize their advantages/differences compared to their RGB counterparts. Finally, we give an overview of the challenges in this field and future research trends. This paper is expected to serve as a tutorial and source of references for Kinect-based computer vision researchers.
IEEE Transactions on Neural Networks | 2016
Dingwen Zhang; Junwei Han; Jungong Han; Ling Shao
As an interesting and emerging topic, cosaliency detection aims at simultaneously extracting common salient objects in multiple related images. It differs from the conventional saliency detection paradigm in which saliency detection for each image is determined one by one independently without taking advantage of the homogeneity in the data pool of multiple related images. In this paper, we propose a novel cosaliency detection approach using deep learning models. Two new concepts, called intrasaliency prior transfer and deep intersaliency mining, are introduced and explored in the proposed work. For the intrasaliency prior transfer, we build a stacked denoising autoencoder (SDAE) to learn the saliency prior knowledge from auxiliary annotated data sets and then transfer the learned knowledge to estimate the intrasaliency for each image in cosaliency data sets. For the deep intersaliency mining, we formulate it by using the deep reconstruction residual obtained in the highest hidden layer of a self-trained SDAE. The obtained deep intersaliency can extract more intrinsic and general hidden patterns to discover the homogeneity of cosalient objects in terms of some higher level concepts. Finally, the cosaliency maps are generated by weighted integration of the proposed intrasaliency prior, deep intersaliency, and traditional shallow intersaliency. Comprehensive experiments over diverse publicly available benchmark data sets demonstrate consistent performance gains of the proposed method over the state-of-the-art cosaliency detection methods.
IEEE Transactions on Systems, Man, and Cybernetics | 2017
Zijia Lin; Guiguang Ding; Jungong Han; Jianmin Wang
For efficiently retrieving nearest neighbors from large-scale multiview data, recently hashing methods are widely investigated, which can substantially improve query speeds. In this paper, we propose an effective probability-based semantics-preserving hashing (SePH) method to tackle the problem of cross-view retrieval. Considering the semantic consistency between views, SePH generates one unified hash code for all observed views of any instance. For training, SePH first transforms the given semantic affinities of training data into a probability distribution, and aims to approximate it with another one in Hamming space, via minimizing their Kullback–Leibler divergence. Specifically, the latter probability distribution is derived from all pair-wise Hamming distances between to-be-learnt hash codes of the training data. Then with learnt hash codes, any kind of predictive models like linear ridge regression, logistic regression, or kernel logistic regression, can be learnt as hash functions in each view for projecting the corresponding view-specific features into hash codes. As for out-of-sample extension, given any unseen instance, the learnt hash functions in its observed views can predict view-specific hash codes. Then by deriving or estimating the corresponding output probabilities with respect to the predicted view-specific hash codes, a novel probabilistic approach is further proposed to utilize them for determining a unified hash code. To evaluate the proposed SePH, we conduct extensive experiments on diverse benchmark datasets, and the experimental results demonstrate that SePH is reasonable and effective.
IEEE Transactions on Image Processing | 2017
Yuchen Guo; Guiguang Ding; Li Liu; Jungong Han; Ling Shao
Sparse representation and image hashing are powerful tools for data representation and image retrieval respectively. The combinations of these two tools for scalable image retrieval, i.e., sparse hashing (SH) methods, have been proposed in recent years and the preliminary results are promising. The core of those methods is a scheme that can efficiently embed the (high-dimensional) image features into a low-dimensional Hamming space, while preserving the similarity between features. Existing SH methods mostly focus on finding better sparse representations of images in the hash space. We argue that the anchor set utilized in sparse representation is also crucial, which was unfortunately underestimated by the prior art. To this end, we propose a novel SH method that optimizes the integration of the anchors, such that the features can be better embedded and binarized, termed as Sparse Hashing with Optimized Anchor Embedding. The central idea is to push the anchors far from the axis while preserving their relative positions so as to generate similar hashcodes for neighboring features. We formulate this idea as an orthogonality constrained maximization problem and an efficient and novel optimization framework is systematically exploited. Extensive experiments on five benchmark image data sets demonstrate that our method outperforms several state-of-the-art related methods.
IEEE Transactions on Image Processing | 2017
Baochang Zhang; Yun Yang; Chen Chen; Linlin Yang; Jungong Han; Ling Shao
Human action recognition is an important yet challenging task. This paper presents a low-cost descriptor called 3D histograms of texture (3DHoTs) to extract discriminant features from a sequence of depth maps. 3DHoTs are derived from projecting depth frames onto three orthogonal Cartesian planes, i.e., the frontal, side, and top planes, and thus compactly characterize the salient information of a specific action, on which texture features are calculated to represent the action. Besides this fast feature descriptor, a new multi-class boosting classifier (MBC) is also proposed to efficiently exploit different kinds of features in a unified framework for action classification. Compared with the existing boosting frameworks, we add a new multi-class constraint into the objective function, which helps to maintain a better margin distribution by maximizing the mean of margin, whereas still minimizing the variance of margin. Experiments on the MSRAction3D, MSRGesture3D, MSRActivity3D, and UTD-MHAD data sets demonstrate that the proposed system combining 3DHoTs and MBC is superior to the state of the art.
IEEE Transactions on Image Processing | 2017
Li Liu; Zijia Lin; Ling Shao; Fumin Shen; Guiguang Ding; Jungong Han
With the dramatic development of the Internet, how to exploit large-scale retrieval techniques for multimodal web data has become one of the most popular but challenging problems in computer vision and multimedia. Recently, hashing methods are used for fast nearest neighbor search in large-scale data spaces, by embedding high-dimensional feature descriptors into a similarity preserving Hamming space with a low dimension. Inspired by this, in this paper, we introduce a novel supervised cross-modality hashing framework, which can generate unified binary codes for instances represented in different modalities. Particularly, in the learning phase, each bit of a code can be sequentially learned with a discrete optimization scheme that jointly minimizes its empirical loss based on a boosting strategy. In a bitwise manner, hash functions are then learned for each modality, mapping the corresponding representations into unified hash codes. We regard this approach as cross-modality sequential discrete hashing (CSDH), which can effectively reduce the quantization errors arisen in the oversimplified rounding-off step and thus lead to high-quality binary codes. In the test phase, a simple fusion scheme is utilized to generate a unified hash code for final retrieval by merging the predicted hashing results of an unseen instance from different modalities. The proposed CSDH has been systematically evaluated on three standard data sets: Wiki, MIRFlickr, and NUS-WIDE, and the results show that our method significantly outperforms the state-of-the-art multimodality hashing techniques.
IEEE Transactions on Systems, Man, and Cybernetics | 2013
Ling Shao; Jungong Han; Dong Xu; Jamie Shotton
Kinect sensor, high-resolution depth and visual (RGB) sensing has become available for widespread use as an off-the-shelf technology. This special issue is specifically dedicated to new algorithms and/or new applications based on the Kinect (or similar RGB-D) sensors. In total, we received over ninety submissions from more than twenty countries all around the world. The submissions cover a wide range of areas including object and scene classification, 3-D pose estimation, visual tracking, data fusion, human action/activity recognition, 3-D reconstruction, mobile robotics, and so on. After two rounds of review by at least two (mostly three) expert reviewers for each paper, the Guest Editors have selected twelve high-quality papers to be included in this highly popular special issue. The papers that comprise this issue are briefly summarized.
IEEE Transactions on Image Processing | 2017
Yuchen Guo; Guiguang Ding; Jungong Han; Yue Gao
By transferring knowledge from the abundant labeled samples of known source classes, zero-shot learning (ZSL) makes it possible to train recognition models for novel target classes that have no labeled samples. Conventional ZSL approaches usually adopt a two-step recognition strategy, in which the test sample is projected into an intermediary space in the first step, and then the recognition is carried out by considering the similarity between the sample and target classes in the intermediary space. Due to this redundant intermediate transformation, information loss is unavoidable, thus degrading the performance of overall system. Rather than adopting this two-step strategy, in this paper, we propose a novel one-step recognition framework that is able to perform recognition in the original feature space by using directly trained classifiers. To address the lack of labeled samples for training supervised classifiers for the target classes, we propose to transfer samples from source classes with pseudo labels assigned, in which the transferred samples are selected based on their transferability and diversity. Moreover, to account for the unreliability of pseudo labels of transferred samples, we modify the standard support vector machine formulation such that the unreliable positive samples can be recognized and suppressed in the training phase. The entire framework is fairly general with the possibility of further extensions to several common ZSL settings. Extensive experiments on four benchmark data sets demonstrate the superiority of the proposed framework, compared with the state-of-the-art approaches, in various settings.
Multimedia Tools and Applications | 2017
Ziyun Cai; Jungong Han; Li Liu; Ling Shao
RGB-D data has turned out to be a very useful representation of an indoor scene for solving fundamental computer vision problems. It takes the advantages of the color image that provides appearance information of an object and also the depth image that is immune to the variations in color, illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect sensor, which was initially used for gaming and later became a popular device for computer vision, high quality RGB-D data can be acquired easily. In recent years, more and more RGB-D image/video datasets dedicated to various applications have become available, which are of great importance to benchmark the state-of-the-art. In this paper, we systematically survey popular RGB-D datasets for different applications including object recognition, scene classification, hand gesture recognition, 3D-simultaneous localization and mapping, and pose estimation. We provide the insights into the characteristics of each important dataset, and compare the popularity and the difficulty of those datasets. Overall, the main goal of this survey is to give a comprehensive description about the available RGB-D datasets and thus to guide researchers in the selection of suitable datasets for evaluating their algorithms.
Neurocomputing | 2014
Junwei Han; Liye Sun; Xintao Hu; Jungong Han; Ling Shao
Visual attention detection in static images has achieved outstanding progress in recent years whereas much less effort has been devoted to learning visual attention in video sequences. In this paper, we propose a novel method to model spatial and temporal visual attention for videos respectively through learning from human gaze data. The spatial visual attention mainly predicts where viewers look in each video frame while the temporal visual attention measures which video frame is more likely to attract viewers׳ interest. Our underlying premise is that objects as well as their movements, instead of conventional contrast-related information, are major factors in dynamic scenes to drive visual attention. Firstly, the proposed models extract two types of bottom-up features derived from multi-scale object filter responses and spatiotemporal motion energy, respectively. Then, spatiotemporal gaze density and inter-observer gaze congruency are generated using a large collection of human-eye gaze data to form two training sets. Finally, prediction models of temporal visual attention and spatial visual attention are learned based on those two training sets and bottom-up features, respectively. Extensive evaluations on publicly available video benchmarks and applications in interestingness prediction of movie trailers demonstrate the effectiveness of the proposed work.