Jingen Liu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jingen Liu is active.

Explore More

Publication

Featured researches published by Jingen Liu.

computer vision and pattern recognition | 2009

Recognizing realistic actions from videos “in the wild”

Jingen Liu; Jiebo Luo; Mubarak Shah

In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild”. Such unconstrained videos are abundant in personal collections as well as on the Web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization.

computer vision and pattern recognition | 2008

Learning human actions via information maximization

Jingen Liu; Mubarak Shah

In this paper, we present a novel approach for automatically learning a compact and yet discriminative appearance-based human action model. A video sequence is represented by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. Our proposed approach is able to automatically discover the optimal number of video-word clusters by utilizing maximization of mutual information(MMI). Unlike the k-means algorithm, which is typically used to cluster spatiotemporal cuboids into video words based on their appearance similarity, MMI clustering further groups the video-words, which are highly correlated to some group of actions. To capture the structural information of the learnt optimal video-word clusters, we explore the correlation of the compact video-word clusters. We use the modified correlogram, which is not only translation and rotation invariant, but also somewhat scale invariant. We extensively test our proposed approach on two publicly available challenging datasets: the KTH dataset and IXMAS multiview dataset. To the best of our knowledge, we are the first to try the bag of video-words related approach on the multiview dataset. We have obtained very impressive results on both datasets.

computer vision and pattern recognition | 2011

Recognizing human actions by attributes

Jingen Liu; Benjamin Kuipers; Silvio Savarese

In this paper we explore the idea of using high-level semantic concepts, also called attributes, to represent human actions from videos and argue that attributes enable the construction of more descriptive models for human action recognition. We propose a unified framework wherein manually specified attributes are: i) selected in a discriminative fashion so as to account for intra-class variability; ii) coherently integrated with data-driven attributes to make the attribute set more descriptive. Data-driven attributes are automatically inferred from the training data using an information theoretic approach. Our framework is built upon a latent SVM formulation where latent variables capture the degree of importance of each attribute for each action class. We also demonstrate that our attribute-based action representation can be effectively used to design a recognition procedure for classifying novel action classes for which no training samples are available. We test our approach on several publicly available datasets and obtain promising results that quantitatively demonstrate our theoretical claims.

computer vision and pattern recognition | 2008

Recognizing human actions using multiple features

Jingen Liu; Saad Ali; Mubarak Shah

In this paper, we propose a framework that fuses multiple features for improved action recognition in videos. The fusion of multiple features is important for recognizing actions as often a single feature based representation is not enough to capture the imaging variations (view-point, illumination etc.) and attributes of individuals (size, age, gender etc.). Hence, we use two types of features: i) a quantized vocabulary of local spatio-temporal (ST) volumes (or cuboids), and ii) a quantized vocabulary of spin-images, which aims to capture the shape deformation of the actor by considering actions as 3D objects (x, y, t). To optimally combine these features, we treat different features as nodes in a graph, where weighted edges between the nodes represent the strength of the relationship between entities. The graph is then embedded into a k-dimensional space subject to the criteria that similar nodes have Euclidian coordinates which are closer to each other. This is achieved by converting this constraint into a minimization problem whose solution is the eigenvectors of the graph Laplacian matrix. This procedure is known as Fiedler embedding. The performance of the proposed framework is tested on publicly available data sets. The results demonstrate that fusion of multiple features helps in achieving improved performance, and allows retrieval of meaningful features and videos from the embedding space.

computer vision and pattern recognition | 2011

Cross-view action recognition via view knowledge transfer

Jingen Liu; Mubarak Shah; Benjamin Kuipers; Silvio Savarese

In this paper, we present a novel approach to recognizing human actions from different views by view knowledge transfer. An action is originally modelled as a bag of visual-words (BoVW), which is sensitive to view changes. We argue that, as opposed to visual words, there exist some higher level features which can be shared across views and enable the connection of action models for different views. To discover these features, we use a bipartite graph to model two view-dependent vocabularies, then apply bipartite graph partitioning to co-cluster two vocabularies into visual-word clusters called bilingual-words (i.e., high-level features), which can bridge the semantic gap across view-dependent vocabularies. Consequently, we can transfer a BoVW action model into a bag-of-bilingual-words (BoBW) model, which is more discriminative in the presence of view changes. We tested our approach on the IXMAS data set and obtained very promising results. Moreover, to further fuse view knowledge from multiple views, we apply a Locally Weighted Ensemble scheme to dynamically weight transferred models based on the local distribution structure around each test example. This process can further improve the average recognition rate by about 7%.

computer vision and pattern recognition | 2009

Recognizing realistic actions from videos .

Jingen Liu; Jiebo Luo; Mubarak Shah

computer vision and pattern recognition | 2009

Learning semantic visual vocabularies using diffusion distance

Jingen Liu; Yang Yang; Mubarak Shah

In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual information (PMI). In this midlevel feature space, we believe the features produced by similar sources must lie on a certain manifold. To capture the intrinsic geometric relations between features, we measure their dissimilarity using diffusion distance. The underlying idea is to embed the midlevel features into a semantic lower-dimensional space. Our goal is to construct a compact yet discriminative semantic visual vocabulary. Although the conventional approach using k-means is good for vocabulary construction, its performance is sensitive to the size of the visual vocabulary. In addition, the learnt visual words are not semantically meaningful since the clustering criterion is based on appearance similarity only. Our proposed approach can effectively overcome these problems by capturing the semantic and geometric relations of the feature space using diffusion maps. Unlike some of the supervised vocabulary construction approaches, and the unsupervised methods such as pLSA and LDA, diffusion maps can capture the local intrinsic geometric relations between the midlevel feature points on the manifold. We have tested our approach on the KTH action dataset, our own YouTube action dataset and the fifteen scene dataset, and have obtained very promising results.

computer vision and pattern recognition | 2012

Evaluation of low-level features and their combinations for complex event detection in open source videos

Amir Tamrakar; Saad Ali; Qian Yu; Jingen Liu; Omar Javed; Ajay Divakaran; Hui Cheng; Harpreet S. Sawhney

Low-level appearance as well as spatio-temporal features, appropriately quantized and aggregated into Bag-of-Words (BoW) descriptors, have been shown to be effective in many detection and recognition tasks. However, their effcacy for complex event recognition in unconstrained videos have not been systematically evaluated. In this paper, we use the NIST TRECVID Multimedia Event Detection (MED11 [1]) open source dataset, containing annotated data for 15 high-level events, as the standardized test bed for evaluating the low-level features. This dataset contains a large number of user-generated video clips. We consider 7 different low-level features, both static and dynamic, using BoW descriptors within an SVM approach for event detection. We present performance results on the 15 MED11 events for each of the features as well as their combinations using a number of early and late fusion strategies and discuss their strengths and limitations.

international conference on computer vision | 2007

Scene Modeling Using Co-Clustering

Jingen Liu; Mubarak Shah

In this paper, we propose a novel approach for scene modeling. The proposed method is able to automatically discover the intermediate semantic concepts. We utilize Maximization of Mutual Information (MMI) co-clustering approach to discover clusters of semantic concepts, which we call intermediate concepts. Each intermediate concept corresponds to a cluster of visterms in the bag of Vis- terms (BOV) paradigm for scene classification. MMI co- clustering results in fewer but meaningful clusters. Unlike k-means which is used to cluster image patches based on their appearances in BOV, MMI co-clustering can group the visterms which are highly correlated to some concept. Unlike probabilistic latent semantic analysis (pLSA), which can be considered as one-sided soft clustering, MMI co- clustering simultaneously clusters visterms and images, so it is able to boost both clustering. In addition, the MMI co- clustering is an unsupervised method. We have extensively tested our proposed approach on two challenging datasets: the fifteen scene categories and the LSCOM dataset, and promising results are obtained.

international conference on computer vision | 2009

Incremental action recognition using feature-tree

Kishore K. Reddy; Jingen Liu; Mubarak Shah

Action recognition methods suffer from many drawbacks in practice, which include (1)the inability to cope with incremental recognition problems; (2)the requirement of an intensive training stage to obtain good performance; (3) the inability to recognize simultaneous multiple actions; and (4) difficulty in performing recognition frame by frame. In order to overcome all these drawbacks using a single method, we propose a novel framework involving the feature- tree to index large scale motion features using Sphere/Rectangle-tree (SR-tree). The recognition consists of the following two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), 2) using a simple voting strategy to label the action. The proposed method can provide the localization of the action. Since our method does not require feature quantization, the feature- tree can be efficiently grown by adding features from new training examples of actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets due to the fact that the SR-tree is a disk-based data structure. We have tested our approach on two publicly available datasets, the KTH and the IXMAS multi-view datasets, and obtained promising results.

Explore More