Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xinxiao Wu is active.

Publication


Featured researches published by Xinxiao Wu.


Image and Vision Computing | 2010

Discriminative human action recognition in the learned hierarchical manifold space

Lei Han; Xinxiao Wu; Wei Liang; Guangming Hou; Yunde Jia

In this paper, we propose a hierarchical discriminative approach for human action recognition. It consists of feature extraction with mutual motion pattern analysis and discriminative action modeling in the hierarchical manifold space. Hierarchical Gaussian Process Latent Variable Model (HGPLVM) is employed to learn the hierarchical manifold space in which motion patterns are extracted. A cascade CRF is also presented to estimate the motion patterns in the corresponding manifold subspace, and the trained SVM classifier predicts the action label for the current observation. Using motion capture data, we test our method and evaluate how body parts make effect on human action recognition. The results on our test set of synthetic images are also presented to demonstrate the robustness.


european conference on computer vision | 2012

View-Invariant action recognition using latent kernelized structural SVM

Xinxiao Wu; Yunde Jia

This paper goes beyond recognizing human actions from a fixed view and focuses on action recognition from an arbitrary view. A novel learning algorithm, called latent kernelized structural SVM, is proposed for the view-invariant action recognition, which extends the kernelized structural SVM framework to include latent variables. Due to the changing and frequently unknown positions of the camera, we regard the view label of action as a latent variable and implicitly infer it during both learning and inference. Motivated by the geometric correlation between different views and semantic correlation between different action classes, we additionally propose a mid-level correlation feature which describes an action video by a set of decision values from the pre-learned classifiers of all the action classes from all the views. Each decision value captures both geometric and semantic correlations between the action video and the corresponding action class from the corresponding view. After that, we combine the low-level visual cue, mid-level correlation description, and high-level label information into a novel nonlinear kernel under the latent kernelized structural SVM framework. Extensive experiments on multi-view IXMAS and MuHAVi action datasets demonstrate that our method generally achieves higher recognition accuracy than other state-of-the-art methods.


IEEE Transactions on Circuits and Systems for Video Technology | 2013

Action Recognition Using Multilevel Features and Latent Structural SVM

Xinxiao Wu; Dong Xu; Lixin Duan; Jiebo Luo; Yunde Jia

We first propose a new low-level visual feature, called spatio-temporal context distribution feature of interest points, to describe human actions. Each action video is expressed as a set of relative XYT coordinates between pairwise interest points in a local region. We learn a global Gaussian mixture model (GMM) (referred to as a universal background model) using the relative coordinate features from all the training videos, and then we represent each video as the normalized parameters of a video-specific GMM adapted from the global GMM. In order to capture the spatio-temporal relationships at different levels, multiple GMMs are utilized to describe the context distributions of interest points over multiscale local regions. Motivated by the observation that some actions share similar motion patterns, we additionally propose a novel mid-level class correlation feature to capture the semantic correlations between different action classes. Each input action video is represented by a set of decision values obtained from the pre-learned classifiers of all the action classes, with each decision value measuring the likelihood that the input video belongs to the corresponding action class. Moreover, human actions are often associated with some specific natural environments and also exhibit high correlation with particular scene classes. It is therefore beneficial to utilize the contextual scene information for action recognition. In this paper, we build the high-level co-occurrence relationship between action classes and scene classes to discover the mutual contextual constraints between action and scene. By treating the scene class label as a latent variable, we propose to use the latent structural SVM (LSSVM) model to jointly capture the compatibility between multilevel action features (e.g., low-level visual context distribution feature and the corresponding mid-level class correlation feature) and action classes, the compatibility between multilevel scene features (i.e., SIFT feature and the corresponding class correlation feature) and scene classes, and the contextual relationship between action classes and scene classes. Extensive experiments on UCF Sports, YouTube and UCF50 datasets demonstrate the effectiveness of the proposed multilevel features and action-scene interaction based LSSVM model for human action recognition. Moreover, our method generally achieves higher recognition accuracy than other state-of-the-art methods on these datasets.


Pattern Recognition | 2010

Incremental discriminant-analysis of canonical correlations for action recognition

Xinxiao Wu; Yunde Jia; Wei Liang

Human action recognition from video sequences is a challenging problem due to the large changes of human appearance in the cases of partial occlusions, non-rigid deformations, and high irregularities. It is difficult to collect a large set of training samples to learn the discriminative model with covering all possible variations of an action. In this paper, we propose an online recognition method, namely incremental discriminant-analysis of canonical correlations (IDCC), in which the discriminative model is incrementally updated to capture the changes of human appearance, and thereby facilitates the recognition task in changing environments. As the training sets are acquired sequentially instead of being given completely in advance, our method is able to compute a new discriminant matrix by updating the existing one using the eigenspace merging algorithm. Furthermore, we integrate our method into the graph-based semi-supervised learning method, linear neighbor propagation, to deal with the limited labeled training data. Experimental results on both Weizmann and KTH action data sets show that our method performs better than state-of-the-art methods on accuracy and efficiency.


international conference on computer vision | 2009

Incremental discriminative-analysis of canonical correlations for action recognition

Xinxiao Wu; Wei Liang; Yunde Jia

Human action recognition is a challenging problem due to the large changes of human appearance in the cases of partial occlusions, non-rigid deformations and high irregularities. It is difficult to collect a large set of training samples with the hope of covering all possible variations of an action. In this paper, we propose an online recognition method, namely Incremental Discriminant-Analysis of Canonical Correlations (IDCC), whose discriminative model is incrementally updated to capture the changes of human appearance and thereby facilitates the recognition task in changing environments. As the training sets are acquired sequentially instead of being given completely in advance, our method is able to compute a new discriminant matrix by updating the existing one using the eigenspace merging algorithm. Experimental results on both Weizmann and KTH action data sets show that our method performs better than state-of-the-art methods on both accuracy and efficiency. Moreover, the robustness of our method is demonstrated on the irregular action recognition.


IEEE Transactions on Multimedia | 2014

Video Annotation via Image Groups from the Web

Han Wang; Xinxiao Wu; Yunde Jia

Searching desirable events in uncontrolled videos is a challenging task. Current researches mainly focus on obtaining concepts from numerous labeled videos. But it is time consuming and labor expensive to collect a large amount of required labeled videos for training event models under various circumstances. To alleviate this problem, we propose to leverage abundant Web images for videos since Web images contain a rich source of information with many events roughly annotated and taken under various conditions. However, knowledge from the Web is noisy and diverse, brute force knowledge transfer of images may hurt the video annotation performance. Therefore, we propose a novel Group-based Domain Adaptation (GDA) learning framework to leverage different groups of knowledge (source domain) queried from the Web image search engine to consumer videos (target domain). Different from traditional methods using multiple source domains of images, our method organizes the Web images according to their intrinsic semantic relationships instead of their sources. Specifically, two different types of groups (i.e., event-specific groups and concept-specific groups) are exploited to respectively describe the event-level and concept-level semantic meanings of target-domain videos. Under this framework, we assign different weights to different image groups according to the relevances between the source groups and the target domain, and each group weight represents how contributive the corresponding source image group is to the knowledge transferred to the target video. In order to make the group weights and group classifiers mutually beneficial and reciprocal, a joint optimization algorithm is presented for simultaneously learning the weights and classifiers, using two novel data-dependent regularizers. Experimental results on three challenging video datasets (i.e., CCV, Kodak, and YouTube) demonstrate the effectiveness of leveraging grouped knowledge gained from Web images for video annotation.


Pattern Recognition Letters | 2009

Tracking articulated objects by learning intrinsic structure of motion

Xinxiao Wu; Wei Liang; Yunde Jia

In this paper, we propose a novel dimensionality reduction method, temporal neighbor preserving embedding (TNPE), to learn the low-dimensional intrinsic motion manifold of articulated objects. The method simultaneously learns the embedding manifold and the mapping from an image feature space to an embedding space by preserving the local temporal relationship hidden in sequential data points. Then tracking is formulated as the problem of estimating the configuration of an articulated object from the learned central embedding representation. To solve this problem, we combine Bayesian mixture of experts (BME) with Gaussian mixture model (GMM) to establish a probabilistic non-linear mapping from the embedding space to the configuration space. The experimental result on articulated hand and human pose tracking shows an encouraging performance on stability and accuracy.


International Journal of Computer Vision | 2016

A Hierarchical Video Description for Complex Activity Understanding

Cuiwei Liu; Xinxiao Wu; Yunde Jia

This paper addresses the challenging problem of complex human activity understanding from long videos. Towards this goal, we propose a hierarchical description of an activity video, referring to the “which” of activities, “what” of atomic actions, and “when” of atomic actions happening in the video. In our work, each complex activity is characterized as a composition of simple motion units (called atomic actions), and different atomic actions are explained by different video segments. We develop a latent discriminative structural model to detect the complex activity and atomic actions, while learning the temporal structure of atomic actions simultaneously. A segment-annotation mapping matrix is introduced for relating video segments to their associational atomic actions, allowing different video segments to explain different atomic actions. The segment-annotation mapping matrix is treated as latent information in the model, since its ground-truth is not available during both training and testing. Moreover, we present a semi-supervised learning method to automatically predict the atomic action labels of unlabeled training videos when the labeled training data is limited, which could greatly alleviate the laborious and time-consuming annotations of atomic actions for training data. Experiments on three activity datasets demonstrate that our method is able to achieve promising activity recognition results and obtain rich and hierarchical descriptions of activity videos.


Science in China Series F: Information Sciences | 2014

Learning a discriminative mid-level feature for action recognition

Cuiwei Liu; MingTao Pei; Xinxiao Wu; Yu Kong; Yunde Jia

In this paper, we address the problem of recognizing human actions from videos. Most of the existing approaches employ low-level features (e.g., local features and global features) to represent an action video. However, algorithms based on low-level features are not robust to complex environments such as cluttered background, camera movement and illumination change. Therefore, we propose a novel random forest learning framework to construct a discriminative and informative mid-level feature from low-level features of densely sampled 3D cuboids. Each cuboid is classified by the corresponding random forests with a novel fusion scheme, and the cuboid’s posterior probabilities of all categories are normalized to generate a histogram. After that, we obtain our mid-level feature by concatenating histograms of all the cuboids. Since a single low-level feature is not enough to capture the variations of human actions, multiple complementary low-level features (i.e., optical flow and histogram of gradient 3D features) are employed to describe 3D cuboids. Moreover, temporal context between local cuboids is exploited as another type of low-level feature. The above three low-level features (i.e., optical flow, histogram of gradient 3D features and temporal context) are effectively fused in the proposed learning framework. Finally, the mid-level feature is employed by a random forest classifier for robust action recognition. Experiments on the Weizmann, UCF sports, Ballet, and multi-view IXMAS datasets demonstrate that out mid-level feature learned from multiple low-level features can achieve a superior performance over state-of-the-art methods.


ieee international conference on automatic face & gesture recognition | 2008

Human action recognition using discriminative models in the learned hierarchical manifold space

Lei Han; Wei Liang; Xinxiao Wu; Yunde Jia

A hierarchical learning based approach for human action recognition is proposed in this paper. It consists of hierarchical nonlinear dimensionality reduction based feature extraction and cascade discriminative model based action modeling. Human actions are inferred from human body joint motions and human bodies are decomposed into several physiological body parts according to inherent hierarchy (e.g. right arm, left arm and head all belong to upper body). We explore the underlying hierarchical structures of high-dimensional human pose space using hierarchical Gaussian process latent variable model (HGPLVM) and learn a representative motion pattern set for each body part. In the hierarchical manifold space, the bottom-up cascade conditional random fields (CRFs) are used to predict the corresponding motion pattern in each manifold subspace, and then the final action label is estimated for each observation by a discriminative classifier on the current motion pattern set.

Collaboration


Dive into the Xinxiao Wu's collaboration.

Top Co-Authors

Avatar

Yunde Jia

Beijing Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Wei Liang

Beijing Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Cuiwei Liu

Beijing Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Han Wang

Beijing Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Hao Song

Beijing Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Feiwu Yu

Beijing Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Jingyi Hou

Beijing Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Yuchao Sun

Beijing Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Yang Feng

University of Rochester

View shared research outputs
Top Co-Authors

Avatar

Wennan Yu

Beijing Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge