Yuanlu Xu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yuanlu Xu is active.

Explore More

Publication

Featured researches published by Yuanlu Xu.

international conference on computer vision | 2013

Human Re-identification by Matching Compositional Template with Cluster Sampling

Yuanlu Xu; Liang Lin; Wei-Shi Zheng; Xiaobai Liu

This paper aims at a newly raising task in visual surveillance: re-identifying people at a distance by matching body information, given several reference examples. Most of existing works solve this task by matching a reference template with the target individual, but often suffer from large human appearance variability (e.g. different poses/views, illumination) and high false positives in matching caused by conjunctions, occlusions or surrounding clutters. Addressing these problems, we construct a simple yet expressive template from a few reference images of a certain individual, which represents the body as an articulated assembly of compositional and alternative parts, and propose an effective matching algorithm with cluster sampling. This algorithm is designed within a candidacy graph whose vertices are matching candidates (i.e. a pair of source and target body parts), and iterates in two steps for convergence. (i) It generates possible partial matches based on compatible and competitive relations among body parts. (ii) It confirms the partial matches to generate a new matching solution, which is accepted by the Markov Chain Monte Carlo (MCMC) mechanism. In the experiments, we demonstrate the superior performance of our approach on three public databases compared to existing methods.

IEEE Transactions on Image Processing | 2014

Complex Background Subtraction by Pursuing Dynamic Spatio-Temporal Models

Liang Lin; Yuanlu Xu; Xiaodan Liang; Jian-Huang Lai

Although it has been widely discussed in video surveillance, background subtraction is still an open problem in the context of complex scenarios, e.g., dynamic backgrounds, illumination variations, and indistinct foreground objects. To address these challenges, we propose an effective background subtraction method by learning and maintaining an array of dynamic texture models within the spatio-temporal representations. At any location of the scene, we extract a sequence of regular video bricks, i.e., video volumes spanning over both spatial and temporal domain. The background modeling is thus posed as pursuing subspaces within the video bricks while adapting the scene variations. For each sequence of video bricks, we pursue the subspace by employing the auto regressive moving average model that jointly characterizes the appearance consistency and temporal coherence of the observations. During online processing, we incrementally update the subspaces to cope with disturbances from foreground objects and scene changes. In the experiments, we validate the proposed method in several complex scenarios, and show superior performances over other state-of-the-art approaches of background subtraction. The empirical studies of parameter setting and component analysis are presented as well.

acm multimedia | 2014

Person Search in a Scene by Jointly Modeling People Commonness and Person Uniqueness

Yuanlu Xu; Bingpeng Ma; Rui Huang; Liang Lin

This paper presents a novel framework for a multimedia search task: searching a person in a scene using human body appearance. Existing works mostly focus on two independent problems related to this task, i.e., people detection and person re-identification. However, a sequential combination of these two components does not solve the person search problem seamlessly for two reasons: 1) the errors in people detection are carried into person re-identification unavoidably; 2) the setting of person re-identification is different from that of person search which is essentially a verification problem. To bridge this gap, we propose a unified framework which jointly models the commonness of people (for detection) and the uniqueness of a person (for identification). We demonstrate superior performance of our approach on public benchmarks compared with the sequential combination of the state-of-the-art detection and identification algorithms.

computer vision and pattern recognition | 2016

Multi-view People Tracking via Hierarchical Trajectory Composition

Yuanlu Xu; Xiaobai Liu; Yang Liu; Song-Chun Zhu

This paper presents a hierarchical composition approach for multi-view object tracking. The key idea is to adaptively exploit multiple cues in both 2D and 3D, e.g., ground occupancy consistency, appearance similarity, motion coherence etc., which are mutually complementary while tracking the humans of interests over time. While feature online selection has been extensively studied in the past literature, it remains unclear how to effectively schedule these cues for the tracking purpose especially when encountering various challenges, e.g. occlusions, conjunctions, and appearance variations. To do so, we propose a hierarchical composition model and re-formulate multi-view multi-object tracking as a problem of compositional structure optimization. We setup a set of composition criteria, each of which corresponds to one particular cue. The hierarchical composition process is pursued by exploiting different criteria, which impose constraints between a graph node and its offsprings in the hierarchy. We learn the composition criteria using MLE on annotated data and efficiently construct the hierarchical graph by an iterative greedy pursuit algorithm. In the experiments, we demonstrate superior performance of our approach on three public datasets, one of which is newly created by us to test various challenges in multi-view multi-object tracking.

international conference on image processing | 2012

Realtime object-of-interest tracking by learning Composite Patch-based Templates

Yuanlu Xu; Hongfei Zhou; Qing Wang; Liang Lin

In this paper, we propose a patch-based object tracking algorithm which provides both good enough robustness and computational efficiency. Our algorithm learns and maintains Composite Patch-based Templates (CPT) of the tracking target. Each composite template employs HOG, CS-LBP, and color histogram to represent the local statistics of edges, texture and flatness. The CPT model is initially established by maximizing the discriminability of the composite templates given the first frame, and automatically updated on-line by adding new effective composite patches and deleting old invalid ones. The inference of the target location is achieved by matching each composite template across frames. By this means the proposed algorithm can effectively track targets with partial occlusions or significant appearance variations. Experimental results demonstrate that the proposed algorithm outperforms both MIL and Ensemble Tracking algorithms.

acm multimedia | 2017

Place-centric Visual Urban Perception with Deep Multi-instance Regression

Xiaobai Liu; Qi Chen; Lei Zhu; Yuanlu Xu; Liang Lin

This paper presents a unified framework to learn to quantify perceptual attributes (e.g., safety, attractiveness) of physical urban environments using crowd-sourced street-view photos without human annotations. The efforts of this work include two folds. First, we collect a large-scale urban image dataset in multiple major cities in U.S.A., which consists of multiple street-view photos for every place. Instead of using subjective annotations as in previous works, which are neither accurate nor consistent, we collect for every place the safety score from governments crime event records as objective safety indicators. Second, we observe that the place-centric perception task is by nature a multi-instance regression problem since the labels are only available for places (bags), rather than images or image regions (instances). We thus introduce a deep convolutional neural network (CNN) to parameterize the instance-level scoring function, and develop an EM algorithm to alternatively estimate the primary instances (images or image regions) which affect the safety scores and train the proposed network. Our method is capable of localizing interesting images and image regions for each place. We evaluate the proposed method on a newly created dataset and a public dataset. Results with comparisons showed that our method can clearly outperform the alternative perception methods and more importantly, is capable of generating region-level safety scores to facilitate interpretations of the perception process.

IEEE Transactions on Circuits and Systems for Video Technology | 2017

A Stochastic Grammar for Simultaneous Human Re-Identification and Tracking

Xiaobai Liu; Qian Xu; Yuanlu Xu; Lei Zhu; Yadong Mu

Identifying humans across camera views is a long standing challenge especially while dealing with complex scenarios with frequent occlusions, significant lighting changes and other difficulties. Under such conditions, most existing appearance and geometric cues are not reliable enough to identify humans across camera views. To address these challenges, in this paper we propose to jointly re-identify and track humans of interest and introduce a stochastic attribute grammar model. The key idea of our method is to leverage complementary and discriminative human attributes for enhacing the above joint parsing task. In particular, we use a set of grammar rules to decompose a graph node (e.g. tracklet) into a set of children nodes (e.g. 3D human boxes), and augment each node with a set of attributes, including geometry (e.g., moving speed, direction), accessories (e.g., bags), and/or activities (e.g., walking, running). These attributes serve as valuable cues, in addition to appearance features (e.g., colors), in determining the associations of human detection boxes across cameras. The attributes of a parent node are inherited by its children nodes, resulting in consistency constraints over the feasible parse graph. Thus, we cast cross-view human re-identification and tracking as finding the most discriminative parse graph for each subject in videos. We develop a learning method to train this attribute grammar model from weakly supervised training data. We develop an alternative inference method to employ both topdown and bottom-up computations to search the optimal solution. We also explicitly reason the occlusion status of each entity in order to deal with significant changes of camera viewpoints. We evaluate the proposed method over public video benchmarks and demonstrate with extensive experiments that our method clearly outperforms state-of-theart tracking methods.In computer vision, tracking humans across camera views remain challenging, especially for complex scenarios with frequent occlusions, significant lighting changes, and other difficulties. Under such conditions, most existing appearance and geometric cues are not reliable enough to distinguish humans across camera views. To address these challenges, this paper presents a stochastic attribute grammar model for leveraging complementary and discriminative human attributes for enhancing cross-view tracking. The key idea of our method is to introduce a hierarchical representation, parse graph, to describe a subject and its movement trajectory in both space and time domains. These results in a hierarchical compositional representation, comprising trajectory entities of varying level, including human boxes, 3D human boxes, tracklets, and trajectories. We use a set of grammar rules to decompose a graph node (e.g., tracklet) into a set of children nodes (e.g., 3D human boxes), and augment each node with a set of attributes, including geometry (e.g., moving speed and direction), accessories (e.g., bags), and/or activities (e.g., walking and running). These attributes serve as valuable cues, in addition to appearance features (e.g., colors), in determining the associations of human detection boxes across cameras. In particular, the attributes of a parent node are inherited by its children nodes, resulting in consistency constraints over the feasible parse graph. Thus, we cast cross-view human tracking as finding the most discriminative parse graph for each subject in videos. We develop a learning method to train this attribute grammar model from weakly supervised training data. To infer the optimal parse graph and its attributes, we develop an alternative parsing method that employs both top-down and bottom-up computations to search the optimal solution. We also explicitly reason the occlusion status of each entity in order to deal with significant changes of camera viewpoints. We evaluate the proposed method over public video benchmarks, and demonstrate with extensive experiments that our method clearly outperforms the state-of-the-art tracking methods.

national conference on artificial intelligence | 2018