Xucong Zhang
Max Planck Society
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xucong Zhang.
computer vision and pattern recognition | 2015
Xucong Zhang; Yusuke Sugano; Mario Fritz; Andreas Bulling
Appearance-based gaze estimation is believed to work well in real-world settings, but existing datasets have been collected under controlled laboratory conditions and methods have been not evaluated across multiple datasets. In this work we study appearance-based gaze estimation in the wild. We present the MPIIGaze dataset that contains 213,659 images we collected from 15 participants during natural everyday laptop use over more than three months. Our dataset is significantly more variable than existing ones with respect to appearance and illumination. We also present a method for in-the-wild appearance-based gaze estimation using multimodal convolutional neural networks that significantly outperforms state-of-the art methods in the most challenging cross-dataset evaluation. We present an extensive evaluation of several state-of-the-art image-based gaze estimation algorithms on three current datasets, including our own. This evaluation provides clear insights and allows us to identify key research challenges of gaze estimation in the wild.
computer vision and pattern recognition | 2013
Junjie Yan; Xucong Zhang; Zhen Lei; Shengcai Liao; Stan Z. Li
The serious performance decline with decreasing resolution is the major bottleneck for current pedestrian detection techniques. In this paper, we take pedestrian detection in different resolutions as different but related problems, and propose a Multi-Task model to jointly consider their commonness and differences. The model contains resolution aware transformations to map pedestrians in different resolutions to a common space, where a shared detector is constructed to distinguish pedestrians from background. For model learning, we present a coordinate descent procedure to learn the resolution aware transformations and deformable part model (DPM) based detector iteratively. In traffic scenes, there are many false positives located around vehicles, therefore, we further build a context model to suppress them according to the pedestrian-vehicle relationship. The context model can be learned automatically even when the vehicle annotations are not available. Our method reduces the mean miss rate to 60% for pedestrians taller than 30 pixels on the Caltech Pedestrian Benchmark, which noticeably outperforms previous state-of-the-art (71%).
advanced video and signal based surveillance | 2012
Xucong Zhang; Junjie Yan; Shikun Feng; Zhen Lei; Dong Yi; Stan Z. Li
People counting is one of the key components in video surveillance applications, however, due to occlusion, illumination, color and texture variation, the problem is far from being solved. Different from traditional visible camera based systems, we construct a novel system that uses vertical Kinect sensor for people counting, where the depth information is used to remove the affect of the appearance variation. Since the head is always closer to the Kinect sensor than other parts of the body, people counting task equals to find the suitable local minimum regions. According to the particularity of the depth map, we propose a novel unsupervised water filling method that can find these regions with the property of robustness, locality and scale-invariance. Experimental comparisons with mean shift and random forest on two databases validate the superiority of our water filling algorithm in people counting.
international conference on computer vision | 2015
Erroll Wood; Tadas Baltruaitis; Xucong Zhang; Yusuke Sugano; Peter Robinson; Andreas Bulling
Images of the eye are key in several computer vision problems, such as shape registration and gaze estimation. Recent large-scale supervised methods for these problems require time-consuming data collection and manual annotation, which can be unreliable. We propose synthesizing perfectly labelled photo-realistic training data in a fraction of the time. We used computer graphics techniquesto build a collection of dynamic eye-region models from head scan geometry. These were randomly posed to synthesize close-up eye images for a wide range of head poses, gaze directions, and illumination conditions. We used our models controllability to verify the importance of realistic illumination and shape variations in eye-region training data. Finally, we demonstrate the benefits of our synthesized training data (SynthesEyes) by out-performing state-of-the-art methods for eye-shape registration as well as cross-dataset appearance-based gaze estimation in the wild.
computer vision and pattern recognition | 2017
Xucong Zhang; Yusuke Sugano; Mario Fritz; Andreas Bulling
Eye gaze is an important non-verbal cue for human affect analysis. Recent gaze estimation work indicated that information from the full face region can benefit performance. Pushing this idea further, we propose an appearance-based method that, in contrast to a long-standing line of work in computer vision, only takes the full face image as input. Our method encodes the face image using a convolutional neural network with spatial weights applied on the feature maps to flexibly suppress or enhance information in different facial regions. Through extensive evaluation, we show that our full-face method significantly outperforms the state of the art for both 2D and 3D gaze estimation, achieving improvements of up to 14.3% on MPIIGaze and 27.7% on EYEDIAP for person-independent 3D gaze estimation. We further show that this improvement is consistent across different illumination conditions and gaze directions and particularly pronounced for the most challenging extreme head poses.
user interface software and technology | 2016
Yusuke Sugano; Xucong Zhang; Andreas Bulling
Gaze is frequently explored in public display research given its importance for monitoring and analysing audience attention. However, current gaze-enabled public display interfaces require either special-purpose eye tracking equipment or explicit personal calibration for each individual user. We present AggreGaze, a novel method for estimating spatio-temporal audience attention on public displays. Our method requires only a single off-the-shelf camera attached to the display, does not require any personal calibration, and provides visual attention estimates across the full display. We achieve this by 1) compensating for errors of state-of-the-art appearance-based gaze estimation methods through on-site training data collection, and by 2) aggregating uncalibrated and thus inaccurate gaze estimates of multiple users into joint attention estimates. We propose different visual stimuli for this compensation: a standard 9-point calibration, moving targets, text and visual stimuli embedded into the display content, as well as normal video content. Based on a two-week deployment in a public space, we demonstrate the effectiveness of our method for estimating attention maps that closely resemble ground-truth audience gaze distributions.
international conference on biometrics | 2013
Junjie Yan; Xucong Zhang; Zheng Lei; Stan Z. Li
We present an effective deformable part model for face detection in the wild. Compared with previous systems on face detection, there are mainly three contributions. The first is an efficient method for calculating histogram of oriented gradients by pre-calculated lookup tables, which only has read and write memory operations and the feature pyramid can be calculated in real-time. The second is a Sparse Constrained Latent Bilinear Model to simultaneously learn the discriminative deformable part model, and reduce the feature dimension by sparse transformations for efficient inference. The third contribution is a deformable part based cascade, where every stage is a deformable part in the discriminatively learned model. By integrating the three techniques, we demonstrate noticeable improvements over previous state-of-the-art on FDDB with real-time speed, under widely comparisons with both academic and commercial detectors.
ieee international conference on automatic face gesture recognition | 2013
Junjie Yan; Xucong Zhang; Zhen Lei; Dong Yi; Stan Z. Li
Despite the success in the last two decades, the state-of-the-art face detectors still have problems in dealing with images in the wild for the large appearance variations. Instead of taking appearance variations as black boxes and leaving them to statistical learning algorithms, we propose a structural face model to explicitly represent them. Our hierarchical part based structural face model enables part subtype option to describe appearance variations of the local part, and part deformation to capture the deformable variations between different poses and expressions. In the process of detection, the input candidate is first fitted by the structural model to infer the part location and part subtype, and the confidence score is then computed based on the fitted configuration to reduce the influence of structure variation. Besides the face model, we utilize the co-occurrence of face and body to further boost the face detection performance. We present a method for training phrase based body detectors, and propose a structural context model to jointly use the results of face detector and various body detectors. Experiments on the challenging FDDB show that our method has state-of-the-art performance, compared with other commercial and academic systems.
arXiv: Computer Vision and Pattern Recognition | 2016
Marc Tonsen; Xucong Zhang; Yusuke Sugano; Andreas Bulling
We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people of different ethnicities and a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, and make-up. We benchmark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution and vision aids as well as recording location (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2018
Xucong Zhang; Yusuke Sugano; Mario Fritz; Andreas Bulling
Learning-based methods are believed to work well for unconstrained gaze estimation, i.e. gaze estimation from a monocular RGB camera without assumptions regarding user, environment, or camera. However, current gaze datasets were collected under laboratory conditions and methods were not evaluated across multiple datasets. Our work makes three contributions towards addressing these limitations. First, we present the MPIIGaze dataset, which contains 213,659 full face images and corresponding ground-truth gaze positions collected from 15 users during everyday laptop use over several months. An experience sampling approach ensured continuous gaze and head poses and realistic variation in eye appearance and illumination. To facilitate cross-dataset evaluations, 37,667 images were manually annotated with eye corners, mouth corners, and pupil centres. Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze. We study key challenges including target gaze range, illumination conditions, and facial appearance variation. We show that image resolution and the use of both eyes affect gaze estimation performance, while head pose and pupil centre information are less informative. Finally, we propose GazeNet, the first deep appearance-based gaze estimation method. GazeNet improves on the state of the art by 22 percent (from a mean error of 13.9 degrees to 10.8 degrees) for the most challenging cross-dataset evaluation.