Andreas Ess
ETH Zurich
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Andreas Ess.
Computer Vision and Image Understanding | 2008
Herbert Bay; Andreas Ess; Tinne Tuytelaars; Luc Van Gool
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURFs application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURFs usefulness in a broad range of topics in computer vision.
international conference on computer vision | 2009
Stefano Pellegrini; Andreas Ess; Konrad Schindler; L. Van Gool
Object tracking typically relies on a dynamic model to predict the objects location from its past trajectory. In crowded scenarios a strong dynamic model is particularly important, because more accurate predictions allow for smaller search regions, which greatly simplifies data association. Traditional dynamic models predict the location for each target solely based on its own history, without taking into account the remaining scene objects. Collisions are resolved only when they happen. Such an approach ignores important aspects of human behavior: people are driven by their future destination, take into account their environment, anticipate collisions, and adjust their trajectories at an early stage in order to avoid them. In this work, we introduce a model of dynamic social behavior, inspired by models developed for crowd simulation. The model is trained with videos recorded from birds-eye view at busy locations, and applied as a motion model for multi-people tracking from a vehicle-mounted camera. Experiments on real sequences show that accounting for social interactions and scene knowledge improves tracking performance, especially during occlusions.
international conference on computer vision | 2007
Andreas Ess; Bastian Leibe; L. Van Gool
In this paper, we address the challenging problem of simultaneous pedestrian detection and ground-plane estimation from video while walking through a busy pedestrian zone. Our proposed system integrates robust stereo depth cues, ground-plane estimation, and appearance-based object detection in a principled fashion using a graphical model. Object-object occlusions lead to complex interactions in this model that make an exact solution computationally intractable. We therefore propose a novel iterative approach that first infers scene geometry using belief propagation and then resolves interactions between objects using a global optimization procedure. This approach leads to a robust solution in few iterations, while allowing object detection to benefit from geometry estimation and vice versa. We quantitatively evaluate the performance of our proposed approach on several challenging test sequences showing strolls through busy shopping streets. Comparisons to various baseline systems show that it outperforms both a system using no scene geometry and one just relying on structure-from-motion without dense stereo.
computer vision and pattern recognition | 2008
Andreas Ess; Bastian Leibe; Konrad Schindler; L. Van Gool
We present a mobile vision system for multi-person tracking in busy environments. Specifically, the system integrates continuous visual odometry computation with tracking-by-detection in order to track pedestrians in spite of frequent occlusions and egomotion of the camera rig. To achieve reliable performance under real-world conditions, it has long been advocated to extract and combine as much visual information as possible. We propose a way to closely integrate the vision modules for visual odometry, pedestrian detection, depth estimation, and tracking. The integration naturally leads to several cognitive feedback loops between the modules. Among others, we propose a novel feedback connection from the object detector to visual odometry which utilizes the semantic knowledge of detection to stabilize localization. Feedback loops always carry the danger that erroneous feedback from one module is amplified and causes the entire system to become instable. We therefore incorporate automatic failure detection and recovery, allowing the system to continue when a module becomes unreliable. The approach is experimentally evaluated on several long and difficult video sequences from busy inner-city locations. Our results show that the proposed integration makes it possible to deliver stable tracking performance in scenes of previously infeasible complexity.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2009
Andreas Ess; Bastian Leibe; Konrad Schindler; L. Van Gool
In this paper, we address the problem of multiperson tracking in busy pedestrian zones using a stereo rig mounted on a mobile platform. The complexity of the problem calls for an integrated solution that extracts as much visual information as possible and combines it through cognitive feedback cycles. We propose such an approach, which jointly estimates camera position, stereo depth, object detection, and tracking. The interplay between those components is represented by a graphical model. Since the model has to incorporate object-object interactions and temporal links to past frames, direct inference is intractable. We, therefore, propose a two-stage procedure: for each frame, we first solve a simplified version of the model (disregarding interactions and temporal continuity) to estimate the scene geometry and an overcomplete set of object detections. Conditioned on these results, we then address object interactions, tracking, and prediction in a second step. The approach is experimentally evaluated on several long and difficult video sequences from busy inner-city locations. Our results show that the proposed integration makes it possible to deliver robust tracking performance in scenes of realistic complexity.
european conference on computer vision | 2010
Stefano Pellegrini; Andreas Ess; Luc Van Gool
We consider the problem of data association in a multiperson tracking context. In semi-crowded environments, people are still discernible as individually moving entities, that undergo many interactions with other people in their direct surrounding. Finding the correct association is therefore difficult, but higher-order social factors, such as group membership, are expected to ease the problem. However, estimating group membership is a chicken-and-egg problem: knowing pedestrian trajectories, it is rather easy to find out possible groupings in the data, but in crowded scenes, it is often difficult to estimate closely interacting trajectories without further knowledge about groups. To this end, we propose a third-order graphical model that is able to jointly estimate correct trajectories and group memberships over a short time window. A set of experiments on challenging data underline the importance of joint reasoning for data association in crowded scenarios.
international conference on robotics and automation | 2009
Andreas Ess; Bastian Leibe; Konrad Schindler; L. Van Gool
We address the problem of vision-based multi-person tracking in busy pedestrian zones using a stereo rig mounted on a mobile platform. Specifically, we are interested in the application of such a system for supporting path planning algorithms in the avoidance of dynamic obstacles. The complexity of the problem calls for an integrated solution, which extracts as much visual information as possible and combines it through cognitive feedback. We propose such an approach, which jointly estimates camera position, stereo depth, object detections, and trajectories based only on visual information. The interplay between these components is represented in a graphical model. For each frame, we first estimate the ground surface together with a set of object detections. Based on these results, we then address object interactions and estimate trajectories. Finally, we employ the tracking results to predict future motion for dynamic objects and fuse this information with a static occupancy map estimated from dense stereo. The approach is experimentally evaluated on several long and challenging video sequences from busy inner-city locations recorded with different mobile setups. The results show that the proposed integration makes stable tracking and motion prediction possible, and thereby enables path planning in complex and highly dynamic scenes.
The International Journal of Robotics Research | 2010
Andreas Ess; Konrad Schindler; Bastian Leibe; Luc Van Gool
We address the problem of vision-based navigation in busy inner-city locations, using a stereo rig mounted on a mobile platform. In this scenario semantic information becomes important: rather than modeling moving objects as arbitrary obstacles, they should be categorized and tracked in order to predict their future behavior. To this end, we combine classical geometric world mapping with object category detection and tracking. Object-category-specific detectors serve to find instances of the most important object classes (in our case pedestrians and cars). Based on these detections, multi-object tracking recovers the objects’ trajectories, thereby making it possible to predict their future locations, and to employ dynamic path planning. The approach is evaluated on challenging, realistic video sequences recorded at busy inner-city locations.
british machine vision conference | 2009
Andreas Ess; Tobias Mueller; Helmut Grabner; Luc Van Gool
Recognizing the traffic scene in front of a car is an important asset for autonomous driving, as well as for safety systems. While GPS-based maps abound and have reached an incredible level of accuracy, they can still profit from additional, image-based information. Especially in urban scenarios, GPS reception can be shaky, or the map might not contain the latest detours due to constructions, demonstrations, etc. Furthermore, such maps are static and cannot account for other dynamic traffic agents, such as cars or pedestrians. In this paper, we therefore propose an image-based system that is able to recognize both the road type (straight, left/right curve, crossing, ...) as well as a set of often encountered objects (car, pedestrian, pedestrian crossing). The obtained information could then be fused with existing maps and either assist the driver directly (e.g., a pedestrian crossing is ahead: slow down) or help in improving object tracking (e.g., where are possible entrance points for pedestrians or cars?). Starting from a video sequence obtained from a car driving through urban areas, we employ a two-stage architecture termed SegmentationBased Urban Traffic Scene Understanding (SUTSU) that first builds an intermediate representation of the image based on a patch-wise image classification. The patch-wise segmentation is inspired by recent work [3, 4, 5] and assigns class probabilities to every 8× 8 image patch. As a feature set, we use the coefficients of the Walsh-Hadamard transform (a decomposition of the image into square waves), and, if available, additional information from the depth map. These are then used in a oneversus-all training using AdaBoost for feature selection, where we choose 13 texture classes that we found to be representative of typical urban scenes. This yields a meta representation of the scene that is more suitable for further processing, Fig. 1 (b,c). In recent publications, such a segmentation was used for a variety of purposes, such as improvement of object detection [1, 5], analysis of occlusion boundaries, or 3D reconstruction. In this paper, we will investigate the use of a segmentation for urban scene analysis. We infer another set of features from the segmentation’s probability maps, analyzing repetitivity, curvature, and rough structure. This set is then again used with a one-versus-all training to infer both the type of road segment ahead, as well the additional presence of pedestrians, cars, or pedestrian crossing. A Hidden Markov Model is used for temporally smoothing the result. SUTSU is tested on two challenging sequences, spanning over 50 minutes video of driving through Zurich. The experiments show that while a state-of-the-art scene classifier [2] can keep global classes such as road types, similarly well apart, a manually crafted feature set based on a segmentation clearly outperforms it on object classes. Example images are shown in Fig. 2. The main contribution of this paper is the application of recent research efforts in scene categorization research to do vision “in the wild”, driving through urban scenarios. We furthermore show the advantage of a segmentation-based approach over a global descriptor, as the intermediate representation can easily be adapted to other underlying image data (e.g. dusk, rain, ...), without having to change the high-level classifier.
european conference on computer vision | 2010
Dennis Mitzel; Esther Horbert; Andreas Ess; Bastian Leibe
This paper presents an integrated framework for mobile street-level tracking of multiple persons. In contrast to classic tracking-by-detection approaches, our framework employs an efficient level-set tracker in order to follow individual pedestrians over time. This low-level tracker is initialized and periodically updated by a pedestrian detector and is kept robust through a series of consistency checks. In order to cope with drift and to bridge occlusions, the resulting tracklet outputs are fed to a high-level multi-hypothesis tracker, which performs longer-term data association. This design has the advantage of simplifying short-term data association, resulting in higher-quality tracks that can be maintained even in situations where the pedestrian detector does no longer yield good detections. In addition, it requires the pedestrian detector to be active only part of the time, resulting in computational savings. We quantitatively evaluate our approach on several challenging sequences and show that it achieves state-of-the-art performance.