Marco Pedersoli
Katholieke Universiteit Leuven
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marco Pedersoli.
european conference on computer vision | 2014
Markus Mathias; Rodrigo Benenson; Marco Pedersoli; Luc Van Gool
Face detection is a mature problem in computer vision. While diverse high performing face detectors have been proposed in the past, we present two surprising new top performance results. First, we show that a properly trained vanilla DPM reaches top performance, improving over commercial and research systems. Second, we show that a detector based on rigid templates - similar in structure to the Viola&Jones detector - can reach similar top performance on this task. Importantly, we discuss issues with existing evaluation benchmark and propose an improved procedure.
computer vision and pattern recognition | 2011
Marco Pedersoli; Andrea Vedaldi; Jordi Gonzàlez
We present a method that can dramatically accelerate object detection with part based models. The method is based on the observation that the cost of detection is likely to be dominated by the cost of matching each part to the image, and not by the cost of computing the optimal configuration of the parts as commonly assumed. Therefore accelerating detection requires minimizing the number of part-to-image comparisons. To this end we propose a multiple-resolutions hierarchical part based model and a corresponding coarse-to-fine inference procedure that recursively eliminates from the search space unpromising part placements. The method yields a ten-fold speedup over the standard dynamic programming approach and is complementary to the cascade-of-parts approach of [9]. Compared to the latter, our method does not have parameters to be determined empirically, which simplifies its use during the training of the model. Most importantly, the two techniques can be combined to obtain a very significant speedup, of two orders of magnitude in some cases. We evaluate our method extensively on the PASCAL VOC and INRIA datasets, demonstrating a very high increase in the detection speed with little degradation of the accuracy.
computer vision and pattern recognition | 2015
Hakan Bilen; Marco Pedersoli; Tinne Tuytelaars
Weakly supervised object detection, is a challenging task, where the training procedure involves learning at the same time both, the model appearance and the object location in each image. The classical approach to solve this problem is to consider the location of the object of interest in each image as a latent variable and minimize the loss generated by such latent variable during learning. However, as learning appearance and localization are two interconnected tasks, the optimization is not convex and the procedure can easily get stuck in a poor local minimum, i.e. the algorithm “misses” the object in some images. In this paper, we help the optimization to get close to the global minimum by enforcing a “soft” similarity between each possible location in the image and a reduced set of “exemplars”, or clusters, learned with a convex formulation in the training images. The help is effective because it comes from a different and smooth source of information that is not directly connected with the main task. Results show that our method improves a strong baseline based on convolutional neural network features by more than 4 points without any additional features or extra computation at testing time but only adding a small increment of the training time due to the convex clustering.
international conference on computer vision | 2015
Amir Ghodrati; Ali Diba; Marco Pedersoli; Tinne Tuytelaars; Luc Van Gool
In this paper we evaluate the quality of the activation layers of a convolutional neural network (CNN) for the generation of object proposals. We generate hypotheses in a sliding-window fashion over different activation layers and show that the final convolutional layers can find the object of interest with high recall but poor localization due to the coarseness of the feature maps. Instead, the first layers of the network can better localize the object of interest but with a reduced recall. Based on this observation we design a method for proposing object locations that is based on CNN features and that combines the best of both worlds. We build an inverse cascade that, going from the final to the initial convolutional layers of the CNN, selects the most promising object locations and refines their boxes in a coarse-to-fine manner. The method is efficient, because i) it uses the same features extracted for detection, ii) it aggregates features using integral images, and iii) it avoids a dense evaluation of the proposals due to the inverse coarse-to-fine cascade. The method is also accurate, it outperforms most of the previously proposed object proposals approaches and when plugged into a CNN-based detector produces state-of-the-art detection performance.
IEEE Transactions on Intelligent Transportation Systems | 2014
Marco Pedersoli; Jordi Gonzàlez; Xu Hu; F. Xavier Roca
Most advanced driving assistance systems already include pedestrian detection systems. Unfortunately, there is still a tradeoff between precision and real time. For a reliable detection, excellent precision-recall such a tradeoff is needed to detect as many pedestrians as possible while, at the same time, avoiding too many false alarms; in addition, a very fast computation is needed for fast reactions to dangerous situations. Recently, novel approaches based on deformable templates have been proposed since these show a reasonable detection performance although they are computationally too expensive for real-time performance. In this paper, we present a system for pedestrian detection based on a hierarchical multiresolution part-based model. The proposed system is able to achieve state-of-the-art detection accuracy due to the local deformations of the parts while exhibiting a speedup of more than one order of magnitude due to a fast coarse-to-fine inference technique. Moreover, our system explicitly infers the level of resolution available so that the detection of small examples is feasible with a very reduced computational cost. We conclude this contribution by presenting how a graphics processing unit-optimized implementation of our proposed system is suitable for real-time pedestrian detection in terms of both accuracy and speed.
european conference on computer vision | 2010
Marco Pedersoli; Jordi Gonzàlez; Andrew D. Bagdanov; Juan José Villanueva
Cascading techniques are commonly used to speed-up the scan of an image for object detection. However, cascades of detectors are slow to train due to the high number of detectors and corresponding thresholds to learn. Furthermore, they do not use any prior knowledge about the scene structure to decide where to focus the search. To handle these problems, we propose a new way to scan an image, where we couple a recursive coarse-to-fine refinement together with spatial constraints of the object location. For doing that we split an image into a set of uniformly distributed neighborhood regions, and for each of these we apply a local greedy search over feature resolutions. The neighborhood is defined as a scanning region that only one object can occupy. Therefore the best hypothesis is obtained as the location with maximum score and no thresholds are needed. We present an implementation of our method using a pyramid of HOG features and we evaluate it on two standard databases, VOC2007 and INRIA dataset. Results show that the Recursive Coarse-to-Fine Localization (RCFL) achieves a 12x speed-up compared to standard sliding windows. Compared with a cascade of multiple resolutions approach our method has slightly better performance in speed and Average-Precision. Furthermore, in contrast to cascading approach, the speed-up is independent of image conditions, the number of detected objects and clutter.
british machine vision conference | 2014
Amir Ghodrati; Marco Pedersoli; Tinne Tuytelaars
Recent top performing methods for viewpoint estimation make use of 3D information like 3D CAD models or 3D landmarks to build a 3D representation of the class. These 3D annotations are expensive and not really available for many classes. In this paper we investigate whether and how comparable performance can be obtained without any 3D information. We consider viewpoint estimation as a 1-vs-all classification problem on the previously detected object bounding box. In this framework we compare several features and parameter configurations and show that the modern representations based on Fisher encoding and convolutional neural network based features together with a neighbor viewpoints suppression strategy on the training data lead to comparable or even better performance than 3D methods.
international conference on computer vision | 2017
Marco Pedersoli; Thomas Lucas; Cordelia Schmid; Jakob J. Verbeek
We propose “Areas of Attention”, a novel attentionbased model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions. In contrast to previous attentionbased approaches that associate image regions only to the RNN state, our method allows a direct association between caption words and image regions. During training these associations are inferred from image-level captions, akin to weakly-supervised object detector training. These associations help to improve captioning by localizing the corresponding regions during testing. We also propose and compare different ways of generating attention areas: CNN activation grids, object proposals, and spatial transformers nets applied in a convolutional fashion. Spatial transformers give the best results. They allow for image specific attention areas, and can be trained jointly with the rest of the network. Our attention mechanism and spatial transformer attention areas together yield state-of-the-art results on the MSCOCO dataset.
International Journal of Computer Vision | 2015
Marco Pedersoli; Radu Timofte; Tinne Tuytelaars; Luc Van Gool
Deformable Parts Models (DPM) are the current state-of-the-art for object detection. Nevertheless they seem sub-optimal in the representation of deformations. Object deformations are often continuous and not confined to big parts. Therefore we propose to replace the DPM star model based on big parts by a deformation field. This consists of a grid of small parts connected with pairwise constraints which can better handle continuous deformations. The naive application of this model for object detection would consist of a bounded sliding window approach: for each possible location of the image the best part configuration within a limited bound around this location is found. This is computationally very expensive.Instead, we propose a different inference procedure, where an iterative image-level search finds the best object hypothesis. We show that this approach is faster than bounded sliding windows yet produces comparable accuracy. Experiments further show that the deformation field can better approximate real object deformations and therefore, for certain classes, produces even better detection accuracy than state-of-the-art DPM. Finally, the same approach is adapted to model-free tracking, showing improved accuracy also in this case.
british machine vision conference | 2016
Amir Ghodrati; Xu Jia; Marco Pedersoli; Tinne Tuytelaars
Learning the distribution of images in order to generate new samples is a challenging task due to the high dimensionality of the data and the highly non-linear relations that are involved. Nevertheless, some promising results have been reported in the literature recently,building on deep network architectures. In this work, we zoom in on a specific type of image generation: given an image and knowing the category of objects it belongs to (e.g. faces), our goal is to generate a similar and plausible image, but with some altered attributes. This is particularly challenging, as the model needs to learn to disentangle the effect of each attribute and to apply a desired attribute change to a given input image, while keeping the other attributes and overall object appearance intact. To this end, we learn a convolutional network, where the desired attribute information is encoded then merged with the encoded image at feature map level. We show promising results, both qualitatively as well as quantitatively, in the context of a retrieval experiment, on two face datasets (MultiPie and CAS-PEAL-R1).