Pascal Mettes | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pascal Mettes is active.

Explore More

Publication

Featured researches published by Pascal Mettes.

international conference on multimedia retrieval | 2016

The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection

Pascal Mettes; Dennis Koelma; Cees G. M. Snoek

This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition Challenge, we investigate how to leverage the complete ImageNet hierarchy for pre-training deep networks. To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21,814 classes and more than 14 million images. Experiments on the TRECVID Multimedia Event Detection 2013 and 2015 datasets show that video representations derived from the layers of a deep neural network pre-trained with our reorganized hierarchy i) improves over standard pre-training, ii) is complementary among different reorganizations, iii) maintains the benefits of fusion with other modalities, and v) leads to state-of-the-art event detection results. The reorganized hierarchies and their derived Caffe models are publicly available at http://tinyurl.com/imagenetshuffle.

european conference on computer vision | 2016

Spot On: Action Localization from Pointly-Supervised Proposals

Pascal Mettes; Jan C. van Gemert; Cees G. M. Snoek

We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier trained on carefully annotated box annotations. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at http://tinyurl.com/hollywood2tubes.

international conference on multimedia retrieval | 2015

Bag-of-Fragments: Selecting and Encoding Video Fragments for Event Detection and Recounting

Pascal Mettes; Jan C. van Gemert; Spencer Cappallo; Thomas Mensink; Cees G. M. Snoek

The goal of this paper is event detection and recounting using a representation of concept detector scores. Different from existing work, which encodes videos by averaging concept scores over all frames, we propose to encode videos using fragments that are discriminatively learned per event. Our bag-of-fragments split a video into semantically coherent fragment proposals. From training video proposals we show how to select the most discriminative fragment for an event. An encoding of a video is in turn generated by matching and pooling these discriminative fragments to the fragment proposals of the video. The bag-of-fragments forms an effective encoding for event detection and is able to provide a precise temporally localized event recounting. Furthermore, we show how bag-of-fragments can be extended to deal with irrelevant concepts in the event recounting. Experiments on challenging web videos show that i) our modest number of fragment proposals give a high sub-event recall, ii) bag-of-fragments is complementary to global averaging and provides better event detection, iii) bag-of-fragments with concept filtering yields a desirable event recounting. We conclude that fragments matter for video event detection and recounting.

european conference on computer vision | 2014

Nature Conservation Drones for Automatic Localization and Counting of Animals

Jan C. van Gemert; Camiel R. Verschoor; Pascal Mettes; Kitso Epema; Lian Pin Koh; Serge A. Wich

This paper is concerned with nature conservation by automatically monitoring animal distribution and animal abundance. Typically, such conservation tasks are performed manually on foot or after an aerial recording from a manned aircraft. Such manual approaches are expensive, slow and labor intensive. In this paper, we investigate the combination of small unmanned aerial vehicles (UAVs or “drones”) with automatic object recognition techniques as a viable solution to manual animal surveying. Since no controlled data is available, we record our own animal conservation dataset with a quadcopter drone. We evaluate two nature conservation tasks: (i) animal detection (ii) animal counting using three state-of-the-art generic object recognition methods that are particularly well-suited for on-board detection. Results show that object detection techniques for human-scale photographs do not directly translate to a drone perspective, but that light-weight automatic object detection techniques are promising for nature conservation tasks.

Computer Vision and Image Understanding | 2016

No spare parts

Pascal Mettes; Jan C. van Gemert; Cees G. M. Snoek

We establish that three part types are relevant for image categorization, which are all naturally shared between categories when learning a part representation for image categorization.We present an algorithm for part selection, part sharing, and image categorization by extending the AdaBoost optimization.We extend our joint optimization to a fusion with global image representations.We further improve over deep convolutional networks for image categorization. This work aims for image categorization by learning a representation of discriminative parts. Different from most existing part-based methods, we argue that parts are naturally shared between image categories and should be modeled as such. We motivate our approach with a quantitative and qualitative analysis by backtracking where selected parts come from. Our analysis shows that in addition to the category parts defining the category, the parts coming from the background context and parts from other image categories improve categorization performance. Part selection should not be done separately for each category, but instead be shared and optimized over all categories. To incorporate part sharing between categories, we present an algorithm based on AdaBoost to optimize part sharing and selection, as well as fusion with the global image representation. With a single algorithm and without the need for task-specific optimization, we achieve results competitive to the state-of-the-art on object, scene, and action categories, further improving over deep convolutional neural networks and alternative part representations.

Computer Vision and Image Understanding | 2017

Water detection through spatio-temporal invariant descriptors

Pascal Mettes; Robby T. Tan; Remco C. Veltkamp

We introduce a video pre-processing step to remove background reflections and inherent water colours.We introduce a hybrid spatial and temporal descriptor for local water classification.We introduce a new dataset, the Video Water Database, for experimental evaluation and to encourage research into water detection.We show experimentally that our water detection method improves over methods from dynamic texture and material recognition. In this work, we aim to segment and detect water in videos. Water detection is beneficial for appllications such as video search, outdoor surveillance, and systems such as unmanned ground vehicles and unmanned aerial vehicles. The specific problem, however, is less discussed compared to general texture recognition. Here, we analyze several motion properties of water. First, we describe a video pre-processing step, to increase invariance against water reflections and water colours. Second, we investigate the temporal and spatial properties of water and derive corresponding local descriptors. The descriptors are used to locally classify the presence of water and a binary water detection mask is generated through spatio-temporal Markov Random Field regularization of the local classifications. Third, we introduce the Video Water Database, containing several hours of water and non-water videos, to validate our algorithm. Experimental evaluation on the Video Water Database and the DynTex database indicates the effectiveness of the proposed algorithm, outperforming multiple algorithms for dynamic texture recognition and material recognition.

international conference on image processing | 2016

Featureless: Bypassing feature extraction in action categorization

Silvia L. Pintea; Pascal Mettes; J.C. van Gemert; Arnold W. M. Smeulders

This method introduces an efficient manner of learning action categories without the need of feature estimation. The approach starts from low-level values, in a similar style to the successful CNN methods. However, rather than extracting general image features, we learn to predict specific video representations from raw video data. The benefit of such an approach is that at the same computational expense it can predict 2D video representations as well as 3D ones, based on motion. The proposed model relies on discriminative Wald-boost, which we enhance to a multiclass formulation for the purpose of learning video representations. The suitability of the proposed approach as well as its time efficiency are tested on the UCF11 action recognition dataset.

acm multimedia | 2016

Weakly-Supervised Recognition, Localization, and Explanation of Visual Entities

Pascal Mettes

To learn from visual collections, manual annotations are required. Humans however can no longer keep up with providing strong and time consuming annotations on the ever increasing wealth of visual data. As a result, approaches are required that can learn from fast and weak forms of annotations in visual data. This doctorial symposium summarizes my ongoing PhD dissertation on how to utilize weakly-supervised annotations to recognize, localize, and explain visual entities in images and videos. In this context, visual entities denote objects, scenes, and actions (in images), and actions and events (in videos). The summary is performed through four publications. For each publication, we discuss the current state-of-the-art, as well as our proposed novelties and performed experiments. The end of the summary discusses several possibilities to extend the dissertation.

International Journal of Computer Vision | 2018

Pointly-Supervised Action Localization

Pascal Mettes; Cees G. M. Snoek

This paper strives for spatio-temporal localization of human actions in videos. In the literature, the consensus is to achieve localization by training on bounding box annotations provided for each frame of each training video. As annotating boxes in video is expensive, cumbersome and error-prone, we propose to bypass box-supervision. Instead, we introduce action localization based on point-supervision. We start from unsupervised spatio-temporal proposals, which provide a set of candidate regions in videos. While normally used exclusively for inference, we show spatio-temporal proposals can also be leveraged during training when guided by a sparse set of point annotations. We introduce an overlap measure between points and spatio-temporal proposals and incorporate them all into a new objective of a multiple instance learning optimization. During inference, we introduce pseudo-points, visual cues from videos, that automatically guide the selection of spatio-temporal proposals. We outline five spatial and one temporal pseudo-point, as well as a measure to best leverage pseudo-points at test time. Experimental evaluation on three action localization datasets shows our pointly-supervised approach (1) is as effective as traditional box-supervision at a fraction of the annotation cost, (2) is robust to sparse and noisy point annotations, (3) benefits from pseudo-points during inference, and (4) outperforms recent weakly-supervised alternatives. This leads us to conclude that points provide a viable alternative to boxes for action localization.

international conference on multimedia retrieval | 2017

Music-Guided Video Summarization using Quadratic Assignments

Thomas Mensink; Thomas Jongstra; Pascal Mettes; Cees G. M. Snoek

This paper aims to automatically generate a summary of an unedited video, guided by an externally provided music-track. The tempo, energy and beats in the music determine the choices and cuts in the video summarization. To solve this challenging task, we model video summarization as a quadratic assignment problem. We assign frames to the summary, using rewards based on frame interestingness, plot coherency, audio-visual match, and cut properties. Experimentally we validate our approach on the SumMe dataset. The results show that our music guided summaries are more appealing, and even outperform the current state-of-the-art summarization methods when evaluated on the F1 measure of precision and recall.

Explore More