Michael Gygli
ETH Zurich
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael Gygli.
european conference on computer vision | 2014
Michael Gygli; Helmut Grabner; Hayko Riemenschneider; Luc Van Gool
This paper proposes a novel approach and a new benchmark for video summarization. Thereby we focus on user videos, which are raw videos containing a set of interesting events. Our method starts by segmenting the video by using a novel “superframe” segmentation, tailored to raw videos. Then, we estimate visual interestingness per superframe using a set of low-, mid- and high-level features. Based on this scoring, we select an optimal subset of superframes to create an informative and interesting summary. The introduced benchmark comes with multiple human created summaries, which were acquired in a controlled psychological experiment. This data paves the way to evaluate summarization methods objectively and to get new insights in video summarization. When evaluating our method, we find that it generates high-quality results, comparable to manual, human-created summaries.
computer vision and pattern recognition | 2015
Michael Gygli; Helmut Grabner; Luc Van Gool
We present a novel method for summarizing raw, casually captured videos. The objective is to create a short summary that still conveys the story. It should thus be both, interesting and representative for the input video. Previous methods often used simplified assumptions and only optimized for one of these goals. Alternatively, they used handdefined objectives that were optimized sequentially by making consecutive hard decisions. This limits their use to a particular setting. Instead, we introduce a new method that (i) uses a supervised approach in order to learn the importance of global characteristics of a summary and (ii) jointly optimizes for multiple objectives and thus creates summaries that posses multiple properties of a good summary. Experiments on two challenging and very diverse datasets demonstrate the effectiveness of our method, where we outperform or match current state-of-the-art.
computer vision and pattern recognition | 2016
Michael Gygli; Yale Song; Liangliang Cao
We introduce the novel problem of automatically generating animated GIFs from video. GIFs are short looping video with no sound, and a perfect combination between image and video that really capture our attention. GIFs tell a story, express emotion, turn events into humorous moments, and are the new wave of photojournalism. We pose the question: Can we automate the entirely manual and elaborate process of GIF creation by leveraging the plethora of user generated GIF content? We propose a Robust Deep RankNet that, given a video, generates a ranked list of its segments according to their suitability as GIF. We train our model to learn what visual content is often selected for GIFs by using over 100K user generated GIFs and their corresponding video sources. We effectively deal with the noisy web data by proposing a novel adaptive Huber loss in the ranking formulation. We show that our approach is robust to outliers and picks up several patterns that are frequently present in popular animated GIFs. On our new large-scale benchmark dataset, we show the advantage of our approach over several state-of-the-art methods.
conference of the international speech communication association | 2016
Naoya Takahashi; Michael Gygli; Beat Pfister; Luc Van Gool
We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event detection end-to-end. Our architecture is inspired by the success of VGGNet and uses small, 3x3 convolutions, but more depth than previous methods in AED. In order to prevent over-fitting and to take full advantage of the modeling capabilities of our network, we further propose a novel data augmentation method to introduce data variation. Experimental results show that our CNN significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
computer vision and pattern recognition | 2013
Xavier Boix; Michael Gygli; Gemma Roig; Luc Van Gool
The representation of local image patches is crucial for the good performance and efficiency of many vision tasks. Patch descriptors have been designed to generalize towards diverse variations, depending on the application, as well as the desired compromise between accuracy and efficiency. We present a novel formulation of patch description, that serves such issues well. Sparse quantization lies at its heart. This allows for efficient encodings, leading to powerful, novel binary descriptors, yet also to the generalization of existing descriptors like SIFT or BRIEF. We demonstrate the capabilities of our formulation for both key point matching and image classification. Our binary descriptors achieve state-of-the-art results for two key point matching benchmarks, namely those by Brown and Mikolajczyk. For image classification, we propose new descriptors, that perform similar to SIFT on Caltech101 and PASCAL VOC07.
IEEE Transactions on Multimedia | 2018
Naoya Takahashi; Michael Gygli; Luc Van Gool
We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear subword units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works, this allows us to train an audio event detection system end to end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learned generic audio features, similar to the way CNNs learn generic features on vision tasks. In video analysis, combining visual features and traditional audio features, such as mel frequency cepstral coefficients, typically only leads to marginal improvements. Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection. In video highlight detection, our audio features improve the performance by more than 8% over visual features alone.
Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Interacting with Maps | 2014
Julien Weissenberg; Michael Gygli; Hayko Riemenschneider; Luc Van Gool
Navigation has been greatly improved by positioning systems, but visualization still relies on maps. Yet because they only represent an abstract street network, maps are sometimes difficult to read. Conversely, Tourist Maps, which are enriched with landmark drawings, have been shown to be much more intuitive to understand. However, outside the very centres of cities, major landmarks are too sparse to be helpful. In this work, we present a method to automatically augment maps with most locally prominent such buildings, at multiple scale. Further, we generate a characterization which helps emphasize the special attributes of these buildings. Descriptive features are extracted from facades, analyzed and re-ranked to match human perception. To do so, we collected a total number of over 5900 human annotations to characterize 117 facades across 3 different cities. Finally, the characterizations are also used to produce natural language descriptions of the facades.
acm multimedia | 2016
Michael Gygli; Mohammad Soleymani
Animated GIFs have regained huge popularity. They are used in instant messaging, online journalism, social media, among others. In this paper, we present an in-depth study on the interestingness of GIFs. We create and annotate a dataset with a set of affective labels, which allows us to investigate the sources of interest. We show that GIFs of pets are considered more interesting that GIFs of people. Furthermore, we study the connection of interest to other features and factors such as popularity. Finally, we build a predictive model and show that it can estimate GIF interestingness with high accuracy. Our model outperforms the existing methods on GIF popularity, as well as a model based on still image interestingness, by a large margin. We envision that the insights and method developed can be used for automatic recognition and generation of interesting GIFs.
computer vision and pattern recognition | 2016
Anna Volokitin; Michael Gygli; Xavier Boix
Many computational models of visual attention use image features and machine learning techniques to predict eye fixation locations as saliency maps. Recently, the success of Deep Convolutional Neural Networks (DCNNs) for object recognition has opened a new avenue for computational models of visual attention due to the tight link between visual attention and object recognition. In this paper, we show that using features from DCNNs for object recognition we can make predictions that enrich the information provided by saliency models. Namely, we can estimate the reliability of a saliency model from the raw image, which serves as a meta-saliency measure that may be used to select the best saliency algorithm for an image. Analogously, the consistency of the eye fixations among subjects, i.e. the agreement between the eye fixation locations of different subjects, can also be predicted and used by a designer to assess whether subjects reach a consensus about salient image locations.
acm multimedia | 2017
Arun Balajee Vasudevan; Michael Gygli; Anna Volokitin; Luc Van Gool
Although the problem of automatic video summarization has recently received a lot of attention, the problem of creating a video summary that also highlights elements relevant to a search query has been less studied. We address this problem by posing query-relevant summarization as a video frame subset selection problem, which lets us optimise for summaries which are simultaneously diverse, representative of the entire video, and relevant to a text query. We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. In addition, we extend the model to capture query-independent properties, such as frame quality. We compare our method against previous state of the art on textual-visual embeddings for thumbnail selection and show that our model outperforms them on relevance prediction. Furthermore, we introduce a new dataset, annotated with diversity and query-specific relevance labels. On this dataset, we train and test our complete model for video summarization and show that it outperforms standard baselines such as Maximal Marginal Relevance.