Hilde Kuehne
University of Bonn
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hilde Kuehne.
computer vision and pattern recognition | 2014
Hilde Kuehne; Ali Bilgin Arslan; Thomas Serre
This paper describes a framework for modeling human activities as temporally structured processes. Our approach is motivated by the inherently hierarchical nature of human activities and the close correspondence between human actions and speech: We model action units using Hidden Markov Models, much like words in speech. These action units then form the building blocks to model complex human activities as sentences using an action grammar. To evaluate our approach, we collected a large dataset of daily cooking activities: The dataset includes a total of 52 participants, each performing a total of 10 cooking activities in multiple real-life kitchens, resulting in over 77 hours of video footage. We evaluate the HTK toolkit, a state-of-the-art speech recognition engine, in combination with multiple video feature descriptors, for both the recognition of cooking activities (e.g., making pancakes) as well as the semantic parsing of videos into action units (e.g., cracking eggs). Our results demonstrate the benefits of structured temporal generative approaches over existing discriminative approaches in coping with the complexity of human daily life activities.
workshop on applications of computer vision | 2016
Hilde Kuehne; Juergen Gall; Thomas Serre
We describe an end-to-end generative approach for the segmentation and recognition of human activities. In this approach, a visual representation based on reduced Fisher Vectors is combined with a structured temporal model for recognition. We show that the statistical properties of Fisher Vectors make them an especially suitable front-end for generative models such as Gaussian mixtures. The system is evaluated for both the recognition of complex activities as well as their parsing into action units. Using a variety of video datasets ranging from human cooking activities to animal behaviors, our experiments demonstrate that the resulting architecture outperforms state-of-the-art approaches for larger datasets, i.e. when sufficient amount of data is available for training structured generative models.
Computer Vision and Image Understanding | 2017
Hilde Kuehne; Alexander Richard; Juergen Gall
Abstract We present an approach for weakly supervised learning of human actions from video transcriptions. Our system is based on the idea that, given a sequence of input data and a transcript, i.e. a list of the order the actions occur in the video, it is possible to infer the actions within the video stream and to learn the related action models without the need for any frame-based annotation. Starting from the transcript information at hand, we split the given data sequences uniformly based on the number of expected actions. We then learn action models for each class by maximizing the probability that the training video sequences are generated by the action models given the sequence order as defined by the transcripts. The learned model can be used to temporally segment an unseen video with or without transcript. Additionally, the inferred segments can be used as a starting point to train high-level fully supervised models. We evaluate our approach on four distinct activity datasets, namely Hollywood Extended, MPII Cooking, Breakfast and CRIM13. It shows that the proposed system is able to align the scripted actions with the video data, that the learned models localize and classify actions in the datasets, and that they outperform any current state-of-the-art approach for aligning transcripts with video data.
computer vision and pattern recognition | 2017
Alexander Richard; Hilde Kuehne; Juergen Gall
We present an approach for weakly supervised learning of human actions. Given a set of videos and an ordered list of the occurring actions, the goal is to infer start and end frames of the related action classes within the video and to train the respective action classifiers without any need for hand labeled frame boundaries. To address this task, we propose a combination of a discriminative representation of subactions, modeled by a recurrent neural network, and a coarse probabilistic model to allow for a temporal alignment and inference over long sequences. While this system alone already generates good results, we show that the performance can be further improved by approximating the number of subactions to the characteristics of the different action classes. To this end, we adapt the number of subaction classes by iterating realignment and reestimation during training. The proposed system is evaluated on two benchmark datasets, the Breakfast and the Hollywood extended dataset, showing a competitive performance on various weak learning tasks such as temporal action segmentation and action alignment.
german conference on pattern recognition | 2017
Ahsan Iqbal; Alexander Richard; Hilde Kuehne; Juergen Gall
Action recognition is a fundamental problem in computer vision with a lot of potential applications such as video surveillance, human computer interaction, and robot learning. Given pre-segmented videos, the task is to recognize actions happening within videos. Historically, hand crafted video features were used to address the task of action recognition. With the success of Deep ConvNets as an image analysis method, a lot of extensions of standard ConvNets were purposed to process variable length video data. In this work, we propose a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition. The approach extends ResNet, a state of the art model for image classification. While the original formulation of ResNet aims at learning spatial residuals in its layers, we extend the approach by introducing recurrent connections that allow to learn a spatio-temporal residual. In contrast to fully recurrent networks, our temporal connections only allow a limited range of preceding frames to contribute to the output for the current frame, enabling efficient training and inference as well as limiting the temporal context to a reasonable local range around each frame. On a large-scale action recognition dataset, we show that our model improves over both, the standard ResNet architecture and a ResNet extended by a fully recurrent layer.
arXiv: Computer Vision and Pattern Recognition | 2015
Hilde Kuehne; Thomas Serre
arXiv: Computer Vision and Pattern Recognition | 2015
Hilde Kuehne; Juergen Gall; Thomas Serre
computer vision and pattern recognition | 2018
Alexander Richard; Hilde Kuehne; Juergen Gall
computer vision and pattern recognition | 2018
Alexander Richard; Hilde Kuehne; Ahsan Iqbal; Juergen Gall
arXiv: Computer Vision and Pattern Recognition | 2017
Alexander Richard; Hilde Kuehne; Juergen Gall