Jan Kautz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan Kautz is active.

Explore More

Publication

Featured researches published by Jan Kautz.

computer vision and pattern recognition | 2015

Hand gesture recognition with 3D convolutional neural networks

Pavlo Molchanov; Shalini Gupta; Kihwan Kim; Jan Kautz

Touchless hand gesture recognition systems are becoming important in automotive user interfaces as they improve safety and comfort. Various computer vision algorithms have employed color and depth cameras for hand gesture recognition, but robust classification of gestures from different subjects performed under widely varying lighting conditions is still challenging. We propose an algorithm for drivers hand gesture recognition from challenging depth and intensity data using 3D convolutional neural networks. Our solution combines information from multiple spatial scales for the final prediction. It also employs spatio-temporal data augmentation for more effective training and to reduce potential overfitting. Our method achieves a correct classification rate of 77.5% on the VIVA challenge dataset.

international conference on computer graphics and interactive techniques | 2014

FlexISP: a flexible camera image processing framework

Felix Heide; Markus Steinberger; Yun-Ta Tsai; Mushfiqur Rouf; Dawid Pająk; Dikpal Reddy; Orazio Gallo; Jing Liu; Wolfgang Heidrich; Karen O. Egiazarian; Jan Kautz; Kari Pulli

Conventional pipelines for capturing, displaying, and storing images are usually defined as a series of cascaded modules, each responsible for addressing a particular problem. While this divide-and-conquer approach offers many benefits, it also introduces a cumulative error, as each step in the pipeline only considers the output of the previous step, not the original sensor data. We propose an end-to-end system that is aware of the camera and image model, enforces natural-image priors, while jointly accounting for common image processing steps like demosaicking, denoising, deconvolution, and so forth, all directly in a given output representation (e.g., YUV, DCT). Our system is flexible and we demonstrate it on regular Bayer images as well as images from custom sensors. In all cases, we achieve large improvements in image quality and signal reconstruction compared to state-of-the-art techniques. Finally, we show that our approach is capable of very efficiently handling high-resolution images, making even mobile implementations feasible.

computer vision and pattern recognition | 2016

Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks

Pavlo Molchanov; Xiaodong Yang; Shalini Gupta; Kihwan Kim; Stephen Tyree; Jan Kautz

Automatic detection and classification of dynamic hand gestures in real-world systems intended for human computer interaction is challenging as: 1) there is a large diversity in how people perform gestures, making detection and classification difficult, 2) the system must work online in order to avoid noticeable lag between performing a gesture and its classification, in fact, a negative lag (classification before the gesture is finished) is desirable, as feedback to the user can then be truly instantaneous. In this paper, we address these challenges with a recurrent three-dimensional convolutional neural network that performs simultaneous detection and classification of dynamic hand gestures from multi-modal data. We employ connectionist temporal classification to train the network to predict class labels from inprogress gestures in unsegmented input streams. In order to validate our method, we introduce a new challenging multimodal dynamic hand gesture dataset captured with depth, color and stereo-IR sensors. On this challenging dataset, our gesture recognition system achieves an accuracy of 83:8%, outperforms competing state-of-the-art algorithms, and approaches human accuracy of 88:4%. Moreover, our method achieves state-of-the-art performance on SKIG and ChaLearn2014 benchmarks.

IEEE Transactions on Computational Imaging | 2017

Loss Functions for Image Restoration With Neural Networks

Hang Zhao; Orazio Gallo; Iuri Frosio; Jan Kautz

Neural networks are becoming central in several areas of computer vision and image processing and different architectures have been proposed to solve specific problems. The impact of the loss layer of neural networks, however, has not received much attention in the context of image processing: the default and virtually only choice is

acm multimedia | 2016

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Xiaodong Yang; Pavlo Molchanov; Jan Kautz

ell _2

Applied Optics | 2015

Slim near-eye display using pinhole aperture arrays

Kaan Akşit; Jan Kautz; David Luebke

. In this paper, we bring attention to alternative choices for image restoration. In particular, we show the importance of perceptually-motivated losses when the resulting image is to be evaluated by a human observer. We compare the performance of several losses, and propose a novel, differentiable error function. We show that the quality of the results improves significantly with better loss functions, even when the network architecture is left unchanged.

international conference on computer vision | 2015

Robust Model-Based 3D Head Pose Estimation

Gregory P. Meyer; Shalini Gupta; Iuri Frosio; Dikpal Reddy; Jan Kautz

This paper presents a novel framework to combine multiple layers and modalities of deep neural networks for video classification. We first propose a multilayer strategy to simultaneously capture a variety of levels of abstraction and invariance in a network, where the convolutional and fully connected layers are effectively represented by our proposed feature aggregation methods. We further introduce a multimodal scheme that includes four highly complementary modalities to extract diverse static and dynamic cues at multiple temporal scales. In particular, for modeling the long-term temporal information, we propose a new structure, FC-RNN, to effectively transform pre-trained fully connected layers into recurrent layers. A robust boosting model is then introduced to optimize the fusion of multiple layers and modalities in a unified way. In the extensive experiments, we achieve state-of-the-art results on two public benchmark datasets: UCF101 and HMDB51.

european conference on computer vision | 2018

Multimodal Unsupervised Image-to-Image Translation

Xun Huang; Ming-Yu Liu; Serge J. Belongie; Jan Kautz

We report a new technique for building a wide-angle, lightweight, thin-form-factor, cost-effective, easy-to-manufacture near-eye head-mounted display (HMD) for virtual reality applications. Our approach adopts an aperture mask containing an array of pinholes and a screen as a source of imagery. We demonstrate proof-of-concept HMD prototypes with a binocular field of view (FOV) of 70°×45°, or total diagonal FOV of 83°. This FOV should increase with increasing display panel size. The optical angular resolution supported in our prototype can go down to 1.4-2.1 arcmin by adopting a display with 20-30xa0μm pixel pitch.

high performance graphics | 2015

An adaptive acceleration structure for screen-space ray tracing

Sven Widmer; Dawid Pająk; Andre Schulz; Kari Pulli; Jan Kautz; Michael Goesele; David Luebke

We introduce a method for accurate three dimensional head pose estimation using a commodity depth camera. We perform pose estimation by registering a morphable face model to the measured depth data, using a combination of particle swarm optimization (PSO) and the iterative closest point (ICP) algorithm, which minimizes a cost function that includes a 3D registration and a 2D overlap term. The pose is estimated on the fly without requiring an explicit initialization or training phase. Our method handles large pose angles and partial occlusions by dynamically adapting to the reliable visible parts of the face. It is robust and generalizes to different depth sensors without modification. On the Biwi Kinect dataset, we achieve best-in-class performance, with average angular errors of 2.1, 2.1 and 2.4 degrees for yaw, pitch, and roll, respectively, and an average translational error of 5.9 mm, while running at 6 fps on a graphics processing unit.

international conference on 3d vision | 2015

MLMD: Maximum Likelihood Mixture Decoupling for Fast and Accurate Point Cloud Registration

Benjamin Eckart; Kihwan Kim; Alejandro Troccoli; Alonzo Kelly; Jan Kautz

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at this https URL

Explore More