Tobias Gehrig
Karlsruhe Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tobias Gehrig.
EURASIP Journal on Advances in Signal Processing | 2006
Ulrich Klee; Tobias Gehrig; John W. McDonough
In this work, we propose an algorithm for acoustic source localization based on time delay of arrival (TDOA) estimation. In earlier work by other authors, an initial closed-form approximation was first used to estimate the true position of the speaker followed by a Kalman filtering stage to smooth the time series of estimates. In the proposed algorithm, this closed-form approximation is eliminated by employing a Kalman filter to directly update the speakers position estimate based on the observed TDOAs. In particular, the TDOAs comprise the observation associated with an extended Kalman filter whose state corresponds to the speakers position. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that the proposed algorithm provides source localization accuracy superior to the standard spherical and linear intersection techniques. Moreover, the proposed algorithm, although relying on an iterative optimization scheme, proved efficient enough for real-time operation.
workshop on applications of signal processing to audio and acoustics | 2005
Tobias Gehrig; Kai Nickel; Hazim Kemal Ekenel; Ulrich Klee; John W. McDonough
In prior work, we proposed using an extended Kalman filter to directly update position estimates in a speaker localization system based on time delays of arrival. We found that such a scheme provided superior tracking quality as compared with the conventional closed-form approximation methods. In this work, we enhance our audio localizer with video information. We propose an algorithm to incorporate detected face positions in different camera views into the Kalman filter without doing any explicit triangulation. This approach yields a robust source localizer that functions reliably both for segments wherein the speaker is silent, which would be detrimental for an audio only tracker, and wherein many faces appear, which would confuse a video only tracker. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that the audio-video localizer functioned better than a localizer based solely on audio or solely on video features.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Kenichi Kumatani; Tobias Gehrig; Uwe Mayer; Emilian Stoimenov; John W. McDonough; Matthias Wölfel
In this paper, we consider an acoustic beamforming application where two speakers are simultaneously active. We construct one subband-domain beamformer in generalized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly optimize the active weight vectors of both GSCs to obtain two output signals with minimum mutual information (MMI). Assuming that the subband snapshots are Gaussian-distributed, this MMI criterion reduces to the requirement that the cross-correlation coefficient of the subband outputs of the two GSCs vanishes. We also compare separation performance under the Gaussian assumption with that obtained from several super-Gaussian probability density functions (pdfs), namely, the Laplace and pdfs. Our proposed technique provides effective nulling of the undesired source, but without the signal cancellation problems seen in conventional beamforming. Moreover, our technique does not suffer from the source permutation and scaling ambiguities encountered in conventional blind source separation algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the PASCAL Speech Separation Challenge (SSC). On the SSC development data, the simple delay-and-sum beamformer achieves a word error rate (WER) of 70.4%. The MMI beamformer under a Gaussian assumption achieves a 55.2% WER, which is further reduced to 52.0% with a pdf, whereas the WER for data recorded with a close-talking microphone is 21.6%.
Multimodal Technologies for Perception of Humans | 2008
Keni Bernardin; Tobias Gehrig; Rainer Stiefelhagen
In this paper, two multimodal systems for the tracking of multiple users in smart environments are presented. The first is a multi-view particle filter tracker using foreground, color and special upper body detection and person region features. The other is a wide angle overhead view person tracker relying on foreground segmentation and model-based blob tracking. Both systems are completed by a joint probabilistic data association filter-based source localizer using the input from several microphone arrays. While the first system fuses audio and visual cues at the feature level, the second one incorporates them at the decision level using state-based heuristics. The systems are designed to estimate the 3D scene locations of room occupants and are evaluated based on their precision in estimating person locations, their accuracy in recognizing person configurations and their ability to consistently keep track identities over time. The trackers are extensively tested and compared, for each separate modality and for the combined modalities, on the CLEAR 2007 Evaluation Database.
computer vision and pattern recognition | 2011
Tobias Gehrig; Hazim Kemal Ekenel
In this paper, we present a common framework for realtime action unit detection and emotion recognition that we have developed for the emotion recognition and action unit detection sub-challenges of the FG 2011 Facial Expression Recognition and Analysis Challenge. For these tasks we employed a local appearance-based face representation approach using discrete cosine transform, which has been shown to be very effective and robust for face recognition. Using these features, we trained multiple one-versus-all support vector machine classifiers corresponding to the individual classes of the specific task. With this framework we achieve 24.2% and 7.6% absolute improvement over the overall baseline results on the emotion recognition and action unit detection sub-challenge, respectively.
CLEaR | 2006
Keni Bernardin; Tobias Gehrig; Rainer Stiefelhagen
Simultaneous tracking of multiple persons in real world environments is an active research field and several approaches have been proposed, based on a variety of features and algorithms. In this work, we present 2 multimodal systems for tracking multiple users in a smart room environment. One is a multi-view tracker based on color histogram tracking and special person region detectors. The other is a wide angle overhead view person tracker relying on foreground segmentation and model-based tracking. Both systems are completed by a joint probabilistic data association filter-based source localization framework using input from several microphone arrays. We also very briefly present two intuitive metrics to allow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracy in recognizing object configurations and their ability to consistently label objects over time. The trackers are extensively tested and compared, for each modality separately, and for the combined modalities, on the CLEAR 2006 Evaluation Database.
international conference on computer vision | 2011
Tobias Gehrig; Hazim Kemal Ekenel
In this work, we propose a framework for simultaneously detecting the presence of multiple facial action units using kernel partial least square regression (KPLS). This method has the advantage of being easily extensible to learn more face related labels, while at the same time being computationally efficient. We compare the approach to linear and non-linear support vector machines (SVM) and evaluate its performance on the extended Cohn-Kanade (CK+) dataset and the GEneva Multimodal Emotion Portrayals (GEMEP-FERA) dataset, as well as across databases. It is shown that KPLS achieves around 2% absolute improvement over the SVM-based approach in terms of the two alternative forced choice (2AFC) score when trained on CK+ and tested on CK+ and GEMEP-FERA. It achieves around 6% absolute improvement over the SVM-based approach when trained on GEMEP-FERA and tested on CK+. We also show that KPLS is handling non-additive AU combinations better than SVM-based approaches trained to detect single AUs only.
CLEaR | 2006
Tobias Gehrig; John W. McDonough
In prior work, we developed a speaker tracking system based on an extended Kalman filter using time delays of arrival (TDOAs) as acoustic features. In particular, the TDOAs comprised the observation associated with an iterated extended Kalman filter (IEKF) whose state corresponds to the speaker position. In other work, we followed the same approach to develop a system that could use both audio and video information to track a moving lecturer. While these systems functioned well, their utility was limited to scenarios in which a single speaker was to be tracked. In this work, we seek to remove this restriction by generalizing the IEKF, first to a probabilistic data association filter, which incorporates a clutter model for rejection of spurious acoustic events, and then to a joint probabilistic data association filter (JPDAF), which maintains a separate state vector for each active speaker. In a set of experiments conducted on seminar and meeting data, we demonstrate that the JPDAF provides tracking performance superior to the IEKF.
Proceedings of the 2013 on Emotion recognition in the wild challenge and workshop | 2013
Tobias Gehrig; Hazim Kemal Ekenel
In this paper, we discuss the challenges for facial expression analysis in the wild. We studied the problems exemplarily on the Emotion Recognition in the Wild Challenge 2013 [3] dataset. We performed extensive experiments on this dataset comparing different approaches for face alignment, face representation, and classification, as well as human performance. It turns out that under close-to-real conditions, especially with co-occurring speech, it is hard even for humans to assign emotion labels to clips when only taking video into account. Our experiments on automatic emotion classification achieved at best a correct classification rate of 29.81% on the test set using Gabor features and linear support vector machines, which were trained on web images. This result is 7.06% better than the official baseline, which additionally incorporates time information.
ieee automatic speech recognition and understanding workshop | 2007
Kenichi Kumatani; Uwe Mayer; Tobias Gehrig; Emilian Stoimenov; John W. McDonough; Matthias Wölfel
In this work, we address an acoustic beamforming application where two speakers are simultaneously active. We construct one subband domain beamformer in generalized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly adjust the active weight vectors of both GSCs to obtain two output signals with minimum mutual information (MMI). In order to calculate the mutual information of the complex subband snapshots, we consider four probability density functions (pdfs), namely the Gaussian, Laplace, K0 and lceil pdfs. The latter three belong to the class of super-Gaussian density functions that are typically used in independent component analysis as opposed to conventional beam-forming. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the PASCAL Speech Separation Challenge. In the experiments, the delay-and-sum beamformer achieved a word error rate (WER) of 70.4 %. The MMI beamformer under a Gaussian assumption achieved 55.2 % WER which was further reduced to 52.0 % with a K0 pdf, whereas the WER for data recorded with close-talking microphone was 21.6 %.