Oliver Walter
University of Paderborn
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Oliver Walter.
ieee automatic speech recognition and understanding workshop | 2013
Oliver Walter; Timo Korthals; Bhiksha Raj
Discovering the linguistic structure of a language solely from spoken input asks for two steps: phonetic and lexical discovery. The first is concerned with identifying the categorical subword unit inventory and relating it to the underlying acoustics, while the second aims at discovering words as repeated patterns of subword units. The hierarchical approach presented here accounts for classification errors in the first stage by modelling the pronunciation of a word in terms of subword units probabilistically: a hidden Markov model with discrete emission probabilities, emitting the observed subword unit sequences. We describe how the system can be learned in a completely unsupervised fashion from spoken input. To improve the initialization of the training of the word pronunciations, the output of a dynamic time warping based acoustic pattern discovery system is used, as it is able to discover similar temporal sequences in the input data. This improved initialization, using only weak supervision, has led to a 40% reduction in word error rate on a digit recognition task.
ieee automatic speech recognition and understanding workshop | 2013
Jahn Heymann; Oliver Walter; Bhiksha Raj
In this paper we present an algorithm for the unsupervised segmentation of a character or phoneme lattice into words. Using a lattice at the input rather than a single string accounts for the uncertainty of the character/phoneme recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model is known. Recently a Weighted Finite State Transducer (WFST) based approach has been published which we show to suffer from an issue: language model probabilities of known words are computed incorrectly. Fixing this issue leads to greatly improved precision and recall rates, however at the cost of increased computational complexity. It is therefore practical only for single input strings. To allow for a lattice input and thus for errors in the character/phoneme recognizer, we propose a computationally efficient suboptimal two-stage approach, which is shown to significantly improve the word segmentation performance compared to the earlier WFST approach.
workshop on positioning navigation and communication | 2013
Oliver Walter; Joerg Schmalenstroeer; Andreas Engler
In this paper we present a system for car navigation by fusing sensor data on an Android smartphone. The key idea is to use both the internal sensors of the smartphone (e.g., gyroscope) and sensor data from the car (e.g., speed information) to support navigation via GPS. To this end we employ a CAN-Bus-to-Bluetooth adapter to establish a wireless connection between the smartphone and the CAN-Bus of the car. On the smartphone a strapdown algorithm and an error-state Kalman filter are used to fuse the different sensor data streams. The experimental results show that the system is able to maintain higher positioning accuracy during GPS dropouts, thus improving the availability and reliability, compared to GPS-only solutions.
international conference on acoustics, speech, and signal processing | 2014
Jahn Heymann; Oliver Walter; Bhiksha Raj
In this paper we present an algorithm for the unsupervised segmentation of a lattice produced by a phoneme recognizer into words. Using a lattice rather than a single phoneme string accounts for the uncertainty of the recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model (LM) is known. We propose a computationally efficient iterative approach, which alternates between the following two steps: First, the most probable string is extracted from the lattice using a phoneme LM learned on the segmentation result of the previous iteration. Second, word segmentation is performed on the extracted string using a word and phoneme LM which is learned alongside the new segmentation. We present results on lattices produced by a phoneme recognizer on the WSJ-CAM0 dataset. We show that our approach delivers superior segmentation performance than an earlier approach found in the literature, in particular for higher-order language models.
workshop on positioning navigation and communication | 2010
Maik Bevermeier; Oliver Walter; Sven Peschke
In this paper we present a robust location estimation algorithm especially focused on the accuracy in vertical position. A loosely-coupled error state space Kalman filter, which fuses sensor data of an Inertial Measurement Unit and the output of a Global Positioning System device, is augmented by height information from an altitude measurement unit. This unit consists of a barometric altimeter whose output is fused with topographic map information by a Kalman filter to provide robust information about the current vertical user position. These data replace the less reliable vertical position information provided the GPS device. It is shown that typical barometric errors like thermal divergences and fluctuations in the pressure due to changing weather conditions can be compensated by the topographic map information and the barometric error Kalman filter. The resulting height information is shown not only to be more reliable than height information provided by GPS. It also turns out that it leads to better attitude and thus better overall localization estimation accuracy due to the coupling of spatial orientations via the Direct Cosine Matrix. Results are presented both for artificially generated and field test data, where the user is moving by car.
international conference on acoustics, speech, and signal processing | 2015
Oliver Walter; Lukas Drude
In this paper we present a source counting algorithm to determine the number of speakers in a speech mixture. In our proposed method, we model the histogram of estimated directions of arrival with a non-parametric Bayesian infinite Gaussian mixture model. As an alternative to classical model selection criteria and to avoid specifying the maximum number of mixture components in advance, a Dirichlet process prior is employed over the mixture components. This allows to automatically determine the optimal number of mixture components that most probably model the observations. We demonstrate by experiments that this model outperforms a parametric approach using a finite Gaussian mixture model with a Dirichlet distribution prior over the mixture weights.
Künstliche Intelligenz | 2015
Oliver Walter; Bassam Mokbel; Benjamin Paassen; Barbara Hammer
Besides the core learning algorithm itself, one major question in machine learning is how to best encode given training data such that the learning technology can efficiently learn based thereon and generalize to novel data. While classical approaches often rely on a hand coded data representation, the topic of autonomous representation or feature learning plays a major role in modern learning architectures. The goal of this contribution is to give an overview about different principles of autonomous feature learning, and to exemplify two principles based on two recent examples: autonomous metric learning for sequences, and autonomous learning of a deep representation for spoken language, respectively.
Speech Communication | 2018
Vladimir Despotovic; Oliver Walter
Abstract We present an experimental comparison of seven state-of-the-art machine learning algorithms for the task of semantic analysis of spoken input, with a special emphasis on applications for dysarthric speech. Dysarthria is a motor speech disorder, which is characterized by poor articulation of phonemes. In order to cater for these non-canonical phoneme realizations, we employed an unsupervised learning approach to estimate the acoustic models for speech recognition, which does not require a literal transcription of the training data. Even for the subsequent task of semantic analysis, only weak supervision is employed, whereby the training utterance is accompanied by a semantic label only, rather than a literal transcription. Results on two databases, one of them containing dysarthric speech, are presented showing that Markov logic networks and conditional random fields substantially outperform other machine learning approaches. Markov logic networks have proved to be especially robust to recognition errors, which are caused by imprecise articulation in dysarthric speech.
conference of the international speech communication association | 2014
Oliver Walter; Vladimir Despotovic; Jort F. Gemmeke; Bart Ons; Hugo Van hamme
international conference on robotics and automation | 2013
Oliver Walter; Sourish Chaudhuri; Bhiksha Raj