Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Constantin Spille is active.

Publication


Featured researches published by Constantin Spille.


international conference on acoustics, speech, and signal processing | 2013

Using binarual processing for automatic speech recognition in multi-talker scenes

Constantin Spille; Mathias Dietz; Volker Hohmann; Bernd T. Meyer

The segregation of concurrent speakers and other sound sources is an important aspect of the human auditory system but is missing in most current systems for automatic speech recognition (ASR), resulting in a large gap between human and machine performance. The present study uses a physiologically-motivated model of binaural hearing to estimate the position of moving speakers in a noisy environment by combining methods from Computational Auditory Scene Analysis (CASA) and ASR. The binaural model is paired with a particle filter and a beamformer to enhance spoken sentences that are transcribed by the ASR system. Results based on an evaluation in clean, anechoic two-speaker condition shows the word recognition rates to be increased from 30.8% to 72.6%, demonstrating the potential of the CASA-based approach. In different noisy environments, improvements were also observed for SNRs of 5 dB and above, which was attributed to the average tracking errors that were consistent over a wide range of SNRs.


IEEE Transactions on Audio, Speech, and Language Processing | 2017

Combining Binaural and Cortical Features for Robust Speech Recognition

Constantin Spille; Birger Kollmeier; Bernd T. Meyer

The segregation of concurrent speakers and other sound sources is an important ability of the human auditory system, but is missing in most current systems for automatic speech recognition (ASR), resulting in a large gap between human and machine performance. This study combines processing related to peripheral and cortical stages of the auditory pathway: A physiologically motivated binaural model estimates the positions of moving speakers to enhance the desired speech signal. Second, signals are converted to spectro-temporal Gabor features that resemble cortical speech representations and which have been shown to improve ASR in noisy conditions. Spectro-temporal Gabor features improve recognition results in all acoustic conditions under consideration compared with Mel-frequency cepstral coefficients. Binaural processing results in lower word error rates (WERs) in acoustic scenes with a concurrent speaker, whereas monaural processing should be preferred in the presence of a stationary masking noise. In-depth analysis of binaural processing identifies crucial processing steps such as localization of sound sources and estimation of the beamformers noise coherence matrix, and shows how much each processing step affects the recognition performance in acoustic conditions with different complexity.


conference of the international speech communication association | 2016

Assessing Speech Quality in Speech-Aware Hearing Aids Based on Phoneme Posteriorgrams.

Constantin Spille; Hendrik Kayser; Hynek Hermansky; Bernd T. Meyer

Current behind-the-ear hearing aids (HA) allow to perform spatial filtering to enhance localized sound sources; however, they often lack processing strategies that are tailored to spoken language. Hence, without a feedback about speech quality achieved by the system, spatial filtering potentially remains unused, in case of a conservative enhancement strategy, or can even be detrimental to the speech intelligibility of the output signal. In this paper we apply phoneme posteriorgrams obtained from HA signals processed with deep neural networks to measure the quality of speech representations in spatial scenes. Inverse entropy of phoneme probabilities is proposed as a measure that allows to evaluate if current hearing aid parameters are optimal for the given acoustic condition. We investigate how varying noise levels and wrong estimates of the to-beenhanced direction affect this measure in anechoic and reverberant conditions and show our measure to provide a high reliability when varying each parameter.Experiments show that entropy as a function of the beam angle has a distinct minimum at the speaker’s true position and its immediate vicinity. Thus, it can be used to determine the beam angle which optimizes the speech representation. Further, variations of the SNR cause a consistent offset of the entropy.


Archive | 2013

Binaural Scene Analysis with Multidimensional Statistical Filters

Constantin Spille; Bernd T. Meyer; Mathias Dietz; Volker Hohmann

The segregation of concurrent speakers and other sound sources is an important aspect in improving the performance of audio technology, such as noise reduction and automatic speech recognition, ASR, in difficult acoustic conditions. This technology is relevant for applications like hearing aids, mobile audio devices, robotics, hands-free audio communication and speech-based computer interfaces. Computational auditory-scene analysis (CASA) techniques simulate aspects of processing properties of the human perceptual system using statistical signal-processing techniques to improve inferences about the causes of audio input received by the system. This study argues that CASA is a promising approach to achieve source separation and outlines several theoretical arguments to support this hypothesis. With a focus on computational binaural scene analysis, principles of CASA techniques are reviewed. Furthermore, in an experimental approach, the applicability of a recent model of binaural interaction to improve ASR performance in multi-speaker conditions with spatially separated moving speakers is explored. The binaural model provides input to a statistical inference filter that employs a priori information on possible movements of the sources in order to track the positions of the speakers. The tracks are used to adapt a beamformer that selects a specific speaker. The output of the beamformer is subsequently used for an ASR task. Compared to the unprocessed, that is, mixed, data in a two-speaker condition, the word recognition rates obtained with the enhanced signals based on binaural information were increased from 30.8 to 88.4 %, demonstrating the potential of the proposed CASA-based approach.


Computer Speech & Language | 2018

Predicting speech intelligibility with deep neural networks

Constantin Spille; Stephan D. Ewert; Birger Kollmeier; Bernd T. Meyer

Abstract An accurate objective prediction of human speech intelligibility is of interest for many applications such as the evaluation of signal processing algorithms. To predict the speech recognition threshold (SRT) of normal-hearing listeners, an automatic speech recognition (ASR) system is employed that uses a deep neural network (DNN) to convert the acoustic input into phoneme predictions, which are subsequently decoded into word transcripts. ASR results are obtained with and compared to data presented in Schubotz et al. (2016), which comprises eight different additive maskers that range from speech-shaped stationary noise to a single-talker interferer and responses from eight normal-hearing subjects. The task for listeners and ASR is to identify noisy words from a German matrix sentence test in monaural conditions. Two ASR training schemes typically used in applications are considered: (A) matched training, which uses the same noise type for training and testing and (B) multi-condition training, which covers all eight maskers. For both training schemes, ASR-based predictions outperform established measures such as the extended speech intelligibility index (ESII), the multi-resolution speech envelope power spectrum model (mr-sEPSM) and others. This result is obtained with a speaker-independent model that compares the word labels of the utterance with the ASR transcript, which does not require separate noise and speech signals. The best predictions are obtained for multi-condition training with amplitude modulation features, which implies that the noise type has been seen during training. Predictions and measurements are analyzed by comparing speech recognition thresholds and individual psychometric functions to the DNN-based results.


conference of the international speech communication association | 2012

Hooking up spectro-temporal filters with auditory-inspired representations for robust automatic speech recognition.

Bernd T. Meyer; Constantin Spille; Birger Kollmeier; Nelson Morgan


conference of the international speech communication association | 2015

Improving automatic speech recognition in spatially-aware hearing aids.

Hendrik Kayser; Constantin Spille; Daniel Marquardt; Bernd T. Meyer


conference of the international speech communication association | 2017

Single-Ended Prediction of Listening Effort Based on Automatic Speech Recognition.

Rainer Huber; Constantin Spille; Bernd T. Meyer


conference of the international speech communication association | 2014

Identifying the human-machine differences in complex binaural scenes: what can be learned from our auditory system.

Constantin Spille; Bernd T. Meyer


publisher | None

title

author

Collaboration


Dive into the Constantin Spille's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Nelson Morgan

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mathias Dietz

University College London

View shared research outputs
Top Co-Authors

Avatar

Mathias Dietz

University College London

View shared research outputs
Researchain Logo
Decentralizing Knowledge