Bernd T. Meyer
University of Oldenburg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bernd T. Meyer.
Journal of the Acoustical Society of America | 2012
Marc René Schädler; Bernd T. Meyer; Birger Kollmeier
In an attempt to increase the robustness of automatic speech recognition (ASR) systems, a feature extraction scheme is proposed that takes spectro-temporal modulation frequencies (MF) into account. This physiologically inspired approach uses a two-dimensional filter bank based on Gabor filters, which limits the redundant information between feature components, and also results in physically interpretable features. Robustness against extrinsic variation (different types of additive noise) and intrinsic variability (arising from changes in speaking rate, effort, and style) is quantified in a series of recognition experiments. The results are compared to reference ASR systems using Mel-frequency cepstral coefficients (MFCCs), MFCCs with cepstral mean subtraction (CMS) and RASTA-PLP features, respectively. Gabor features are shown to be more robust against extrinsic variation than the baseline systems without CMS, with relative improvements of 28% and 16% for two training conditions (using only clean training samples or a mixture of noisy and clean utterances, respectively). When used in a state-of-the-art system, improvements of 14% are observed when spectro-temporal features are concatenated with MFCCs, indicating the complementarity of those feature types. An analysis of the importance of specific MF shows that temporal MF up to 25 Hz and spectral MF up to 0.25 cycles/channel are beneficial for ASR.
Speech Communication | 2011
Bernd T. Meyer; Birger Kollmeier
The effect of bio-inspired spectro-temporal processing for automatic speech recognition (ASR) is analyzed for two different tasks with focus on the robustness of spectro-temporal Gabor features in comparison to mel-frequency cepstral coefficients (MFCCs). Experiments aiming at extrinsic factors such as additive noise and changes of the transmission channel were carried out on a digit classification task (AURORA 2) for which spectro-temporal features were found to be more robust than the MFCC baseline against a wide range of noise sources. Intrinsic variations, i.e., changes in speaking rate, speaking effort and pitch, were analyzed on a phoneme recognition task with matched training and test conditions. The sensitivity of Gabor and MFCC features against various speaking styles was found to be different in a systematic way. An analysis based on phoneme confusions for both feature types suggests that spectro-temporal and purely spectral features carry complementary information. The usefulness of the combined information was demonstrated in a system using a combination of both types of features which yields a decrease in word-error rate of 16% compared to the best single-stream recognizer and 47% compared to an MFCC baseline.
Journal of the Acoustical Society of America | 2011
Bernd T. Meyer; Thomas Brand; Birger Kollmeier
The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.
international conference on acoustics, speech, and signal processing | 2012
Howard Lei; Bernd T. Meyer; Nikki Mirghafori
In this work, we have investigated the performance of 2D Gabor features (known as spectro-temporal features) for speaker recognition. Gabor features have been used mainly for automatic speech recognition (ASR), where they have yielded improvements. We explored different Gabor feature implementations, along with different speaker recognition approaches, on ROSSI [1] and NIST SRE08 databases. Using the noisy ROSSI database, the Gabor features performed as well as the MFCC features standalone, and score-level combination of Gabor and MFCC features resulted in an 8% relative EER improvement over MFCC features standalone. These results demonstrated the value of both spectral and temporal information for feature extraction, and the complementarity of Gabor features to MFCC features.
Journal of the Acoustical Society of America | 2010
Bernd T. Meyer; Tim Jürgens; Thorsten Wesker; Thomas Brand; Birger Kollmeier
The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).
international conference on acoustics, speech, and signal processing | 2013
Shuo-Yiin Chang; Bernd T. Meyer; Nelson Morgan
Previous work has demonstrated that spectro-temporal Gabor features reduced word error rates for automatic speech recognition under noisy conditions. However, the features based on mel spectra were easily corrupted in the presence of noise or channel distortion. We have exploited an algorithm for power normalized cepstral coefficients (PNCCs) to generate a more robust spectro-temporal representation. We refer to it as power normalized spectrum (PNS), and to the corresponding output processed by Gabor filters and MLP nonlinear weighting as PNS-Gabor. We show that the proposed feature outperforms state-of-the-art noise-robust features, ETSI-AFE and PNCC for both Aurora2 and a noisy version of the Wall Street Jounal (WSJ) corpus. A comparison of the individual processing steps of mel spectra and PNS shows that power bias subtraction is the most important aspect of PNS-Gabor features to provide an improvement over Mel-Gabor features. The result indicates that Gabor processing compensates the limitation of PNCC for channels with frequency-shift characteristic. Overall, PNS-Gabor features decrease the word error rate by 32% relative to MFCC and 13% relative to PNCC in Aurora2. For noisy WSJ, they decrease the word error rate by 30.9% relative to MFCC and 24.7% relative to PNCC.
international conference on acoustics, speech, and signal processing | 2013
Constantin Spille; Mathias Dietz; Volker Hohmann; Bernd T. Meyer
The segregation of concurrent speakers and other sound sources is an important aspect of the human auditory system but is missing in most current systems for automatic speech recognition (ASR), resulting in a large gap between human and machine performance. The present study uses a physiologically-motivated model of binaural hearing to estimate the position of moving speakers in a noisy environment by combining methods from Computational Auditory Scene Analysis (CASA) and ASR. The binaural model is paired with a particle filter and a beamformer to enhance spoken sentences that are transcribed by the ASR system. Results based on an evaluation in clean, anechoic two-speaker condition shows the word recognition rates to be increased from 30.8% to 72.6%, demonstrating the potential of the CASA-based approach. In different noisy environments, improvements were also observed for SNRs of 5 dB and above, which was attributed to the average tracking errors that were consistent over a wide range of SNRs.
international conference on acoustics, speech, and signal processing | 2013
Feifei Xiong; Stefan Goetze; Bernd T. Meyer
A novel method for blind estimation of the reverberation time (RT60) is proposed based on applying spectro-temporal modulation filters to time-frequency representations. 2D-Gabor filters arranged in a filterbank enable an analysis of the properties of temporal, spectral, and spectro-temporal filtering for this task. Features are used as input to a multi-layer perceptron (MLP) classifier combined with a simple decision rule that attributes a specific RT60 to a given utterance and allows to assess the reliability of the approach for different resolutions of RT60 classification. While the filter set including temporal, spectral, and spectro-temporal filters already outperforms an MFCC baseline, the error rates are further reduced when relying on diagonal spectro-temporal filters alone. The average error rate is 1.9% for the best feature set, which corresponds to a relative reduction of 58.3% compared to the MFCC baseline for RT60s in 0.1 s resolution.
Carbohydrate Research | 1990
Heide Kogelberg; Bernd T. Meyer
Abstract 2,3:5,6:3′,4′-Tri- O -isopropylidenelactose dimethyl acetal was converted into the 2′-sulfate derivatives 9 and 10 and the 2′,6′-di-sulfate derivative 2 . A 2′-sulfate group was shown to induce a change of the ring conformation of the galactosyl residue to a 3,0 B conformation. 2′-Sulfated lactose derivatives were synthesized which differed only in the group linked to O-6′, namely the 6′- O - tert -butyldimethylsilyl 9 , the 6′- O -triphenylmethyl 10 , and the 6′-sulfate 2 , all of which adopt a conformation close to a 3,0 B boat. The conformation of the galactosyl ring is slightly influenced by the group at O-6′, with 2 showing the most pronounced effect, and 9 showing the smallest changes. It was proved by various n.m.r. spectroscopic parameters, which included coupling constants, n.O.e.s, and T 1 values, that the galactosyl residues of 9, 10 , and 2 adopt a 3,0 B conformation. However, other protective groups at O-2′, such as an O -acetyl group in 7 , an O -benzyl group in 8 , or the hydroxy group itself in 1, 4, 5 , and 3 , do not cause a change of the ring conformation. Using the GESA program, we found that the observed change of the conformation cannot be explained by unfavorable sterical interactions between the 2′-sulfate group and other parts of the molecule. However, the conformational change observed here could be attributable to electronic effects that are unique to the sulfate group.
international conference on acoustics, speech, and signal processing | 2014
Feifei Xiong; Stefan Goetze; Bernd T. Meyer
This work analyzes the influence of reverberation on automatic speech recognition (ASR) systems and how to compensate its influence, with special focus on the important acoustical parameters i.e. room reverberation time T60 and clarity index C50. A multilayer perceptron (MLP) using features of a spectro-temporal filter bank as input is employed to identify the acoustic conditions spanning various reverberant scenarios. The posterior probabilities of the MLP are used to design a novel selection scheme for adaptation in a cluster-based manner and for system combination achieved by recognizer output voting error reduction (ROVER). A comparison of word error rates is performed considering different training modes, and an average relative improvement of 7.1% is obtained by the proposed system compared to conventional multistyle training.