Is this you? Create Your Porfile

Steven Wegmann

International Computer Science Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Steven Wegmann is active.

Explore More

Publication

Featured researches published by Steven Wegmann.

international conference on acoustics speech and signal processing | 1996

Speaker normalization on conversational telephone speech

Steven Wegmann; Don McAllaster; Jeremy Orloff; Barbara Peskin

This paper reports on a simplified system for determining vocal tract normalization. Such normalization has led to significant gains in recognition accuracy by reducing variability among speakers and allowing the pooling of training data and the construction of sharper models. But standard methods for determining the warp scale have been extremely cumbersome, generally requiring multiple recognition passes. We present a new system for warp scale selection which uses a simple generic voiced speech model to rapidly select appropriate frequency scales. The selection is sufficiently streamlined that it can moved completely into the front-end processing. Using this system on a standard test of the Switchboard Corpus, we have achieved relative reductions in word error rates of 12% over unnormalized gender-independent models and 6% over our best unnormalized gender-dependent models.

ieee automatic speech recognition and understanding workshop | 2011

Don't multiply lightly: Quantifying problems with the acoustic model assumptions in speech recognition

Daniel Gillick; Larry Gillick; Steven Wegmann

We describe a series of experiments simulating data from the standard Hidden Markov Model (HMM) framework used for speech recognition. Starting with a set of test transcriptions, we begin by simulating every step of the generative process. In each subsequent experiment, we substitute a real component for a simulated component (real state durations rather than simulating from the transition models, for example), and compare the word error rates of the resulting data, thus quantifying the relative costs of each modeling assumption. A novel sampling process allows us to test the independence assumptions of the HMM, which appear to present far more serious problems than the other data/model mismatches.

international conference on acoustics, speech, and signal processing | 2013

The blame game in meeting room ASR: An analysis of feature versus model errors in noisy and mismatched conditions

Sree Hari Krishnan Parthasarathi; Shuo-Yiin Chang; Jordan Cohen; Nelson Morgan; Steven Wegmann

Given a test waveform, state-of-the-art ASR systems extract a sequence of MFCC features and decode them with a set of trained HMMs. When this test data is clean, and it matches the condition used for training the models, then there are few errors. While it is known that ASR systems are brittle in noisy or mismatched conditions, there has been little work in quantitatively attributing the errors to features or to models. This paper attributes the sources of these errors in three conditions: (a) matched near-field, (b) matched far-field, and a (c) mismatched condition. We undertake a series of diagnostic analyses employing the bootstrap method to probe a meeting room ASR system. Results show that when the conditions are matched (even if they are far-field), the model errors dominate; however, in mismatched conditions features are neither invariant nor separable and this causes as many errors as the model does.

international conference on acoustics, speech, and signal processing | 2015

On the importance of modeling and robustness for deep neural network feature

Shuo-Yiin Chang; Steven Wegmann

A large body of research has shown that acoustic features for speech recognition can be learned from data using neural networks with multiple hidden layers (DNNs) and that these learned features are superior to standard features (e.g., MFCCs). However, this superiority is usually demonstrated when the data used to learn the features is very similar in character to the data used to test recognition performance. An open question is how well these learned features generalize to realistic data that is different in character to their training data. The ability of a feature representation to generalize to unfamiliar data is a highly desirable form of robustness. In this paper we investigate the robustness of two DNN-based feature sets to training/test mismatch using the ICSI meeting corpus. The experiments were performed under 3 training/test scenarios: (1) matched near-field (2) matched far-field and (3) the mismatched condition near-field training with far-field testing. The experiments leverage simulation and a novel sampling process that we have developed for diagnostic analysis within the HMM-based speech recognition framework. First, diagnostic analysis shows that a DNN-based feature representation that uses MFCC inputs (MFCC-DNN) is indeed superior to the corresponding MFCC baselines in the two matched scenarios where the source of recognition errors are from incorrect model, but the DNN-based features and MFCCs have nearly identical and poor performance in the mismatched scenario. Second, we show that a DNN-based feature representation that uses a more robust input, namely power normalized spectrum (PNS) and Gabor filters, performs nearly as well as the MFCC-DNN features in the matched scenarios and much better than MFCCs and MFCC-DNNs in the mismatched scenario.

international conference on acoustics, speech, and signal processing | 2014

Chasing the metric: Smoothing learning algorithms for keyword detection

Oriol Vinyals; Steven Wegmann

In this paper we propose to directly optimize a discrete objective function by smoothing it, showing it is both effective at enhancing the figure of merit that we are interested in while keeping the overall complexity of the training procedure unaltered. We looked at the task of keyword detection with data scarcity (e.g., for languages for which we do not have enough data), and found it useful to optimize the Actual Term Weighted Value (ATWV) directly. In particular, we were able to automatically set the detection threshold while improving ATWV by more than 1% using a computationally cheap method based on a smoothed ATWV on both single systems and for system combination. Furthermore, we did study additional features to refine keyword candidates which were easy to optimize thanks to the same techniques, and improved ATWV by an additional 1%. The advantage of our method with respect to others is that, since we can use continuous optimization techniques, it does not impose a limit in the number of parameters that other discrete optimization techniques exhibit.

international conference on acoustics, speech, and signal processing | 2012

Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework

Daniel Gillick; Steven Wegmann; Larry Gillick

The parameters of the standard Hidden Markov Model framework for speech recognition are typically trained via Maximum Likelihood. However, better recognition performance is achievable with discriminative training criteria like Maximum Mutual Information or Minimum Phone Error. While it is generally accepted that these discriminative criteria are better suited to minimizing Word Error Rate, there is very little qualitative intuition for how the improvements are achieved. Through a series of “resampling” experiments, we show that discriminative training (MPE in particular) appears to be compensating for a specific incorrect assumption of the HMM-that speech frames are conditionally independent.

spoken language technology workshop | 2014

Syllable based keyword search: Transducing syllable lattices to word lattices

Hang Su; James Hieronymus; Yanzhang He; Eric Fosler-Lussier; Steven Wegmann

This paper presents a weighted finite state transducer (WFST) based syllable decoding and transduction framework for keyword search (KWS). Acoustic context dependent phone models are trained from word forced alignments. Then syllable decoding is done with lattices generated using a syllable lexicon and language model (LM). To process out-of-vocabulary (OOV) keywords, pronunciations are produced using a grapheme-to-syllable (G2S) system. A syllable to word lexical transducer containing both in-vocabulary (IV) and OOV keywords is then constructed and composed with a keyword-boosted LM transducer. The composed transducer is then used to transduce syllable lattices to word lattices for final KWS. We show that our method can effectively perform KWS on both IV and OOV keywords, and yields up to 0.03 Actual Term-Weighted Value (ATWV) improvement over searching keywords directly in subword lattices. Word Error Rates (WER) and KWS results are reported for three different languages.

international conference on acoustics, speech, and signal processing | 2014

Calibration and multiple system fusion for spoken term detection using linear logistic regression

J. van Hout; Luciana Ferrer; Dimitra Vergyri; N. Scheffer; Yun Lei; Vikramjit Mitra; Steven Wegmann

State-of-the-art calibration and fusion approaches for spoken term detection (STD) systems currently rely on a multi-pass approach where the scores are calibrated, then fused, and finally re-calibrated to obtain a single decision threshold across keywords. While the above techniques are theoretically correct, they rely on meta-parameter tuning and are prone to over-fitting. This study presents an efficient and effective score calibration technique for keyword detection that is based on the logistic regression calibration approach commonly used in forensic speaker identification. The technique applies seamlessly to both single systems and to system fusion, and enables optimization for specific keyword detection evaluation functions. We run experiments on a Vietnamese STD task, comparing the technique with more empirical calibration and fusion schemes and demonstrate that we can achieve comparable or better performance in terms of the NIST ATWV metric with a more elegant solution.

conference of the international speech communication association | 2016

Factor Analysis Based Speaker Verification Using ASR.

Hang Su; Steven Wegmann

In this paper, we propose to improve speaker verification performance by importing better posterior statistics from acoustic models trained for Automatic Speech Recognition (ASR). This approach aims to introduce state-of-the-art techniques in ASR to speaker verification task. We compare statistics collected from several ASR systems, and show that those collected from deep neural networks (DNN) trained with fMLLR features can effectively reduce equal error rate (EER) by more than 30% on NIST SRE 2010 task, compared with those DNN trained without feature transformations. We also present derivation of factor analysis using variational Bayes inference, and illustrate implementation details of factor analysis and probabilistic linear discriminant analysis (PLDA) in Kaldi recognition toolkit.

international conference on acoustics, speech, and signal processing | 2016

How neural network features and depth modify statistical properties of HMM acoustic models

Suman V. Ravuri; Steven Wegmann

Tandem neural network features, especially ones trained with more than one hidden layer, have improved word recognition performance, but why these features improve automatic speech recognition systems is not completely understood. In this work, we study how neural network features cope with the mismatch between the underlying stochastic process inherent in speech, and the models we use to represent that process. We use a novel resampling framework, which re-samples test set data to match the conditional independence assumptions of the acoustic model, and measure performance as we break those assumptions. We discover that depth provides modest robustness to data/model mismatch at the state level, and compared to standard MFCC features, neural network features actually fix poor duration modeling assumptions of the HMM. The duration modeling problem is also fixed by the language model, suggesting that the dictionary and language model make very strong implicit assumptions about phone length, which may now need to be revisited.

Explore More