Michael L. Seltzer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael L. Seltzer is active.

Explore More

Publication

Featured researches published by Michael L. Seltzer.

international conference on acoustics, speech, and signal processing | 2013

An investigation of deep neural networks for noise robust speech recognition

Michael L. Seltzer; Dong Yu; Yongqiang Wang

Recently, a new acoustic model based on deep neural networks (DNN) has been introduced. While the DNN has generated significant improvements over GMM-based systems on several tasks, there has been no evaluation of the robustness of such systems to environmental distortion. In this paper, we investigate the noise robustness of DNN-based acoustic models and find that they can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation. This performance can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training. When combined with the recently proposed dropout training technique, a 7.5% relative improvement over the previously best published result on this task is achieved using only a single decoding pass and no additional decoding complexity compared to a standard DNN.

international conference on acoustics, speech, and signal processing | 2013

Recent advances in deep learning for speech research at Microsoft

Li Deng; Jinyu Li; Jui-Ting Huang; Kaisheng Yao; Dong Yu; Frank Seide; Michael L. Seltzer; Geoffrey Zweig; Xiaodong He; Jason D. Williams; Yifan Gong; Alex Acero

Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We organize this overview along the feature-domain and model-domain dimensions according to the conventional approach to analyzing speech systems. Selected experimental results, including speech recognition and related applications such as spoken dialogue and language modeling, are presented to demonstrate and analyze the strengths and weaknesses of the techniques described in the paper. Potential improvement of these techniques and future research directions are discussed.

Speech Communication | 2004

Reconstruction of missing features for robust speech recognition

Bhiksha Raj; Michael L. Seltzer; Richard M. Stern

Speech recognition systems perform poorly in the presence of corrupting noise. Missing feature methods attempt to compensate for the noise by removing noise corrupted components of spectrographic representations of noisy speech and performing recognition with the remaining reliable components. Conventional classifier-compensation methods modify the recognition system to work with the incomplete representations so obtained. This constrains them to perform recognition using spectrographic features which are known to be less optimal than cepstra. In this paper we present two missing-feature algorithms that reconstruct complete spectrograms from incomplete noisy ones. Cepstral vectors can now be derived from the reconstructed spectrograms for recognition. The first algorithm uses MAP procedures to estimate corrupt components from their correlations with reliable components. The second algorithm clusters spectral vectors of clean speech. Corrupt components of noisy speech are estimated from the distribution of the cluster that the analysis frame is identified with. Experiments show that, although conventional classifier-compensation methods are superior when recognition is performed with spectrographic features, cepstra derived from the reconstructed spectrograms result in better recognition performance overall. The proposed methods are also less expensive computationally and do not require modification of the recognizer.

Speech Communication | 2004

A Bayesian Classifier for Spectrographic Mask Estimation for Missing Feature Speech Recognition

Michael L. Seltzer; Bhiksha Raj; Richard M. Stern

Missing feature methods of noise compensation for speech recognition operate by first identifying components of a spectrographic representation of speech that are considered to be corrupt. Recognition is then performed either using only the remaining reliable components, or the corrupt components are reconstructed prior to recognition. These methods require a spectrographic mask which accurately labels the reliable and corrupt regions of the spectrogram. Depending on the missing feature method applied, these masks must either contain binary values or probabilistic values. Current mask estimation techniques rely on explicit estimation of the characteristics of the corrupting noise. The estimation process usually assumes that the noise is pseudo-stationary or varies slowly with time. This is a significant drawback since the missing feature methods themselves have no such restrictions. We present a new mask estimation technique that uses a Bayesian classifier to determine the reliability of spectrographic elements. Features used for classification were designed that make no assumptions about the corrupting noise signal, but rather exploit characteristics of the speech signal itself. Experiments were performed on speech corrupted by a variety of noises, using missing feature compensation methods which require binary masks and probabilistic masks. In all cases, the proposed Bayesian mask estimation method resulted in significantly better recognition accuracy than conventional mask estimation approaches.

international conference on acoustics, speech, and signal processing | 2013

Multi-task learning in deep neural networks for improved phoneme recognition

Michael L. Seltzer; Jasha Droppo

In this paper we demonstrate how to improve the performance of deep neural network (DNN) acoustic models using multi-task learning. In multi-task learning, the network is trained to perform both the primary classification task and one or more secondary tasks using a shared representation. The additional model parameters associated with the secondary tasks represent a very small increase in the number of trained parameters, and can be discarded at runtime. In this paper, we explore three natural choices for the secondary task: the phone label, the phone context, and the state context. We demonstrate that, even on a strong baseline, multi-task learning can provide a significant decrease in error rate. Using phone context, the phonetic error rate (PER) on TIMIT is reduced from 21.63% to 20.25% on the core test set, and surpassing the best performance in the literature for a DNN that uses a standard feed-forward network architecture.

IEEE Transactions on Speech and Audio Processing | 2004

Likelihood-maximizing beamforming for robust hands-free speech recognition

Michael L. Seltzer; Bhiksha Raj; Richard M. Stern

Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

international conference on acoustics, speech, and signal processing | 2014

Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network

Jian Xue; Jinyu Li; Dong Yu; Michael L. Seltzer; Yifan Gong

The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use of speaker personalization due to the huge storage cost in large-scale deployments. In this paper we address DNN adaptation and personalization issues by presenting two methods based on the singular value decomposition (SVD). The first method uses an SVD to replace the weight matrix of a speaker independent DNN by the product of two low rank matrices. Adaptation is then performed by updating a square matrix inserted between the two low-rank matrices. In the second method, we adapt the full weight matrix but only store the delta matrix - the difference between the original and adapted weight matrices. We decrease the footprint of the adapted model by storing a reduced rank version of the delta matrix via an SVD. The proposed methods were evaluated on short message dictation task. Experimental results show that we can obtain similar accuracy improvements as the previously proposed Kullback-Leibler divergence (KLD) regularized method with far fewer parameters, which only requires 0.89% of the original model storage.

international conference on acoustics, speech, and signal processing | 2011

CROWDMOS: An approach for crowdsourcing mean opinion score studies

Flavio Protasio Ribeiro; Dinei A. F. Florêncio; Cha Zhang; Michael L. Seltzer

MOS (mean opinion score) subjective quality studies are used to evaluate many signal processing methods. Since laboratory quality studies are time consuming and expensive, researchers often run small studies with less statistical significance or use objective measures which only approximate human perception. We propose a cost-effective and convenient measure called crowdMOS, obtained by having internet users participate in a MOS-like listening study. Workers listen and rate sentences at their leisure, using their own hardware, in an environment of their choice. Since these individuals cannot be supervised, we propose methods for detecting and discarding inaccurate scores. To automate crowdMOS testing, we offer a set of freely distributable, open-source tools for Amazon Mechanical Turk, a platform designed to facilitate crowdsourcing. These tools implement the MOS testing methodology described in this paper, providing researchers with a user-friendly means of performing subjective quality evaluations without the overhead associated with laboratory studies. Finally, we demonstrate the use of crowdMOS using data from the Blizzard text-to-speech competition, showing that it delivers accurate and repeatable results.

international conference on acoustics, speech, and signal processing | 2017

The microsoft 2016 conversational speech recognition system

Wayne Xiong; Jasha Droppo; Xuedong Huang; Frank Seide; Michael L. Seltzer; Andreas Stolcke; Dong Yu; Geoffrey Zweig

We describe Microsofts conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.

international conference on acoustics, speech, and signal processing | 2001

Speech in Noisy Environments: robust automatic segmentation, feature extraction, and hypothesis combination

Rita Singh; Michael L. Seltzer; Bhiksha Raj; Richard M. Stern

The first evaluation for Speech in Noisy Environments (SPINE1) was conducted by the Naval Research Labs (NRL) in August, 2000. The purpose of the evaluation was to test existing core speech recognition technologies for speech in the presence of varying types and levels of noise. In this case the noises were taken from military settings. Among the strategies used by Carnegie Mellon Universitys successful systems designed for this task were session-adaptive segmentation, robust mel-scale filtering for the computation of cepstra, the use of parallel front-end features and noise-compensation algorithms, and parallel hypotheses combination through word-graphs. This paper describes the motivations behind the design decisions taken for these components, supported by observations and experiments.

Explore More