Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xuedong Huang is active.

Publication


Featured researches published by Xuedong Huang.


Computer Speech & Language | 1992

The SPHINX-II Speech Recognition System: An Overview

Xuedong Huang; Fileno A. Alleva; Hsiao-Wuen Hon; Mei-Yuh Hwang; Ronald Rosenfeld

In order for speech recognizers to deal with increased task perplexity, speaker variation, and environment variation, improved speech recognition is critical. Steady progress has been made along these three dimensions at Carnegie Mellon. In this paper, we review the SPHINX-II speech recognition system and summarize our recent efforts on improved speech recognition.


Communications of The ACM | 2004

Challenges in adopting speech recognition

Li Deng; Xuedong Huang

Although progress has been impressive, there are still several hurdles that speech recognition technology must clear before ubiquitous adoption can be realized. R&D in spontaneous and free-flowing speech style is critical to its success.


international conference on acoustics, speech, and signal processing | 2001

High-performance robust speech recognition using stereo training data

Li Deng; Alex Acero; Li Jiang; Jasha Droppo; Xuedong Huang

We describe a novel technique of SPLICE (Stereo-based Piecewise Linear Compensation for Environments) for high performance robust speech recognition. It is an efficient noise reduction and channel distortion compensation technique that makes effective use of stereo training data. We present a new version of SPLICE using the minimum-mean-square-error decision, and describe an extension by training clusters of hidden Markov models (HMMs) with SPLICE processing. Comprehensive results using a Wall Street Journal large vocabulary recognition task and with a wide range of noise types demonstrate the superior performance of the SPLICE technique over that under noisy matched conditions (19% word error rate reduction). The new technique is also shown to consistently outperform the spectral-subtraction noise reduction technique, and is currently being integrated into the Microsoft MiPad, a new generation PDA prototype.


human language technology | 1993

Efficient cepstral normalization for robust speech recognition

Fu-Hua Liu; Richard M. Stern; Xuedong Huang; Alejandro Acero

In this paper we describe and compare the performance of a series of cepstrum-based procedures that enable the CMU SPHINX-II speech recognition system to maintain a high level of recognition accuracy over a wide variety of acoustical environments. We describe the MFCDCN algorithm, an environment-independent extension of the efficient SDCN and FCDCN algorithms developed previously. We compare the performance of these algorithms with the very simple RASTA and cepstral mean normalization procedures, describing the performance of these algorithms in the context of the 1992 DARPA CSR evaluation using secondary microphones, and in the DARPA stress-test evaluation.


IEEE Transactions on Speech and Audio Processing | 1993

Shared-distribution hidden Markov models for speech recognition

Mei-Yuh Hwang; Xuedong Huang

A shared-distribution hidden Markov model (HMM) is presented for speaker-independent continuous speech recognition. The output distributions across different phonetic HMMs are shared with each other when they exhibit acoustic similarity. This sharing provides the freedom to use a larger number of Markov states for each phonetic model. Although an increase in the number of states will increase the total number of free parameters, with distribution sharing one can collapse redundant states while maintaining necessary ones. The shared-distribution model reduced the word error rate on the DARPA Resource Management task by 20% in comparison with the generalized-triphone model. >


international conference on acoustics, speech, and signal processing | 1993

Predicting unseen triphones with senones

Mei-Yuh Hwang; Xuedong Huang; Fileno A. Alleva

In large-vocabulary speech recognition, there are always new triphones that are not covered in the training data. These unseen triphones are usually represented by corresponding diphones or context-independent monophones. It is proposed that decision-tree-based senones be used to generate needed senonic baseforms for unseen triphones. A decision tree is built for each individual Markov state of each phone, and the leaves of the trees constitute the senone codebook. A Markov state of any triphone traverses the corresponding tree until it reaches a leaf to find the senone it is to be associated with. The DARPA 5000-word peaker-independent Wall Street Journal dictation task is used to evaluate the proposed method. The word error rate is reduced by more than 10% when unseen triphones are modeled by the decision-tree-based senones.<<ETX>>


international conference on spoken language processing | 1996

Whistler: a trainable text-to-speech system

Xuedong Huang; Alex Acero; Jim Adcock; Hsiao-Wuen Hon; John Goldsmith; Jingsong Liu; Mike Plumpe

We introduce Whistler, a trainable text to speech (TTS) system that automatically learns the model parameters from a corpus. Both prosody parameters and concatenative speech units are derived through the use of probabilistic learning methods that have been successfully used for speech recognition. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style.


international conference on acoustics, speech, and signal processing | 2017

The microsoft 2016 conversational speech recognition system

Wayne Xiong; Jasha Droppo; Xuedong Huang; Frank Seide; Michael L. Seltzer; Andreas Stolcke; Dong Yu; Geoffrey Zweig

We describe Microsofts conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.


international conference on acoustics, speech, and signal processing | 2000

A unified context-free grammar and n-gram model for spoken language processing

Ye-Yi Wang; Milind Mahajan; Xuedong Huang

While context-free grammars (CFGs) remain as one of the most important formalisms for interpreting natural language, word n-gram models are surprisingly powerful for domain-independent applications. We propose to unify these two formalisms for both speech recognition and spoken language understanding (SLU). With portability as the major problem, we incorporated domain-specific CFGs into a domain-independent n-gram model that can improve the generalizability of the CFG and the specificity of the n-gram. In our experiments, the unified model can significantly reduce the test set perplexity from 378 to 90 in comparison with a domain-independent word trigram. The unified model converges well when domain-specific data becomes available. The perplexity can be further reduced from 90 to 65 with a limited amount of domain-specific data. While we have demonstrated excellent portability, the full potential of our approach lies in its unified recognition and understanding that we are investigating.


international conference on acoustics, speech, and signal processing | 1992

Subphonetic modeling with Markov states-Senone

Mei-Yuh Hwang; Xuedong Huang

There will never be sufficient training data to model all the various acoustic-phonetic phenomena. How to capture important clues and estimate those needed parameters reliably is one of the central issues in speech recognition. Successful examples include subword models, fenones and many other smoothing techniques. In comparison with subword models, subphonetic modeling may provide a finer level of details. The authors propose to model subphonetic events with Markov states and treat the state in phonetic hidden Markov models as the basic subphonetic unit-senone. Senones generalize fenones in several ways. A word model is a concatenation of senones and senones can be shared across different word models. Senone models not only allow parameter sharing, but also enable pronunciation optimization. The authors report preliminary senone modeling results, which have significantly reduced the word error rate for speaker-independent continuous speech recognition.<<ETX>>

Collaboration


Dive into the Xuedong Huang's collaboration.

Researchain Logo
Decentralizing Knowledge