Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hong-Kwang Jeff Kuo is active.

Publication


Featured researches published by Hong-Kwang Jeff Kuo.


conference of the international speech communication association | 2016

The IBM 2016 English Conversational Telephone Speech Recognition System

George Saon; Tom Sercu; Steven J. Rennie; Hong-Kwang Jeff Kuo

We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model M and hierarchical neural network LMs.


IEEE Transactions on Audio, Speech, and Language Processing | 2006

Maximum entropy direct models for speech recognition

Hong-Kwang Jeff Kuo; Yuqing Gao

Traditional statistical models for speech recognition have mostly been based on a Bayesian framework using generative models such as hidden Markov models (HMMs). This paper focuses on a new framework for speech recognition using maximum entropy direct modeling, where the probability of a state or word sequence given an observation sequence is computed directly from the model. In contrast to HMMs, features can be asynchronous and overlapping. This model therefore allows for the potential combination of many different types of features, which need not be statistically independent of each other. In this paper, a specific kind of direct model, the maximum entropy Markov model (MEMM), is studied. Even with conventional acoustic features, the approach already shows promising results for phone level decoding. The MEMM significantly outperforms traditional HMMs in word error rate when used as stand-alone acoustic models. Preliminary results combining the MEMM scores with HMM and language model scores show modest improvements over the best HMM speech recognizer.


IEEE Transactions on Audio, Speech, and Language Processing | 2009

Advances in Arabic Speech Transcription at IBM Under the DARPA GALE Program

Hagen Soltau; George Saon; Brian Kingsbury; Hong-Kwang Jeff Kuo; Lidia Mangu; Daniel Povey; Ahmad Emami

This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 2.5 machine translation evaluation. Key advances include the use of additional training data from the Linguistic Data Consortium (LDC), use of a very large vocabulary comprising 737 K words and 2.5 M pronunciation variants, automatic vowelization using flat-start training, cross-adaptation between unvowelized and vowelized acoustic models, and rescoring with a neural-network language model. The resulting system achieves word error rates below 10% on Arabic broadcasts. Very large scale experiments with unsupervised training demonstrate that the utility of unsupervised data depends on the amount of supervised data available. While unsupervised training improves system performance when a limited amount (135 h) of supervised data is available, these gains disappear when a greater amount (848 h) of supervised data is used, even with a very large (7069 h) corpus of unsupervised data. We also describe a method for modeling Arabic dialects that avoids the problem of data sparseness entailed by dialect-specific acoustic models via the use of non-phonetic, dialect questions in the decision trees. We show how this method can be used with a statically compiled decoding graph by partitioning the decision trees into a static component and a dynamic component, with the dynamic component being replaced by a mapping that is evaluated at run-time.


ieee automatic speech recognition and understanding workshop | 2009

Syntactic features for Arabic speech recognition

Hong-Kwang Jeff Kuo; Lidia Mangu; Ahmad Emami; Imed Zitouni; Young-Suk Lee

We report word error rate improvements with syntactic features using a neural probabilistic language model through N-best re-scoring. The syntactic features we use include exposed head words and their non-terminal labels both before and after the predicted word. Neural network LMs generalize better to unseen events by modeling words and other context features in continuous space. They are suitable for incorporating many different types of features, including syntactic features, where there is no pre-defined back-off order. We choose an N-best re-scoring framework to be able to take full advantage of the complete parse tree of the entire sentence. Using syntactic features, along with morphological features, improves the word error rate (WER) by up to 5.5% relative, from 9.4% to 8.6%, on the latest GALE evaluation test set.


international conference on acoustics, speech, and signal processing | 2007

Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition

Hong-Kwang Jeff Kuo; Brian Kingsbury; Geoffrey Zweig

Finite-state decoding graphs integrate the decision trees, pronunciation model and language model for speech recognition into a unified representation of the search space. We explore discriminative training of the transition weights in the decoding graph in the context of large vocabulary speech recognition. In preliminary experiments on the RT-03 English Broadcast News evaluation set, the word error rate was reduced by about 5.7% relative, from 23.0% to 21.7%. We discuss how this method is particularly applicable to low-latency and low-resource applications such as real-time closed captioning of broadcast news and interactive speech-to-speech translation.


international conference on acoustics, speech, and signal processing | 2013

Morpheme-based feature-rich language models using Deep Neural Networks for LVCSR of Egyptian Arabic

Amr El-Desoky Mousa; Hong-Kwang Jeff Kuo; Lidia Mangu; Hagen Soltau

Egyptian Arabic (EA) is a colloquial version of Arabic. It is a low-resource morphologically rich language that causes problems in Large Vocabulary Continuous Speech Recognition (LVCSR). Building LMs on morpheme level is considered a better choice to achieve higher lexical coverage and better LM probabilities. Another approach is to utilize information from additional features such as morphological tags. On the other hand, LMs based on Neural Networks (NNs) with a single hidden layer have shown superiority over the conventional n-gram LMs. Recently, Deep Neural Networks (DNNs) with multiple hidden layers have achieved better performance in various tasks. In this paper, we explore the use of feature-rich DNN-LMs, where the inputs to the network are a mixture of words and morphemes along with their features. Significant Word Error Rate (WER) reductions are achieved compared to the traditional word-based LMs.


international conference on acoustics, speech, and signal processing | 2010

A comparative study on system combination schemes for LVCSR

Chengyuan Ma; Hong-Kwang Jeff Kuo; Hagen Soltau; Xiaodong Cui; Upendra V. Chaudhari; Lidia Mangu; Chin-Hui Lee

We present a comparative study on combination schemes for large vocabulary continuous speech recognition by incorporating long-span class posterior probability features into conventional short-time cepstral features. System combination can improve the overall speech recognition performance when multiple systems exhibit different error patterns and multiple knowledge sources encode complementary information. A variety of combination approaches are investigated in this paper, e.g., feature concatenation single stream system, model combination multi-stream system, lattice rescoring and ROVER. These techniques work at different levels of a LVCSR system and have different computational cost. We compared their performance and analyzed their advantages and disadvantages on large vocabulary English broadcast news transcription tasks. Experimental results showed that model combination with independent tree consistently outperforms ROVER, feature concatenation and lattice rescoring. In addition, the phoneme posterior probability features do provide complementary information to short-time cepstral features.


international conference on acoustics, speech, and signal processing | 2010

Morphological and syntactic features for Arabic speech recognition

Hong-Kwang Jeff Kuo; Lidia Mangu; Ahmad Emami; Imed Zitouni

In this paper, we study the use of morphological and syntactic context features to improve speech recognition of a morphologically rich language like Arabic. We examine a variety of syntactic features, including part-of-speech tags, shallow parse tags, and exposed head words and their non-terminal labels both before and after the word to be predicted. Neural network LMs are used to model these features since they generalize better to unseen events by modeling words and other context features in continuous space. Using morphological and syntactic features, we can improve the word error rate (WER) significantly on various test sets, including EVAL08U, the unsequestered portion of the DARPA GALE Phase 3 evaluation test set.


ieee automatic speech recognition and understanding workshop | 2011

Minimum Bayes risk discriminative language models for Arabic speech recognition

Hong-Kwang Jeff Kuo; Ebru Arisoy; Lidia Mangu; George Saon

In this paper we explore discriminative language modeling (DLM) on highly optimized state-of-the-art large vocabulary Arabic broadcast speech recognition systems used for the Phase 5 DARPA GALE Evaluation. In particular, we study in detail a minimum Bayes risk (MBR) criterion for DLM. MBR training outperforms perceptron training. Interestingly, we found that our DLMs generalized to mismatched conditions, such as using a different acoustic model during testing. We also examine the interesting problem of unsupervised DLM training using a Bayes risk metric as a surrogate for word error rate (WER). In some experiments, we were able to obtain about half of the gain of the supervised DLM.


international conference on acoustics, speech, and signal processing | 2007

The IBM Mandarin Broadcast Speech Transcription System

Stephen M. Chu; Hong-Kwang Jeff Kuo; Yi Y. Liu; Yong Qin; Qin Shi; Geoffrey Zweig

This paper describes the technical and system building advances in the automatic transcription of Mandarin broadcast speech made at IBM in the first year of the DARPA GALE program. In particular, we discuss the application of minimum phone error (MPE) discriminative training and a new topic-adaptive language modeling technique. We present results on both the RT04 evaluation data and two larger community-defined test sets designed to cover both the broadcast news and the broadcast conversation domain. It is shown that with the described advances, the new transcription system achieves a 26.3% relative reduction in character error rate over our previous best-performing system, and is competitive with published numbers on these datasets.

Collaboration


Dive into the Hong-Kwang Jeff Kuo's collaboration.

Researchain Logo
Decentralizing Knowledge