Hong-Kwang Kuo
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hong-Kwang Kuo.
international conference on acoustics, speech, and signal processing | 2010
Brian Kingsbury; Hagen Soltau; George Saon; Stephen M. Chu; Hong-Kwang Kuo; Lidia Mangu; Suman V. Ravuri; Nelson Morgan; Adam Janin
This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 3.5 machine translation evaluation. Key advances compared to our Phase 2.5 system include improved discriminative training, the use of Subspace Gaussian Mixture Models (SGMM), neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models and neural network language models. These advances were instrumental in achieving a word error rate of 8.9% on the evaluation test set.
international conference on acoustics, speech, and signal processing | 2013
Lidia Mangu; Hagen Soltau; Hong-Kwang Kuo; Brian Kingsbury; George Saon
The paper describes a state-of-the-art spoken term detection system in which significant improvements are obtained by diversifying the ASR engines used for indexing and combining the search results. First, we describe the design factors that, when varied, produce complementary STD systems and show that the performance of the combined system is 3 times better than the best individual component. Next, we describe different strategies for system combination and show that significant improvements can be achieved by normalizing the combined scores. We propose a classifier-based system combination strategy which outperforms a highly optimized baseline. The system described in this paper had the highest accuracy in the 2012 DARPA RATS evaluation.
Theory of Computing Systems \/ Mathematical Systems Theory | 2006
Yuqing Gao; Bowen Zhou; Ruhi Sarikaya; Mohamed Afify; Hong-Kwang Kuo; Weizhong Zhu; Yonggang Deng; Charles Prosser; Wei Zhang; Laurent Besacier
In this paper, we describe the IBM MASTOR, a speech-to-speech translation system that can translate spontaneous free-form speech in real-time on both laptop and hand-held PDAs. Challenges include speech recognition and machine translation in adverse environments, lack of training data and linguistic resources for under-studied languages, and the need to rapidly develop capabilities for new languages. Another challenge is designing algorithms and building models in a scalable manner to perform well even on memory and CPU deficient hand-held computers. We describe our approaches, experience, and success in building working free-form S2S systems that can handle two language pairs (including a low-resource language).
international conference on acoustics, speech, and signal processing | 2006
Yuqing Gao; Bowen Zhou; Liang Gu; Ruhi Sarikaya; Hong-Kwang Kuo; Antti-Veikko I. Rosti; Mohamed Afify; Weizhong Zhu
In this paper, we describe the IBM MASTOR systems which handle spontaneous free-form speech-to-speech translation on both laptop and hand-held PDAs. Challenges include speech recognition and machine translation in adverse environments, lack of data and linguistic resources for under-studied languages, and the need to rapidly develop capabilities for new languages. Importantly, the code and models must fit within the limited memory and computational resources of hand-held devices. We describe our approaches, experience, and success in building working free-form S2S systems that can handle two language pairs (including a low-resource language)
international conference on acoustics, speech, and signal processing | 2008
Stephen M. Chu; Hong-Kwang Kuo; Lidia Mangu; Yi Liu; Yong Qin; Qin Shi; Shi Lei Zhang; Hagai Aronowitz
This paper describes the system and algorithmic developments in the automatic transcription of Mandarin broadcast speech made at IBM in the second year of the DARPA GALE program. Technical advances over our previous system include improved acoustic models using embedded tone modeling, and a new topic-adaptive language model (LM) rescoring technique based on dynamically generated LMs. We present results on three community-defined test sets designed to cover both the broadcast news and the broadcast conversation domain. It is shown that our new baseline system attains a 15.4% relative reduction in character error rate compared with our previous GALE evaluation system. And a further 13.6% improvement over the baseline is achieved with the two described techniques.
international conference on acoustics, speech, and signal processing | 2014
Lidia Mangu; Brian Kingsbury; Hagen Soltau; Hong-Kwang Kuo; Michael Picheny
In this paper, we present a fast, vocabulary independent algorithm for spoken term detection (STD) that demonstrates a word-based index is sufficient to achieve good performance for both in-vocabulary (IV) and out-of-vocabulary (OOV) terms. Previous approaches have required that a separate index be built at the sub-word level and then expanded to allow for matching OOV terms. Such a process, while accurate, is expensive in both time and memory. In the proposed architecture, a word-level confusion network (CN) based index is used for both IV and OOV search. This is implemented using a flexible WFST framework. Comparisons on 3 Babel languages (Tagalog, Pashto and Turkish) show that CN-based indexing results in better performance compared with the lattice approach while being orders of magnitude faster and having a much smaller footprint.
international conference on acoustics, speech, and signal processing | 2010
Stephen M. Chu; Daniel Povey; Hong-Kwang Kuo; Lidia Mangu; Shilei Zhang; Qin Shi; Yong Qin
This paper gives an up-to-date description of the IBM Mandarin broadcast transcription system developed under the DARPA GALE program. Technical advances over our previous system include a novel acoustic modeling approach using subspace Gaussian mixture models, a speaking rate adaptation method using frame rate normalization, and an effective recipe for lattice combination. We present results on three consortium-defined test sets. It is shown that with these advances, the new system attains a 9% relative reduction in character error rate compared to our previous GALE evaluation system. The reported 9.1% error rate on the phase three evaluation set represents the state of the art in Mandarin broadcast speech transcription.
ieee automatic speech recognition and understanding workshop | 2013
Lidia Mangu; Hagen Soltau; Hong-Kwang Kuo; George Saon
The paper describes a state-of-the-art keyword search (KWS) system in which significant improvements are obtained by using Convolutional Neural Network acoustic models, a two-step speech segmentation approach and a simplified ASR architecture optimized for KWS. The system described in this paper had the best performance in the 2013 DARPA RATS evaluation for both Levantine and Farsi.
Computers in the Human Interaction Loop | 2014
Hagen Soltau; George Saon; Lidia Mangu; Hong-Kwang Kuo; Brian Kingsbury; Stephen M. Chu; Fadi Biadsy
In this chapter we describe techniques to build a high performance speech recognizer for Arabic and related languages. The key insights are derived from our experience in the DARPA GALE program, a 5-year program devoted to enhancing the state-of-the-art in Arabic speech recognition and translation. The most important lesson is that general speech recognition techniques work very well also on Arabic. An example is the issue of vowelization: short vowels are often not transcribed in Arabic, Hebrew, and other Semitic languages. Semi-automatic vowelization procedures, specifically designed for the language, can improve the pronunciation lexicon. However, we also can simply choose to ignore the problem at the lexicon level, and compensate for the resulting pronunciation mismatch with the use of discriminative training of the acoustic models. While we focus on Arabic, in this chapter, we speculate that the vast majority of the issues we address here will completely carry over to other Semitic languages. We have tested the approaches discussed in this chapter only on Arabic, as that is the Semitic language with the most resources. Our experimental results demonstrate that such language-independent techniques can solve language-specific issues at least to a large extent. Another example is morphology, where we show that a combination of language-independent techniques (an efficient decoder to deal with large vocabulary and exponential language models) and language-specific techniques (a neural network language model that uses morphological and syntactic features) lead to good results. For these reasons we describe in the text a list of both language-independent and language-specific techniques. We describe also a full-fledged LVCSR system for Arabic that makes best use of all the techniques. We also demonstrate how this system can be used to bootstrap systems for related Arabic dialects and Semitic languages.
international conference on acoustics, speech, and signal processing | 2014
Hong-Kwang Kuo; Ellen Kislal; Lidia Mangu; Hagen Soltau; Tomas Beran
In this paper we describe progress we have made in detecting out-of-vocabulary words (OOVs) for a speech-to-speech translation system for the purpose of playing back audio to the user for clarification and correction. Our OOV detector follows a strategy of first identifying a rough location of the OOV and then merging adjacent decoded words to cover the true OOV word. We show the advantage of our OOV detection strategy and report on improvements using a real-time implementation of a new Convolutional Neural Network acoustic model. We discuss why commonly used metrics for OOV detection do not meet our needs and explore an overlap metric as well as a Jaccard metric for evaluating our ability to detect the OOVs and localize them accurately in time. We have found different metrics to be useful at different stages of development.