Roger Hsiao
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Roger Hsiao.
international conference on acoustics, speech, and signal processing | 2006
Brian Mak; Tsz-Chung Lai; Roger Hsiao
We would like to revisit a simple fast adaptation technique called reference speaker weighting (RSW). RSW is similar to eigenvoice (EV) adaptation, and simply requires the model of a new speaker to lie on the span of a set of reference speaker vectors. In the original RSW, the reference speakers are computed through a hierarchical speaker clustering (HSC) algorithm using information such as the gender and speaking rate. We show in this paper that RSW adaptation may be improved if those training speakers that have the highest likelihoods of the adaptation data are selected as the reference speakers; we call them the maximum-likelihood (ML) reference speakers. When RSW adaptation was evaluated on WSJ0 using 5s of adaptation speech, the word error rate reduction can be boosted from 2.54% to 9.15% by using 10 ML reference speakers instead of reference speakers determined from HSC. Moreover, when compared with EV, MAP, MLLR, and eKEV on fast adaptation, we are surprised that the algorithmically simplest RSW technique actually gives the best performance
international conference on acoustics, speech, and signal processing | 2006
Man-Wai Mak; Roger Hsiao; Brian Mak
One key factor that hinders the widespread deployment of speaker verification technologies is the requirement of long enrollment utterances to guarantee low error rate during verification. To gain user acceptance of speaker verification technologies, adaptation algorithms that can enroll speakers with short utterances are highly essential. To this end, this paper applies kernel eigenspace-based MLLR (KEMLLR) for speaker enrollment and compares its performance against three state-of-the-art model adaptation techniques: maximum a posteriori (MAP), maximum-likelihood linear regression (MLLR), and reference speaker weighting (RSW). The techniques were compared under the NIST2001 SRE framework, with enrollment data vary from 2 to 32 seconds. Experimental results show that KEMLLR is most effective for short enrollment utterances (between 2 to 4 seconds) and that MAP performs better when long utterances (32 seconds) are available
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Brian Mak; Roger Hsiao; Simon Ka-Lung Ho; James Tin-Yau Kwok
Recently, we proposed an improvement to the conventional eigenvoice (EV) speaker adaptation using kernel methods. In our novel kernel eigenvoice (KEV) speaker adaptation, speaker supervectors are mapped to a kernel-induced high dimensional feature space, where eigenvoices are computed using kernel principal component analysis. A new speaker model is then constructed as a linear combination of the leading eigenvoices in the kernel-induced feature space. KEV adaptation was shown to outperform EV, MAP, and MLLR adaptation in a TIDIGITS task with less than 10 s of adaptation speech. Nonetheless, due to many kernel evaluations, both adaptation and subsequent recognition in KEV adaptation are considerably slower than conventional EV adaptation. In this paper, we solve the efficiency problem and eliminate all kernel evaluations involving adaptation or testing observations by finding an approximate pre-image of the implicit adapted model found by KEV adaptation in the feature space; we call our new method embedded kernel eigenvoice (eKEV) adaptation. eKEV adaptation is faster than KEV adaptation, and subsequent recognition runs as fast as normal HMM decoding. eKEV adaptation makes use of multidimensional scaling technique so that the resulting adapted model lies in the span of a subset of carefully chosen training speakers. It is related to the reference speaker weighting (RSW) adaptation method that is based on speaker clustering. Our experimental results on Wall Street Journal show that eKEV adaptation continues to outperform EV, MAP, MLLR, and the original RSW method. However, by adopting the way we choose the subset of reference speakers for eKEV adaptation, we may also improve RSW adaptation so that it performs as well as our eKEV adaptation
ieee automatic speech recognition and understanding workshop | 2009
Hassan Al-Haj; Roger Hsiao; Ian R. Lane; Alan W. Black; Alex Waibel
Short vowels in Arabic are normally omitted in written text which leads to ambiguity in the pronunciation. This is even more pronounced for dialectal Arabic where a single word can be pronounced quite differently based on the speakers nationality, level of education, social class and religion. In this paper we focus on pronunciation modeling for Iraqi-Arabic speech. We introduce multiple pronunciations into the Iraqi speech recognition lexicon, and compare the performance, when weights computed via forced alignment are assigned to the different pronunciations of a word. Incorporating multiple pronunciations improved recognition accuracy compared to a single pronunciation baseline and introducing pronunciation weights further improved performance. Using these techniques an absolute reduction in word-error-rate of 2.4% was obtained compared to the baseline system.
north american chapter of the association for computational linguistics | 2009
Nguyen Bach; Roger Hsiao; Matthias Eck; Paisarn Charoenpornsawat; Stephan Vogel; Tanja Schultz; Ian R. Lane; Alex Waibel; Alan W. Black
In building practical two-way speech-to-speech translation systems the end user will always wish to use the system in an environment different from the original training data. As with all speech systems, it is important to allow the system to adapt to the actual usage situations. This paper investigates how a speech-to-speech translation system can adapt day-to-day from collected data on day one to improve performance on day two. The platform is the CMU Iraqi-English portable two-way speech-to-speech system as developed under the DARPA TransTac program. We show how machine translation, speech recognition and overall system performance can be improved on day 2 after adapting from day 1 in both a supervised and unsupervised way.
international conference on acoustics, speech, and signal processing | 2009
Roger Hsiao; Yik-Cheung Tam; Tanja Schultz
We propose a new optimization algorithm called Generalized Baum Welch (GBW) algorithm for discriminative training on hidden Markov model (HMM). GBW is based on Lagrange relaxation on a transformed optimization problem. We show that both Baum-Welch (BW) algorithm for ML estimate ofHMMparameters, and the popular extended Baum-Welch (EBW) algorithm for discriminative training are special cases of GBW.We compare the performance of GBW and EBW for Farsi large vocabulary continuous speech recognition (LVCSR).
conference of the international speech communication association | 2016
Roger Hsiao; Ralf Meermeier; Tim Ng; Zhongqiang Huang; Maxwell Jordan; Enoch Kan; Tanel Alumäe; Jan Silovsky; William Hartmann; Francis Keith; Omer Lang; Man-Hung Siu; Owen Kimball
To capitalize on the rapid development of Speech-to-Text (STT) technologies and the proliferation of open source machine learning toolkits, BBN has developed Sage, a new speech processing platform that integrates technologies from multiple sources, each of which has particular strengths. In this paper, we describe the design of Sage, which allows the easy interchange of STT components from different sources. We also describe our approach for fast prototyping with new machine learning toolkits, and a framework for sharing STT components across different applications. Finally, we report Sage’s state-of-the-art performance on different STT tasks.
conference of the international speech communication association | 2016
William Hartmann; Le Zhang; Kerri Barnes; Roger Hsiao; Stavros Tsakalidis; Richard M. Schwartz
System combination is a common approach to improving results for both speech transcription and keyword spotting—especially in the context of low-resourced languages where building multiple complementary models requires less computational effort. Using state-of-the-art CNN and DNN acoustic models, we analyze the performance, cost, and trade-offs of four system combination approaches: feature combination, joint decoding, hitlist combination, and a novel lattice combination method. Previous work has focused solely on accuracy comparisons. We show that joint decoding, lattice combination, and hitlist combination perform comparably, significantly better than feature combination. However, for practical systems, earlier combination reduces computational cost and storage requirements. Results are reported on four languages from the IARPA Babel dataset.
conference of the international speech communication association | 2016
William Hartmann; Tim Ng; Roger Hsiao; Stavros Tsakalidis; Richard M. Schwartz
Abstract : Low resourced languages suffer from limited training data and resources. Data augmentation is a common approach to increasing the amount of training data. Additional data is synthesized by manipulating the original data with a variety of methods. Unlike most previous work that focuses on a single technique, we combine multiple, complementary augmentation approaches. The first stage adds noise and perturbs the speed of additional copies of the original audio. The data is further augmented in a second stage, where a novel fMLLR-based augmentation is applied to bottleneck features to further improve performance. A reduction in word error rate is demonstrated on four languages from the IARPA Babel program. We present an analysis exploring why these techniques are beneficial.
international conference on acoustics, speech, and signal processing | 2005
Roger Hsiao; Brian Mak
Recently, we have been investigating the application of kernel methods to improve the performance of eigenvoice-based adaptation methods by exploiting possible nonlinearity in their original working space. We proposed the kernel eigenvoice adaptation (KEV), and the kernel eigenspace-based MLLR adaptation (KEMLLR). In KEMLLR, speaker-dependent MLLR transformation matrices are mapped to a kernel-induced high dimensional feature space, and kernel principal component analysis (KPCA) is used to derive a set of eigenmatrices in the feature space. A new speaker is then represented by a linear combination of the leading eigenmatrices. In this paper, we further improve KEMLLR by the use of multiple regression classes and the quasi-Newton BFGS optimization algorithm.