Xin Lei
University of Washington
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xin Lei.
international conference on acoustics, speech, and signal processing | 2006
Andreas Stolcke; Frantisek Grezl; Mei-Yuh Hwang; Xin Lei; Nelson Morgan; Dimitra Vergyri
Recent results with phone-posterior acoustic features estimated by multilayer perceptrons (MLPs) have shown that such features can effectively improve the accuracy of state-of-the-art large vocabulary speech recognition systems. MLP features are trained discriminatively to perform phone classification and are therefore, like acoustic models, tuned to a particular language and application domain. In this paper we investigate how portable such features are across domains and languages. We show that even without retraining, English-trained MLP features can provide a significant boost to recognition accuracy in new domains within the same language, as well as in entirely different languages such as Mandarin and Arabic. We also show the effectiveness of feature-level adaptation in porting MLP features to new domains
IEEE Transactions on Audio, Speech, and Language Processing | 2006
Andreas Stolcke; Barry Y. Chen; H. Franco; Venkata Ramana Rao Gadde; Martin Graciarena; Mei-Yuh Hwang; Katrin Kirchhoff; Arindam Mandal; Nelson Morgan; Xin Lei; Tim Ng; Mari Ostendorf; M. Kemal Sönmez; Anand Venkataraman; Dimitra Vergyri; Wen Wang; Jing Zheng; Qifeng Zhu
We summarize recent progress in automatic speech-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin
international conference on acoustics, speech, and signal processing | 2005
Tim Ng; Mari Ostendorf; Mei-Yuh Hwang; Manhung Siu; Ivan Bulyko; Xin Lei
Lack of data is a problem in training language models for conversational speech recognition, particularly for languages other than English. Experiments in English have successfully used Web-based text collection, targeted for a conversational style, to augment small sets of transcribed speech; we look at extending these techniques to Mandarin. In addition, we investigate different techniques for topic adaptation. Experiments in recognizing Mandarin telephone conversations show that the use of filtered Web data leads to a 28% reduction in perplexity and 7% reduction in character error rate, with most of the gain due to the general filtered Web data.
international conference on acoustics, speech, and signal processing | 2007
Jing Zheng; Özgür Çetin; Mei-Yuh Hwang; Xin Lei; Andreas Stolcke; Nelson Morgan
Recent developments in large vocabulary continuous speech recognition (LVCSR) have shown the effectiveness of discriminative training approaches, employing the following three representative techniques: discriminative Gaussian training using the minimum phone error (MPE) criterion, discriminately trained features estimated by multilayer perceptrons (MLPs); and discriminative feature transforms such as feature-level MPE (fMPE). Although MLP features, MPE models, and fMPE transforms have each been shown to improve recognition accuracy, no previous work has applied all three in a single LVCSR system. This paper uses a state-of-the-art Mandarin recognition system as a platform to study the interaction of all three techniques. Experiments in the broadcast news and broadcast conversation domains show that the contribution of each technique is nonredundant, and that the full combination yields the best performance and has good domain generalization.
international conference on acoustics, speech, and signal processing | 2007
Xin Lei; Mari Ostendorf
Standard HMM-based Mandarin speech recognition systems do not exploit the suprasegmental nature of tones, but explicit tone models can be incorporated with lattice rescoring. This work extends previous approaches to explicit tone modeling from the syllable level to the word level, incorporating a hierarchical backoff. Word-dependent tone models are trained to explicitly model the tone coarticulation within the word. For less frequent words, tonal-syllable-dependent tone models or plain tone models are used as backoff. More generally, context-dependent tone models can be used as backoff. The word-dependent tone modeling framework can be viewed as a generalization of the traditional context-independent and context-dependent tone modeling. Under this framework, different types of tone modeling strategies are compared experimentally on a Mandarin broadcast news speech recognition task, showing significant gains from the word-level tone modeling approach.
international symposium on chinese spoken language processing | 2004
Mei-Yuh Hwang; Xin Lei; Tim Ng; Ivan Bulyko; Mari Ostendorf; Andreas Stolcke; Wen Wang; Jing Zheng; Venkata Ramana Rao Gadde; Martin Graciarena; Man-Hung Siu; Yan Huang
Over the past decade, there has been good progress on English conversational telephone speech (CTS) recognition, built on the Switchboard and Fisher corpora. In this paper, we present our efforts on extending language-independent technologies into Mandarin CTS, as well as addressing language-dependent issues such as tone. We show the impact of each of the following factors: (a) simplified Mandarin phone set; (b) pitch features; (c) auto-retrieved Web texts for augmenting n-gram training; (d) speaker adaptive training; (e) maximum mutual information estimation; (f) decision-tree-based parameter sharing; (g) cross-word co-articulation modeling; and (h) combining MFCC and PLP decoding outputs using confusion networks. We have reduced the Chinese character error rate (CER) of the BBN-2003 development test set from 53.8% to 46.8% after (a)+(b)+(c)+(f)+(g) are combined. Further reduction in CER is anticipated after integrating all improvements.
international conference on acoustics, speech, and signal processing | 2005
Xin Lei; Gang Ji; Tim Ng; Jeff A. Bilmes; Mari Ostendorf
A toneme in Mandarin Chinese is a tonal phone which consists of a base phone (main vowel) and a tone. To capture both, most recognition systems use two feature streams: the standard MFCC for the base phones, and pitch features for the tones. In this paper we propose the use of dynamic Bayesian networks for modeling the two streams in toneme recognition. We used the Graphical Model Toolkit to build and compare three different models: a standard HMM with concatenated features, and synchronous and asynchronous multi-stream systems. Stream-level model parameter tying is also exploited. The toneme recognition results show significant improvements by using the multi-stream models.
conference of the international speech communication association | 2006
Xin Lei; Manhung Siu; Mei-Yuh Hwang; Mari Ostendorf; Tan Lee
conference of the international speech communication association | 2006
Mei-Yuh Hwang; Xin Lei; Wen Wang; Takahiro Shinozaki
conference of the international speech communication association | 2005
Xin Lei; Mei-Yuh Hwang; Mari Ostendorf