Is this you? Create Your Porfile

Qifeng Zhu

International Computer Science Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Qifeng Zhu is active.

Explore More

Publication

Featured researches published by Qifeng Zhu.

IEEE Signal Processing Magazine | 2005

Pushing the envelope - aside [speech recognition]

Nelson Morgan; Qifeng Zhu; Andreas Stolcke; K. Sonmez; S. Sivadas; T. Shinozaki; Mari Ostendorf; P. Jain; Hynek Hermansky; Daniel P. W. Ellis; G. Doddington; Barry Y. Chen; O. Cretin; H. Bourlard; M. Athineos

Despite successes, there are still significant limitations to speech recognition performance, particularly for conversational speech and/or for speech with significant acoustic degradations from noise or reverberation. For this reason, authors have proposed methods that incorporate different (and larger) analysis windows, which are described in this article. Note in passing that we and many others have already taken advantage of processing techniques that incorporate information over long time ranges, for instance for normalization (by cepstral mean subtraction as stated in B. Atal (1974) or relative spectral analysis (RASTA) based in H. Hermansky and N. Morgan (1994)). They also have proposed features that are based on speech sound class posterior probabilities, which have good properties for both classification and stream combination.

IEEE Transactions on Audio, Speech, and Language Processing | 2006

Recent innovations in speech-to-text transcription at SRI-ICSI-UW

Andreas Stolcke; Barry Y. Chen; H. Franco; Venkata Ramana Rao Gadde; Martin Graciarena; Mei-Yuh Hwang; Katrin Kirchhoff; Arindam Mandal; Nelson Morgan; Xin Lei; Tim Ng; Mari Ostendorf; M. Kemal Sönmez; Anand Venkataraman; Dimitra Vergyri; Wen Wang; Jing Zheng; Qifeng Zhu

We summarize recent progress in automatic speech-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin

international conference on acoustics, speech, and signal processing | 2004

Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition

Nelson Morgan; Barry Y. Chen; Qifeng Zhu; Andreas Stolcke

Temporal patterns (TRAP) and tandem MLP/HMM approaches incorporate feature streams computed from longer time intervals than the conventional short-time analysis. These methods have been used for challenging small- and medium-vocabulary recognition tasks, such as Aurora and SPINE. Conversational telephone speech recognition is a difficult large-vocabulary task, with current systems giving incorrect output for 20-40% of the words, depending on the system complexity and test set. Training and test times for this problem also tend to be relatively long, making rapid development quite difficult. In this paper we report experiments with a reduced conversational speech task that led to the adoption of a number of engineering decisions for the design of an acoustic front end. We then describe our results with this front end on a full-vocabulary conversational telephone speech task. In both cases the front end yielded significant improvements over the baseline.

international conference on machine learning | 2004

Tandem connectionist feature extraction for conversational speech recognition

Qifeng Zhu; Barry Y. Chen; Nelson Morgan; Andreas Stolcke

Multi-Layer Perceptrons (MLPs) can be used in automatic speech recognition in many ways. A particular application of this tool over the last few years has been the Tandem approach, as described in [7] and other more recent publications. Here we discuss the characteristics of the MLP-based features used for the Tandem approach, and conclude with a report on their application to conversational speech recognition. The paper shows that MLP transformations yield variables that have regular distributions, which can be further modified by using logarithm to make the distribution easier to model by a Gaussian-HMM. Two or more vectors of these features can easily be combined without increasing the feature dimension. We also report recognition results that show that MLP features can significantly improve recognition performance for the NIST 2001 Hub-5 evaluation set with models trained on the Switchboard Corpus, even for complex systems incorporating MMIE training and other enhancements.

international conference on acoustics, speech, and signal processing | 2005

Tonotopic multi-layered perceptron: a neural network for learning long-term temporal features for speech recognition

B.Y. Chenl; Qifeng Zhu; Nelson Morgan

We have been reducing word error rates (WER) on conversational telephone speech (CTS) tasks by capturing long-term (/spl sim/500ms) temporal information using multilayered perceptrons (MLP). In this paper we experiment with an MLP architecture called tonotopic MLP (TMLP), incorporating two hidden layers. The first of these is tonotopically organized: for each critical band, there is a disjoint set of hidden units that use the long-term energy trajectory as the input. Thus, each of these subsets of hidden units learns to discriminate single band energy trajectory patterns. The rest of the layers are fully connected to their inputs. When used in combination with an intermediate-term (/spl sim/100ms) MLP system to augment standard PLP features, the TMLP reduces the WER on the 2001 Nist Hub-5 CTS evaluation set (Eval2001) by 8.87% relative. We show some practical advantages over our previous methods. We also report results from a series of experiments to determine the best ranges of hidden layer sizes and total parameters with respect to the number of training patterns for this task and architecture.

international conference on machine learning | 2004

Long-Term temporal features for conversational speech recognition

Barry Y. Chen; Qifeng Zhu; Nelson Morgan

The automatic transcription of conversational speech, both from telephone and in-person interactions, is still an extremely challenging task. Our efforts to recognize speech from meetings is likely to benefit from any advances we achieve with conversational telephone speech, a topic of considerable focus for our research. Towards both of these ends, we have developed, in collaboration with our colleagues at SRI and IDIAP, techniques to incorporate long-term (~500 ms) temporal information using multi-layered perceptrons (MLPs). Much of this work is based on prior achievements in recent years at the former lab of Hynek Hermansky at the Oregon Graduate Institute (OGI), where the TempoRAl Pattern (TRAP) approach was developed. The contribution here is to present experiments showing: 1) that simply widening acoustic context by using more frames of full band speech energies as input to the MLP is suboptimal compared to a more constrained two-stage approach that first focuses on long-term temporal patterns in each critical band separately and then combines them, 2) that the best two-stage approach studied utilizes hidden activation values of MLPs trained on the log critical band energies (LCBEs) of 51 consecutive frames, and 3) that combining the best two-stage approach with conventional short-term features significantly reduces word error rates on the 2001 NIST Hub-5 conversational telephone speech (CTS) evaluation set with models trained using the Switchboard Corpus.

conference of the international speech communication association | 2004