Chaojun Liu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chaojun Liu is active.

Explore More

Publication

Featured researches published by Chaojun Liu.

international conference on acoustics, speech, and signal processing | 2016

Investigations on speaker adaptation of LSTM RNN models for speech recognition

Chaojun Liu; Yongqiang Wang; Kshitiz Kumar; Yifan Gong

Recently Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNN) acoustic models have demonstrated superior performance over deep neural networks (DNN) models in speech recognition and many other tasks. Although a lot of work have been reported on DNN model adaptation, very little has been done on LSTM model adaptation. In this paper we present our extensive studies of speaker adaptation of LSTM-RNN models for speech recognition. We investigated different adaptation methods combined with KL-divergence based regularization, where and which network component to adapt, supervised versus unsupervised adaptation and asymptotic analysis. We made a few distinct and important observations. In a large vocabulary speech recognition task, by adapting only 2.5% of the LSTM model parameters using 50 utterances per speaker, we obtained 12.6% WERR on the dev set and 9.1% WERR on the evaluation set over a strong LSTM baseline model.

international conference on acoustics, speech, and signal processing | 2015

Deep neural support vector machines for speech recognition

Shi-Xiong Zhang; Chaojun Liu; Kaisheng Yao; Yifan Gong

A new type of deep neural networks (DNNs) is presented in this paper. Traditional DNNs use the multinomial logistic regression (softmax activation) at the top layer for classification. The new DNN instead uses a support vector machine (SVM) at the top layer. Two training algorithms are proposed at the frame and sequence-level to learn parameters of SVM and DNN in the maximum-margin criteria. In the frame-level training, the new model is shown to be related to the multiclass SVM with DNN features; In the sequence-level training, it is related to the structured SVM with DNN features and HMM state transition features. Its decoding process is similar to the DNN-HMM hybrid system but with frame-level posterior probabilities replaced by scores from the SVM. We term the new model deep neural support vector machine (DNSVM). We have verified its effectiveness on the TIMIT task for continuous speech recognition.

international conference on acoustics, speech, and signal processing | 2013

Predicting speech recognition confidence using deep learning with word identity and score features

Po-Sen Huang; Kshitiz Kumar; Chaojun Liu; Yifan Gong; Li Deng

Confidence classifiers for automatic speech recognition (ASR) provide a quantitative representation for the reliability of ASR decoding. In this paper, we improve the ASR confidence measure performance for an utterance using two distinct approaches: (1) to define and incorporate additional predictors in the confidence classifier including those based on the word identity and on the aggregated words, and (2) to train the confidence classifier built on deep learning architectures including the deep neural network (DNN) and the kernel deep convex network (K-DCN). Our experiments show that adding the new predictors to our multi-layer perceptron (MLP)-based baseline classifier provides 38.6% relative reduction in the correct-reject rate as our measure of the classifier performance. Further, replacing the MLP with the DNN and K-DCN provides an additional 14.5% and 47.5% in the relative performance gain, respectively.

international conference on acoustics, speech, and signal processing | 2016

Non-negative intermediate-layer DNN adaptation for a 10-KB speaker adaptation profile

Kshitiz Kumar; Chaojun Liu; Yifan Gong

Previously we demonstrated that speaker adaptation of acoustic models (AM) can provide significant improvement in the accuracy of large-scale speech recognition systems. In this work we discuss numerous challenges in scaling speaker adaptation to millions of speakers, where the size of speaker-dependent (SD) parameters is a critical challenge. Subsequently, we formulate an intermediate-layer adaptation framework for adaptation, upon which we build a non-negative adaptation for a very sparse set of non-negative SD parameters. We further improve this work with, (a) non-negative adaptation with a small-positive threshold, (b) setting small-positive weights in an already trained non-negative model to zero. We also discuss effective methods to store the non-negative SD parameters. We show that our methods reduce the SD parameters from 86KB for our previous best adaptation approach to 8.8KB, thus about 90% relative reduction in the size of SD parameters, and still retain 10+% word-error-rate-relative (WERR) gain over the baseline speaker-independent (SI) model.

international conference on acoustics, speech, and signal processing | 2016

Recurrent support vector machines for speech recognition

Shi-Xiong Zhang; Rui Zhao; Chaojun Liu; Jinyu Li; Yifan Gong

Recurrent Neural Networks (RNNs) using Long-Short Term Memory (LSTM) architecture have demonstrated the state-of-the-art performances on speech recognition. Most of deep RNNs use the softmax activation function in the last layer for classification. This paper illustrates small but consistent advantages of replacing the softmax layer in RNN with Support Vector Machines (SVMs). The parameters of RNNs and SVMs are jointly learned using a sequence-level max-margin criteria, instead of cross-entropy. The resulting model is termed Recurrent SVM. The conventional SVMs need to predefine a feature space and do not have internal states to deal with arbitrary long-term dependencies in sequences. The proposed recurrent SVM uses LSTMs to learn the feature space and to capture temporal dependencies, while using the SVM (in the last layer) for sequence classification. The model is evaluated on the Windows phone task for large vocabulary continuous speech recognition.

international conference on acoustics, speech, and signal processing | 2015

Estimating confidence scores on ASR results using recurrent neural networks

Kaustubh Prakash Kalgaonkar; Chaojun Liu; Yifan Gong; Kaisheng Yao

In this paper we present a confidence estimation system using recurrent neural networks (RNN) and compare it to a traditional multilayered perception (MLP) based system. The ability of RNN to capture sequence information and improve decisions using processed history was main motivation to explore RNNs for confidence estimation. In this paper we also explore two subtle variations of confidence estimator: one that uses objective extracted over the entire sequence for training, and other that uses dynamic programming to decode and estimate confidence on all the words of the sequence jointly. In our experiments, we observed that for a constant false positive (FP) rate of 3% we can secure a relative reduction of 10% in false negative (FN) rate when we replaced a MLP in confidence estimator with a RNN.We also observed that relative gains achieved by a RNN based confidence estimator are directly proportional to the number of word in the utterances.

New Era for Robust Speech Recognition, Exploiting Deep Learning | 2017

Challenges in and Solutions to Deep Learning Network Acoustic Modeling in Speech Recognition Products at Microsoft

Yifan Gong; Yan Huang; Kshitiz Kumar; Jinyu Li; Chaojun Liu; Guoli Ye; Shi-Xiong Zhang; Yong Zhao; Rui Zhao

Deep learning (DL) network acoustic modeling has been widely deployed in real-world speech recognition products and services that benefit millions of users. In addition to the general modeling research that academics work on, there are special constraints and challenges that the industry has to face, e.g., the run-time constraint on system deployment, robustness to variations such as the acoustic environment, accents, lack of manual transcription, etc. For large-scale automatic speech recognition applications, this chapter briefly describes selected developments and investigations at Microsoft to make deep learning networks more effective in a production environment, including reducing run-time cost with singular-value-decomposition-based training, improving the accuracy of small-size deep neural networks (DNNs) with teacher–student training, the use of a small amount of parameters for speaker adaptation of acoustic models, improving the robustness to the acoustic environment with variable-component DNN modeling, improving the robustness to accent/dialect with model adaptation and accent-dependent modeling, introducing time and frequency invariance with time–frequency long short-term memory recurrent neural networks, exploring the generalization capability to unseen data with maximum margin sequence training, the use of unsupervised data to improve speech recognition accuracy, and increasing language capability by reusing speech-training material across languages. The outcome has enabled the deployment of DL acoustic models across Microsoft server and client product lines including Windows 10 desktop/laptop/phones, XBOX, and skype speech-to-speech translation.

conference of the international speech communication association | 2013