Yanmin Qian
Shanghai Jiao Tong University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yanmin Qian.
international conference on acoustics, speech, and signal processing | 2012
Daniel Povey; Mirko Hannemann; Gilles Boulianne; Lukas Burget; Arnab Ghoshal; Milos Janda; Martin Karafiát; Stefan Kombrink; Petr Motlicek; Yanmin Qian; Korbinian Riedhammer; Karel Vesely; Ngoc Thang Vu
We describe a lattice generation method that is exact, i.e. it satisfies all the natural properties we would want from a lattice of alternative transcriptions of an utterance. This method does not introduce substantial overhead above one-best decoding. Our method is most directly applicable when using WFST decoders where the WFST is “fully expanded”, i.e. where the arcs correspond to HMM transitions. It outputs lattices that include HMM-state-level alignments as well as word labels. The general idea is to create a state-level lattice during decoding, and to do a special form of determinization that retains only the best-scoring path for each word sequence. This special determinization algorithm is a solution to the following problem: Given a WFST A, compute a WFST B that, for each input-symbol-sequence of A, contains just the lowest-cost path through A.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Yanmin Qian; Mengxiao Bi; Tian Tan; Kai Yu
Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Recently, very deep convolutional neural networks (CNNs) have been successfully applied to computer vision and speech recognition tasks. Based on our previous work on very deep CNNs, in this paper this architecture is further developed to improve recognition accuracy for noise robust speech recognition. In the proposed very deep CNN architecture, we study the best configuration for the sizes of filters, pooling, and input feature maps: the sizes of filters and poolings are reduced and dimensions of input features are extended to allow for adding more convolutional layers. Then the appropriate pooling, padding, and input feature map selection strategies are investigated and applied to the very deep CNN to make it more robust for speech recognition. In addition, an in-depth analysis of the architecture reveals key characteristics, such as compact model scale, fast convergence speed, and noise robustness. The proposed new model is evaluated on two tasks: Aurora4 task with multiple additive noise types and channel mismatch, and the AMI meeting transcription task with significant reverberation. Experiments on both tasks show that the proposed very deep CNNs can significantly reduce word error rate (WER) for noise robust speech recognition. The best architecture obtains a 10.0% relative reduction over the traditional CNN on AMI, competitive with the long short-term memory recurrent neural networks (LSTM-RNN) acoustic model. On Aurora4, even without feature enhancement, model adaptation, and sequence training, it achieves a WER of 8.81%, a 17.0% relative improvement over the LSTM-RNN. To our knowledge, this is the best published result on Aurora4.
international conference on acoustics, speech, and signal processing | 2014
Tianxing He; Yuchen Fan; Yanmin Qian; Tian Tan; Kai Yu
Although deep neural networks (DNN) has achieved significant accuracy improvements in speech recognition, it is computationally expensive to deploy large-scale DNN in decoding due to huge number of parameters. Weights truncation and decomposition methods have been proposed to speed up decoding by exploiting the sparseness of DNN. This paper summarizes different approaches of restructuring DNN and proposes a new node pruning approach to reshape DNN for fast decoding. In this approach, hidden nodes of a fully trained DNN are pruned with certain importance function and the reshaped DNN is retuned using back-propagation. The approach requires no modification on code and can directly save computational costs during decoding. Furthermore, it is complementary to weight decomposition methods. Experiments on a switchboard task shows that, by using the proposed node-pruning approach, DNN complexity can be reduced to 37.9%. The complexity can be further reduced to 12.3% without accuracy loss when node-pruning is combined with weight decomposition.
international conference on acoustics, speech, and signal processing | 2015
Tian Tan; Yanmin Qian; Maofan Yin; Yimeng Zhuang; Kai Yu
Although context-dependent DNN-HMM systems have achieved significant improvements over GMM-HMM systems, there still exists big performance degradation if the acoustic condition of the test data mismatches that of the training data. Hence, adaptation and adaptive training of DNN are of great research interest. Previous works mainly focus on adapting the parameters of a single DNN by regularized or selective fine-tuning, applying linear transforms to feature or hidden-layer output, or introducing vector representation of non-speech variability into the input. These methods all require relatively large number of parameters to be estimated during adaptation. In contrast, this paper employs the cluster adaptive training (CAT) framework for DNN adaptation. Here, multiple DNNs are constructed to form the bases of a canonical parametric space. During adaptation, an interpolation vector, specific to a particular acoustic condition, is used to combine the multiple DNN bases into a single adapted DNN. The DNN bases can also be constructed at layer level for more flexibility. The CAT-DNN approach was evaluated on an English switchboard task in unsupervised adaptation mode. It achieved significant WER reductions over the unadapted DNN-HMM, relative 6% to 8.5%, with only 10 parameters.
international conference on acoustics, speech, and signal processing | 2016
Xie Chen; Xunying Liu; Yanmin Qian; Mark J. F. Gales; Philip C. Woodland
In recent years, recurrent neural network language models (RNNLMs) have become increasingly popular for a range of applications including speech recognition. However, the training of RNNLMs is computationally expensive, which limits the quantity of data, and size of network, that can be used. In order to fully exploit the power of RNNLMs, efficient training implementations are required. This paper introduces an open-source toolkit, the CUED-RNNLM toolkit, which supports efficient GPU-based training of RNNLMs. RNNLM training with a large number of word level output targets is supported, in contrast to existing tools which used class-based output-targets. Support fotN-best and lattice-based rescoring of both HTK and Kaldi format lattices is included. An example of building and evaluating RNNLMs with this toolkit is presented for a Kaldi based speech recognition system using the AMI corpus. All necessary resources including the source code, documentation and recipe are available online1.
Speech Communication | 2015
Yuan Liu; Yanmin Qian; Nanxin Chen; Tianfan Fu; Ya Zhang; Kai Yu
Abstract Recently deep learning has been successfully used in speech recognition, however it has not been carefully explored and widely accepted for speaker verification. To incorporate deep learning into speaker verification, this paper proposes novel approaches of extracting and using features from deep learning models for text-dependent speaker verification. In contrast to the traditional short-term spectral feature, such as MFCC or PLP, in this paper, outputs from hidden layer of various deep models are employed as deep features for text-dependent speaker verification. Fours types of deep models are investigated: deep Restricted Boltzmann Machines, speech-discriminant Deep Neural Network (DNN), speaker-discriminant DNN, and multi-task joint-learned DNN. Once deep features are extracted, they may be used within either the GMM-UBM framework or the identity vector (i-vector) framework. Joint linear discriminant analysis and probabilistic linear discriminant analysis are proposed as effective back-end classifiers for identity vector based deep features. These approaches were evaluated on the RSR2015 data corpus. Experiments showed that deep feature based methods can obtain significant performance improvements compared to the traditional baselines, no matter if they are directly applied in the GMM-UBM system or utilized as identity vectors. The EER of the best system using the proposed identity vector is 0.10%, only one fifteenth of that in the GMM-UBM baseline.
international conference on acoustics, speech, and signal processing | 2016
Tian Tan; Yanmin Qian; Dong Yu; Souvik Kundu; Liang Lu; Khe Chai Sim; Xiong Xiao; Yu Zhang
Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based acoustic models has not been well investigated. In this paper, we study the LSTM-RNN speaker-aware training that incorporates the speaker information during model training to normalise the speaker variability. We first present several speaker-aware training architectures, and then empirically evaluate three types of speaker representation: I-vectors, bottleneck speaker vectors and speaking rate. Furthermore, to factorize the variability in the acoustic signals caused by speakers and phonemes respectively, we investigate the speaker-aware and phone-aware joint training under the framework of multi-task learning. In AMI meeting speech transcription task, speaker-aware training of LSTM-RNNs reduces word error rates by 6.5% relative to a very strong LSTM-RNN baseline, which uses FMLLR features.
ieee automatic speech recognition and understanding workshop | 2011
Yanmin Qian; Ji Xu; Daniel Povey; Jia Liu
Recently there has been some interest in the question of how to build LVCSR systems when there is only a limited amount of acoustic training data in the target language, but possibly more plentiful data in other languages. In this paper we investigate approaches using MLP based features. We experiment with two approaches: One is based on Automatic Speech Attribute Transcription (ASAT), in which we train classifiers to learn articulatory features. The other approach uses only the target-language data and relies on combination of multiple MLPs trained on different subsets. After system combination we get large improvements of more than 10% relative versus a conventional baseline. These feature-level approaches may also be combined with other, model-level methods for the multilingual or low-resource scenario.
international conference on acoustics, speech, and signal processing | 2016
Souvik Kundu; Gautam Mantena; Yanmin Qian; Tian Tan; Marc Delcroix; Khe Chai Sim
Deep neural networks (DNNs) for acoustic modeling have been shown to provide impressive results on many state-of-the-art automatic speech recognition (ASR) applications. However, DNN performance degrades due to mismatches in training and testing conditions and thus adaptation is necessary. In this paper, we explore the use of discriminative auxiliary input features obtained using joint acoustic factor learning for DNN adaptation. These features are derived from a bottleneck (BN) layer of a DNN and are referred to as BN vectors. To derive these BN vectors, we explore the use of two types of joint acoustic factor learning which capture speaker and auxiliary information such as noise, phone and articulatory information of speech. In this paper, we show that these BN vectors can be used for adaptation and thereby improve the performance of an ASR system. We also show that the performance can be further improved on augmenting these BN vectors to conventional i-vectors. In this paper, experiments are performed on Aurora-4, REVERB challenge and AMI databases.
ieee automatic speech recognition and understanding workshop | 2015
Philip C. Woodland; Xunying Liu; Yanmin Qian; Chao Zhang; Mark J. F. Gales; Penny Karanasou; Pierre Lanchantin; Linlin Wang
We describe the development of our speech-to-text transcription systems for the 2015 Multi-Genre Broadcast (MGB) challenge. Key features of the systems are: a segmentation system based on deep neural networks (DNNs); the use of HTK 3.5 for building DNN-based hybrid and tandem acoustic models and the use of these models in a joint decoding framework; techniques for adaptation of DNN based acoustic models including parameterised activation function adaptation; alternative acoustic models built using Kaldi; and recurrent neural network language models (RNNLMs) and RNNLM adaptation. The same language models were used with both HTK and Kaldi acoustic models and various combined systems built. The final systems had the lowest error rates on the evaluation data.