Is this you? Create Your Porfile

Tian Tan

Shanghai Jiao Tong University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tian Tan is active.

Explore More

Publication

Featured researches published by Tian Tan.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition

Yanmin Qian; Mengxiao Bi; Tian Tan; Kai Yu

Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Recently, very deep convolutional neural networks (CNNs) have been successfully applied to computer vision and speech recognition tasks. Based on our previous work on very deep CNNs, in this paper this architecture is further developed to improve recognition accuracy for noise robust speech recognition. In the proposed very deep CNN architecture, we study the best configuration for the sizes of filters, pooling, and input feature maps: the sizes of filters and poolings are reduced and dimensions of input features are extended to allow for adding more convolutional layers. Then the appropriate pooling, padding, and input feature map selection strategies are investigated and applied to the very deep CNN to make it more robust for speech recognition. In addition, an in-depth analysis of the architecture reveals key characteristics, such as compact model scale, fast convergence speed, and noise robustness. The proposed new model is evaluated on two tasks: Aurora4 task with multiple additive noise types and channel mismatch, and the AMI meeting transcription task with significant reverberation. Experiments on both tasks show that the proposed very deep CNNs can significantly reduce word error rate (WER) for noise robust speech recognition. The best architecture obtains a 10.0% relative reduction over the traditional CNN on AMI, competitive with the long short-term memory recurrent neural networks (LSTM-RNN) acoustic model. On Aurora4, even without feature enhancement, model adaptation, and sequence training, it achieves a WER of 8.81%, a 17.0% relative improvement over the LSTM-RNN. To our knowledge, this is the best published result on Aurora4.

international conference on acoustics, speech, and signal processing | 2014

Reshaping deep neural network for fast decoding by node-pruning

Tianxing He; Yuchen Fan; Yanmin Qian; Tian Tan; Kai Yu

Although deep neural networks (DNN) has achieved significant accuracy improvements in speech recognition, it is computationally expensive to deploy large-scale DNN in decoding due to huge number of parameters. Weights truncation and decomposition methods have been proposed to speed up decoding by exploiting the sparseness of DNN. This paper summarizes different approaches of restructuring DNN and proposes a new node pruning approach to reshape DNN for fast decoding. In this approach, hidden nodes of a fully trained DNN are pruned with certain importance function and the reshaped DNN is retuned using back-propagation. The approach requires no modification on code and can directly save computational costs during decoding. Furthermore, it is complementary to weight decomposition methods. Experiments on a switchboard task shows that, by using the proposed node-pruning approach, DNN complexity can be reduced to 37.9%. The complexity can be further reduced to 12.3% without accuracy loss when node-pruning is combined with weight decomposition.

international conference on acoustics, speech, and signal processing | 2015

Cluster adaptive training for deep neural network

Tian Tan; Yanmin Qian; Maofan Yin; Yimeng Zhuang; Kai Yu

Although context-dependent DNN-HMM systems have achieved significant improvements over GMM-HMM systems, there still exists big performance degradation if the acoustic condition of the test data mismatches that of the training data. Hence, adaptation and adaptive training of DNN are of great research interest. Previous works mainly focus on adapting the parameters of a single DNN by regularized or selective fine-tuning, applying linear transforms to feature or hidden-layer output, or introducing vector representation of non-speech variability into the input. These methods all require relatively large number of parameters to be estimated during adaptation. In contrast, this paper employs the cluster adaptive training (CAT) framework for DNN adaptation. Here, multiple DNNs are constructed to form the bases of a canonical parametric space. During adaptation, an interpolation vector, specific to a particular acoustic condition, is used to combine the multiple DNN bases into a single adapted DNN. The DNN bases can also be constructed at layer level for more flexibility. The CAT-DNN approach was evaluated on an English switchboard task in unsupervised adaptation mode. It achieved significant WER reductions over the unadapted DNN-HMM, relative 6% to 8.5%, with only 10 parameters.

international conference on acoustics, speech, and signal processing | 2016

Speaker-aware training of LSTM-RNNS for acoustic modelling

Tian Tan; Yanmin Qian; Dong Yu; Souvik Kundu; Liang Lu; Khe Chai Sim; Xiong Xiao; Yu Zhang

Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based acoustic models has not been well investigated. In this paper, we study the LSTM-RNN speaker-aware training that incorporates the speaker information during model training to normalise the speaker variability. We first present several speaker-aware training architectures, and then empirically evaluate three types of speaker representation: I-vectors, bottleneck speaker vectors and speaking rate. Furthermore, to factorize the variability in the acoustic signals caused by speakers and phonemes respectively, we investigate the speaker-aware and phone-aware joint training under the framework of multi-task learning. In AMI meeting speech transcription task, speaker-aware training of LSTM-RNNs reduces word error rates by 6.5% relative to a very strong LSTM-RNN baseline, which uses FMLLR features.

international conference on acoustics, speech, and signal processing | 2016

Joint acoustic factor learning for robust deep neural network based automatic speech recognition

Souvik Kundu; Gautam Mantena; Yanmin Qian; Tian Tan; Marc Delcroix; Khe Chai Sim

Deep neural networks (DNNs) for acoustic modeling have been shown to provide impressive results on many state-of-the-art automatic speech recognition (ASR) applications. However, DNN performance degrades due to mismatches in training and testing conditions and thus adaptation is necessary. In this paper, we explore the use of discriminative auxiliary input features obtained using joint acoustic factor learning for DNN adaptation. These features are derived from a bottleneck (BN) layer of a DNN and are referred to as BN vectors. To derive these BN vectors, we explore the use of two types of joint acoustic factor learning which capture speaker and auxiliary information such as noise, phone and articulatory information of speech. In this paper, we show that these BN vectors can be used for adaptation and thereby improve the performance of an ASR system. We also show that the performance can be further improved on augmenting these BN vectors to conventional i-vectors. In this paper, experiments are performed on Aurora-4, REVERB challenge and AMI databases.

international conference on acoustics, speech, and signal processing | 2016

Integrated adaptation with multi-factor joint-learning for far-field speech recognition

Yanmin Qian; Tian Tan; Dong Yu; Yu Zhang

Although great progress has been made in automatic speech recognition (ASR), significant performance degradation still exists in distant talking scenarios due to significantly lower signal power. In this paper, a novel adaptation framework, named integrated adaptation with multi-factor joint-learning, is proposed to improve the recognition accuracy for distant speech recognition. We explore and extract speaker, phone and environment factor representations using deep neural networks (DNNs), which are integrated into the main ASR DNN to improve classification accuracy. In addition, the hidden activations in the main ASR DNN are used to improve the factor extraction, which in turn helps the ASR DNN. All the model parameters, including those in the ASR DNN and factor extractor DNNs, are jointly optimized under the multi-task learning framework. Further more, unlike prior techniques, our novel approach requires no explicit separate stages for factor extraction and adaptation. Experiments on the AMI single distant microphone (SDM) task show that the proposed architecture can significantly reduce word error rate (WER) and additional improvement can be achieved by combining it with the i-vector adaptation. Our best configuration obtained more than 15% and 10% relative reduction on WER over the baselines using the SDM and close-talk data generated alignments, respectively.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition

Yanmin Qian; Tian Tan; Dong Yu

Although great progress has been made in automatic speech recognition (ASR), significant performance degradation still exists in noisy environments. In this paper, a novel factor-aware training framework, named neural network-based multifactor aware joint training, is proposed to improve the recognition accuracy for noise robust speech recognition. This approach is a structured model which integrates several different functional modules into one computational deep model. We explore and extract speaker, phone, and environment factor representations using deep neural networks (DNNs), which are integrated into the main ASR DNN to improve classification accuracy. In addition, the hidden activations in the main ASR DNN are used to improve factor extraction, which in turn helps the ASR DNN. All the model parameters, including those in the ASR DNN and factor extraction DNNs, are jointly optimized under the multitask learning framework. Unlike prior traditional techniques for the factor-aware training, our approach requires no explicit separate stages for factor extraction and adaptation. Moreover, the proposed neural network-based multifactor aware joint training can be easily combined with the conventional factor-aware training which uses the explicit factors, such as i-vector, noise energy, and T60 value to obtain additional improvement. The proposed method is evaluated on two main noise robust tasks: the AMI single distant microphone task in which reverberation is the main concern, and the Aurora4 task in which multiple noise types exist. Experiments on both tasks show that the proposed model can significantly reduce word error rate (WER). The best configuration achieved more than 15% relative reduction in WER over the baselines on these two tasks.

international conference on acoustics, speech, and signal processing | 2016

An investigation into using parallel data for far-field speech recognition

Yanmin Qian; Tian Tan; Dong Yu

Far-field speech recognition is an important yet challenging task due to low signal to noise ratio. In this paper, three novel deep neural network architectures are explored to improve the far-field speech recognition accuracy by exploiting the parallel far-field and close-talk recordings. All three novel architectures use multi-task learning for the model optimization but focus on three different ideas: dereverberation and recognition joint-learning, close-talk and far-field model knowledge sharing, and environment-code aware training. Experiments on the AMI single distant microphone (SDM) task show that each of the proposed method can boost accuracy individually, and additional improvement can be obtained with appropriate integration of these models. Overall we reduced the error rate by 10% relatively on the SDM set by exploiting the IHM data.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

Cluster adaptive training for deep neural network based acoustic model

Tian Tan; Yanmin Qian; Kai Yu

Although context-dependent DNN-HMM systems have achieved significant improvements over GMM-HMM systems, significant performance degradation has been observed if the acoustic condition of the test data mismatches that of the training data. Hence, adaptation and adaptive training of DNN are of great research interest. Previous DNN adaptation works mainly focus on adapting parameters of a single DNN by applying linear transformations to feature or hidden-layer output; introducing vector representation of non-speech variability into the input. In these methods, large number of parameters are required to be estimated during adaptation. In this paper, the cluster adaptive training (CAT) framework is employed for DNN adaptive training. Here, multiple weight matrices are constructed to form the basis of a canonical parametric space. During adaptation, for a new acoustic condition, an interpolation vector is estimated to combine the weight basis into a single adapted weight matrix. Since only the interpolation vector need to be estimated during adaptation, the number of updated parameters is much smaller than existing DNN adaptation methods. The CAT-DNN approach was evaluated on an English switchboard task in unsupervised adaptation mode. It achieved significant WER reductions over the unadapted DNN-HMM, relative 7.6% to 10.6%, with only 10 parameters.

international conference on signal processing | 2016

An investigation on deep learning with beta stabilizer

Qi Liu; Tian Tan; Kai Yu

Artificial neural networks (ANN) have been used in many applications such like handwriting recognition and speech recognition. It is well-known that learning rate is a crucial value in the training procedure for artificial neural networks. It is shown that the initial value of learning rate can confoundedly affect the final result and this value is always set manually in practice. A new parameter called beta stabilizer has been introduced to reduce the sensitivity of the initial learning rate. But this method has only been proposed for deep neural network (DNN) with sigmoid activation function. In this paper we extended beta stabilizer to long short-term memory (LSTM) and investigated the effects of beta stabilizer parameters on different models, including LSTM and DNN with relu activation function. It is concluded that beta stabilizer parameters can reduce the sensitivity of learning rate with almost the same performance on DNN with relu activation function and LSTM. However, it is shown that the effects of beta stabilizer on DNN with relu activation function and LSTM are fewer than the effects on DNN with sigmoid activation function.

Explore More