Li-Rong Dai
University of Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Li-Rong Dai.
IEEE Signal Processing Letters | 2014
Yong Xu; Jun Du; Li-Rong Dai; Chin-Hui Lee
This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture. In the DNN learning process, a large training set ensures a powerful modeling capability to estimate the complicated nonlinear mapping from observed noisy speech to desired clean signals. Acoustic context was found to improve the continuity of speech to be separated from the background noises successfully without the annoying musical artifact commonly observed in conventional speech enhancement algorithms. A series of pilot experiments were conducted under multi-condition training with more than 100 hours of simulated speech data, resulting in a good generalization capability even in mismatched testing conditions. When compared with the logarithmic minimum mean square error approach, the proposed DNN-based algorithm tends to achieve significant improvements in terms of various objective quality measures. Furthermore, in a subjective preference evaluation with 10 listeners, 76.35% of the subjects were found to prefer DNN-based enhanced speech to that obtained with other conventional technique.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Yong Xu; Jun Du; Li-Rong Dai; Chin-Hui Lee
In contrast to the conventional minimum mean square error (MMSE)-based noise reduction techniques, we propose a supervised method to enhance speech by means of finding a mapping function between noisy and clean speech signals based on deep neural networks (DNNs). In order to be able to handle a wide range of additive noises in real-world situations, a large training set that encompasses many possible combinations of speech and noise types, is first designed. A DNN architecture is then employed as a nonlinear regression function to ensure a powerful modeling capability. Several techniques have also been proposed to improve the DNN-based speech enhancement system, including global variance equalization to alleviate the over-smoothing problem of the regression model, and the dropout and noise-aware training strategies to further improve the generalization capability of DNNs to unseen noise conditions. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional MMSE based technique. It is also interesting to observe that the proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general. Furthermore, the resulting DNN model, trained with artificial synthesized data, is also effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
acm multimedia | 2007
Meng Wang; Xian-Sheng Hua; Xun Yuan; Yan Song; Li-Rong Dai
Learning based semantic video annotation is a promising approach for enabling content-based video search. However, severe difficulties, such as insufficiency of training data and curse of dimensionality, are frequently encountered. This paper proposes a novel unified scheme, Optimized Multi-Graph-based Semi-Supervised Learning (OMG-SSL), to simultaneously attack these difficulties. Instead of only using a single graph, OMG-SSL integrates multiple graphs into a regularization and optimization framework to sufficiently explore their complementary nature. We then show that various crucial factors in video annotation, including multiple modalities, multiple distance metrics, and temporal consistency, in fact all correspond to different correlations among samples, and hence they can be represented by different graphs. Therefore, OMG-SSL is able to simultaneously deal with these factors within a unified framework. Experiments on the TRECVID benchmark demonstrate the effectiveness of our proposed approach.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Ling-Hui Chen; Zhen-Hua Ling; Li-Juan Liu; Li-Rong Dai
This paper presents a new spectral envelope conversion method using deep neural networks (DNNs). The conventional joint density Gaussian mixture model (JDGMM) based spectral conversion methods perform stably and effectively. However, the speech generated by these methods suffer severe quality degradation due to the following two factors: 1) inadequacy of JDGMM in modeling the distribution of spectral features as well as the non-linear mapping relationship between the source and target speakers, 2) spectral detail loss caused by the use of high-level spectral features such as mel-cepstra. Previously, we have proposed to use the mixture of restricted Boltzmann machines (MoRBM) and the mixture of Gaussian bidirectional associative memories (MoGBAM) to cope with these problems. In this paper, we propose to use a DNN to construct a global non-linear mapping relationship between the spectral envelopes of two speakers. The proposed DNN is generatively trained by cascading two RBMs, which model the distributions of spectral envelopes of source and target speakers respectively, using a Bernoulli BAM (BBAM). Therefore, the proposed training method takes the advantage of the strong modeling ability of RBMs in modeling the distribution of spectral envelopes and the superiority of BAMs in deriving the conditional distributions for conversion. Careful comparisons and analysis among the proposed method and some conventional methods are presented in this paper. The subjective results show that the proposed method can significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Shaofei Xue; Ossama Abdel-Hamid; Hui Jiang; Li-Rong Dai; Qingfeng Liu
Fast adaptation of deep neural networks (DNN) is an important research topic in deep learning. In this paper, we have proposed a general adaptation scheme for DNN based on discriminant condition codes, which are directly fed to various layers of a pre-trained DNN through a new set of connection weights. Moreover, we present several training methods to learn connection weights from training data as well as the corresponding adaptation methods to learn new condition code from adaptation data for each new test condition. In this work, the fast adaptation scheme is applied to supervised speaker adaptation in speech recognition based on either frame-level cross-entropy or sequence-level maximum mutual information training criterion. We have proposed three different ways to apply this adaptation scheme based on the so-called speaker codes: i) Nonlinear feature normalization in feature space; ii) Direct model adaptation of DNN based on speaker codes; iii) Joint speaker adaptive training with speaker codes. We have evaluated the proposed adaptation methods in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that all three methods are quite effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed speaker-code-based adaptation methods may achieve up to 8-10% relative error reduction using only a few dozens of adaptation utterances per speaker. Finally, we have achieved very good performance in Switchboard (12.1% in WER) after speaker adaptation using sequence training criterion, which is very close to the best performance reported in this task (“Deep convolutional neural networks for LVCSR,” T. N. Sainath , Proc. IEEE Acoust., Speech, Signal Process., 2013).
international conference on acoustics, speech, and signal processing | 2014
Shaofei Xue; Ossama Abdel-Hamid; Hui Jiang; Li-Rong Dai
Recently an effective fast speaker adaptation method using discriminative speaker code (SC) has been proposed for the hybrid DNN-HMM models in speech recognition [1]. This adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes using the standard back-propagation algorithm. In this paper, we propose an alternative direct adaptation in model space, where speaker codes are directly connected to the original DNN models through a set of new connection weights, which can be estimated very efficiently from all or part of training data. As a result, the proposed method is more suitable for large scale speech recognition tasks since it eliminates the time-consuming training process to estimate another adaptation neural networks. In this work, we have evaluated the proposed direct SC-based adaptation method in the large scale 320-hr Switchboard task. Experimental results have shown that the proposed SC-based rapid adaptation method is very effective not only for small recognition tasks but also for very large scale tasks. For example, it has shown that the proposed method leads to up to 8% relative reduction in word error rate in Switchboard by using only a very small number of adaptation utterances per speaker (from 10 to a few dozens). Moreover, the extra training time required for adaptation is also significantly reduced from the method in [1].
international conference on acoustics, speech, and signal processing | 2015
Tian Gao; Jun Du; Li-Rong Dai; Chin-Hui Lee
Based on the recently proposed speech pre-processing front-end with deep neural networks (DNNs), we first investigate different feature mapping directly from noisy speech via DNN for robust speech recognition. Next, we propose to jointly train a single DNN for both feature mapping and acoustic modeling. In the end, we show that the word error rate (WER) of the jointly trained system could be significantly reduced by the fusion of multiple DNN pre-processing systems which implies that features obtained from different domains of the DNN-enhanced speech signals are strongly complementary. Testing on the Aurora4 noisy speech recognition task our best system with multi-condition training can achieves an average WER of 10.3%, yielding a relative reduction of 16.3% over our previous DNN pre-processing only system with a WER of 12.3%. To the best of our knowledge, this represents the best published result on the Aurora4 task without using any adaptation techniques.
acm multimedia | 2007
Meng Wang; Tao Mei; Xun Yuan; Yan Song; Li-Rong Dai
Graph-based semi-supervised learning methods have been proven effective in tackling the difficulty of training data insufficiency in many practical applications such as video annotation. These methods are all based on an assumption that the labels of similar samples are close. However, as a crucial factor of these algorithms, the estimation of pairwise similarity has not been sufficiently studied. Usually, the similarity of two samples is estimated based on the Euclidean distance between them. But we will show that similarities are not merely related to distances but also related to the structures around the samples. It is shown that distance-based similarity measure may lead to high classification error rates even on several simple datasets. In this paper we propose a novel neighborhood similarity measure, which simultaneously takes into account both thse distance between samples and the difference between the structures around the corresponding samples. Experiments on synthetic dataset and TRECVID benchmark demonstrate that the neighborhood similarity is superior to existing distance based similarity.
international conference on acoustics, speech, and signal processing | 2013
Yebo Bao; Hui Jiang; Li-Rong Dai; Cong Liu
Recently, the hybrid model combining deep neural network (DNN) with context-dependent HMMs has achieved some dramatic gains over the conventional GMM/HMM method in many speech recognition tasks. In this paper, we study how to compete with the state-of-the-art DNN/HMM method under the traditional GMM/HMM framework. Instead of using DNN as acoustic model, we use DNN as a front-end bottleneck (BN) feature extraction method to decorrelate long feature vectors concatenated from several consecutive speech frames. More importantly, we have proposed two novel incoherent training methods to explicitly de-correlate BN features in learning of DNN. The first method relies on minimizing coherence of weight matrices in DNN while the second one attempts to minimize correlation coefficients of BN features calculated in each mini-batch data in DNN training. Experimental results on a 70-hr Mandarin transcription task and the 309-hr Switchboard task have shown that the traditional GMM/HMMs using BN features can yield comparable performance as DNN/HMM. The proposed incoherent training can produce 2-3% additional gain over the baseline BN features. At last, the discriminatively trained GMM/HMMs using incoherently trained BN features have consistently surpassed the state-of-the-art DNN/HMMs in all evaluated tasks.
international conference on acoustics, speech, and signal processing | 2009
Haizhou Li; Bin Ma; Kong-Aik Lee; Hanwu Sun; Donglai Zhu; Khe Chai Sim; Changhuai You; Rong Tong; Ismo Kärkkäinen; Chien-Lin Huang; Vladimir Pervouchine; Wu Guo; Yijie Li; Li-Rong Dai; Mohaddeseh Nosratighods; Thiruvaran Tharmarajah; Julien Epps; Eliathamby Ambikairajah; Eng Siong Chng; Tanja Schultz; Qin Jin
This paper describes the performance of the I4U speaker recognition system in the NIST 2008 Speaker Recognition Evaluation. The system consists of seven subsystems, each with different cepstral features and classifiers. We describe the I4U Primary system and report on its core test results as they were submitted, which were among the best-performing submissions. The I4U effort was led by the Institute for Infocomm Research, Singapore (IIR), with contributions from the University of Science and Technology of China (USTC), the University of New South Wales, Australia (UNSW), Nanyang Technological University, Singapore (NTU) and Carnegie Mellon University, USA (CMU).