Is this you? Create Your Porfile

Peng Shen

National Institute of Information and Communications Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peng Shen is active.

Explore More

Publication

Featured researches published by Peng Shen.

Speech Communication | 2016

Combination of multiple acoustic models with unsupervised adaptation for lecture speech transcription

Peng Shen; Xugang Lu; Xinhui Hu; Naoyuki Kanda; Masahiro Saiko; Chiori Hori; Hisashi Kawai

Abstract Automatic speech recognition systems (ASR) have achieved considerable progress in real applications because of skilled design of the architecture with advanced techniques and algorithms. However, how to design a system efficiently integrating these various techniques to obtain advanced performance is still a challenging task. In this paper, we introduced an ensemble model combination and adaptation based ASR system with two characteristics: (1) large-scale combination of multiple ASR systems based on a Recognizer Output Voting Error Reduction (ROVER) system, and (2) multi-pass unsupervised speaker adaptation for deep neural network acoustic models and topic adaptation on language model. The multiple acoustic models were trained with different acoustic features and model architectures which helped to provide complementary and discriminative information in the ROVER process. With these multiple acoustic models, a better estimation of word confidence could be obtained from ROVER process which helped in selecting data for unsupervised adaptation on the previously trained acoustic models. The final recognition result was obtained using multi-pass decoding, ROVER, and adaptation processes. We tested the system on lecture speeches with topics related to Technology, Entertainment and Design (TED) that were used in the international workshop on spoken language translation (IWSLT) evaluation campaign, and obtained 6.5%, 7.0%, 10.6%, and 8.4% word error rates for test sets in 2011, 2012, 2013, and 2014, which to our knowledge are the best results for these evaluation sets.

international conference on acoustics, speech, and signal processing | 2016

Local fisher discriminant analysis for spoken language identification

Peng Shen; Xugang Lu; Lemao Liu; Hisashi Kawai

I-vector is a state-of-the-art technique widely used in spoken language identification systems. Since i-vectors include total variability factors, discriminant analysis methods have been introduced to find the most discriminative features while removing the undesired variables for language identification, for example, linear discriminant analysis (LDA) and nonparametric discriminant analysis (NDA). However, these methods either do not consider or use weak local structures of the data. In this study, we introduce a local Fisher discriminant analysis (LFDA) as a post-processing discriminant analysis method to extract the discriminative features from i-vectors. LFDA is a full-rank method which takes the local structure of the data into account for non-Gaussian distribution data, i.e., multimodal. Compared with LDA and NDA, LFDA is a pair-wise local method which enhances the centralization of the distribution of samples in the same class to obtain larger amounts of discriminative features. Experimental results indicate that LFDA is more effective than LDA and NDA for the i-vector-based language identification task.

Computer Speech & Language | 2017

Regularization of neural network model with distance metric learning for i-vector based spoken language identification

Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

Pair-wise distance metric learning was designed on the feature transform layers.Stacking a soft-max layer on the feature transform layers as a classifier layer.Learning the coupled functions in feature extraction and classification layers with a regularized objective function.Experiments showed significant improvements compared to conventional regularization algorithms. The i-vector representation and modeling technique has been successfully applied in spoken language identification (SLI). The advantage of using the i-vector representation is that any speech utterance with a variable duration length can be represented as a fixed length vector. In modeling, a discriminative transform or classifier must be applied to emphasize the variations correlated to language identity since the i-vector representation encodes several types of the acoustic variations (e.g., speaker variation, transmission channel variation, etc.). Owing to the strong nonlinear discriminative power, the neural network model has been directly used to learn the mapping function between the i-vector representation and the language identity labels. In most studies, only the point-wise feature-label information is fed to the model for parameter learning that may result in model overfitting, particularly when with limited training data. In this study, we propose to integrate pair-wise distance metric learning as the regularization of model parameter optimization. In the representation space of nonlinear transforms in the hidden layers, a distance metric learning is explicitly designed to minimize the pair-wise intra-class variation and maximize the inter-class variation. Using the pair-wise distance metric learning, the i-vectors are transformed to a new feature space, wherein they are much more discriminative for samples belonging to different languages while being much more similar for samples belonging to the same language. We tested the algorithm on an SLI task, and obtained promising results, which outperformed conventional regularization methods.

international symposium on chinese spoken language processing | 2016

Automatic acoustic segmentation in N-best list rescoring for lecture speech recognition

Peng Shen; Xugang Lu; Hisashi Kawai

Speech segmentation is important in automatic speech recognition (ASR) and machine translation (MT). Particularly in N-best list rescoring processing, generalizing N-best lists consisting of as many as candidates from a decoding lattice requires proper utterance segmentation. In lecture speech recognition, only long audio recordings are provided without any utterance segmentation information. In addition, rather than only speech event, other acoustic events, e.g., laugh, applause, etc., are included in the recordings. Traditional speech segmentation algorithms for ASR focus on acoustic cues in segmentation, while in MT, speech text segmentation algorithms pay much attention to linguistic cues. In this study, we propose a three-stage speech segmentation framework by integrating both the acoustic and linguistic cues. We tested the segmentation framework for lecture speech recognition. Our results showed the effectiveness of the proposed segmentation algorithm.

international symposium on chinese spoken language processing | 2016

A pseudo-task design in multi-task learning deep neural network for speaker recognition

Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

Deep neural network (DNN) technique has been successfully applied in image and automatic speech recognition (ASR) fields due to its efficiency to learn robust and invariant features for pattern recognition. Recently, it has been applied to learn speaker specific acoustic features for speaker recognition. However, due to large number of parameters to be optimized in DNN model and limited training data, it is easy for the DNN to be overfitting with a local solution. The overfitted model has a weak generalization ability on testing data sets. Multitask learning framework is proved to improve DNN model generalization, but how to design a secondary task for speaker recognition is not a trivial work. In this study, under the multi-task learning framework, besides the main task with speaker ID labels in training, a pseudo-task as an auxiliary task was designed with labels obtained from an unsupervised learning algorithm. Speaker recognition experiments were carried out to test the proposed multi-task learning DNN model. The results showed that adding the pseudo-task in the multi-task learning, the performance of the main task was improved.

international symposium on chinese spoken language processing | 2016

Comparison of regularization constraints in deep neural network based speaker adaptation

Peng Shen; Xugang Lu; Hisashi Kawai

Adaptation of deep neural network (DNN) acoustic model has been proven to significantly improve the automatic speech recognition (ASR) performance. But how to improve the generalization ability of the adapted model is still a challenge problem. In this study, we investigated algorithms to improve model generalization ability in a parameter regularization framework. Although some regularization algorithms have been proposed, there is no investigation on how the effects of using different regularization constraints in adaptation (e.g., parameter space smoothness or sparsity etc.). We investigated regularization constraints in a 1p regularization framework which includes h, 12 regularization, and several constrained forms of them. We carried out the investigation on a lecture speech recognition task. Our investigation showed that most of the regularization constraints could improve the performance but with different parameter updating mechanisms. The regularization constraint which makes the adaptation to pick up only a few model parameters for updating showed the most effective. In addition, by combining different regularization constraints, further improvements could be achieved.

conference of the international speech communication association | 2016

Pair-Wise Distance Metric Learning of Neural Network Model for Spoken Language Identification.

Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

The i-vector representation and modeling technique has been successfully applied in spoken language identification (SLI). In modeling, a discriminative transform or classifier must be applied to emphasize variations correlated to language identity since the i-vector representation encodes most of the acoustic variations (e.g., speaker variation, transmission channel variation, etc.). Due to the strong nonlinear discriminative power of neural network (NN) modeling (including its deep form DNN), the NN has been directly used to learn the mapping function between the i-vector representation and language identity labels. In most studies, only the point-wise feature-label information is feeded to NN for parameter learning which may result in model overfitting, particularly when with limited training data. In this study, we propose to integrate pair-wise distance metric learning in NN parameter optimization. In the representation space of nonlinear transforms of hidden layers, a distance metric learning is explicitly designed for minimizing the pair-wise intra-class variation and maximizing the inter-class variation. With the distance metric as a constraint in the point-wise learning, the i-vectors are transformed to a new feature space which are much more discriminative for samples belonging to different languages while are much more similar for samples belonging to the same language. We tested the algorithm on a SLI task, encouraging results were obtained with more than 20% relative improvement on identification error rate.

international symposium on chinese spoken language processing | 2014

Spectral patch based sparse coding for acoustic event detection

Xugang Lu; Yu Tsao; Peng Shen; Chiori Hori

In most algorithms for acoustic event detection (AED), frame based acoustic representations are used in acoustic modeling. Due to lack of context information in feature representation, large model confusions may occur during modeling. We have proposed a feature learning and representation algorithm to explore context information from temporal-frequency patches of signal for AED. With the algorithm, a sparse feature was extracted based on an acoustic dictionary composed of a bag of spectral patches. In our previous algorithm, the feature was obtained based on a definition of Euclidian distance between input signal and acoustic dictionary. In this study, we formulate the sparse feature extraction as l1 regularization in signal reconstruction. The sparsity of the representation is efficiently controlled via varying a regularization parameter. A support vector machine (SVM) classifier was built on the extracted sparse feature for AED. Our experimental results showed that the spectral patch based sparse representation effectively improved the performance by incorporating temporal-frequency context information in modeling.

conference of the international speech communication association | 2015