Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Minghui Dong is active.

Publication


Featured researches published by Minghui Dong.


international symposium on chinese spoken language processing | 2006

Fusion of acoustic and tokenization features for speaker recognition

Rong Tong; Bin Ma; Kong-Aik Lee; Changhuai You; Donglai Zhu; Tomi Kinnunen; Hanwu Sun; Minghui Dong; Eng Siong Chng; Haizhou Li

This paper describes our recent efforts in exploring effective discriminative features for speaker recognition. Recent researches have indicated that the appropriate fusion of features is critical to improve the performance of speaker recognition system. In this paper we describe our approaches for the NIST 2006 Speaker Recognition Evaluation. Our system integrated the cepstral GMM modeling, cepstral SVM modeling and tokenization at both phone level and frame level. The experimental results on both NIST 2005 SRE corpus and NIST 2006 SRE corpus are presented. The fused system achieved 8.14% equal error rate on 1conv4w-1conv4w test condition of the NIST 2006 SRE.


international conference on acoustics, speech, and signal processing | 2015

Sparse representation for frequency warping based voice conversion

Xiaohai Tian; Zhizheng Wu; Siu Wa Lee; Nguyen Quy Hy; Eng Siong Chng; Minghui Dong

This paper presents a sparse representation framework for weighted frequency warping based voice conversion. In this method, a frame-dependent warping function and the corresponding spectral residual vector are first calculated for each source-target spectrum pair. At runtime conversion, a source spectrum is factorised as a linear combination of a set of source spectra in the training data. The linear combination weight matrix, which is constrained to be sparse, is used to interpolate the frame-dependent warping functions and spectral residual vectors. In this way, the proposed method not only avoids the statistical averaging caused by GMM but also preserves the high-resolution spectral details for high-quality converted speech. Experiments are conducted on the VOICES database. Both objective and subjective results confirmed the effectiveness of the proposed method. In particular, the spectral distortion dropped from 5.55 dB of the conventional frequency warping approach to 5.0 dB of the proposed method. Compare to the state-of-the-art GMM-based conversion with global variance (GV) enhancement, our method achieved 68.5 % in an AB preference test.


Archive | 2010

Machine Learning Methods in the Application of Speech Emotion Recognition

Ling Cen; Minghui Dong; Haizhou Li Zhu Liang Yu; Paul Y. Chan

Machine Learning concerns the development of algorithms, which allows machine to learn via inductive inference based on observation data that represent incomplete information about statistical phenomenon. Classification, also referred to as pattern recognition, is an important task in Machine Learning, by which machines “learn” to automatically recognize complex patterns, to distinguish between exemplars based on their different patterns, and to make intelligent decisions. A pattern classification task generally consists of three modules, i.e. data representation (feature extraction) module, feature selection or reduction module, and classification module. The first module aims to find invariant features that are able to best describe the differences in classes. The second module of feature selection and feature reduction is to reduce the dimensionality of the feature vectors for classification. The classification module finds the actual mapping between patterns and labels based on features. The objective of this chapter is to investigate the machine learning methods in the application of automatic recognition of emotional states from human speech. It is well-known that human speech not only conveys linguistic information but also the paralinguistic information referring to the implicit messages such as emotional states of the speaker. Human emotions are the mental and physiological states associated with the feelings, thoughts, and behaviors of humans. The emotional states conveyed in speech play an important role in human-human communication as they provide important information about the speakers or their responses to the outside world. Sometimes, the same sentences expressed in different emotions have different meanings. It is, thus, clearly important for a computer to be capable of identifying the emotional state expressed by a human subject in order for personalized responses to be delivered accordingly. 1


affective computing and intelligent interaction | 2015

Fundamental frequency modeling using wavelets for emotional voice conversion

Huaiping Ming; Dong-Yan Huang; Minghui Dong; Haizhou Li; Lei Xie; Shaofei Zhang

This paper is to show a representation of fundamental frequency (F0) using continuous wavelet transform (CWT) for prosody modeling in emotion conversion. Emotional conversion aims at converting speech from one emotion state to another. Specifically, we use CWT to decompose F0 into a five-scale representation that corresponds to five temporal scales. A neutral voice is converted to an emotional voice under an exemplar-based voice conversion framework, where both spectrum and F0 are simultaneously converted. The simulation results demonstrate that the dynamics of F0 in different temporal scales can be well captured and converted using the five-scale CWT representation. The converted speech signals are evaluated both objectively and subjectively, that confirm the effectiveness of the proposed method.


asia pacific signal and information processing association annual summit and conference | 2014

Emotional facial expression transfer based on temporal restricted Boltzmann machines

Shuojun Liu; Dong-Yan Huang; Weisi Lin; Minghui Dong; Haizhou Li; Ee Ping Ong

Emotional facial expression transfer involves sequence-to-sequence mappings from an neutral facial expression to another emotional facial expression, which is a well-known problem in computer graphics. In the graphics community, current considered methods are typically linear (e.g., methods based on blendshape mapping) and the dynamical aspects of the facial motion itself are not taken into account. This makes it difficult to retarget the facial articulations involved in speech. In this paper, we apply a temporal restricted Boltzmann machines based model to emotional facial expression transfer. The method can encode a complex nonlinear mapping from the motion of one neutral facial expression to another emotional facial expression which captures facial geometry and dynamics of both neutral state and emotional state.


international conference on acoustics, speech, and signal processing | 2016

Exemplar-based sparse representation of timbre and prosody for voice conversion

Huaiping Ming; Dong-Yan Huang; Lei Xie; Shaofei Zhang; Minghui Dong; Haizhou Li

Voice conversion (VC) aims to make one speaker (source) to sound like spoken by another speaker (target) without changing the language content. Most of the state-of-the-art voice conversion systems focus only on timbre conversion. However, the speaker identity is characterized by the source-related cues such as fundamental frequency and energy as well. In this work, we propose an exemplarbased sparse representation of timbre and prosody for voice conversion that does not necessitate separately timbre conversion and prosody conversions. The experiment results show that, in addition to the conversion of spectral features, the proper conversion of prosody features will improve the quality and speaker identity of the converted speech.


conference of the international speech communication association | 2016

Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion

Huaiping Ming; Dong-Yan Huang; Lei Xie; Jie Wu; Minghui Dong; Haizhou Li

Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long shortterm memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of fundamental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are evaluated both objectively and subjectively, which confirms the effectiveness of the proposed method.


international conference on acoustics, speech, and signal processing | 2012

Generalized F0 modelling with absolute and relative pitch features for singing voice synthesis

Siu Wa Lee; Shen Ting Ang; Minghui Dong; Haizhou Li

Natural pitch fluctuations are essential to human singing. To effectively synthesize singing voice, the generation of these pitch fluctuations is necessary. Previous synthesis methods classify and reproduce them individually. These fluctuations, however, are found to be dependent and vary under different contexts. This paper proposes a generalized framework for F0 modelling to learn and generate these fluctuations on a note basis. Context-dependent hidden Markov models, representing the possible fluctuations observed in particular musical contexts, are built. To capture the pitch fluctuation and the voicing transitions in human singing, we employ both absolute and relative pitch as the modelling features. Results of our experiments on pitch accuracy and quality of synthesized singing showed that the proposed framework achieves accurate pitch generation and better naturalness of synthesized outputs.


international joint conference on natural language processing | 2004

Selecting prosody parameters for unit selection based chinese TTS

Minghui Dong; Kim-Teng Lua; Jun Xu

In unit selection text-to-speech approach, each unit is described by a set of parameters. However, which parameters effectively express prosody of speech is a problem. In this paper, we propose an approach to the determination of prosody parameters for unit selection-based speech synthesis. We are concerned about how prosody parameters can correctly describe tones and prosodic breaks in Chinese speech. First, we define and evaluate a set of parameters. Then, we cluster the parameters and select a representative parameter from each cluster. Finally, the parameters are evaluated in a real TTS system. Experiment shows that the selected parameters help to improve speech quality.


international conference on multimodal interfaces | 2016

Audio and face video emotion recognition in the wild using deep neural networks and small datasets

Wan Ding; Mingyu Xu; Dong-Yan Huang; Weisi Lin; Minghui Dong; Xinguo Yu; Haizhou Li

This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced reality TV videos containing more spontaneous emotion. Our proposed solution is the fusion of facial expression recognition and audio emotion recognition subsystems at score level. For facial emotion recognition, starting from a network pre-trained on ImageNet training data, a deep Convolutional Neural Network is fine-tuned on FER2013 training data for feature extraction. The classifiers, i.e., kernel SVM, logistic regression and partial least squares are studied for comparison. An optimal fusion of classifiers learned from different kernels is carried out at the score level to improve system performance. For audio emotion recognition, a deep Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) is trained directly using the challenge dataset. Experimental results show that both subsystems individually and as a whole can achieve state-of-the art performance. The overall accuracy of the proposed approach on the challenge test dataset is 53.9%, which is better than the challenge baseline of 40.47% .

Collaboration


Dive into the Minghui Dong's collaboration.

Top Co-Authors

Avatar

Haizhou Li

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Lei Xie

Northwestern Polytechnical University

View shared research outputs
Top Co-Authors

Avatar

Eng Siong Chng

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge