Is this you? Create Your Porfile

Dong-Yan Huang

Agency for Science, Technology and Research

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dong-Yan Huang is active.

Explore More

Publication

Featured researches published by Dong-Yan Huang.

affective computing and intelligent interaction | 2015

Fundamental frequency modeling using wavelets for emotional voice conversion

Huaiping Ming; Dong-Yan Huang; Minghui Dong; Haizhou Li; Lei Xie; Shaofei Zhang

This paper is to show a representation of fundamental frequency (F0) using continuous wavelet transform (CWT) for prosody modeling in emotion conversion. Emotional conversion aims at converting speech from one emotion state to another. Specifically, we use CWT to decompose F0 into a five-scale representation that corresponds to five temporal scales. A neutral voice is converted to an emotional voice under an exemplar-based voice conversion framework, where both spectrum and F0 are simultaneously converted. The simulation results demonstrate that the dynamics of F0 in different temporal scales can be well captured and converted using the five-scale CWT representation. The converted speech signals are evaluated both objectively and subjectively, that confirm the effectiveness of the proposed method.

asia pacific signal and information processing association annual summit and conference | 2014

Emotional facial expression transfer based on temporal restricted Boltzmann machines

Shuojun Liu; Dong-Yan Huang; Weisi Lin; Minghui Dong; Haizhou Li; Ee Ping Ong

Emotional facial expression transfer involves sequence-to-sequence mappings from an neutral facial expression to another emotional facial expression, which is a well-known problem in computer graphics. In the graphics community, current considered methods are typically linear (e.g., methods based on blendshape mapping) and the dynamical aspects of the facial motion itself are not taken into account. This makes it difficult to retarget the facial articulations involved in speech. In this paper, we apply a temporal restricted Boltzmann machines based model to emotional facial expression transfer. The method can encode a complex nonlinear mapping from the motion of one neutral facial expression to another emotional facial expression which captures facial geometry and dynamics of both neutral state and emotional state.

international conference on acoustics, speech, and signal processing | 2016

Exemplar-based sparse representation of timbre and prosody for voice conversion

Huaiping Ming; Dong-Yan Huang; Lei Xie; Shaofei Zhang; Minghui Dong; Haizhou Li

Voice conversion (VC) aims to make one speaker (source) to sound like spoken by another speaker (target) without changing the language content. Most of the state-of-the-art voice conversion systems focus only on timbre conversion. However, the speaker identity is characterized by the source-related cues such as fundamental frequency and energy as well. In this work, we propose an exemplarbased sparse representation of timbre and prosody for voice conversion that does not necessitate separately timbre conversion and prosody conversions. The experiment results show that, in addition to the conversion of spectral features, the proper conversion of prosody features will improve the quality and speaker identity of the converted speech.

conference of the international speech communication association | 2016

Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion

Huaiping Ming; Dong-Yan Huang; Lei Xie; Jie Wu; Minghui Dong; Haizhou Li

Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long shortterm memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of fundamental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are evaluated both objectively and subjectively, which confirms the effectiveness of the proposed method.

asilomar conference on signals, systems and computers | 2007

Maximum a Posteriori based Adaptive Algorithms

Dong-Yan Huang; Susanto Rahardja; Haibin Huang

It is well known that most adaptive filtering algorithms are developed based on the methods of least mean squares or of least squares. The popular adaptive algorithms such like the LMS, the RLS and their variants have been developed for different applications. In this paper, we propose to use maximum a posteriori (MAP) probability approach to estimate the filter coefficients. We show that the RLS, LMS and their variants based on the MAP method are in fact particular cases where the models of the filtering errors and the filter coefficients are with different probability density functions. We can further explore new adaptive algorithms within MAP framework.

international conference on multimodal interfaces | 2016

Audio and face video emotion recognition in the wild using deep neural networks and small datasets

Wan Ding; Mingyu Xu; Dong-Yan Huang; Weisi Lin; Minghui Dong; Xinguo Yu; Haizhou Li

This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced reality TV videos containing more spontaneous emotion. Our proposed solution is the fusion of facial expression recognition and audio emotion recognition subsystems at score level. For facial emotion recognition, starting from a network pre-trained on ImageNet training data, a deep Convolutional Neural Network is fine-tuned on FER2013 training data for feature extraction. The classifiers, i.e., kernel SVM, logistic regression and partial least squares are studied for comparison. An optimal fusion of classifiers learned from different kernels is carried out at the score level to improve system performance. For audio emotion recognition, a deep Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) is trained directly using the challenge dataset. Experimental results show that both subsystems individually and as a whole can achieve state-of-the art performance. The overall accuracy of the proposed approach on the challenge test dataset is 53.9%, which is better than the challenge baseline of 40.47% .

9th ISCA Speech Synthesis Workshop | 2016

An Automatic Voice Conversion Evaluation Strategy Based on Perceptual Background Noise Distortion and Speaker Similarity.

Dong-Yan Huang; Lei Xie; Yvonne Siu Wa Lee; Jie Wu; Huaiping Ming; Xiaohai Tian; Shaofei Zhang; Chuang Ding; Mei Li; Quy Hy Nguyen; Minghui Dong; Haizhou Li

Voice conversion aims to modify the characteristics of one speaker to make it sound like spoken by another speaker without changing the language content. This task has attracted considerable attention and various approaches have been proposed since two decades ago. The evaluation of voice conversion approaches, usually through time-intensive subject listening tests, requires a huge amount of human labor. This paper proposes an automatic voice conversion evaluation strategy based on perceptual background noise distortion and speaker similarity. Experimental results show that our automatic evaluation results match the subjective listening results quite well. We further use our strategy to select best converted samples from multiple voice conversion systems and our submission achieves promising results in the voice conversion challenge (VCC2016).

asia pacific signal and information processing association annual summit and conference | 2015

Mapping frames with DNN-HMM recognizer for non-parallel voice conversion

Minghui Dong; Chenyu Yang; Yanfeng Lu; Jochen Walter Ehnes; Dong-Yan Huang; Huaiping Ming; Rong Tong; Siu Wa Lee; Haizhou Li

To convert one speakers voice to anothers, the mapping of the corresponding speech segments from source speaker to target speaker must be obtained first. In parallel voice conversion, normally dynamic time warping (DTW) method is used to align signals of source and target voices. However, for conversion between non-parallel speech data, the DTW based mapping method does not work. In this paper, we propose to use a DNN-HMM recognizer to recognize each frame for both source and target speech signals. The vector of pseudo likelihood is then used to represent the frame. Similarity between two frames is measured with the distance between the vectors. A clustering method is used to group both source and target frames. Frame mapping from source to target is then established based on the clustering result. The experiments show that the proposed method can generate similar conversion results compared to parallel voice conversion.

international symposium on chinese spoken language processing | 2014

Acoustic emotion recognition based on fusion of multiple feature-dependent deep Boltzmann machines

Kelvin Poon-Feng; Dong-Yan Huang; Minghui Dong; Haizhou Li

In this paper, we present a method to improve the classification recall of a deep Boltzmann machine (DBM) on the task of emotion recognition from speech. The task involves the binary classification of four emotion dimensions such as arousal, expectancy, power, and valence. The method consists of dividing the features of the input data into separate sets and training each set individually using a deep Boltzmann machine algorithm. Afterwards, the results from each set are fused together using simple fusion. The final fused scores are compared to scores obtained from support vector machine (SVM) classifiers and from the same DBM algorithm on the full feature set. The results show that the proposed method can improve the performance of classification of four dimensions and is suitable for classification of unbalanced data sets.

international conference on multimodal interfaces | 2017

Audio-visual emotion recognition using deep transfer learning and multiple temporal models

Xi Ouyang; Shigenori Kawaai; Ester Gue Hua Goh; Shengmei Shen; Wan Ding; Huaiping Ming; Dong-Yan Huang

This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenges for the wild emotion recognition. Deep network transfer learning is used for feature extraction. Spatial-temporal model fusion is to make full use of the complementary of different networks. Semi-auto reinforcement learning is for the optimization of fusion strategy based on dynamic outside feedbacks given by challenge organizers. The overall accuracy of the proposed approach on the challenge test dataset is 57.2%, which is better than the challenge baseline of 40.47% .

Explore More