Ling-Hui Chen
University of Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ling-Hui Chen.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Ling-Hui Chen; Zhen-Hua Ling; Li-Juan Liu; Li-Rong Dai
This paper presents a new spectral envelope conversion method using deep neural networks (DNNs). The conventional joint density Gaussian mixture model (JDGMM) based spectral conversion methods perform stably and effectively. However, the speech generated by these methods suffer severe quality degradation due to the following two factors: 1) inadequacy of JDGMM in modeling the distribution of spectral features as well as the non-linear mapping relationship between the source and target speakers, 2) spectral detail loss caused by the use of high-level spectral features such as mel-cepstra. Previously, we have proposed to use the mixture of restricted Boltzmann machines (MoRBM) and the mixture of Gaussian bidirectional associative memories (MoGBAM) to cope with these problems. In this paper, we propose to use a DNN to construct a global non-linear mapping relationship between the spectral envelopes of two speakers. The proposed DNN is generatively trained by cascading two RBMs, which model the distributions of spectral envelopes of source and target speakers respectively, using a Bernoulli BAM (BBAM). Therefore, the proposed training method takes the advantage of the strong modeling ability of RBMs in modeling the distribution of spectral envelopes and the superiority of BAMs in deriving the conditional distributions for conversion. Careful comparisons and analysis among the proposed method and some conventional methods are presented in this paper. The subjective results show that the proposed method can significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Ling-Hui Chen; Tuomo Raitio; Cassia Valentini-Botinhao; Zhen-Hua Ling; Junichi Yamagishi
The generated speech of hidden Markov model (HMM)-based statistical parametric speech synthesis still sounds “muffled.” One cause of this degradation in speech quality may be the loss of fine spectral structures. In this paper, we propose to use a deep generative architecture, a deep neural network (DNN) generatively trained, as a postfilter. The network models the conditional probability of the spectrum of natural speech given that of synthetic speech to compensate for such gap between synthetic and natural speech. The proposed probabilistic postfilter is generatively trained by cascading two restricted Boltzmann machines (RBMs) or deep belief networks (DBNs) with one bidirectional associative memory (BAM). We devised two types of DNN postfilters: one operating in the mel-cepstral domain and the other in the higher dimensional spectral domain. We compare these two new data-driven postfilters with other types of postfilters that are currently used in speech synthesis: a fixed mel-cepstral based postfilter, the global variance based parameter generation, and the modulation spectrum-based enhancement. Subjective evaluations using the synthetic voices of a male and female speaker confirmed that the proposed DNN-based postfilter in the spectral domain significantly improved the segmental quality of synthetic speech compared to that with conventional methods.
PLOS ONE | 2012
Xiao-Dong Wang; Feng Gu; Kang He; Ling-Hui Chen; Lin Chen
Background Extraction of linguistically relevant auditory features is critical for speech comprehension in complex auditory environments, in which the relationships between acoustic stimuli are often abstract and constant while the stimuli per se are varying. These relationships are referred to as the abstract auditory rule in speech and have been investigated for their underlying neural mechanisms at an attentive stage. However, the issue of whether or not there is a sensory intelligence that enables one to automatically encode abstract auditory rules in speech at a preattentive stage has not yet been thoroughly addressed. Methodology/Principal Findings We chose Chinese lexical tones for the current study because they help to define word meaning and hence facilitate the fabrication of an abstract auditory rule in a speech sound stream. We continuously presented native Chinese speakers with Chinese vowels differing in formant, intensity, and level of pitch to construct a complex and varying auditory stream. In this stream, most of the sounds shared flat lexical tones to form an embedded abstract auditory rule. Occasionally the rule was randomly violated by those with a rising or falling lexical tone. The results showed that the violation of the abstract auditory rule of lexical tones evoked a robust preattentive auditory response, as revealed by whole-head electrical recordings of the mismatch negativity (MMN), though none of the subjects acquired explicit knowledge of the rule or became aware of the violation. Conclusions/Significance Our results demonstrate that there is an auditory sensory intelligence in the perception of Chinese lexical tones. The existence of this intelligence suggests that the humans can automatically extract abstract auditory rules in speech at a preattentive stage to ensure speech communication in complex and noisy auditory environments without drawing on conscious resources.
conference of the international speech communication association | 2016
Tomoki Toda; Ling-Hui Chen; Daisuke Saito; Fernando Villavicencio; Mirjam Wester; Zhizheng Wu; Junichi Yamagishi
This paper describes the Voice Conversion Challenge 2016 devised by the authors to better understand different voice conversion (VC) techniques by comparing their performance on a common dataset. The task of the challenge was speaker conversion, i.e., to transform the voice identity of a source speaker into that of a target speaker while preserving the linguistic content. Using a common dataset consisting of 162 utterances for training and 54 utterances for evaluation from each of 5 source and 5 target speakers, 17 groups working in VC around the world developed their own VC systems for every combination of the source and target speakers, i.e., 25 systems in total, and generated voice samples converted by the developed systems. These samples were evaluated in terms of target speaker similarity and naturalness by 200 listeners in a controlled environment. This paper summarizes the design of the challenge, its result, and a future plan to share views about unsolved problems and challenges faced by the current VC techniques.
international conference on acoustics, speech, and signal processing | 2014
Li-Juan Liu; Ling-Hui Chen; Zhen-Hua Ling; Li-Rong Dai
The spectral envelope is the most natural representation of speech signal. But in voice conversion, it is difficult to directly model the raw spectral envelope space, which is high dimensional and strongly cross-dimensional correlated, with conventional Gaussian distributions. Bidirectional associative memory (BAM) is a two-layer feedback neural network that can better model the cross-dimensional correlations in high dimensional vectors. In this paper, we propose to reformulate BAMs as Gaussian distributions in order to model the spectral envelope space. The parameters of BAMs are estimated using the contrastive divergence algorithm. The evaluations on likelihood show that BAMs have better modeling ability than Gaussians with diagonal covariance. And the subjective tests on voice conversion indicate that the performance of the proposed method is significantly improved comparing with the conventional GMM based method.
international conference on acoustics, speech, and signal processing | 2011
Ling-Hui Chen; Zhen-Hua Ling; Li-Rong Dai
This paper presents a non-parallel training algorithm for voice conversion based on feature transform Gaussian mixture model (FTGMM), which is a mixture model of joint density space of source speaker and target speaker with explicit feature transform modeling. In FT-GMM, the correlations between the distributions of two speakers in each component of the mixture model are not directly modeled, but absorbed into these explicit feature transformations. This makes it possible to extend this model to non-parallel training by simply decomposing it into two sub-models, one for each speaker and optimizing them separatively. A frequency warping process is adopted to compensate performance degradation caused by original spectral distance between source and target speakers. Cross-gender experimental results show that the proposed method achieves comparable performance as parallel training.
international symposium on chinese spoken language processing | 2010
Ling-Hui Chen; Zhen-Hua Ling; Wu Guo; Li-Rong Dai
In this paper, we propose a Gaussian mixture model (GMM) based voice conversion method using explicit feature transform models. A piecewise linear transform with stochastic bias is adopted to present the relationship between the spectral features of source and target speakers. This explicit transformations are integrated into the training of GMM for the joint probability density of source and target features. The maximum likelihood parameter generation algorithm with dynamic features is used to generate the converted spectral trajectories. Our method can model the cross-dimension correlations for the joint density GMM (JDGMM), while significantly decreasing computation cost comparing with JDGMM with full covariance. Experimental results show that the proposed method outperformed the conventional GMM-based method in cross-gender voice conversion.
international conference on acoustics, speech, and signal processing | 2015
Li-Juan Liu; Ling-Hui Chen; Zhen-Hua Ling; Li-Rong Dai
This paper presents a method for voice conversion using deep neural networks (DNNs) trained with multiple source speakers. The proposed DNNs can be used in two ways for different scenarios: 1) in the absence of training data for source speaker, the DNNs can be treated as source-speaker-independent models and perform conversions directly from arbitrary source speakers to certain target speaker; 2) the DNNs can also be used as initial models for further fine-tuning of source-speaker-dependent DNNs when parallel training data for both source and target speakers are available. Experimental results show that, as source-speaker-independent models, the proposed DNNs can achieve comparable performance to conventional source-speaker-dependent models. On the other hand, the proposed method outperforms the conventional initialization method with restricted Boltzmann machines (RBMs).
international symposium on chinese spoken language processing | 2014
Li Gao; Zhen-Hua Ling; Ling-Hui Chen; Li-Rong Dai
The speech generated by hidden Markov model (HMM) based speech synthesis method always sounds monotonous compared with natural recordings. An important reason is that the predicted F0 trajectories are over-smoothed. This arises from the adoption of frame-level F0 features and the averaging effect of acoustic modeling using Gaussians in the conventional F0 modeling approach. In this paper, we propose a method to improve the F0 prediction of HMM-based Mandarin speech synthesis in a post-filtering way. Syllable-level F0 features, e.g., length-normalized logF0 vectors or quantitative target approximation (qTA) parameters, are extracted from the F0 trajectories predicted by the conventional approach. These features are mapped towards natural ones by Gaussian bidirectional associative memory (GBAM) based transformation. Our subjective experiments indicate that the GBAM-based F0 post-filtering method using either logF0 vectors or qTA parameters can significantly improve the naturalness of synthetic speech. Using raw logF0 vectors for post-filtering can achieve better performance than using derived qTA parameters.
conference of the international speech communication association | 2016
Ling-Hui Chen; Li-Juan Liu; Zhen-Hua Ling; Yuan Jiang; Li-Rong Dai
This paper introduces the methods we adopt to build our system for the evaluation event of Voice Conversion Challenge (VCC) 2016. We propose to use neural network-based approaches to convert both spectral and excitation features. First, the generatively trained deep neural network (GTDNN) is adopted for spectral envelope conversion after the spectral envelopes have been pre-processed by frequency warping. Second, we propose to use a recurrent neural network (RNN) with long short-term memory (LSTM) cells for F0 trajectory conversion. In addition, we adopt a DNN for band aperiodicity conversion. Both internal tests and formal VCC evaluation results demonstrate the effectiveness of the proposed methods.
Collaboration
Dive into the Ling-Hui Chen's collaboration.
National Institute of Information and Communications Technology
View shared research outputs