Shiyin Kang
The Chinese University of Hong Kong
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shiyin Kang.
IEEE Signal Processing Magazine | 2015
Zhen-Hua Ling; Shiyin Kang; Heiga Zen; Andrew W. Senior; Mike Schuster; Xiaojun Qian; Helen M. Meng; Li Deng
Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature. This article systematically reviews these emerging speech generation approaches, with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation.
international conference on acoustics, speech, and signal processing | 2013
Shiyin Kang; Xiaojun Qian; Helen M. Meng
Deep belief network (DBN) has been shown to be a good generative model in tasks such as hand-written digit image generation. Previous work on DBN in the speech community mainly focuses on using the generatively pre-trained DBN to initialize a discriminative model for better acoustic modeling in speech recognition (SR). To fully utilize its generative nature, we propose to model the speech parameters including spectrum and F0 simultaneously and generate these parameters from DBN for speech synthesis. Compared with the predominant HMM-based approach, objective evaluation shows that the spectrum generated from DBN has less distortion. Subjective results also confirm the advantage of the spectrum from DBN, and the overall quality is comparable to that of context-independent HMM.
international conference on acoustics, speech, and signal processing | 2015
Lifa Sun; Shiyin Kang; Kun Li; Helen M. Meng
This paper investigates the use of Deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks (DBLSTM-RNNs) for voice conversion. Temporal correlations across speech frames are not directly modeled in frame-based methods using conventional Deep Neural Networks (DNNs), which results in a limited quality of the converted speech. To improve the naturalness and continuity of the speech output in voice conversion, we propose a sequence-based conversion method using DBLSTM-RNNs to model not only the frame-wised relationship between the source and the target voice, but also the long-range context-dependencies in the acoustic trajectory. Experiments show that DBLSTM-RNNs outperform DNNs where Mean Opinion Scores are 3.2 and 2.3 respectively. Also, DBLSTM-RNNs without dynamic features have better performance than DNNs with dynamic features.
international conference on acoustics, speech, and signal processing | 2015
Peng Liu; Quanjie Yu; Zhiyong Wu; Shiyin Kang; Helen M. Meng; Lianhong Cai
To solve the acoustic-to-articulatory inversion problem, this paper proposes a deep bidirectional long short term memory recurrent neural network and a deep recurrent mixture density network. The articulatory parameters of the current frame may have correlations with the acoustic features many frames before or after. The traditional pre-designed fixed-length context window may be either insufficient or redundant to cover such correlation information. The advantage of recurrent neural network is that it can learn proper context information on its own without the requirement of externally specifying a context window. Experimental results indicate that recurrent model can produce more accurate predictions for acoustic-to-articulatory inversion than deep neural network having fixed-length context window. Furthermore, the predicted articulatory trajectory curve of recurrent neural network is smooth. Average root mean square error of 0.816 mm on the MNGU0 test set is achieved without any post-filtering, which is state-of-the-art inversion accuracy.
international conference on pattern recognition | 2010
Quansheng Duan; Shiyin Kang; Zhiyong Wu; Lianhong Cai; Zhiwei Shuang; Yong Qin
The performance of HMM-based text to speech (TTS) system is affected by the basic modeling units and the size of training data. This paper compares two HMM based Mandarin TTS systems using syllable and phone as basic units respectively with 1000, 3000 and 5000 sentences’ training data. Two female speakers’ corpora are used as training data for evaluation. For both corpora, the system using syllable as basic unit outperforms the system using phone as basic unit with 3000 and 5000 sentences’ training data.
international conference on multimedia and expo | 2016
Lifa Sun; Kun Li; Hao Wang; Shiyin Kang; Helen M. Meng
This paper proposes a novel approach to voice conversion with non-parallel training data. The idea is to bridge between speakers by means of Phonetic PosteriorGrams (PPGs) obtained from a speaker-independent automatic speech recognition (SI-ASR) system. It is assumed that these PPGs can represent articulation of speech sounds in a speaker-normalized space and correspond to spoken content speaker-independently. The proposed approach first obtains PPGs of target speech. Then, a Deep Bidirectional Long Short-Term Memory based Recurrent Neural Network (DBLSTM) structure is used to model the relationships between the PPGs and acoustic features of the target speech. To convert arbitrary source speech, we obtain its PPGs from the same SI-ASR and feed them into the trained DBLSTM for generating converted speech. Our approach has two main advantages: 1) no parallel training data is required; 2) a trained model can be applied to any other source speaker for a fixed target speaker (i.e., many-to-one conversion). Experiments show that our approach performs equally well or better than state-of-the-art systems in both speech quality and speaker similarity.
conference of the international speech communication association | 2016
Lifa Sun; Hao Wang; Shiyin Kang; Kun Li; Helen M. Meng
We present a novel approach that enables a target speaker (e.g. monolingual Chinese speaker) to speak a new language (e.g. English) based on arbitrary textual input. Our system includes a trained English speaker-independent automatic speech recognition (SI-ASR) engine using TIMIT. Given the target speaker’s speech in a non-target language, we generate Phonetic PosteriorGrams (PPGs) with the SI-ASR and then train a Deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks (DBLSTM) to model the relationships between the PPGs and the acoustic signal. Synthesis involves input of arbitrary text to a general TTS engine (trained on any nontarget speaker), the output of which is indexed by SI-ASR as PPGs. These are used by the DBLSTM to synthesize the target language in the target speaker’s voice. A main advantage of this approach has very low training data requirement of the target speaker which can be in any language, as compared with a reference approach of training a special TTS engine using many recordings from the target speaker only in the target language. For a given target speaker, our proposed approach trained on 100 Mandarin (i.e. non-target language) utterances achieves comparable performance (in MOS and ABX test) of English synthetic speech as an HTS system trained on 1,000 English utterances.
conference of the international speech communication association | 2014
Shiyin Kang; Helen M. Meng
conference of the international speech communication association | 2013
Kun Li; Xiaojun Qian; Shiyin Kang; Helen M. Meng
symposium on languages, applications and technologies | 2015
Kun Li; Xiaojun Qian; Shiyin Kang; Pengfei Liu; Helen M. Meng