Siu Wa Lee
Agency for Science, Technology and Research
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Siu Wa Lee.
international conference on acoustics, speech, and signal processing | 2015
Xiaohai Tian; Zhizheng Wu; Siu Wa Lee; Nguyen Quy Hy; Eng Siong Chng; Minghui Dong
This paper presents a sparse representation framework for weighted frequency warping based voice conversion. In this method, a frame-dependent warping function and the corresponding spectral residual vector are first calculated for each source-target spectrum pair. At runtime conversion, a source spectrum is factorised as a linear combination of a set of source spectra in the training data. The linear combination weight matrix, which is constrained to be sparse, is used to interpolate the frame-dependent warping functions and spectral residual vectors. In this way, the proposed method not only avoids the statistical averaging caused by GMM but also preserves the high-resolution spectral details for high-quality converted speech. Experiments are conducted on the VOICES database. Both objective and subjective results confirmed the effectiveness of the proposed method. In particular, the spectral distortion dropped from 5.55 dB of the conventional frequency warping approach to 5.0 dB of the proposed method. Compare to the state-of-the-art GMM-based conversion with global variance (GV) enhancement, our method achieved 68.5 % in an AB preference test.
international symposium on chinese spoken language processing | 2014
Xiaohai Tian; Zhizheng Wu; Siu Wa Lee; Eng Siong Chng
Frequency warping (FW) based voice conversion aims to modify the frequency axis of source spectra towards that of the target. In previous works, the optimal warping function was calculated by minimizing the spectral distance of converted and target spectra without considering the spectral shape. Nevertheless, speaker timbre and identity greatly depend on vocal tract peaks and valleys of spectrum. In this paper, we propose a method to define the warping function by maximizing the correlation between the converted and target spectra. Different from the conventional warping methods, the correlation-based optimization is not determined by the magnitude of the spectra. Instead, both spectral peaks and valleys are considered in the optimization process, which also improves the performance of amplitude scaling. Experiments were conducted on VOICES database, and the results show that after amplitude scaling our proposed method reduced the mel-spectral distortion from 5.85 dB to 5.60 dB. The subjective listening tests also confirmed the effectiveness of the proposed method.
international conference on acoustics, speech, and signal processing | 2012
Siu Wa Lee; Shen Ting Ang; Minghui Dong; Haizhou Li
Natural pitch fluctuations are essential to human singing. To effectively synthesize singing voice, the generation of these pitch fluctuations is necessary. Previous synthesis methods classify and reproduce them individually. These fluctuations, however, are found to be dependent and vary under different contexts. This paper proposes a generalized framework for F0 modelling to learn and generate these fluctuations on a note basis. Context-dependent hidden Markov models, representing the possible fluctuations observed in particular musical contexts, are built. To capture the pitch fluctuation and the voicing transitions in human singing, we employ both absolute and relative pitch as the modelling features. Results of our experiments on pitch accuracy and quality of synthesized singing showed that the proposed framework achieves accurate pitch generation and better naturalness of synthesized outputs.
asia pacific signal and information processing association annual summit and conference | 2015
Minghui Dong; Chenyu Yang; Yanfeng Lu; Jochen Walter Ehnes; Dong-Yan Huang; Huaiping Ming; Rong Tong; Siu Wa Lee; Haizhou Li
To convert one speakers voice to anothers, the mapping of the corresponding speech segments from source speaker to target speaker must be obtained first. In parallel voice conversion, normally dynamic time warping (DTW) method is used to align signals of source and target voices. However, for conversion between non-parallel speech data, the DTW based mapping method does not work. In this paper, we propose to use a DNN-HMM recognizer to recognize each frame for both source and target speech signals. The vector of pseudo likelihood is then used to represent the frame. Similarity between two frames is measured with the distance between the vectors. A clustering method is used to group both source and target frames. Frame mapping from source to target is then established based on the clustering result. The experiments show that the proposed method can generate similar conversion results compared to parallel voice conversion.
Multimedia Tools and Applications | 2016
Hy Quy Nguyen; Siu Wa Lee; Xiaohai Tian; Minghui Dong; Eng Siong Chng
Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This pretraining is tailor-made for voice conversion and leverages on autoencoder to capture the generic spectral shape of source speech. Additionally, our framework uses segmental DNN models to capture the evolution of the prosodic features over time. To reconstruct the converted speech, the spectral feature produced by the DNN model is combined with the three prosodic features produced by the DNN segmental models. Our experimental results show that the application of both prosodic and high-resolution spectral features leads to quality converted speech as measured by objective evaluation and subjective listening tests.
asia pacific signal and information processing association annual summit and conference | 2015
Bo Fan; Siu Wa Lee; Xiaohai Tian; Lei Xie; Minghui Dong
State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.
international symposium on chinese spoken language processing | 2012
Siu Wa Lee; Minghui Dong; Haizhou Li
Natural pitch fluctuation is essential to singing voice. Recently, we have proposed a generalized F0 modelling method which models the expected F0 fluctuation under various contexts with note HMMs. Knowing that having F0 contours close to human professional singing promotes perceived quality, we are confronted with two requirements: (1) accurate estimation on F0 and (2) precise voiced/unvoiced decisions. In this paper, we introduce two techniques in the above directions. Influence of lyrics phonetics on singing F0 is considered to capture the F0 and voicing behaviour brought from different note-lyrics combinations. The generalized F0 modelling method is further extended to frequency-domain to study if shape characterization in terms of sinusoids helps F0 estimation or not. Our experiments showed that the use of lyrics information leads to better F0 generation and improves naturalness of synthesized singing. While the frequency-domain representation is viable, its performance is less competitive than time-domain representation, which requires further study.
IEEE Transactions on Audio, Speech, and Language Processing | 2017
Xiaohai Tian; Siu Wa Lee; Zhizheng Wu; Eng Siong Chng; Haizhou Li
The voice conversions task is to modify a source speakers voice to sound like that of a target speaker. A conversion method is considered successful when the produced speech sounds natural and similar to the target speaker. This paper presents a new voice conversion framework in which we combine frequency warping and exemplar-based method for voice conversion. Our method maintains high-resolution details during conversion by directly applying frequency warping on the high-resolution spectrum to represent the target. The warping function is generated by a sparse interpolation from a dictionary of exemplar warping functions. As the generated warping function is dependent only on a very small set of exemplars, we do away with the statistical averaging effects inherited from Gaussian mixture models. To compensate for the conversion error, we also apply residual exemplars into the conversion process. Both objective and subjective evaluations on the VOICES database validated the effectiveness of the proposed voice conversion framework. We observed a significant improvement in speech quality over the state-of-the-art parametric methods.
international symposium on chinese spoken language processing | 2014
Renbo Zhao; Siu Wa Lee; Dong-Yan Huang; Minghui Dong
Separating leading voice from a music mixture remains challenging for automatic systems. Competing harmonics from music accompaniment severely interfere the leading voice estimation. To properly extract the leading voice, separation algorithms based on source-filter modeling of human voice and non-negative matrix factorization have been introduced. This paper extends this approach with a statistical weighting scheme to rank various pitch candidates with music score information. It imposes a soft constraint on the likelihood of these pitch candidates, so the interference from music accompaniment on leading voice estimation is reduced. Our experiments showed that this soft-constrained separation with score guidance provides accurate inference about the leading vocal pitch with reliable score and remains robust for erroneous score.
international symposium on chinese spoken language processing | 2010
Chun-Man Mak; Tan Lee; Siu Wa Lee
This paper presents a study on model-based speech separation for monaural speech mixture. With prior knowledge about of the text content of the speech sources, we estimate the spectral envelope trajectory of each target source and use them to filter the mixture signal so that the target signal is enhanced and the interfering signal is suppressed. Accurate trajectory estimation is therefore crucial for successful separation. We proposed to use the nonnegative matrix factorization in the trajectory estimation process which improves the accuracy of the estimated trajectories considerably. Performance evaluation is carried out using mixtures of two equally-loud Cantonese speech sources. The proposed method is found to have significant improvement over previously proposed speech separation methods.