Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xiaohai Tian is active.

Publication


Featured researches published by Xiaohai Tian.


international conference on acoustics, speech, and signal processing | 2015

Sparse representation for frequency warping based voice conversion

Xiaohai Tian; Zhizheng Wu; Siu Wa Lee; Nguyen Quy Hy; Eng Siong Chng; Minghui Dong

This paper presents a sparse representation framework for weighted frequency warping based voice conversion. In this method, a frame-dependent warping function and the corresponding spectral residual vector are first calculated for each source-target spectrum pair. At runtime conversion, a source spectrum is factorised as a linear combination of a set of source spectra in the training data. The linear combination weight matrix, which is constrained to be sparse, is used to interpolate the frame-dependent warping functions and spectral residual vectors. In this way, the proposed method not only avoids the statistical averaging caused by GMM but also preserves the high-resolution spectral details for high-quality converted speech. Experiments are conducted on the VOICES database. Both objective and subjective results confirmed the effectiveness of the proposed method. In particular, the spectral distortion dropped from 5.55 dB of the conventional frequency warping approach to 5.0 dB of the proposed method. Compare to the state-of-the-art GMM-based conversion with global variance (GV) enhancement, our method achieved 68.5 % in an AB preference test.


international symposium on chinese spoken language processing | 2014

Correlation-based frequency warping for voice conversion

Xiaohai Tian; Zhizheng Wu; Siu Wa Lee; Eng Siong Chng

Frequency warping (FW) based voice conversion aims to modify the frequency axis of source spectra towards that of the target. In previous works, the optimal warping function was calculated by minimizing the spectral distance of converted and target spectra without considering the spectral shape. Nevertheless, speaker timbre and identity greatly depend on vocal tract peaks and valleys of spectrum. In this paper, we propose a method to define the warping function by maximizing the correlation between the converted and target spectra. Different from the conventional warping methods, the correlation-based optimization is not determined by the magnitude of the spectra. Instead, both spectral peaks and valleys are considered in the optimization process, which also improves the performance of amplitude scaling. Experiments were conducted on VOICES database, and the results show that after amplitude scaling our proposed method reduced the mel-spectral distortion from 5.85 dB to 5.60 dB. The subjective listening tests also confirmed the effectiveness of the proposed method.


conference of the international speech communication association | 2016

An Investigation of Spoofing Speech Detection Under Additive Noise and Reverberant Conditions.

Xiaohai Tian; Zhizheng Wu; Xiong Xiao; Eng Siong Chng; Haizhou Li

Spoofing detection for automatic speaker verification (ASV), which is to discriminate between live and artificial speech, has received increasing attentions recently. However, the previous studies have been done on the clean data without significant noise. It is still not clear whether the spoofing detectors trained on clean speech can generalise well under noisy conditions. In this work, we perform an investigation of spoofing detection under additive noise and reverberant conditions. In particular, we consider five difference additive noises at three different signalto-noise ratios (SNR), and a reverberation noise with different reverberation time (RT). Our experimental results reveal that additive noises degrade the spoofing detectors trained on clean speech significantly. However, the reverberation does not hurt the performance too much.


9th ISCA Speech Synthesis Workshop | 2016

An Automatic Voice Conversion Evaluation Strategy Based on Perceptual Background Noise Distortion and Speaker Similarity.

Dong-Yan Huang; Lei Xie; Yvonne Siu Wa Lee; Jie Wu; Huaiping Ming; Xiaohai Tian; Shaofei Zhang; Chuang Ding; Mei Li; Quy Hy Nguyen; Minghui Dong; Haizhou Li

Voice conversion aims to modify the characteristics of one speaker to make it sound like spoken by another speaker without changing the language content. This task has attracted considerable attention and various approaches have been proposed since two decades ago. The evaluation of voice conversion approaches, usually through time-intensive subject listening tests, requires a huge amount of human labor. This paper proposes an automatic voice conversion evaluation strategy based on perceptual background noise distortion and speaker similarity. Experimental results show that our automatic evaluation results match the subjective listening results quite well. We further use our strategy to select best converted samples from multiple voice conversion systems and our submission achieves promising results in the voice conversion challenge (VCC2016).


Multimedia Tools and Applications | 2016

High quality voice conversion using prosodic and high-resolution spectral features

Hy Quy Nguyen; Siu Wa Lee; Xiaohai Tian; Minghui Dong; Eng Siong Chng

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution spectral feature. The prosodic features include F0, intensity and duration. It is well known that DNN is useful as a tool to model high-dimensional features. In this work, we show that DNN initialized by our proposed autoencoder pretraining yields good quality DNN conversion models. This pretraining is tailor-made for voice conversion and leverages on autoencoder to capture the generic spectral shape of source speech. Additionally, our framework uses segmental DNN models to capture the evolution of the prosodic features over time. To reconstruct the converted speech, the spectral feature produced by the DNN model is combined with the three prosodic features produced by the DNN segmental models. Our experimental results show that the application of both prosodic and high-resolution spectral features leads to quality converted speech as measured by objective evaluation and subjective listening tests.


international conference on signal and information processing | 2015

Detecting synthetic speech using long term magnitude and phase information

Xiaohai Tian; Steven Du; Xiong Xiao; Haihua Xu; Eng Siong Chng; Haizhou Li

Synthetic speech is speech signals generated by text-to-speech (TTS) and voice conversion (VC) techniques. They impose a threat to speaker verification (SV) systems as an attacker may make use of TTS or VC to synthesize a speakers voice to cheat the SV system. To address this challenge, we study the detection of synthetic speech using long term magnitude and phase information of speech. As most of the TTS and VC techniques make use of vocoders for speech analysis and synthesis, we focus on differentiating speech signals generated by vocoders from natural speech. Log magnitude spectrum and two phase-based features, including instantaneous frequency derivation and modified group delay, were studied in this work. We conducted experiments on the CMU-ARCTIC database using various speech features and a neural network classifier. During training, the synthetic speech detection is formulated as a 2-class classification problem and the neural network is trained to differentiate synthetic speech from natural speech. During testing, the posterior scores generated by the neural network is used for the detection of synthetic speech. The synthetic speech used in training and testing are generated by different types of vocoders and VC methods. Experimental results show that long term information up to 0.3s is important for synthetic speech detection. In addition, the high dimensional log magnitude spectrum features significantly outperforms the low dimensional MFCC features, showing that it is important to retain the detailed spectral information for detecting synthetic speech. Furthermore, the two phase-based features are found to perform well and complementary to the log magnitude spectrum features. The fusion of these features produces an equal error rate (EER) of 0.09%.


asia pacific signal and information processing association annual summit and conference | 2015

A waveform representation framework for high-quality statistical parametric speech synthesis

Bo Fan; Siu Wa Lee; Xiaohai Tian; Lei Xie; Minghui Dong

State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.


asia-pacific signal and information processing association annual summit and conference | 2013

Local partial least square regression for spectral mapping in voice conversion

Xiaohai Tian; Zhizheng Wu; Eng Siong Chng

Joint density Gaussian mixture model (JD-GMM) based method has been widely used in voice conversion task due to its flexible implementation. However, the statistical averaging effect during estimating the model parameters will result in over-smoothing the target spectral trajectories. Motivated by the local linear transformation method, which uses neighboring data rather than all the training data to estimate the transformation function for each feature vector, we proposed a local partial least square method to avoid the over-smoothing problem of JD-GMM and the over-fitting problem of local linear transformation when training data are limited. We conducted experiments using the VOICES database and measure both spectral distortion and correlation coefficient of the spectral parameter trajectory. The experimental results show that our proposed method obtain better performance as compared to baseline methods.


IEEE Transactions on Audio, Speech, and Language Processing | 2017

An Exemplar-Based Approach to Frequency Warping for Voice Conversion

Xiaohai Tian; Siu Wa Lee; Zhizheng Wu; Eng Siong Chng; Haizhou Li

The voice conversions task is to modify a source speakers voice to sound like that of a target speaker. A conversion method is considered successful when the produced speech sounds natural and similar to the target speaker. This paper presents a new voice conversion framework in which we combine frequency warping and exemplar-based method for voice conversion. Our method maintains high-resolution details during conversion by directly applying frequency warping on the high-resolution spectrum to represent the target. The warping function is generated by a sparse interpolation from a dictionary of exemplar warping functions. As the generated warping function is dependent only on a very small set of exemplars, we do away with the statistical averaging effects inherited from Gaussian mixture models. To compensate for the conversion error, we also apply residual exemplars into the conversion process. Both objective and subjective evaluations on the VOICES database validated the effectiveness of the proposed voice conversion framework. We observed a significant improvement in speech quality over the state-of-the-art parametric methods.


asia pacific signal and information processing association annual summit and conference | 2016

Spoofing speech detection using temporal convolutional neural network

Xiaohai Tian; Xiong Xiao; Eng Siong Chng; Haizhou Li

Spoofing speech detection aims to differentiate spoofing speech from natural speech. Frame-based features are usually used in most of previous works. Although multiple frames or dynamic features are used to form a super-vector to represent the temporal information, the time span covered by these features are not sufficient. Most of the systems failed to detect the non-vocoder or unit selection based spoofing attacks. In this work, we propose to use a temporal convolutional neural network (CNN) based classifier for spoofing speech detection. The temporal CNN first convolves the feature trajectories with a set of filters, then extract the maximum responses of these filters within a time window using a max-pooling layer. Due to the use of max-pooling, we can extract useful information from a long temporal span without concatenating a large number of neighbouring frames, as in feedforward deep neural network (DNN). Five types of feature are employed to access the performance of proposed classifier. Experimental results on ASVspoof 2015 corpus show that the temporal CNN based classifier is effective for synthetic speech detection. Specifically, the proposed method brings a significant performance boost for the unit selection based spoofing speech detection.

Collaboration


Dive into the Xiaohai Tian's collaboration.

Top Co-Authors

Avatar

Eng Siong Chng

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Haizhou Li

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar

Zhizheng Wu

University of Edinburgh

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Xiong Xiao

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Chunyan Miao

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Haihua Xu

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Lei Meng

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Nguyen Quy Hy

Nanyang Technological University

View shared research outputs
Researchain Logo
Decentralizing Knowledge