Xiang Xie
Beijing Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiang Xie.
international conference on multimedia and expo | 2014
Xingyu Na; Xiang Xie; Jingming Kuang
Speech synthesizer is commonly used in human-computer interaction. In many applicational cases, the computing resource is limited while real-time synthesis is demanded. The HMM-based speech synthesis technique allows creating a natural voice quality with small footprint, but current synthesizers require the concatenation of sentence level acoustic units, which is not applicable in real-time mode. In this paper, we propose a blocked parameter generation algorithm for low latency speech synthesis which can work real-time in resource limited applications. Phonetic units at various time spans are used as blocks. The objective and subjective evaluations suggest that the proposed system produce promising voice quality with a low demand for the computing resource.
international conference on acoustics, speech, and signal processing | 2014
Yishan Jiao; Xiang Xie; Xingyu Na; Ming Tu
HMM-based speech synthesis system (HTS) often generates buzzy and muffled speech. Such degradation of voice quality makes synthetic speech sound robotically rather than naturally. From this point, we suppose that synthetic speech is in a different speaker space apart from the original. We propose to use voice conversion method to transform synthetic speech toward the original so as to improve its quality. Local linear transformation (LLT) combined with temporal decomposition (TD) is proposed as the conversion method. It can not only ensure smooth spectral conversion but also avoid over-smoothing problem. Moreover, we design a robust spectral selection and modification strategy to make the modified spectra stable. Preference test shows that the proposed method can improve the quality of HMM-based speech synthesis.
Eurasip Journal on Audio, Speech, and Music Processing | 2013
Jing Wang; Xuan Ji; Shenghui Zhao; Xiang Xie; Jingming Kuang
This paper presents a novel lossless compression technique of the context-based adaptive arithmetic coding which can be used to further compress the quantized parameters in audio codec. The key feature of the new technique is the combination of the context model in time domain and frequency domain which is called time-frequency context model. It is used for the lossless compression of audio coding parameters such as the quantized modified discrete cosine transform (MDCT) coefficients and the frequency band gains in ITU-T G.719 audio codec. With the proposed adaptive arithmetic coding, a high degree of adaptation and redundancy reduction can be achieved. In addition, an efficient variable rate algorithm is employed, which is designed based on both the baseline entropy coding method of G.719 and the proposed adaptive arithmetic coding technique. Experiments show that the proposed technique is of higher efficiency compared with the conventional Huffman coding and the common adaptive arithmetic coding when used in the lossless compression of audio coding parameters. For a set of audio samples used in the G.719 application, the proposed technique achieves an average bit rate saving of 7.2% at low bit rate coding mode while producing audio quality equal to that of the original G.719.
international symposium on chinese spoken language processing | 2008
Hui Yin; Climent Nadeu; Volker Hohmann; Xiang Xie; Jingming Kuang
We propose an acoustic feature for speech recognition based on the combination of MFCC and fractional Fourier transform (FrFT). The transform orders for FrFT are adaptively set according to the intraframe pitch change rate. This method is motivated by the fact that the speech is not stationary even in a short period of time, and the idea is shown using an AM-FM speech model and some spectrograms of an artificial periodic signal. Experiments were conducted on the intervocalic English consonants provided by Interspeech 2008 Consonant Challenge and a Mandarin connected digits corpus. The performance of the proposed method is compared with the MFCC baseline system. Experimental results show that the proposed features get a slightly better recognition rate than MFCCs presumably because they can better track the dynamic characteristics of the speech harmonics.
international conference on acoustics, speech, and signal processing | 2011
Duo-jia Ma; Xiang Xie; Jingming Kuang
The determination of the optimal fractional Fourier transform (FrFT) order is a crucial issue for FrFT. This paper introduces a novel algorithm is proposed for the estimation of FrFT order. We use the information of pitches, harmonies and formants in the correlogram of Gammatone filterbanks to get a few candidates of the transform order. The proposed method reduces the computation complexity in the searching of optimal transform order. We apply this method for speech processing such as Mel-frequency cepstral coefficients (MFCC) extraction and speech enhancement. The experiment of MFCC extraction shows that the proposed method is superior to the traditional method based on Fourier Transform in the sense of Fisher distance improvement. The experimental results of speech enhancement also show the improvement in sense of SNR and the Itakura-Saito distance of LPC coefficients.
international conference on acoustics, speech, and signal processing | 2013
Jing Wang; Chundong Xu; Xiang Xie; Jingming Kuang
This paper proposes a novel multichannel audio signal compression method based on tensor decomposition. The multichannel audio tensor space is established with three factors (channel, time, and frequency) and is decomposed into the core tensor and three factor matrices based on tucker model. Only the truncated core tensor is transmitted to the decoder which is multiplied by the factor matrices trained before processing. The performance of the proposed method is evaluated with approximation errors, compression degree and listening tests. When the core tensor is smaller, the compression degree will be higher. A very noticeable compression capability will be achieved with an acceptable retrieved quality. The novelty of the proposed method is that it enables both high compression capability and backward compatibility with little signal distortion to the hearing.
international symposium on chinese spoken language processing | 2010
Duo-jia Ma; Xiang Xie; Jingming Kuang
This article introduces fractional Fourier transform (FrFT) to speech enhancement, A novel algorithm is proposed for the estimation of FrFT order. The determination of the optimal FrFT order is a crucial issue for FrFT. We use the information of pitches, harmonies and formants in correlogram of Gammatone filterbank to get a few candidates of the transform order. The proposed method reduces the computation complexity in the searching of optimal transform order. The experimental results of speech enhancement show that the proposed method is superior to the conventional spectral subtraction in the sense of SNR improvement of the enhanced speeches and the Itakura-Saito distance of LPC coefficients.
international conference on machine learning | 2017
Jin Hu; Jing Liu; Yingnan Zhang; Zhuanling Zha; Xiang Xie; Shilei Huang
This paper proposes a robust classifier for Mandarin vowels considering articulatory manners (AMs) which include the height of the body of the tongue, the front-back position of the tongue, and the degree of lip rounding. Firstly, the articulatory manners of each vowel are encoded to a 3-dimension vector pattern. Then, acoustic features are extracted and mapped to the articulatory manner vector by ELM. Finally, the nearest vowel to the articulatory manner vector is chosen as the recognized result. Comparison between our method and the direct method without considering the articulatory manners shows that the proposed method has an improvement of 7.1 percentage points. Tests with three kinds of noisy data in the Aurora-4 show it also outperforms the normal method with an about a gain of about 4 percentage points.
China Communications | 2017
Jiyue Liu; Jing Wang; Min Liu; Xiang Xie; Jingming Kuang
With the development of multichannel audio systems, corresponding audio quality assessment techniques, especially the objective prediction models, have received increasing attention. Existing methods, such as PEAQ (Perceptual Evaluation of Audio Quality) recommended by ITU, usually lead to poor results when assessing multichannel audio, which have little correlation with subjective scores. In this paper, a novel two-layer model based on Multiple Linear Regression (MLR) and Neural Network (NN) is proposed. Through the first layer, two indicators of multichannel audio, Audio Quality Score (AQS) and Spatial Perception Score (SPS) are derived, and through the second layer the overall score is output. The final results show that this model can not only improve the correlation with the subjective test score by 30.7% and decrease the Root Mean Square Error (RMSE) by 44.6%, but also add two new indicators: AQS and SPS, which can help reflect the multichannel audio quality more clearly.
international symposium on chinese spoken language processing | 2016
Jing Wang; Yahui Shan; Shequan Jiang; Xiang Xie
This paper proposes a novel speech denoising method based on tensor filtering, in which the microphone array speech signal is constructed by tensor data and processed by tensor filtering model. The multi-microphone signal is represented with three-order tensor space in the way of channel, time and frequency. Noise can be reduced by finding the lower-rank approximation of the three-order tensor with tucker model. MDL (Minimum Description Length) criterion is used to estimate the optimal tensor rank. The performance of the proposed approach is evaluated with objective indexes and listening quality test. The experimental results indicate that the proposed approach has potential ability of retrieving the target signal from noisy microphone array signal.