Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xiong Xiao is active.

Publication


Featured researches published by Xiong Xiao.


IEEE Transactions on Audio, Speech, and Language Processing | 2008

Normalization of the Speech Modulation Spectra for Robust Speech Recognition

Xiong Xiao; Eng Siong Chng; Haizhou Li

In this paper, we study a novel technique that normalizes the modulation spectra of speech signals for robust speech recognition. The modulation spectra of a speech signal are the power spectral density (PSD) functions of the feature trajectories generated from the signal, hence they describe the temporal structure of the features. The modulation spectra are distorted when the speech signal is corrupted by noise. We propose the temporal structure normalization (TSN) filter to reduce the noise effects by normalizing the modulation spectra to reference spectra. The TSN filter is different from other feature normalization methods such as the histogram equalization (HEQ) that only normalize the probability distributions of the speech features. Our previous work showed promising results of TSN on a small vocabulary Aurora-2 task. In this paper, we conduct an inquiry into the theoretical and practical issues of the TSN filter that includes the following. 1) We investigate the effects of noises on the speech modulation spectra and show the general characteristics of noisy speech modulation spectra. The observations help to further explain and justify the TSN filter. 2) We evaluate the TSN filter on the Aurora-4 task and demonstrate its effectiveness for a large vocabulary task. 3) We propose a segment-based implementation of the TSN filter that reduces the processing delay significantly without affecting the performance. Overall, the TSN filter produces significant improvements over the baseline systems, and delivers competitive results when compared to other state-of-the-art temporal filters.


international conference on acoustics, speech, and signal processing | 2013

Synthetic speech detection using temporal modulation feature

Zhizheng Wu; Xiong Xiao; Eng Siong Chng; Haizhou Li

Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.


international conference on acoustics, speech, and signal processing | 2016

Deep beamforming networks for multi-channel speech recognition

Xiong Xiao; Shinji Watanabe; Hakan Erdogan; Liang Lu; John R. Hershey; Michael L. Seltzer; Guoguo Chen; Yu Zhang; Michael I. Mandel; Dong Yu

Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam-former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common cross-entropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.


IEEE Signal Processing Letters | 2007

Temporal Structure Normalization of Speech Feature for Robust Speech Recognition

Xiong Xiao; Eng Siong Chng; Haizhou Li

This letter presents a new feature normalization technique to normalize the temporal structure of speech features. The temporal structure of the features is partially represented by its power spectral density (PSD). We observed that the PSD of the features varies with the corrupting noise and signal-to-noise ratio. To reduce the PSD variation due to noise, we propose to normalize the PSD of features to a reference function by filtering the features. Experimental results on the AURORA-2 task show that the proposed approach when combined with the mean and variance normalization improves the speech recognition accuracy significantly; the system achieves 69.11% relative error rate reduction over the baseline.


international conference on acoustics, speech, and signal processing | 2015

Low-resource keyword search strategies for tamil

Nancy F. Chen; Chongjia Ni; I-Fan Chen; Sunil Sivadas; Van Tung Pham; Haihua Xu; Xiong Xiao; Tze Siong Lau; Su Jun Leow; Boon Pang Lim; Cheung-Chi Leung; Lei Wang; Chin-Hui Lee; Alvina Goh; Eng Siong Chng; Bin Ma; Haizhou Li

We propose strategies for a state-of-the-art keyword search (KWS) system developed by the SINGA team in the context of the 2014 NIST Open Keyword Search Evaluation (OpenKWS14) using conversational Tamil provided by the IARPA Babel program. To tackle low-resource challenges and the rich morphological nature of Tamil, we present highlights of our current KWS system, including: (1) Submodular optimization data selection to maximize acoustic diversity through Gaussian component indexed N-grams; (2) Keywordaware language modeling; (3) Subword modeling of morphemes and homophones.


international conference on acoustics, speech, and signal processing | 2015

A learning-based approach to direction of arrival estimation in noisy and reverberant environments

Xiong Xiao; Shengkui Zhao; Xionghu Zhong; Douglas L. Jones; Eng Siong Chng; Haizhou Li

This paper presents a learning-based approach to the task of direction of arrival estimation (DOA) from microphone array input. Traditional signal processing methods such as the classic least square (LS) method rely on strong assumptions on signal models and accurate estimations of time delay of arrival (TDOA) . They only work well in relatively clean conditions, but suffer from noise and reverberation distortions. In this paper, we propose a learning-based approach that can learn from a large amount of simulated noisy and reverberant microphone array inputs for robust DOA estimation. Specifically, we extract features from the generalised cross correlation (GCC) vectors and use a multilayer perceptron neural network to learn the nonlinear mapping from such features to the DOA. One advantage of the learning based method is that as more and more training data becomes available, the DOA estimation will become more and more accurate. Experimental results on simulated data show that the proposed learning based method produces much better results than the state-of-the-art LS method. The testing results on real data recorded in meeting rooms show improved root-mean-square error (RMSE) compared to the LS method.


2009 Oriental COCOSDA International Conference on Speech Database and Assessments | 2009

MASS: A Malay language LVCSR corpus resource

Tien-Ping Tan; Xiong Xiao; Enya Kong Tang; Eng Siong Chng; Haizhou Li

This paper presents the development of the speech, text and pronunciation dictionary resources required to build a large vocabulary speech recognizer for the Malay language. This project is a collaboration project among three universities: USM, MMU from Malaysia and NTU from Singapore. The Malay speech corpus consists of read speech (speaker independent/ dependent and accent independent/ dependent) and broadcast news. To date, 90 speakers have been recorded which is equal to a total of nearly 70 hours of read speech, and 10 hours of broadcast news from local TV stations in Malaysia was transcribed. The text corpus consists of 700Mbytes of data extracted from Malaysias local news web pages from 1998–2008 and a rule based G2P tool is develop to generate the pronunciation dictionary.


international conference on acoustics, speech, and signal processing | 2016

Speaker-aware training of LSTM-RNNS for acoustic modelling

Tian Tan; Yanmin Qian; Dong Yu; Souvik Kundu; Liang Lu; Khe Chai Sim; Xiong Xiao; Yu Zhang

Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based acoustic models has not been well investigated. In this paper, we study the LSTM-RNN speaker-aware training that incorporates the speaker information during model training to normalise the speaker variability. We first present several speaker-aware training architectures, and then empirically evaluate three types of speaker representation: I-vectors, bottleneck speaker vectors and speaking rate. Furthermore, to factorize the variability in the acoustic signals caused by speakers and phonemes respectively, we investigate the speaker-aware and phone-aware joint training under the framework of multi-task learning. In AMI meeting speech transcription task, speaker-aware training of LSTM-RNNs reduces word error rates by 6.5% relative to a very strong LSTM-RNN baseline, which uses FMLLR features.


IEEE Transactions on Audio, Speech, and Language Processing | 2010

A Study on the Generalization Capability of Acoustic Models for Robust Speech Recognition

Xiong Xiao; Jinyu Li; Eng Siong Chng; Haizhou Li; Chin-Hui Lee

In this paper, we explore the generalization capability of acoustic model for improving speech recognition robustness against noise distortions. While generalization in statistical learning theory originally refers to the models ability to generalize well on unseen testing data drawn from the same distribution as that of the training data, we show that good generalization capability is also desirable for mismatched cases. One way to obtain such general models is to use margin-based model training method, e.g., soft-margin estimation (SME), to enable some tolerance to acoustic mismatches without a detailed knowledge about the distortion mechanisms through enhancing margins between competing models. Experimental results on the Aurora-2 and Aurora-3 connected digit string recognition tasks demonstrate that, by improving the models generalization capability through SME training, speech recognition performance can be significantly improved in both matched and low to medium mismatched testing cases with no language model constraints. Recognition results show that SME indeed performs better with than without mean and variance normalization, and therefore provides a complimentary benefit to conventional feature normalization techniques such that they can be combined to further improve the system performance. Although this study is focused on noisy speech recognition, we believe the proposed margin-based learning framework can be extended to dealing with different types of distortions and robustness issues in other machine learning applications.


international conference on acoustics, speech, and signal processing | 2015

Language independent query-by-example spoken term detection using N-best phone sequences and partial matching

Haihua Xu; Peng Yang; Xiong Xiao; Lei Xie; Cheung-Chi Leung; Hongjie Chen; Jia Yu; Hang Lv; Lei Wang; Su Jun Leow; Bin Ma; Eng Siong Chng; Haizhou Li

In this paper, we propose a partial sequence matching based symbolic search (SS) method for the task of language independent query-by-example spoken term detection. One main drawback of conventional SS approach is the high miss rate for long queries. This is due to high variations in symbol representation of query and search audios, especially in language independent scenario. The successful matching of a query with its instances in search audio becomes exponentially more difficult as the query grows longer. To reduce miss rate, we propose a partial matching strategy, in which all partial phone sequences of a query are used to search for query instances. The partial matching is also suitable for real life applications where exact match is usually not necessary and word prefix, suffix, and order should not affect the search result. When applied to the QUESST 2014 task, results show the partial matching of phone sequences is able to reduce miss rate of long queries significantly compared with conventional full matching method. In addition, for the most challenging inexact matching queries (type 3), it also shows clear advantage over DTW-based methods.

Collaboration


Dive into the Xiong Xiao's collaboration.

Top Co-Authors

Avatar

Eng Siong Chng

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Haizhou Li

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar

Haihua Xu

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Lei Xie

Northwestern Polytechnical University

View shared research outputs
Top Co-Authors

Avatar

Van Hai Do

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Van Tung Pham

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge