Wu Guo
University of Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Wu Guo.
international conference on acoustics, speech, and signal processing | 2009
Haizhou Li; Bin Ma; Kong-Aik Lee; Hanwu Sun; Donglai Zhu; Khe Chai Sim; Changhuai You; Rong Tong; Ismo Kärkkäinen; Chien-Lin Huang; Vladimir Pervouchine; Wu Guo; Yijie Li; Li-Rong Dai; Mohaddeseh Nosratighods; Thiruvaran Tharmarajah; Julien Epps; Eliathamby Ambikairajah; Eng Siong Chng; Tanja Schultz; Qin Jin
This paper describes the performance of the I4U speaker recognition system in the NIST 2008 Speaker Recognition Evaluation. The system consists of seven subsystems, each with different cepstral features and classifiers. We describe the I4U Primary system and report on its core test results as they were submitted, which were among the best-performing submissions. The I4U effort was led by the Institute for Infocomm Research, Singapore (IIR), with contributions from the University of Science and Technology of China (USTC), the University of New South Wales, Australia (UNSW), Nanyang Technological University, Singapore (NTU) and Carnegie Mellon University, USA (CMU).
international symposium on chinese spoken language processing | 2010
LianWu Chen; Wu Guo; Li-Rong Dai
With the development of the HMM-based parametric speech synthesis algorithm, it is easy for impostors to generate the synthetic speech with specific speakers characteristics, which is a serious threat to the state of the art speaker verification system. In this paper, we investigate the difference of Mel-cepstral (MCEP) between the natural and synthetic speech. Experiments demonstrate that we can discriminate synthetic speech from natural speech by the higher order of MCEP.
international conference on acoustics, speech, and signal processing | 2009
Wu Guo; Yanhua Long; Yijie Li; Lei Pan; Eryu Wang; Li-Rong Dai
The description of iFLY system submitted for NIST 2008 speaker recognition evaluation (SRE), which has achieved excellent performance in the 2008 SRE evaluation, is presented in this paper. Our primary system is a fusion of two subsystems GMM-UBM and GMM-SVM. For each sub-system, two kinds of short-time acoustic features PLP and LPCC are adopted. We focus on three key issues in this evaluation: channel compensation, multi-lingual or bi-lingual cues and the voice activity detection. We also point out that data selection and factor analysis play key roles in the system improvement.
international conference on acoustics, speech, and signal processing | 2014
Diyuan Liu; Si Wei; Wu Guo; Yebo Bao; Shifu Xiong; Li-Rong Dai
This paper proposes a lattice-based sequential discriminative training method to extract more discriminative bottleneck features. In our method, the bottleneck neural network is first trained with cross entropy criteria, and then only the weights of bottleneck layer are retrained with sequential criteria. If the outputs of the layer before bottleneck are treated as the raw features, the new method is an equivalent to a linear feature transformation algorithm. This linearity makes the optimization much easier than updating the whole neural network. Just like the fMPE and RDLT, the neural network is retrained with batch mode gradient descent, making the training to be easily implemented in parallel. Meanwhile, batch mode optimization can naturally deal with the indirect gradient to make the optimization more precise. Experimental results on a Mandarin transcription task and the Switchboard task have shown the effectiveness of the proposed method with the CER decreases from 12.2% to 11.3% and the WER from 16.1% to 15.0%, respectively.
international symposium on chinese spoken language processing | 2010
Ling-Hui Chen; Zhen-Hua Ling; Wu Guo; Li-Rong Dai
In this paper, we propose a Gaussian mixture model (GMM) based voice conversion method using explicit feature transform models. A piecewise linear transform with stochastic bias is adopted to present the relationship between the spectral features of source and target speakers. This explicit transformations are integrated into the training of GMM for the joint probability density of source and target features. The maximum likelihood parameter generation algorithm with dynamic features is used to generate the converted spectral trajectories. Our method can model the cross-dimension correlations for the joint density GMM (JDGMM), while significantly decreasing computation cost comparing with JDGMM with full covariance. Experimental results show that the proposed method outperformed the conventional GMM-based method in cross-gender voice conversion.
signal processing systems | 2016
Liping Chen; Kong Aik Lee; Bin Ma; Wu Guo; Haizhou Li; Li-Rong Dai
Total variability model has shown to be effective for text-independent speaker verification. It provisions a tractable way to estimate the so-called i-vector, which describes the speaker and session variability rendered in a whole utterance. In order to extract the local session variability that is neglected by an i-vector, local variability models were proposed, including the Gaussian- and the dimension-oriented local variability models. This paper presents a consolidated study of the total and local variability models and gives a full comparison between them under the same framework. Besides, new extensions are proposed for the existing local variability models. The comparison between the total variability model and the local variability models is fulfilled with the experiments on NIST SRE’08 and SRE’10 datasets. Furthermore, in the experiments, the dimension-oriented local variability models show their capability to capture the session variability which is complementary to that estimated by the total variability model.
international symposium on chinese spoken language processing | 2014
Shifu Xiong; Wu Guo; Diyuan Liu
In this paper, we report our recent progress on the under-resource language automatic speech recognition (ASR) and the following spoken term detection (STD). The experiments are carried on the National Institute of Standards and Technology (NIST) Open Keyword Search 2013 (OpenKWS13) evaluation Vietnamese corpus. Compared with the conventional ASR system, we made the following modifications to improve recognition accuracy. First, pitch features and tone modeling are applied to cover pitch and tone information since Vietnamese is a tonal language. Second, automatic question generation for decision tree is used for state tying to address the problem of lack of linguistic knowledge. Finally, we investigate rectified linear units (ReLUs) activation function and cross-lingual pre-training in deep neural network (DNN) acoustic model training. In the STD procedure, we adopt term-dependent score normalization and combine the outputs of diverse ASR systems to increase actual term weighted value (ATWV). After applying these methods, our current best single system achieves 48.32% word accuracy and 0.398 ATWV after STD system combination on OpenKWS13 Vietnamese development set.
international symposium on chinese spoken language processing | 2012
Kui Wu; Yan Song; Wu Guo; Li-Rong Dai
Recently, the speaker clustering approach exploiting the intra-conversation variability in the total variability space has shown promising performance. However, there exists the variability in different segments of the same speaker within a conversation, termed as intra-conversation intra-speaker variability, which may scatter the distribution of the corresponding i-vector based representation of short speech segment, and degrades the clustering performance. To address this issue, we propose a new speaker clustering approach based on an extended total variability factor analysis. In our proposed method, the intra-conversation total variability space is divided into the inter-speaker and intra-speaker variability space. And by explicitly compensating the intra-conversation intra-speaker variability, the short speech segments would be represented more accurately. To evaluate the effectiveness of the proposed method, we conduct extensive experiments on NIST SRE 2008 summed channel telephone dataset. The experimental results show that the proposed method clearly outperforms the other state-of-the-art speaker clustering techniques in terms of clustering error rate.
international symposium on chinese spoken language processing | 2010
Chen-Yu Yang; Zhen-Hua Ling; Heng Lu; Wu Guo; Li-Rong Dai
In this paper, an automatic prosodic phrase boundary labeling method for speech synthesis database is presented. This method can be divided into two stages: training stage and labeling stage. In training stage, context-dependent HMM, which is commonly adopted in the HMM-based parametric speech synthesis, is estimated using the training database with manual prosodic labeling. In labeling stage, the maximum likelihood criterion derived from the trained HMMs and the exhaustive search method are employed to find the optimal phrase boundary positions for an input sentence based on its acoustic features. The experimental results show that an F-score of 76.46% can be achieved for the prosodic phrase boundary detection of our Mandarin TTS corpus, which is close to the accuracy of experienced human labelers.
international conference on acoustics, speech, and signal processing | 2011
Yanhua Long; Zhi-Jie Yan; Frank K. Soong; Li-Rong Dai; Wu Guo
This paper proposes a feature extraction for speaker characterization by exploring the relationship between the two distinct components of the speech signal, one is harmonics accounting for the periodicity of the signal and the other is modulated noise accounting for the turbulences of the glottal airflow. The harmonic and noise parts of the speech signal are decomposed based on the Harmonic plus Noise Model approach. We estimate the spectral subband energy ratios (SSERs) as the speaker characteristic features, which are expected to reflect the interaction property of the vocal tract and glottal airflow of individual speakers for speaker verification. The speaker verification experiments based on a GMM-UBM system have shown the efficiency of the SSER features, reducing the error equal rate by 27.2% by combining with the conventional MFCC features.