Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Mingxing Xu is active.

Publication


Featured researches published by Mingxing Xu.


international conference on acoustics, speech, and signal processing | 2007

GMM Supervector Based SVM with Spectral Features for Speech Emotion Recognition

Hao Hu; Mingxing Xu; Wei Wu

Speech emotion recognition is a challenging yet important speech technology. In this paper, the GMM supervector based SVM is applied to this field with spectral features. A GMM is trained for each emotional utterance, and the corresponding GMM supervector is used as the input feature for SVM. Experimental results on an emotional speech database demonstrate that the GMM supervector based SVM outperforms standard GMM on speech emotion recognition.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

A Cohort-Based Speaker Model Synthesis for Mismatched Channels in Speaker Verification

Wei Wu; Thomas Fang Zheng; Mingxing Xu; Frank K. Soong

Mismatch between enrollment and test data is one of the top performance degrading factors in speaker recognition applications. This mismatch is particularly true over public telephone networks, where input speech data is collected over different handsets and transmitted over different channels from one trial to the next. In this paper, a cohort-based speaker model synthesis (SMS) algorithm, designed for synthesizing robust speaker models without requiring channel-specific enrollment data, is proposed. This algorithm utilizes a priori knowledge of channels extracted from speaker-specific cohort sets to synthesize such speaker models. The cohort selection in the proposed new SMS can be either speaker-specific or Gaussian component based. Results on the China Criminal Police College (CCPC) speaker recognition corpus, which contains utterances from both landline and mobile channel, show the new algorithms yield significant speaker verification performance improvement over Htnorm and universal background model (UBM)-based speaker model synthesis.


international conference on acoustics, speech, and signal processing | 2016

Question detection from acoustic features using recurrent neural network with gated recurrent unit

Yaodong Tang; Yuchen Huang; Zhiyong Wu; Helen M. Meng; Mingxing Xu; Lianhong Cai

Question detection is of importance for many speech applications. Only parts of the speech utterances can provide useful clues for question detection. Previous work of question detection using acoustic features in Mandarin conversation is weak in capturing such proper time context information, which could be modeled essentially in recurrent neural network (RNN) structure. In this paper, we conduct an investigation on recurrent approaches to cope with this problem. Based on gated recurrent unit (GRU), we build different RNN and bidirectional RNN (BRNN) models to extract efficient features at segment and utterance level. The particular advantage of GRU is it can determine a proper time scale to extract high-level contextual features. Experimental results show that the features extracted within proper time scale make the classifier perform better than the baseline method with pre-designed lexical and acoustic feature set.


international conference on acoustics, speech, and signal processing | 2016

SVR based double-scale regression for dynamic emotion prediction in music

Haishu Xianyu; Xinxing Li; Wenxiao Chen; Fanhang Meng; Jiashen Tian; Mingxing Xu; Lianhong Cai

Dynamic music emotion prediction is to recognize the continuous emotion contained in music, and has various applications. In recent years, dynamic music emotion recognition is widely studied, while the inside structure of the emotion in music remains unclear. We conduct a data observation based on the database provided by Free Music Archive (FMA), and find that emotion dynamic shows different properties under different scales. According to the data observation, we propose a new method, Double-scale Support Vector Regression (DS-SVR), to dynamically recognize the music emotion. The new method decouples two scales of emotion dynamics apart, and recognizes them separately. We apply the DS-SVR to MediaEval 2015, Emotion in Music database, and achieve an outstanding performance, significantly better than the baseline provided by organizer.


international conference on acoustics, speech, and signal processing | 2016

A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction

Xinxing Li; Haishu Xianyu; Jiashen Tian; Wenxiao Chen; Fanhang Meng; Mingxing Xu; Lianhong Cai

Music Dynamic Emotion Prediction is a challenging and significant task. In this paper, We adopt the dimensional valence-arousal (V-A) emotion model to represent the dynamic emotion in music. Considering the high context correlation among the music feature sequence and the advantage of Bidirectional Long Short-Term Memory (BLSTM) in capturing sequence information, we propose a multi-scale approach, Deep BLSTM (DBLSTM) based multi-scale regression and fusion with Extreme Learning Machine (ELM), to predict the V-A values in music. We achieved the best performance on the database of Emotion in Music task in MediaEval 2015 compared with other submitted results. The experimental results demonstrated the effectiveness of our novel proposed multi-scale DBLSTM-ELM model.


international conference natural language processing | 2005

Language model adaptation based on the classification of a trigram's language style feature

Qi Liang; Thomas Fang Zheng; Mingxing Xu; Wenhu Wu

In this paper, an adaptation method of the language style of a language model is proposed based on the differences between spoken and written language. Several interpolation methods based on trigram counts are used for adaptation. An interpolation method considering Katz smoothing computes weights according to the confidence score of a trigram. An adaptation method based on the classification of a trigrams style feature computes weights dynamically according to the trigrams language style tendency, and several weight generation functions are proposed. Experiments for spoken language on the Chinese corpora show that these methods, especially the method considering both a trigrams confidence and style tendency, can achieve a reduction in the Chinese character error rate for pinyin-to-character conversion.


international conference on acoustics, speech, and signal processing | 2017

Speaker segmentation using deep speaker vectors for fast speaker change scenarios

Renyu Wang; Mingliang Gu; Lantian Li; Mingxing Xu; Thomas Fang Zheng

A novel speaker segmentation approach based on deep neural network is proposed and investigated. This approach uses deep speaker vectors (d-vectors) to represent speaker characteristics and to find speaker change points. The d-vector is a kind of frame-level speaker discriminative feature, whose discriminative training process corresponds to the goal of discriminating a speaker change point from a single speaker speech segment in a short time window. Following the traditional metric-based segmentation, each analysis window contains two sub-windows and is shifting along the audio stream to detect speaker change points, where the speaker characteristics are represented by the means of deep speaker vectors for all frames in each window. Experimental investigations conducted in fast speaker change scenarios show that the proposed method can detect speaker change points more quickly and more effectively than the commonly used segmentation methods.


conference of the international speech communication association | 2016

Combining CNN and BLSTM to Extract Textual and Acoustic Features for Recognizing Stances in Mandarin Ideological Debate Competition.

Linchuan Li; Zhiyong Wu; Mingxing Xu; Helen M. Meng; Lianhong Cai

Recognizing stances in ideological debates is a relatively new and challenging problem in opinion mining. While previous work mainly focused on text modality, in this paper, we try to recognize stances from both text and acoustic modalities, where how to derive more representative textual and acoustic features still remains the research problem. Inspired by the promising performances of neural network models in natural language understanding and speech processing, we propose a unified framework named C-BLSTM by combining convolutional neural network (CNN) and bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) for feature extraction. In C-BLSTM, CNN is utilized to extract higherlevel local features of text (n-grams) and speech (emphasis, intonation), while BLSTM is used to extract bottleneck features for context-sensitive feature compression and targetrelated feature representation. Maximum entropy model is then used to recognize stances from the bimodal textual acoustic bottleneck features. Experiments on four debate datasets show C-BLSTM outperforms all challenging baseline methods, and specifically, acoustic intonation and emphasis features further improve F1-measure by 6% as compared to textual features only.


conference of the international speech communication association | 2016

Analysis on Gated Recurrent Unit Based Question Detection Approach.

Yaodong Tang; Zhiyong Wu; Helen M. Meng; Mingxing Xu; Lianhong Cai

Recent studies have shown various kinds of recurrent neural networks (RNNs) are becoming powerful sequence models in speech related applications. Our previous work in detecting questions of Mandarin speech presents that gated recurrent unit (GRU) based RNN can achieve significantly better results. In this paper, we try to open the black box to find the correlations between inner architecture of GRU and phonetic features of question sentences. We find that both update gate and reset gate in GRU blocks react when people begin to pronounce a word. According to the reactions, experiments are conducted to show the behavior of GRU based question detection approach on three important factors, including keywords or special structure of questions, final particles and interrogative intonation. We also observe that update gate and reset gate don’t collaborate well on our dataset. Based on the asynchronous acts of update gate and reset gate in GRU, we adapt the structure of GRU block to our dataset and get further performance improvement in question detection task.


asia pacific signal and information processing association annual summit and conference | 2014

Intrinsic variation robust speaker verification based on sparse representation

Yi Nie; Mingxing Xu; Haishu Xianyu

Intrinsic variation is one of the major factors that aggravate performance of speaker verification system dramatically. In this paper, we focus on alleviating influence caused by intrinsic variation using sparse representation. Because the over-complete dictionary increases the flexibility and the ability to adapt to variable data in signal representation, we expect redundancy of the dictionary could benefit addressing the implicit properties of intrinsic variation within each speaker. Both exemplar dictionary and learned dictionary are evaluated on an intrinsic variation corpus and compared with GMM-UBM, Joint Factor Analysis (JFA) and i-vector systems. In our system, we choose the K-SVD algorithm, generalization of K-means algorithm to learn dictionary with Singular Value Decomposition (SVD). The experiment results show that the two sparse representation systems achieve higher accuracy than GMM-UBM, JFA and i-vector systems consistently, especially outperform GMM-UBM respectively by 37.17% and 41.55%. We also find that the K-SVD based sparse representation system has almost the best performance, which achieve an average Error Equal Rate (EER) of 14.23%.

Collaboration


Dive into the Mingxing Xu's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Helen M. Meng

The Chinese University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge