Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Zixing Zhang is active.

Publication


Featured researches published by Zixing Zhang.


IEEE Signal Processing Letters | 2014

Autoencoder-based Unsupervised Domain Adaptation for Speech Emotion Recognition

Jun Deng; Zixing Zhang; Florian Eyben; Björn W. Schuller

With the availability of speech data obtained from different devices and varied acquisition conditions, we are often faced with scenarios, where the intrinsic discrepancy between the training and the test data has an adverse impact on affective speech analysis. To address this issue, this letter introduces an Adaptive Denoising Autoencoder based on an unsupervised domain adaptation method, where prior knowledge learned from a target set is used to regularize the training on a source set. Our goal is to achieve a matched feature space representation for the target and source sets while ensuring target domain knowledge transfer. The method has been successfully evaluated on the 2009 INTERSPEECH Emotion Challenges FAU Aibo Emotion Corpus as target corpus and two other publicly available speech emotion corpora as sources. The experimental results show that our method significantly improves over the baseline performance and outperforms related feature domain adaptation methods.


ieee automatic speech recognition and understanding workshop | 2011

Unsupervised learning in cross-corpus acoustic emotion recognition

Zixing Zhang; Felix Weninger; Martin Wöllmer; Björn W. Schuller

One of the ever-present bottlenecks in Automatic Emotion Recognition is data sparseness. We therefore investigate the suitability of unsupervised learning in cross-corpus acoustic emotion recognition through a large-scale study with six commonly used databases, including acted and natural emotion speech, and covering a variety of application scenarios and acoustic conditions. We show that adding unlabeled emotional speech to agglomerated multi-corpus training sets can enhance recognition performance even in a challenging cross-corpus setting; furthermore, we show that the expected gain by adding unlabeled data on average is approximately half the one achieved by additional manually labeled data in leave-one-corpus-out validation.


IEEE Transactions on Audio, Speech, and Language Processing | 2015

Cooperative learning and its application to emotion recognition from speech

Zixing Zhang; Eduardo Coutinho; Jun Deng; Björn W. Schuller

In this paper, we propose a novel method for highly efficient exploitation of unlabeled data-Cooperative Learning. Our approach consists of combining Active Learning and Semi-Supervised Learning techniques, with the aim of reducing the costly effects of human annotation. The core underlying idea of Cooperative Learning is to share the labeling work between human and machine efficiently in such a way that instances predicted with insufficient confidence value are subject to human labeling, and those with high confidence values are machine labeled. We conducted various test runs on two emotion recognition tasks with a variable number of initial supervised training instances and two different feature sets. The results show that Cooperative Learning consistently outperforms individual Active and Semi-Supervised Learning techniques in all test cases. In particular, we show that our method based on the combination of Active Learning and Co-Training leads to the same performance of a model trained on the whole training set, but using 75% fewer labeled instances. Therefore, our method efficiently and robustly reduces the need for human annotations.


international conference on acoustics, speech, and signal processing | 2012

Semi-supervised learning helps in sound event classification

Zixing Zhang; Björn W. Schuller

We investigate the suitability of semi-supervised learning in sound event classification on a large database of 17 k sound clips. Seven categories are chosen based on the findsounds.com schema: animals, people, nature, vehicles, noisemakers, office, and musical instruments. Our results show that adding unlabelled sound event data to the training set based on sufficient classifier confidence level after its automatic labelling level can significantly enhance classification performance. Furthermore, combined with optimal re-sampling of originally labelled instances and iteratively learning in semi-supervised manner, the expected gain can reach approximately half the one achieved by using the originally manually labelled data. Overall, maximum performance of 71.7% can be reported for the automatic classification of sound in a large-scale archive.


international conference on acoustics, speech, and signal processing | 2013

Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise

Martin Wöllmer; Zixing Zhang; Felix Weninger; Björn W. Schuller; Gerhard Rigoll

The recognition of spontaneous speech in highly variable noise is known to be a challenge, especially at low signal-to-noise ratios (SNR). In this paper, we investigate the effect of applying bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks for speech feature enhancement in noisy conditions. BLSTM networks tend to prevail over conventional neural network architectures, whenever the recognition or regression task relies on an intelligent exploitation of temporal context information. We show that BLSTM networks are well-suited for mapping from noisy to clean speech features and that the obtained recognition performance gain is partly complementary to improvements via additional techniques such as speech enhancement by non-negative matrix factorization and probabilistic feature generation by Bottleneck-BLSTM networks. Compared to simple multi-condition training or feature enhancement via standard recurrent neural networks, our BLSTM-based feature enhancement approach leads to remarkable gains in word accuracy in a highly challenging task of recognizing spontaneous speech at SNR levels between -6 and 9 dB.


conference of the international speech communication association | 2016

Facing Realism in Spontaneous Emotion Recognition from Speech: Feature Enhancement by Autoencoder with LSTM Neural Networks

Zixing Zhang; Fabien Ringeval; Jing Han; Jun Deng; Erik Marchi; Björn W. Schuller

During the last decade, speech emotion recognition technology has matured well enough to be used in some real-life scenarios. However, these scenarios require an almost silent environment to not compromise the performance of the system. Emotion recognition technology from speech thus needs to evolve and face more challenging conditions, such as environmental additive and convolutional noises, in order to broaden its applicability to real-life conditions. This contribution evaluates the impact of a front-end feature enhancement method based on an autoencoder with long short-term memory neural networks, for robust emotion recognition from speech. Support Vector Regression is then used as a back-end for time- and value-continuous emotion prediction from enhanced features. We perform extensive evaluations on both non-stationary additive noise and convolutional noise, on a database of spontaneous and natural emotions. Results show that the proposed method significantly outperforms a system trained on raw features, for both arousal and valence dimensions, while having almost no degradation when applied to clean speech.


IEEE Signal Processing Letters | 2017

Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition

Jun Deng; Xinzhou Xu; Zixing Zhang; Sascha Frühholz; Björn W. Schuller

One of the serious obstacles to the applications of speech emotion recognition systems in real-life settings is the lack of generalization of the emotion classifiers. Many recognition systems often present a dramatic drop in performance when tested on speech data obtained from different speakers, acoustic environments, linguistic content, and domain conditions. In this letter, we propose a novel unsupervised domain adaptation model, called Universum autoencoders, to improve the performance of the systems evaluated in mismatched training and test conditions. To address the mismatch, our proposed model not only learns discriminative information from labeled data, but also learns to incorporate the prior knowledge from unlabeled data into the learning. Experimental results on the labeled Geneva Whispered Emotion Corpus database plus other three unlabeled databases demonstrate the effectiveness of the proposed method when compared to other domain adaptation methods.


IEEE Transactions on Affective Computing | 2014

Distributing Recognition in Computational Paralinguistics

Zixing Zhang; Eduardo Coutinho; Jun Deng; Björn W. Schuller

In this paper, we propose and evaluate a distributed system for multiple Computational Paralinguistics tasks in a client-server architecture. The client side deals with feature extraction, compression, and bit-stream formatting, while the server side performs the reverse process, plus model training, and classification. The proposed architecture favors large-scale data collection and continuous model updating, personal information protection, and transmission bandwidth optimization. In order to preliminarily investigate the feasibility and reliability of the proposed system, we focus on the trade-off between transmission bandwidth and recognition accuracy. We conduct large-scale evaluations of some key functions, namely, feature compression/decompression, model training and classification, on five common paralinguistic tasks related to emotion, intoxication, pathology, age and gender. We show that, for most tasks, with compression ratios up to 40 (bandwidth savings up to 97.5 percent), the recognition accuracies are very close to the baselines. Our results encourage future exploitation of the system proposed in this paper, and demonstrate that we are not far from the creation of robust distributed multi-task paralinguistic recognition systems which can be applied to a myriad of everyday life scenarios.


international conference on acoustics, speech, and signal processing | 2013

Co-training succeeds in Computational Paralinguistics

Zixing Zhang; Jun Deng; Björn W. Schuller

Data sparsity is one of the major bottlenecks in the field of Computational Paralinguistics. Partially supervised learning approaches can help leverage this problem without the need of cost-intensive human labelling efforts. We thus investigate the feasibility of cotraining for exemplary paralinguistic speech analysis tasks spanning along the time-continuum: from short-term-related emotion to mid-term-related sleepiness and finally to long-term trait of gender. By dividing the acoustic feature space with two views as independent and sufficient as possible, the semi-supervised learning approach of co-training selects instances with high confidence scores in each view, and agglomerates them along with their predictions into initial training sets per iteration. Our experimental results on official Interspeech Computational Paralinguistics Challenge tasks effectively demonstrate co-trainings superiority over the baseline formed by single-view self-training, especially for the short- and medium-term tasks emotion and sleepiness recognition.


International Journal of Speech Technology | 2012

Synthesized speech for model training in cross-corpus recognition of human emotion

Björn W. Schuller; Zixing Zhang; Felix Weninger; Felix Burkhardt

Recognizing speakers in emotional conditions remains a challenging issue, since speaker states such as emotion affect the acoustic parameters used in typical speaker recognition systems. Thus, it is believed that knowledge of the current speaker emotion can improve speaker recognition in real life conditions. Conversely, speech emotion recognition still has to overcome several barriers before it can be employed in realistic situations, as is already the case with speech and speaker recognition. One of these barriers is the lack of suitable training data, both in quantity and quality—especially data that allow recognizers to generalize across application scenarios (‘cross-corpus’ setting). In previous work, we have shown that in principle, the usage of synthesized emotional speech for model training can be beneficial for recognition of human emotions from speech. In this study, we aim at consolidating these first results in a large-scale cross-corpus evaluation on eight of most frequently used human emotional speech corpora, namely ABC, AVIC, DES, EMO-DB, eNTERFACE, SAL, SUSAS and VAM, covering natural, induced and acted emotion as well as a variety of application scenarios and acoustic conditions. Synthesized speech is evaluated standalone as well as in joint training with human speech. Our results show that the usage of synthesized emotional speech in acoustic model training can significantly improve recognition of arousal from human speech in the challenging cross-corpus setting.

Collaboration


Dive into the Zixing Zhang's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jun Deng

University of Passau

View shared research outputs
Top Co-Authors

Avatar

Jing Han

University of Augsburg

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Yue Zhang

Imperial College London

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge