Jing Han
University of Augsburg
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Jing Han.
conference of the international speech communication association | 2016
Zixing Zhang; Fabien Ringeval; Jing Han; Jun Deng; Erik Marchi; Björn W. Schuller
During the last decade, speech emotion recognition technology has matured well enough to be used in some real-life scenarios. However, these scenarios require an almost silent environment to not compromise the performance of the system. Emotion recognition technology from speech thus needs to evolve and face more challenging conditions, such as environmental additive and convolutional noises, in order to broaden its applicability to real-life conditions. This contribution evaluates the impact of a front-end feature enhancement method based on an autoencoder with long short-term memory neural networks, for robust emotion recognition from speech. Support Vector Regression is then used as a back-end for time- and value-continuous emotion prediction from enhanced features. We perform extensive evaluations on both non-stationary additive noise and convolutional noise, on a database of spontaneous and natural emotions. Results show that the proposed method significantly outperforms a system trained on raw features, for both arousal and valence dimensions, while having almost no degradation when applied to clean speech.
acm multimedia | 2017
Jing Han; Zixing Zhang; Maximilian Schmitt; Maja Pantic; Björn W. Schuller
Over the last decade, automatic emotion recognition has become well established. The gold standard target is thereby usually calculated based on multiple annotations from different raters. All related efforts assume that the emotional state of a human subject can be identified by a hard category or a unique value. This assumption tries to ease the human observers subjectivity when observing patterns such as the emotional state of others. However, as the number of annotators cannot be infinite, uncertainty remains in the emotion target even if calculated from several, yet few human annotators. The common procedure to use this same emotion target in the learning process thus inevitably introduces noise in terms of an uncertain learning target. In this light, we propose a soft prediction framework to provide a more human-like and comprehensive prediction of emotion. In our novel framework, we provide an additional target to indicate the uncertainty of human perception based on the inter-rater disagreement level, in contrast to the traditional framework which is merely producing one single prediction (category or value). To exploit the dependency between the emotional state and the newly introduced perception uncertainty, we implement a multi-task learning strategy. To evaluate the feasibility and effectiveness of the proposed soft prediction framework, we perform extensive experiments on a time- and value-continuous spontaneous audiovisual emotion database including late fusion results. We show that the soft prediction framework with multi-task learning of the emotional state and its perception uncertainty significantly outperforms the individual tasks in both the arousal and valence dimensions.
international conference on speech and computer | 2018
Jing Han; Maximilian Schmitt; Björn W. Schuller
In social interaction, people tend to mimic their conversational partners both when they agree and disagree. Research on this phenomenon is complex but not recent in theory, and related studies show that mimicry can enhance social relationships, increase affiliation and rapport. However, automatically recognising such a phenomenon is still in its early development. In this paper, we analyse mimicry in the speech domain and propose a novel method by using hand-crafted low-level acoustic descriptors and autoencoders (AEs). Specifically, for each conversation, two AEs are built, one for each speaker. After training, the acoustic features of one speaker are tested with the AE that is trained on the features of her counterpart. The proposed approach is evaluated on a database consisting of almost 400 subjects from 6 different cultures, recorded in-the-wild. By calculating the AE’s reconstruction errors of all speakers and analysing the errors at different times in their interactions, we show that, albeit to different degrees from culture to culture, mimicry arises in most interactions.
international conference on digital health | 2018
Zhao Ren; Nicholas Cummins; Vedhas Pandit; Jing Han; Kun Qian; Björn W. Schuller
Machine learning based heart sound classification represents an efficient technology that can help reduce the burden of manual auscultation through the automatic detection of abnormal heart sounds. In this regard, we investigate the efficacy of using the pre-trained Convolutional Neural Networks (CNNs) from large-scale image data for the classification of Phonocardiogram (PCG) signals by learning deep PCG representations. First, the PCG files are segmented into chunks of equal length. Then, we extract a scalogram image from each chunk using a wavelet transformation. Next, the scalogram images are fed into either a pre-trained CNN, or the same network fine-tuned on heart sound data. Deep representations are then extracted from a fully connected layer of each network and classification is achieved by a static classifier. Alternatively, the scalogram images are fed into an end-to-end CNN formed by adapting a pre-trained network via transfer learning. Key results indicate that our deep PCG representations extracted from a fine-tuned CNN perform the strongest, 56.2% mean accuracy, on our heart sound classification task. When compared to a baseline accuracy of 46.9%, gained using conventional audio processing features and a support vector machine, this is a significant relative improvement of 19.8% (p∠.001 by one-tailed z-test).
ICMI '18 Proceedings of the 20th ACM International Conference on Multimodal Interaction | 2018
Ya'nan Guo; Jing Han; Zixing Zhang; Björn W. Schuller; Yide Ma
In this paper, we mainly investigate subjects food likability based on audio-related features as a contribution to EAT ? the ICMI 2018 Eating Analysis and Tracking challenge. Specifically, we conduct 4-level Double Tree Complex Wavelet Transform decomposition of an audio signal, and obtain five sub-audio signals with frequencies ranging from low to high. For each sub-audio signal, not only traditional functional-based features but also deep learning-based features via pretrained CNNs based on SliCQ-nonstationary Gabor transform and a cochleagram map, are calculated. Besides, the original audio signals based Bag-of-Audio-Words features extracted by the openXBOW toolkit are used to enhance the model as well. Finally, the early fusion of all these three kinds of features can lead to promising results, yielding the highest UAR of 79.2 % by means of a leave-one-speaker-out cross-validation, which holds a 12.7 % absolute gain compared with the baseline of 66.5 % UAR.
Archive | 2017
Zixing Zhang; Ding Liu; Jing Han; Björn W. Schuller
international conference on acoustics, speech, and signal processing | 2018
Jing Han; Zixing Zhang; Zhao Ren; Fabien Ringeval; Björn W. Schuller
conference of the international speech communication association | 2018
Zixing Zhang; Jing Han; Kun Qian; Björn W. Schuller
conference of the international speech communication association | 2018
Jing Han; Zixing Zhang; Maximilian Schmitt; Zhao Ren; Fabien Ringeval; Björn W. Schuller
arXiv: Computation and Language | 2018
Jing Han; Zixing Zhang; Nicholas Cummins; Björn W. Schuller
