Nicholas Cummins
University of Augsburg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nicholas Cummins.
acm multimedia | 2017
Fabien Ringeval; Björn W. Schuller; Michel F. Valstar; Jonathan Gratch; Roddy Cowie; Stefan Scherer; Sharon Mozgai; Nicholas Cummins; Maximilian Schmitt; Maja Pantic
The Audio/Visual Emotion Challenge and Workshop (AVEC 2017) Real-life depression, and affect will be the seventh competition event aimed at comparison of multimedia processing and machine learning methods for automatic audiovisual depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the depression and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of the various approaches to depression and emotion recognition from real-life data. This paper presents the novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline system on the two proposed tasks: dimensional emotion recognition (time and value-continuous), and dimensional depression estimation (value-continuous).
Image and Vision Computing | 2017
Jing Han; Zixing Zhang; Nicholas Cummins; Fabien Ringeval; Björn W. Schuller
Abstract Automatic continuous affect recognition from audiovisual cues is arguably one of the most active research areas in machine learning. In addressing this regression problem, the advantages of the models, such as the global-optimisation capability of Support Vector Machine for Regression and the context-sensitive capability of memory-enhanced neural networks, have been frequently explored, but in an isolated way. Motivated to leverage the individual advantages of these techniques, this paper proposes and explores a novel framework, Strength Modelling, where two models are concatenated in a hierarchical framework. In doing this, the strength information of the first model, as represented by its predictions, is joined with the original features, and this expanded feature space is then utilised as the input by the successive model. A major advantage of Strength Modelling, besides its ability to hierarchically explore the strength of different machine learning algorithms, is that it can work together with the conventional feature- and decision-level fusion strategies for multimodal affect recognition. To highlight the effectiveness and robustness of the proposed approach, extensive experiments have been carried out on two time- and value-continuous spontaneous emotion databases (RECOLA and SEMAINE) using audio and video signals. The experimental results indicate that employing Strength Modelling can deliver a significant performance improvement for both arousal and valence in the unimodal and bimodal settings. The results further show that the proposed systems is competitive or outperform the other state-of-the-art approaches, but being with a simple implementation.
acm multimedia | 2017
Nicholas Cummins; Shahin Amiriparian; Gerhard Hagerer; Anton Batliner; Stefan Steidl; Björn W. Schuller
The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition tasks. Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train-test conditions.
IEEE Signal Processing Magazine | 2017
Zixing Zhang; Nicholas Cummins; Björn W. Schuller
With recent advances in machine-learning techniques for automatic speech analysis (ASA)-the computerized extraction of information from speech signals-there is a greater need for high-quality, diverse, and very large amounts of data. Such data could be game-changing in terms of ASA system accuracy and robustness, enabling the extraction of feature representations or the learning of model parameters immune to confounding factors, such as acoustic variations, unrelated to the task at hand. However, many current ASA data sets do not meet the desired properties. Instead, they are often recorded under less than ideal conditions, with the corresponding labels sparse or unreliable.
artificial intelligence in medicine in europe | 2017
Nicholas Cummins; Bogdan Vlasenko; Hesam Sagha; Björn W. Schuller
Depression has been consistently linked with alterations in speech motor control characterised by changes in formant dynamics. However, potential differences in the manifestation of depression between male and female speech have not been fully realised or explored. This paper considers speech-based depression classification using gender dependant features and classifiers. Presented key observations reveal gender differences in the effect of depression on vowel-level formant features. Considering this observation, we also show that a small set of hand-crafted gender dependent formant features can outperform acoustic-only based features (on two state-of-the-art acoustic features sets) when performing two-class (depressed and non-depressed) classification.
international conference on digital health | 2017
Jun Deng; Nicholas Cummins; Maximilian Schmitt; Kun Qian; Fabien Ringeval; Björn W. Schuller
Machine learning paradigms based on child vocalisations show great promise as an objective marker of developmental disorders such as Autism. In conventional detection systems, hand-crafted acoustic features are usually fed into a discriminative classifier (e.g, Support Vector Machines); however it is well known that the accuracy and robustness of such a system is limited by the size of the associated training data. This paper explores, for the first time, the use of feature representations learnt using a deep Generative Adversarial Network (GAN) for classifying childrens speech affected by developmental disorders. A comparative evaluation of our proposed system with different acoustic feature sets is performed on the Child Pathological and Emotional Speech database. Key experimental results presented demonstrate that GAN based methods exhibit competitive performance with the conventional paradigms in terms of the unweighted average recall metric.
chinese conference on pattern recognition | 2016
Jun Deng; Nicholas Cummins; Jing Han; Xinzhou Xu; Zhao Ren; Vedhas Pandit; Zixing Zhang; Björn W. Schuller
This paper presents the University of Passau’s approaches for the Multimodal Emotion Recognition Challenge 2016. For audio signals, we exploit Bag-of-Audio-Words techniques combining Extreme Learning Machines and Hierarchical Extreme Learning Machines. For video signals, we use not only the information from the cropped face of a video frame, but also the broader contextual information from the entire frame. This information is extracted via two Convolutional Neural Networks pre-trained for face detection and object classification. Moreover, we extract facial action units, which reflect facial muscle movements and are known to be important for emotion recognition. Long Short-Term Memory Recurrent Neural Networks are deployed to exploit temporal information in the video representation. Average late fusion of audio and video systems is applied to make prediction for multimodal emotion recognition. Experimental results on the challenge database demonstrate the effectiveness of our proposed systems when compared to the baseline.
Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop - AVEC'18 | 2018
Fabien Ringeval; Adrien Michaud; Elvan Ciftçi; Hüseyin Güleç; Albert Ali Salah; Maja Pantic; Björn W. Schuller; Michel F. Valstar; Roddy Cowie; Heysem Kaya; Maximilian Schmitt; Shahin Amiriparian; Nicholas Cummins; Denis Lalanne
The Audio/Visual Emotion Challenge and Workshop (AVEC 2018) Bipolar disorder, and cross-cultural affect recognition is the eighth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: bipolar disorder classification, cross-cultural dimensional emotion recognition, and emotional label generation from individual ratings, respectively.
international conference on digital health | 2018
Gerhard Hagerer; Nicholas Cummins; Florian Eyben; Björn W. Schuller
To build a noise-robust online-capable laughter detector for behavioural monitoring on wearables, we incorporate context-sensitive Long Short-Term Memory Deep Neural Networks. We show our solution»s improvements over a laughter detection baseline by integrating intelligent noise-robust voice activity detection (VAD) into the same model. To this end, we add extensive artificially mixed VAD data without any laughter targets to a small laughter training set. The resulting laughter detection enhancements are stable even when frames are dropped, which happen in low resource environments such as wearables. Thus, the outlined model generation potentially improves the detection of vocal cues when the amount of training data is small and robustness and efficiency are required.
international conference on digital health | 2018
Zhao Ren; Nicholas Cummins; Vedhas Pandit; Jing Han; Kun Qian; Björn W. Schuller
Machine learning based heart sound classification represents an efficient technology that can help reduce the burden of manual auscultation through the automatic detection of abnormal heart sounds. In this regard, we investigate the efficacy of using the pre-trained Convolutional Neural Networks (CNNs) from large-scale image data for the classification of Phonocardiogram (PCG) signals by learning deep PCG representations. First, the PCG files are segmented into chunks of equal length. Then, we extract a scalogram image from each chunk using a wavelet transformation. Next, the scalogram images are fed into either a pre-trained CNN, or the same network fine-tuned on heart sound data. Deep representations are then extracted from a fully connected layer of each network and classification is achieved by a static classifier. Alternatively, the scalogram images are fed into an end-to-end CNN formed by adapting a pre-trained network via transfer learning. Key results indicate that our deep PCG representations extracted from a fine-tuned CNN perform the strongest, 56.2% mean accuracy, on our heart sound classification task. When compared to a baseline accuracy of 46.9%, gained using conventional audio processing features and a support vector machine, this is a significant relative improvement of 19.8% (p∠.001 by one-tailed z-test).