Maximilian Schmitt
University of Passau
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Maximilian Schmitt.
conference of the international speech communication association | 2016
Maximilian Schmitt; Fabien Ringeval; Björn W. Schuller
Recognition of natural emotion in speech is a challenging task. Different methods have been proposed to tackle this complex task, such as acoustic feature brute-forcing or even endto-end learning. Recently, bag-of-audio-words (BoAW) representations of acoustic low-level descriptors (LLDs) have been employed successfully in the domain of acoustic event classification and other audio recognition tasks. In this approach, feature vectors of acoustic LLDs are quantised according to a learnt codebook of audio words. Then, a histogram of the occurring ‘words’ is built. Despite their massive potential, BoAW have not been thoroughly studied in emotion recognition. Here, we propose a method using BoAW created only of mel-frequency cepstral coefficients (MFCCs). Support vector regression is then used to predict emotion continuously in time and value, such as in the dimensions arousal and valence. We compare this approach with the computation of functionals based on the MFCCs and perform extensive evaluations on the RECOLA database, which features spontaneous and natural emotions. Results show that, BoAW representation of MFCCs does not only perform significantly better than functionals, but also outperforms by far most of recently published deep learning approaches, including convolutional and recurrent networks.
international conference on digital health | 2017
Jun Deng; Nicholas Cummins; Maximilian Schmitt; Kun Qian; Fabien Ringeval; Björn W. Schuller
Machine learning paradigms based on child vocalisations show great promise as an objective marker of developmental disorders such as Autism. In conventional detection systems, hand-crafted acoustic features are usually fed into a discriminative classifier (e.g, Support Vector Machines); however it is well known that the accuracy and robustness of such a system is limited by the size of the associated training data. This paper explores, for the first time, the use of feature representations learnt using a deep Generative Adversarial Network (GAN) for classifying childrens speech affected by developmental disorders. A comparative evaluation of our proposed system with different acoustic feature sets is performed on the Child Pathological and Emotional Speech database. Key experimental results presented demonstrate that GAN based methods exhibit competitive performance with the conventional paradigms in terms of the unweighted average recall metric.
international conference on speech and computer | 2018
Jing Han; Maximilian Schmitt; Björn W. Schuller
In social interaction, people tend to mimic their conversational partners both when they agree and disagree. Research on this phenomenon is complex but not recent in theory, and related studies show that mimicry can enhance social relationships, increase affiliation and rapport. However, automatically recognising such a phenomenon is still in its early development. In this paper, we analyse mimicry in the speech domain and propose a novel method by using hand-crafted low-level acoustic descriptors and autoencoders (AEs). Specifically, for each conversation, two AEs are built, one for each speaker. After training, the acoustic features of one speaker are tested with the AE that is trained on the features of her counterpart. The proposed approach is evaluated on a database consisting of almost 400 subjects from 6 different cultures, recorded in-the-wild. By calculating the AE’s reconstruction errors of all speakers and analysing the errors at different times in their interactions, we show that, albeit to different degrees from culture to culture, mimicry arises in most interactions.
international conference on computers helping people with special needs | 2018
Simone Hantke; Christian Cohrs; Maximilian Schmitt; Benjamin Tannert; Florian Lütkebohmert; Mathias Detmers; Heidi Schelhowe; Björn W. Schuller
Mental, neurological and/or physical disabilities often affect individuals’ cognitive processes, which in turn can introduce difficulties with remembering what they have learnt. Therefore, completing trivial daily tasks can be challenging and supervision or help from others is constantly needed. In this regard, these individuals with special needs can benefit from nowadays advanced assistance techniques. Within this contribution, a language-driven, workplace integrated, assistance system is being proposed, supporting disabled individuals in the handling of certain activities while taking into account their emotional-cognitive constitution and state. In this context, we present a set of baseline results for emotion recognition tasks and conduct machine learning experiments to benchmark the performance of an automatic emotion recognition system on the collected data. We show that this is a challenging task that can nevertheless be tackled with state-of-the-art methodologies.
SPECOM | 2018
Vedhas Pandit; Maximilian Schmitt; Nicholas Cummins; Franz Graf; Lucas Paletta; Björn W. Schuller
We evaluate, for the first time, the generalisability of in-the-wild speech-based affect tracking models using the database used in the ‘Affect Recognition’ sub-challenge of the Audio/Visual Emotion Challenge and Workshop (AVEC 2017) – namely the ‘Automatic Sentiment Analysis in the Wild (SEWA)’ and the ‘Graz Real-life Affect in the Street and Supermarket (GRAS\(^{2}\))’ corpus. The \(GRAS^{2}\) corpus is the only corpus to date featuring audiovisual recordings and time-continuous affect labels of the random participants recorded surreptitiously in a public place. The SEWA database was also collected in an in-the-wild paradigm in that it also features spontaneous affect behaviours, and real-life acoustic disruptions due to connectivity and hardware problems. The SEWA participants, however, were well aware of being recorded throughout, and thus the data potentially suffers from the ‘observer’s paradox’. In this paper, we evaluate how a model trained on a typical data suffering from the observer’s paradox (SEWA) fairs on a real-life data that is relatively free from such psychological effect (GRAS\(^{2}\)), and vice versa. Because of the drastically different recording conditions and the recording equipments, the feature spaces for the two databases differ extremely. The in-the-wild nature of the real-life databases, and the extreme disparity between the feature spaces are the key challenges tackled in this paper, a problem of a high practical relevance. We extract bag of audio words features using, for the very first time, a randomised database-independent codebook. True to our hypothesis, the Support Vector Regression model trained on GRAS\(^{2}\) had better generalisability, as this model could reasonably predict the SEWA arousal labels.
Computers in Biology and Medicine | 2018
Christoph Janott; Maximilian Schmitt; Yue Zhang; Kun Qian; Vedhas Pandit; Zixing Zhang; Clemens Heiser; Winfried Hohenhorst; Michael Herzog; Werner Hemmert; Björn W. Schuller
OBJECTIVE Snoring can be excited in different locations within the upper airways during sleep. It was hypothesised that the excitation locations are correlated with distinct acoustic characteristics of the snoring noise. To verify this hypothesis, a database of snore sounds is developed, labelled with the location of sound excitation. METHODS Video and audio recordings taken during drug induced sleep endoscopy (DISE) examinations from three medical centres have been semi-automatically screened for snore events, which subsequently have been classified by ENT experts into four classes based on the VOTE classification. The resulting dataset containing 828 snore events from 219 subjects has been split into Train, Development, and Test sets. An SVM classifier has been trained using low level descriptors (LLDs) related to energy, spectral features, mel frequency cepstral coefficients (MFCC), formants, voicing, harmonic-to-noise ratio (HNR), spectral harmonicity, pitch, and microprosodic features. RESULTS An unweighted average recall (UAR) of 55.8% could be achieved using the full set of LLDs including formants. Best performing subset is the MFCC-related set of LLDs. A strong difference in performance could be observed between the permutations of train, development, and test partition, which may be caused by the relatively low number of subjects included in the smaller classes of the strongly unbalanced data set. CONCLUSION A database of snoring sounds is presented which are classified according to their sound excitation location based on objective criteria and verifiable video material. With the database, it could be demonstrated that machine classifiers can distinguish different excitation location of snoring sounds in the upper airway based on acoustic parameters.
international symposium on neural networks | 2017
Johanna Bohm; Florian Eyben; Maximilian Schmitt; Harald Kosch; Björn W. Schuller
The quality of the singing voice is an important aspect of subjective, aesthetic perception of music. In this contribution, we propose a method to automatically assess perceived singing quality. We classify monophonic vocal recordings without accompaniment into one of three classes of singing quality. Unprocessed private and non-commercial recordings from a social media website are utilised. In addition to the user ratings given on the website, we let both subjects with and without a musical background annotate the samples. Building on musicological foundations, we define and extract acoustic parameters describing the quality of the sound, musical expression and intonation of the singing. Besides features which are already established in the field of Music Information Retrieval, such as loudness and mel-frequency cepstral coefficients, we propose and employ new types of features which are specific to intonation. For automatic classification by supervised machine learning methods, models predicting the subjective ratings and the user ratings on the social media website are learnt. We perform an exhaustive evaluation of both different classifiers and combinations of features. We show that the performance of automatic classification is close to that of human evaluators. Utilising support vector machines, an accuracy of classification of 55.4 %, based on the subjective ratings, and of 84.7 %, based on the user ratings of the social media website, are achieved.
international conference on e-health networking, applications and services | 2017
Christian Kohlschein; Maximilian Schmitt; Björn W. Schuller; Sabina Jeschke; Cornelius J. Werner
Aphasia is an acquired language disorder resulting from damage to language related networks of the brain, most often as a result of ischemic stroke or traumatic brain injury. Within the European Union, over 580000 people are affected each year. Both assessment and treatment of aphasia require the analysis of language, in particular of spontaneous speech. Factoring in therapy and diagnosis sessions, which require the presence of a speech therapist and a physician, aphasia is a resource intensive condition: It has been estimated that in Germany alone, there are 70000 new cases of stroke-related aphasia every year, 35000 of which persist over more than six months — all of which should receive formal diagnostic testing at some point. Having an automatic system for the detection and evaluation of aphasic speech would be of great benefit for the medical domain by immensely speeding up diagnostic processes and thus freeing up valuable resources for, e.g., therapy. As a first step towards building such a system, it is necessary to identify the vocal biomarkers which characterize aphasic speech. Furthermore, a database is needed which maps from recordings of aphasic speech to the type and severity of the disorder. In this paper, we present the vocal biomarkers and a description of the existing Aachen Aphasia database containing recordings and transcriptions of therapy sessions. We outline how the biomarkers and the database could be used to construct a recognition system which automatically maps pathological speech to aphasia type and severity.
international conference of the ieee engineering in medicine and biology society | 2017
Nicholas Cummins; Maximilian Schmitt; Shahin Amiriparian; Jarek Krajewski; Björn W. Schuller
A combination of passive, non-invasive and nonintrusive smart monitoring technologies is currently transforming healthcare. These technologies will soon be able to provide immediate health related feedback for a range of illnesses and conditions. Such tools would be game changing for serious public health concerns, such as seasonal cold and flu, for which early diagnosis and social isolation play a key role in reducing the spread. In this regard, this paper explores, for the first times, the automated classification of individuals with Upper Respiratory Tract Infections (URTI) using recorded speech samples. Key results presented indicate that our classifiers can achieve similar results to those seen in related health-based detection tasks indicating the promise of using computational paralinguistic analysis for the detection of URTI related illnesses.
GI-Jahrestagung | 2017
Maximilian Schmitt; Björn W. Schuller
The recognition of audio effects employed in recordings of electric guitar or bass has a wide range of applications in music information retrieval. It is meaningful in holistic automatic music transcription and annotation approaches for, e. g., music education, intelligent music search, or musicology. In this contribution, we investigate the relevance of a large variety of state-of-the-art acoustic features for the task of automatic guitar effect recognition. The usage of functionals, i. e., statistics such as moments and percentiles, is hereby compared to the bag-of-audio-words approach to obtain an acoustic representation of a recording on instance level. Our results are based on a database of more than 50 000 monophonic and polyphonic samples of electric guitars and bass guitars, processed with 10 different digital audio effects.
