Bogdan Vlasenko
Otto-von-Guericke University Magdeburg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bogdan Vlasenko.
ieee automatic speech recognition and understanding workshop | 2009
Björn W. Schuller; Bogdan Vlasenko; Florian Eyben; Gerhard Rigoll; Andreas Wendemuth
In the light of the first challenge on emotion recognition from speech we provide the largest-to-date benchmark comparison under equal conditions on nine standard corpora in the field using the two pre-dominant paradigms: modeling on a frame-level by means of hidden Markov models and supra-segmental modeling by systematic feature brute-forcing. Investigated corpora are the ABC, AVIC, DES, EMO-DB, eNTERFACE, SAL, SmartKom, SUSAS, and VAM databases. To provide better comparability among sets, we additionally cluster each databases emotions into binary valence and arousal discrimination tasks. In the result large differences are found among corpora that mostly stem from naturalistic emotions and spontaneous speech vs. more prototypical events. Further, supra-segmental modeling proves significantly beneficial on average when several classes are addressed at a time.
IEEE Transactions on Affective Computing | 2010
Björn W. Schuller; Bogdan Vlasenko; Florian Eyben; Martin Wöllmer; André Stuhlsatz; Andreas Wendemuth; Gerhard Rigoll
As the recognition of emotion from speech has matured to a degree where it becomes applicable in real-life settings, it is time for a realistic view on obtainable performances. Most studies tend to...
affective computing and intelligent interaction | 2007
Bogdan Vlasenko; Björn W. Schuller; Andreas Wendemuth; Gerhard Rigoll
Opposing the pre-dominant turn-wise statistics of acoustic Low-Level-Descriptors followed by static classification we re-investigate dynamic modeling directly on the frame-level in speech-based emotion recognition. This seems beneficial, as it is well known that important information on temporal sub-turn-layers exists. And, most promisingly, we integrate this frame-level information within a state-of-the-art large-feature-space emotion recognition engine. In order to investigate frame-level processing we employ a typical speaker-recognition set-up tailored for the use of emotion classification. That is a GMM for classification and MFCC plus speed and acceleration coefficients as features. We thereby also consider use of multiple states, respectively an HMM. In order to fuse this information with turn-based modeling, output scores are added to a super-vector combined with static acoustic features. Thereby a variety of Low-Level-Descriptors and functionals to cover prosodic, speech quality, and articulatory aspects are considered. Starting from 1.4k features we select optimal configurations including and excluding GMM information. The final decision task is realized by use of SVM. Extensive test-runs are carried out on two popular public databases, namely EMO-DB and SUSAS, to investigate acted and spontaneous data. As we face the current challenge of speaker-independent analysis we also discuss benefits arising from speaker normalization. The results obtained clearly emphasize the superior power of integrated diverse time-levels.
ieee automatic speech recognition and understanding workshop | 2007
Björn W. Schuller; Bogdan Vlasenko; Ricardo Minguez; Gerhard Rigoll; Andreas Wendemuth
In the search for a standard unit for use in recognition of emotion in speech, a whole turn, that is the full section of speech by one person in a conversation, is common. Within applications such turns often seem favorable. Yet, high effectiveness of sub-turn entities is known. In this respect a two-stage approach is investigated to provide higher temporal resolution by chunking of speech-turns according to acoustic properties, and multi-instance learning for turn-mapping after individual chunk analysis. For chunking fast pre-segmentation into emotionally quasi-stationary segments by one-pass Viterbi beam search with token passing basing on MFCC is used. Chunk analysis is realized by brute-force large feature space construction with subsequent subset selection, SVM classification, and speaker normalization. Extensive tests reveal differences compared to one-stage processing. Alternatively, syllables are used for chunking.
international conference on multimedia and expo | 2008
Björn W. Schuller; Bogdan Vlasenko; Dejan Arsic; Gerhard Rigoll; Andreas Wendemuth
Recognition of emotion in speech usually uses acoustic models that ignore the spoken content. Likewise one general model per emotion is trained independent of the phonetic structure. Given sufficient data, this approach seemingly works well enough. Yet, this paper tries to answer the question whether acoustic emotion recognition strongly depends on phonetic content, and if models tailored for the spoken unit can lead to higher accuracies. We therefore investigate phoneme-, and word-models by use of a large prosodic, spectral, and voice quality feature space and Support Vector Machines (SVM). Experiments also take the necessity of ASR into account to select appropriate unit- models. Test-runs on the well-known EMO-DB database facing speaker-independence demonstrate superiority of word emotion models over todays common general models provided sufficient occurrences in the training corpus.
Computer Speech & Language | 2014
Bogdan Vlasenko; Dmytro Prylipko; Ronald Böck; Andreas Wendemuth
The role of automatic emotion recognition from speech is growing continuously because of the accepted importance of reacting to the emotional state of the user in human-computer interaction. Most state-of-the-art emotion recognition methods are based on turn- and frame-level analysis independent from phonetic transcription. Here, we are interested in a phoneme-based classification of the level of arousal in acted and spontaneous emotions. To start, we show that our previously published classification technique which showed high-level results in the Interspeech 2009 Emotion Challenge cannot provide sufficiently good classification in cross-corpora evaluation (a condition close to real-life applications). To prove the robustness of our emotion classification techniques we use cross-corpora evaluation for a simplified two-class problem; namely high and low arousal emotions. We use emotion classes on a phoneme-level for classification. We build our speaker-independent emotion classifier with HMMs, using GMMs-based production probabilities and MFCC features. This classifier performs equally well when using a complete phoneme set, as it does in the case of a reduced set of indicative vowels (7 out of 39 phonemes in the German SAM-PA list). Afterwards we compare emotion classification performance of the technique used in the Emotion Challenge with phoneme-based classification within the same experimental setup. With phoneme-level emotion classes we increase cross-corpora classification performance by about 3.15% absolute (4.69% relative) for models trained on acted emotions (EMO-DB dataset) and evaluated on spontaneous emotions (VAM dataset); within vice versa experimental conditions (trained on VAM, tested on EMO-DB) we obtain 15.43% absolute (23.20% relative) improvement. We show that using phoneme-level emotion classes can improve classification performance even with comparably low speech recognition performance obtained with scant a priori knowledge about the language, implemented as a zero-gram for word-level modeling and a bi-gram for phoneme-level modeling. Finally we compare our results with the state-of-the-art cross-corpora evaluations on the VAM database. For training our models, we use an almost 15 times smaller training set, consisting of 456 utterances (210 low and 246 high arousal emotions) instead of 6820 utterances (4685 high and 2135 low arousal emotions). We are yet able to increase cross-corpora classification performance by about 2.25% absolute (3.22% relative) from UA=69.7% obtained by Zhang et al. to UA=71.95%.
international conference on multimedia and expo | 2011
Bogdan Vlasenko; David Philippou-Hübner; Dmytro Prylipko; Ronald Böck; Ingo Siegert; Andreas Wendemuth
Recently, automatic emotion recognition from speech has achieved growing interest within the human-machine interaction research community. Most part of emotion recognition methods use context independent frame-level analysis or turn-level analysis. In this article, we introduce context dependent vowel level analysis applied for emotion classification. An average first formant value extracted on vowel level has been used as unidimensional acoustic feature vector. The Neyman-Pearson criterion has been used for classification purpose. Our classifier is able to detect high-arousal emotions with small error rates. Within our research we proved that the smallest emotional unit should be the vowel instead of the word. We find out that using vowel level analysis can be an important issue during developing a robust emotion classifier. Also, our research can be useful for developing robust affective speech recognition methods and high quality emotional speech synthesis systems.
international conference on multimedia and expo | 2011
Ingo Siegert; Ronald Böck; Bogdan Vlasenko; David Philippou-Hübner; Andreas Wendemuth
In emotion recognition from speech, a good transcription and annotation of given material is crucial. Moreover, the question of how to find good emotional labels for new data material is a basic issue. It is not only the question of which emotion labels to choose, it is also a matter of how labellers can cope with annotation methods. In this paper, we present our investigations for emotional labelling with three different methods (Basic Emotions, Geneva Emotion Wheel and Self Assessment Manikins) and compare them in terms of emotion coverage and usability. We show that emotion labels derived from Geneva Emotion Wheel or Self Assessment Manikins fulfill our requirements, but Basic Emotions are not feasible for emotion labelling from spontaneous speech.
perception and interactive technologies | 2008
Bogdan Vlasenko; Björn W. Schuller; Andreas Wendemuth; Gerhard Rigoll
Acoustic Modeling in todays emotion recognition engines employs general models independent of the spoken phonetic content. This seems to work well enough given sufficient instances to cover for a broad variety of phonetic structures and emotions at the same time. However, data is usually sparse in the field and the question arises whether unit specific models as word emotion models could outperform the typical general models. In this respect this paper tries to answer the question how strongly acoustic emotion models depend on the textual and phonetic content. We investigate the influence on the turn and word level by use of state-of-the-art techniques for frame and word modeling on the well-known public Berlin Emotional Speech and Speech Under Simulated and Actual Stress databases. In the result it is clearly shown that the phonetic structure does strongly influence the accuracy of emotion recognition.
international conference on multimedia and expo | 2009
Bogdan Vlasenko; Andreas Wendemuth
Spoken human-machine interaction supported by state-ofthe- art dialog systems is becoming a standard technology. A lot of effort was invested for this kind of artificial communication interface. But still the spoken dialog systems (SDS) are not able to provide for the user a natural way of communication. Because, existing automated dialog system do not dedicate enough attention to problems in the interaction related to affected user behavior. This paper addresses some aspects of design and implementation of user behavior models in dialog systems aimed to provide naturalness of human-machine interaction. We discuss a viable integration technique of speech based emotion classification in SDS for robust affected automatic speech recognition and user emotion correlated dialog strategy. First of all, we describe existing methods of emotion recognition within speech and affected speech adapted ASR methods. Second, we introduce an approach to achieve emotion adaptive dialog management in human-machine interaction. A multimodal human-machine interaction system with integrated user behavior model is created within the project “Neurobiologically Inspired, Multimodal Intention Recognition for Technical Communication Systems” (NIMITEK). Currently NIMITEK provides a technical demonstrator to study these principles in a dedicated prototypical task, namely solving the game Towers of Hanoi. In this paper, we will describe the general approach NIMITEK takes to emotional man-machine interactions.