Dino Seppi
Katholieke Universiteit Leuven
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dino Seppi.
international conference on acoustics, speech, and signal processing | 2007
Björn W. Schuller; Dino Seppi; Anton Batliner; Andreas K. Maier; Stefan Steidl
As automatic emotion recognition based on speech matures, new challenges can be faced. We therefore address the major aspects in view of potential applications in the field, to benchmark todays emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence. Three different data-sets are used: the Berlin Emotional Speech Database, the Danish Emotional Speech Database, and the spontaneous AIBO Emotion Corpus. By using different feature types such as word- or turn-based statistics, manual versus forced alignment, and optimization techniques we show how to best cope with this demanding task and how noise addition or different microphone positions affect emotion recognition.
international conference on acoustics, speech, and signal processing | 2009
Björn W. Schuller; Anton Batliner; Stefan Steidl; Dino Seppi
This paper investigates the automatic recognition of emotion from spoken words by vector space modeling vs. string kernels which have not been investigated in this respect, yet. Apart from the spoken content directly, we integrate Part-of-Speech and higher semantic tagging in our analyses. As opposed to most works in the field, we evaluate the performance with an ASR engine in the loop. Extensive experiments are run on the FAU Aibo Emotion Corpus of 4k spontaneous emotional child-robot interactions and show surprisingly low performance degradation with real ASR over transcription-based emotion recognition. In the result, bag of words dominate over all other modeling forms based on the spoken content.
Archive | 2011
Anton Batliner; Björn W. Schuller; Dino Seppi; Stefan Steidl; Laurence Devillers; Laurence Vidrascu; Thurid Vogt; Vered Aharonson; Noam Amir
In this chapter, we focus on the automatic recognition of emotional states using acoustic and linguistic parameters as features and classifiers as tools to predict the ‘correct’ emotional states. We first sketch history and state of the art in this field; then we describe the process of ‘corpus engineering’, i.e. the design and the recording of databases, the annotation of emotional states, and further processing such as manual or automatic segmentation. Next, we present an overview of acoustic and linguistic features that are extracted automatically or manually. In the section on classifiers, we deal with topics such as the curse of dimensionality and the sparse data problem, classifiers, and evaluation. At the end of each section, we point out important aspects that should be taken into account for the planning or the assessment of studies. The subject area of this chapter is not emotions in some narrow sense but in a wider sense encompassing emotion-related states such as moods, attitudes, or interpersonal stances as well. We do not aim at an in-depth treatise of some specific aspects or algorithms but at an overview of approaches and strategies that have been used or should be used.
affective computing and intelligent interaction | 2009
Stefan Steidl; Anton Batliner; Björn W. Schuller; Dino Seppi
We first depict the challenge to address all non-prototypical varieties of emotional states signalled in speech in an open microphone setting, i. e. using all data recorded. In the remainder of the article, we illustrate promising strategies, using the FAU Aibo emotion corpus, by showing different degrees of classification performance for different degrees of prototypicality, and by elaborating on the use of ROC curves, classification confidences, and the use of correlation-based analyses.
international conference on acoustics, speech, and signal processing | 2011
Kris Demuynck; Dino Seppi; Dirk Van Compernolle; Patrick Nguyen; Geoffrey Zweig
Exemplar based recognition systems are characterized by the fact that, instead of abstracting large amounts of data into compact models, they store the observed data enriched with some annotations and infer on-the-fly from the data by finding those exemplars that resemble the input speech best. One advantage of exemplar based systems is that next to deriving what the current phone or word is, one can easily derive a wealth of meta-information concerning the chunk of audio under investigation. In this work we harvest meta-information from the set of best matching exemplars, that is thought to be relevant for the recognition such as word boundary predictions and speaker entropy. Integrating this meta-information into the recognition framework using segmental conditional random fields, reduced the WER of the exemplar based system on the WSJ Nov92 20k task from 8.2% to 7.6%. Adding the HMM-score and multiple HMM phone detectors as features further reduced the error rate to 6.6%.
EURASIP Journal on Advances in Signal Processing | 2011
Felix Weninger; Björn W. Schuller; Anton Batliner; Stefan Steidl; Dino Seppi
We present a comprehensive study on the effect of reverberation and background noise on the recognition of nonprototypical emotions from speech. We carry out our evaluation on a single, well-defined task based on the FAU Aibo Emotion Corpus consisting of spontaneous childrens speech, which was used in the INTERSPEECH 2009 Emotion Challenge, the first of its kind. Based on the challenge task, and relying on well-proven methodologies from the speech recognition domain, we derive test scenarios with realistic noise and reverberation conditions, including matched as well as mismatched condition training. As feature extraction based on supervised Nonnegative Matrix Factorization (NMF) has been proposed in automatic speech recognition for enhanced robustness, we introduce and evaluate different kinds of NMF-based features for emotion recognition. We conclude that NMF features can significantly contribute to the robustness of state-of-the-art emotion recognition engines in practical application scenarios where different noise and reverberation conditions have to be faced.
international conference on acoustics, speech, and signal processing | 2011
Kris Demuynck; Dino Seppi; Hugo Van hamme; Dirk Van Compernolle
In this paper we present a number of improvements that were recently made to the template based speech recognition system developed at ESAT. Combining these improvements resulted in a decrease in word error rate from 9.6% to 8.2% on the Nov92, 20k trigram, Wall Street Journal task. The improvements are along different lines. Apart from the time warping already applied within the DTW, it was found beneficial to apply additional length compensation on the template score. The single best score was replaced by a weighted k-NN average, while maintaining natural successor information as an ensemble cost. The local geometry of the acoustic space is now taken into account by assigning a diagonal covariance matrix to each input frame. Context sensitivity of short templates is increased by taking cross boundary scores into account for sorting the N best templates. Furthermore boundaries on the template segmentations may be relaxed. Finally context dependent word templates are now being used for short words. Several other variants that were not retained in the final system are discussed as well.
ACM Transactions on Speech and Language Processing | 2011
Martin Wöllmer; Björn W. Schuller; Anton Batliner; Stefan Steidl; Dino Seppi
In this article, we focus on keyword detection in childrens speech as it is needed in voice command systems. We use the FAU Aibo Emotion Corpus which contains emotionally colored spontaneous childrens speech recorded in a child-robot interaction scenario and investigate various recent keyword spotting techniques. As the principle of bidirectional Long Short-Term Memory (BLSTM) is known to be well-suited for context-sensitive phoneme prediction, we incorporate a BLSTM network into a Tandem model for flexible coarticulation modeling in childrens speech. Our experiments reveal that the Tandem model prevails over a triphone-based Hidden Markov Model approach.
International Workshop on Evaluation of Natural Language and Speech Tool for Italian | 2013
Francesco Cutugno; Antonio Origlia; Dino Seppi
Forced alignment both for words and phones is a challenging and interesting task for automatic speech processing systems because the difficulties introduced by natural speech are many and hard to deal with. Furthermore, forced alignment approaches have been tested on Italian just in a few studies. In this task, the main goal was to evaluate the performance offered by the proposed systems on Italian and their robustness in presence of noisy data.
Speech Communication | 2011
Björn W. Schuller; Anton Batliner; Stefan Steidl; Dino Seppi