Iulia Lefter
Delft University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Iulia Lefter.
text speech and dialogue | 2010
Iulia Lefter; Léon J. M. Rothkrantz; Pascal Wiggers; David A. van Leeuwen
We explore possibilities for enhancing the generality, portability and robustness of emotion recognition systems by combining data-bases and by fusion of classifiers. In a first experiment, we investigate the performance of an emotion detection system tested on a certain database given that it is trained on speech from either the same database, a different database or a mix of both. We observe that generally there is a drop in performance when the test database does not match the training material, but there are a few exceptions. Furthermore, the performance drops when a mixed corpus of acted databases is used for training and testing is carried out on real-life recordings. In a second experiment we investigate the effect of training multiple emotion detectors, and fusing these into a single detection system. We observe a drop in the Equal Error Rate (EER) from 19.0% on average for 4 individual detectors to 4.2% when fused using FoCal [1].
Pattern Recognition Letters | 2013
Iulia Lefter; Léon J. M. Rothkrantz; Gertjan J. Burghouts
Multimodal fusion is a complex topic. For surveillance applications audio-visual fusion is very promising given the complementary nature of the two streams. However, drawing the correct conclusion from multi-sensor data is not straightforward. In previous work we have analysed a database with audio-visual recordings of unwanted behavior in trains (Lefter et al., 2012) and focused on a limited subset of the recorded data. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. We showed that there are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels since part of the information is lost when using the unimodal streams. We proposed an intermediate step to discover the structure in the fusion process. This step is based upon meta-features and we find a set of five which have an impact on the fusion process. In this paper we extend the findings in (Lefter et al., 2012) for the general case using the entire database. We prove that the meta-features have a positive effect on the fusion process in terms of labels. We then compare three fusion methods that encapsulate the meta-features. They are based on automatic prediction of the intermediate level variables and multimodal aggression from state of the art low level acoustic, linguistic and visual features. The first fusion method is based on applying multiple classifiers to predict intermediate level features from the low level features, and to predict the multimodal label from the intermediate variables. The other two approaches are based on probabilistic graphical models, one using (Dynamic) Bayesian Networks and the other one using Conditional Random Fields. We learn that each approach has its strengths and weaknesses in predicting specific aggression classes and using the meta-features yields significant improvements in all cases.
text speech and dialogue | 2012
Iulia Lefter; Léon J. M. Rothkrantz; Gertjan J. Burghouts
By analyzing a multimodal (audio-visual) database with aggressive incidents in trains, we have observed that there are no trivial fusion algorithms to successfully predict multimodal aggression based on unimodal sensor inputs. We proposed a fusion framework that contains a set of intermediate level variables (meta-features) between the low level sensor features and the multimodal aggression detection [1]. In this paper we predict the multimodal level of aggression and two of the meta-features: Context and Semantics. We do this based on the audio stream, from which we extract both acoustic (nonverbal) and linguistic (verbal) information. Given the spontaneous nature of speech in the database, we rely on a keyword spotting approach in the case of verbal information. We have found the existence of 6 semantic groups of keywords that have a positive influence on the prediction of aggression and of the two meta-features.
International Journal of Intelligent Defence Support Systems | 2011
Iulia Lefter; Léon J. M. Rothkrantz; David A. van Leeuwen; Pascal Wiggers
The abundance of calls to emergency lines during crises is difficult to handle by the limited number of operators. Detecting if the caller is experiencing some extreme emotions can be a solution for distinguishing the more urgent calls. Apart from these, there are several other applications that can benefit from awareness of the emotional state of the speaker. This paper describes the design of a system for selecting the calls that appear to be urgent, based on emotion detection. The system is trained using a database of spontaneous emotional speech from a call-centre. Four machine learning techniques are applied, based on either prosodic or spectral features, resulting in individual detectors. As a last stage, we investigate the effect of fusing these detectors into a single detection system. We observe an improvement in the equal error rate (EER) from 19.0% on average for four individual detectors to 4.2% when fused using linear logistic regression. All experiments are performed in a speaker independent cross-validation framework.
IEEE Transactions on Affective Computing | 2016
Iulia Lefter; Gertjan J. Burghouts; Léon J. M. Rothkrantz
This paper investigates how speech and gestures convey stress, and how they can be used for automatic stress recognition. As a first step, we look into how humans use speech and gestures to convey stress. In particular, for both speech and gestures, we distinguish between stress conveyed by the intended semantic message (e.g. spoken words for speech, symbolic meaning for gestures), and stress conveyed by the modulation of either speech and gestures (e.g. intonation for speech, speed and rhythm for gestures). As a second step, we use this decomposition of stress as an approach for automatic stress prediction. The considered components provide an intermediate representation with intrinsic meaning, which helps bridging the semantic gap between the low level sensor representation and the high level context sensitive interpretation of behavior. Our experiments are run on an audiovisual dataset with service-desk interactions. The final goal is having a surveillance system that would notify when the stress level is high and extra assistance is needed. We find that speech modulation is the best performing intermediate level variable for automatic stress prediction. Using gestures increases the performance and is mostly beneficial when speech is lacking. The two-stage approach with intermediate variables performs better than baseline feature level or decision level fusion.
Journal on Multimodal User Interfaces | 2014
Iulia Lefter; Gertjan J. Burghouts; Léon J. M. Rothkrantz
Stressful situations are likely to occur at human operated service desks, as well as at human–computer interfaces used in public domain. Automatic surveillance can help notifying when extra assistance is needed. Human communication is inherently multimodal e.g. speech, gestures, facial expressions. It is expected that automatic surveillance systems can benefit from exploiting multimodal information. This requires automatic fusion of modalities, which is still an unsolved problem. To support the development of such systems, we present and analyze audio-visual recordings of human–human interactions at a service desk. The corpus has a high degree of realism: all interactions are freely improvised by actors based on short scenarios where only the sources of conflict were provided. The recordings can be considered as a prototype for general stressful human–human interaction. The recordings were annotated on a 5 point scale on degree of stress from the perspective of surveillance operators. The recordings are very rich in hand gestures. We find that the more stressful the situation, the higher the proportion of speech that is accompanied by gestures. Understanding the function of gestures and their relation to speech is essential for good fusion strategies. Taking speech as the basic modality, one of our research questions was, what is the role of gestures in addition to speech. Both speech and gestures can express emotion, so we say that they have an emotional function. They can also express non-emotional information, in which case we say that they have a semantic function. We learn that when speech and gestures have the same function, they are usually congruent, but intensities and clarity can vary. Most gestures in this dataset convey emotion. We identify classes of gestures in our recordings, and argue that some classes are clear indications of stressful situations
advanced video and signal based surveillance | 2012
Iulia Lefter; Gertjan J. Burghouts; Léon J. M. Rothkrantz
We propose a new method for audio-visual sensor fusion and apply it to automatic aggression detection. While a variety of definitions of aggression exist, in this paper we see it as any kind of behavior that has a disturbing effect on others. We have collected multi- and unimodal assessments by humans, who have given aggression scores on a 3 point scale. There are no trivial fusion algorithms to predict the multimodal labels from the unimodal labels. We propose an intermediate step to discover the structure in the fusion process. We call these meta-features and we find a set of five which have an impact on the fusion process. We use simple state of the art low level audio and video features to predict the level of aggression in audio and video, and we also predict the three most feasible meta-features. We show the significant positive impact of adding the meta-features on predicting the multimodal label as compared to standard fusion techniques like feature and decision level fusion.
text speech and dialogue | 2011
Iulia Lefter; Léon J. M. Rothkrantz; Gertjan J. Burghouts; Zhenke Yang; Pascal Wiggers
Automatic detection of aggressive situations has a high societal and scientific relevance. It has been argued that using data from multimodal sensors as for example video and sound as opposed to unimodal is bound to increase the accuracy of detections. We approach the problem of multimodal aggression detection from the viewpoint of a human observer and try to reproduce his predictions automatically. Typically, a single ground truth for all available modalities is used when training recognizers. We explore the benefits of adding an extra level of annotations, namely audio-only and video-only. We analyze these annotations and compare them to the multimodal case in order to have more insight into how humans reason using multimodal data. We train classifiers and compare the results when using unimodal and multimodal labels as ground truth. Both in the case of audio and video recognizer the performance increases when using the unimodal labels.
computer systems and technologies | 2012
Iulia Lefter; Léon J. M. Rothkrantz; Maarten Somhorst
At this moment many surveillance systems are installed in public domains to control the safety of people and properties. They are constantly watched by human operators who are easily overloaded. To support the human operators, a surveillance system model is designed that detects suspicious behaviour in a non-public area. Its task is to alert the operators about suspicious events to give them the chance to investigate and take action. A prototype application has been implemented using state-of-the-art techniques from Computer Vision and Artificial Intelligence.
computer systems and technologies | 2010
Iulia Lefter; Pascal Wiggers; Léon J. M. Rothkrantz
The paper describes the development of an online real-time system able to recognize emotions from speech. A prosodic feature set was extracted from four databases of emotional speech (three with acted emotions and one with spontaneous ones). Two models were trained using support vector machines (SVM) or merged databases, for the purpose of providing a larger range of examples to the classifier and making it more general. The system outputs probabilities of a closed set of emotions and provides a time track of the emotions recognized in the valence and arousal continuum.