Athanasios Katsamanis
National Technical University of Athens
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Athanasios Katsamanis.
IEEE Transactions on Affective Computing | 2012
Angeliki Metallinou; Martin Wöllmer; Athanasios Katsamanis; Florian Eyben; Björn W. Schuller; Shrikanth Narayanan
Human emotional expression tends to evolve in a structured manner in the sense that certain emotional evolution patterns, i.e., anger to anger, are more probable than others, e.g., anger to happiness. Furthermore, the perception of an emotional display can be affected by recent emotional displays. Therefore, the emotional content of past and future observations could offer relevant temporal context when classifying the emotional content of an observation. In this work, we focus on audio-visual recognition of the emotional content of improvised emotional interactions at the utterance level. We examine context-sensitive schemes for emotion recognition within a multimodal, hierarchical approach: bidirectional Long Short-Term Memory (BLSTM) neural networks, hierarchical Hidden Markov Model classifiers (HMMs), and hybrid HMM/BLSTM classifiers are considered for modeling emotion evolution within an utterance and between utterances over the course of a dialog. Overall, our experimental results indicate that incorporating long-term temporal context is beneficial for emotion recognition systems that encounter a variety of emotional manifestations. Context-sensitive approaches outperform those without context for classification tasks such as discrimination between valence levels or between clusters in the valence-activation space. The analysis of emotional transitions in our database sheds light into the flow of affective expressions, revealing potentially useful patterns.
Journal of the Acoustical Society of America | 2014
Shrikanth Narayanan; Asterios Toutios; Vikram Ramanarayanan; Adam C. Lammert; Jangwon Kim; Sungbok Lee; Krishna S. Nayak; Yoon Chul Kim; Yinghua Zhu; Louis Goldstein; Dani Byrd; Erik Bresch; Athanasios Katsamanis; Michael Proctor
USC-TIMIT is an extensive database of multimodal speech production data, developed to complement existing resources available to the speech research community and with the intention of being continuously refined and augmented. The database currently includes real-time magnetic resonance imaging data from five male and five female speakers of American English. Electromagnetic articulography data have also been presently collected from four of these speakers. The two modalities were recorded in two independent sessions while the subjects produced the same 460 sentence corpus used previously in the MOCHA-TIMIT database. In both cases the audio signal was recorded and synchronized with the articulatory data. The database and companion software are freely available to the research community.
intelligent virtual agents | 2012
David R. Traum; Priti Aggarwal; Ron Artstein; Susan Foutz; Jillian Gerten; Athanasios Katsamanis; Anton Leuski; Dan Noren; William R. Swartout
We report on our efforts to prepare Ada and Grace, virtual guides in the Museum of Science, Boston, to interact directly with museum visitors, including children. We outline the challenges in extending the exhibit to support this usage, mostly relating to the processing of speech from a broad population, especially child speech. We also present the summative evaluation, showing success in all the intended impacts of the exhibit: that children ages 7–14 will increase their awareness of, engagement in, interest in, positive attitude about, and knowledge of computer science and technology.
international conference on acoustics, speech, and signal processing | 2012
Angeliki Metallinou; Athanasios Katsamanis; Shrikanth Narayanan
Incorporating multimodal information and temporal context from speakers during an emotional dialog can contribute to improving performance of automatic emotion recognition systems. Motivated by these issues, we propose a hierarchical framework which models emotional evolution within and between emotional utterances, i.e., at the utterance and dialog level respectively. Our approach can incorporate a variety of generative or discriminative classifiers at each level and provides flexibility and extensibility in terms of multimodal fusion; facial, vocal, head and hand movement cues can be included and fused according to the modality and the emotion classification task. Our results using the multimodal, multi-speaker IEMOCAP database indicate that this framework is well-suited for cases where emotions are expressed multimodally and in context, as in many real-life situations.
affective computing and intelligent interaction | 2011
Chi-Chun Lee; Athanasios Katsamanis; Matthew P. Black; Brian R. Baucom; Panayiotis G. Georgiou; Shrikanth Narayanan
Recently there has been an increase in efforts in Behavioral Signal Processing (BSP), that aims to bring quantitative analysis using signal processing techniques in the domain of observational coding. Currently observational coding in fields such as psychology is based on subjective expert coding of abstract human interaction dynamics. In this work, we use a Multiple Instance Learning (MIL) framework, a saliencybased prediction model, with a signal-driven vocal entrainment measure as the feature to predict the affective state of a spouse in problem solving interactions. We generate 18 MIL classifiers to capture the variablelength saliency of vocal entrainment, and a cross-validation scheme with maximum accuracy and mutual information as the metric to select the best performing classifier for each testing couple. This method obtains a recognition accuracy of 53.93%, a 2.14% (4.13% relative) improvement over baseline model using Support Vector Machine. Furthermore, this MIL-based framework has potential for identifying meaningful regions of interest for further detailed analysis of married couples interactions.
affective computing and intelligent interaction | 2011
Athanasios Katsamanis; James Gibson; Matthew P. Black; Shrikanth Narayanan
Analysis of audiovisual human behavior observations is a common practice in behavioral sciences. It is generally carried through by expert annotators who are asked to evaluate several aspects of the observations along various dimensions. This can be a tedious task. We propose that automatic classification of behavioral patterns in this context can be viewed as a multiple instance learning problem. In this paper, we analyze a corpus of married couples interacting about a problem in their relationship. We extract features from both the audio and the transcriptions and apply the Diverse Density-Support Vector Machine framework. Apart from attaining classification on the expert annotations, this framework also allows us to estimate salient regions of the complex interaction.
international conference on acoustics, speech, and signal processing | 2016
Isidoros Rodomagoulakis; Nikolaos Kardaris; Vassilis Pitsikalis; E. Mavroudi; Athanasios Katsamanis; Antigoni Tsiami; Petros Maragos
Within the context of assistive robotics we develop an intelligent interface that provides multimodal sensory processing capabilities for human action recognition. Human action is considered in multimodal terms, containing inputs such as audio from microphone arrays, and visual inputs from high definition and depth cameras. Exploring state-of-the-art approaches from automatic speech recognition, and visual action recognition, we multimodally recognize actions and commands. By fusing the unimodal information streams, we obtain the optimum multimodal hypothesis which is to be further exploited by the active mobility assistance robot in the framework of the MOBOT EU research project. Evidence from recognition experiments shows that by integrating multiple sensors and modalities, we increase multimodal recognition performance in the newly acquired challenging dataset involving elderly people while interacting with the assistive robot.
european signal processing conference | 2015
Panagiotis Giannoulis; Alessio Brutti; Marco Matassoni; Alberto Abad; Athanasios Katsamanis; Miguel Matos; Gerasimos Potamianos; Petros Maragos
Domestic environments are particularly challenging for distant speech recognition: reverberation, background noise and interfering sources, as well as the propagation of acoustic events across adjacent rooms, critically degrade the performance of standard speech processing algorithms. In this application scenario, a crucial task is the detection and localization of speech events generated by users within the various rooms. A specific challenge of multi-room environments is the inter-room interference that negatively affects speech activity detectors. In this paper, we present and compare different solutions for the multi-room speech activity detection task. The combination of a model-based room-independent speech activity detection module with a room-dependent inside/outside classification stage, based on specific features, provides satisfactory performance. The proposed methods are evaluated on a multi-room, multi-channel corpus, where spoken commands and other typical acoustic events occur in different rooms.
international conference on acoustics, speech, and signal processing | 2012
Theodora Chaspari; Emily Mower Provost; Athanasios Katsamanis; Shrikanth Narayanan
The quality of shared enjoyment in interactions is a key aspect related to Autism Spectrum Disorders (ASD). This paper discusses two types of enjoyment: the first refers to humorous events and is associated with ones positive affective state and the second is used to facilitate social interactions between people. These types of shared enjoyment are objectively specified by their proximity to a voiced and unvoiced laughter instance, respectively. The goal of this work is to study the acoustic differences of areas surrounding the two kinds of shared enjoyment instances, called “social zones”, using data collected from children with autism, and their parents, interacting with an Embodied Conversational Agent (ECA). A classification task was performed to predict whether a “social zone” surrounds a voiced or an unvoiced laughter instance. Our results indicate that humorous events are more easily recognized than events acting as social facilitators and that related speech patterns vary more across children compared to other interlocutors.
international conference on image processing | 2015
Petros Koutras; Athanasia Zlatintsi; Elias Iosif; Athanasios Katsamanis; Petros Maragos; Alexandros Potamianos
In this paper, we present a new and improved synergistic approach to the problem of audio-visual salient event detection and movie summarization based on visual, audio and text modalities. Spatio-temporal visual saliency is estimated through a perceptually inspired frontend based on 3D (space, time) Gabor filters and frame-wise features are extracted from the saliency volumes. For the auditory salient event detection we extract features based on Teager-Kaiser Energy Operator, while text analysis incorporates part-of-speech tagging and affective modeling of single words on the movie subtitles. For the evaluation of the proposed system, we employ an elementary and non-parametric classification technique like KNN. Detection results are reported on the MovSum database, using objective evaluations against ground-truth denoting the perceptually salient events, and human evaluations of the movie summaries. Our evaluation verifies the appropriateness of the proposed methods compared to our baseline system. Finally, our newly proposed summarization algorithm produces summaries that consist of salient and meaningful events, also improving the comprehension of the semantics.