Soroosh Mariooryad
University of Texas at Dallas
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Soroosh Mariooryad.
IEEE Transactions on Affective Computing | 2013
Soroosh Mariooryad; Carlos Busso
Psycholinguistic studies on human communication have shown that during human interaction individuals tend to adapt their behaviors mimicking the spoken style, gestures, and expressions of their conversational partners. This synchronization pattern is referred to as entrainment. This study investigates the presence of entrainment at the emotion level in cross-modality settings and its implications on multimodal emotion recognition systems. The analysis explores the relationship between acoustic features of the speaker and facial expressions of the interlocutor during dyadic interactions. The analysis shows that 72 percent of the time the speakers displayed similar emotions, indicating strong mutual influence in their expressive behaviors. We also investigate the cross-modality, cross-speaker dependence, using mutual information framework. The study reveals a strong relation between facial and acoustic features of one subject with the emotional state of the other subject. It also shows strong dependence between heterogeneous modalities across conversational partners. These findings suggest that the expressive behaviors from one dialog partner provide complementary information to recognize the emotional state of the other dialog partner. The analysis motivates classification experiments exploiting cross-modality, cross-speaker information. The study presents emotion recognition experiments using the IEMOCAP and SEMAINE databases. The results demonstrate the benefit of exploiting this emotional entrainment effect, showing statistically significant improvements.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Soroosh Mariooryad; Carlos Busso
During human communication, every spoken message is intrinsically modulated within different verbal and nonverbal cues that are externalized through various aspects of speech and facial gestures. These communication channels are strongly interrelated, which suggests that generating human-like behavior requires a careful study of their relationship. Neglecting the mutual influence of different communicative channels in the modeling of natural behavior for a conversational agent may result in unrealistic behaviors that can affect the intended visual perception of the animation. This relationship exists both between audiovisual information and within different visual aspects. This paper explores the idea of using joint models to preserve the coupling not only between speech and facial expression, but also within facial gestures. As a case study, the paper focuses on building a speech-driven facial animation framework to generate natural head and eyebrow motions. We propose three dynamic Bayesian networks (DBNs), which make different assumptions about the coupling between speech, eyebrow and head motion. Synthesized animations are produced based on the MPEG-4 facial animation standard, using the audiovisual IEMOCAP database. The experimental results based on perceptual evaluations reveal that the proposed joint models (speech/eyebrow/head) outperform audiovisual models that are separately trained (speech/head and speech/eyebrow).
affective computing and intelligent interaction | 2013
Soroosh Mariooryad; Carlos Busso
Defining useful emotional descriptors to characterize expressive behaviors is an important research area in affective computing. Recent studies have shown the benefits of using continuous emotional evaluations to annotate spontaneous corpora. Instead of assigning global labels per segments, this approach captures the temporal dynamic evolution of the emotions. A challenge of continuous assessments is the inherent reaction lag of the evaluators. During the annotation process, an observer needs to sense the stimulus, perceive the emotional message, and define his/her judgment, all this in real time. As a result, we expect a reaction lag between the annotation and the underlying emotional content. This paper uses mutual information to quantify and compensate for this reaction lag. Classification experiments on the SEMAINE database demonstrate that the performance of emotion recognition systems improve when the evaluator reaction lag is considered. We explore annotator-dependent and annotator-independent compensation schemes.
IEEE Transactions on Affective Computing | 2013
Carlos Busso; Soroosh Mariooryad; Angeliki Metallinou; Shrikanth Narayanan
The externalization of emotion is intrinsically speaker-dependent. A robust emotion recognition system should be able to compensate for these differences across speakers. A natural approach is to normalize the features before training the classifiers. However, the normalization scheme should not affect the acoustic differences between emotional classes. This study presents the iterative feature normalization (IFN) framework, which is an unsupervised front-end, especially designed for emotion detection. The IFN approach aims to reduce the acoustic differences, between the neutral speech across speakers, while preserving the inter-emotional variability in expressive speech. This goal is achieved by iteratively detecting neutral speech for each speaker, and using this subset to estimate the feature normalization parameters. Then, an affine transformation is applied to both neutral and emotional speech. This process is repeated till the results from the emotion detection system are consistent between consecutive iterations. The IFN approach is exhaustively evaluated using the IEMOCAP database and a data set obtained under free uncontrolled recording conditions with different evaluation configurations. The results show that the systems trained with the IFN approach achieve better performance than systems trained either without normalization or with global normalization.
Speech Communication | 2014
Soroosh Mariooryad; Carlos Busso
Affect recognition is a crucial requirement for future human machine interfaces to effectively respond to nonverbal behaviors of the user. Speech emotion recognition systems analyze acoustic features to deduce the speakers emotional state. However, human voice conveys a mixture of information including speaker, lexical, cultural, physiological and emotional traits. The presence of these communication aspects introduces variabilities that affect the performance of an emotion recognition system. Therefore, building robust emotional models requires careful considerations to compensate for the effect of these variabilities. This study aims to factorize speaker characteristics, verbal content and expressive behaviors in various acoustic features. The factorization technique consists in building phoneme level trajectory models for the features. We propose a metric to quantify the dependency between acoustic features and communication traits (i.e., speaker, lexical and emotional factors). This metric, which is motivated by the mutual information framework, estimates the uncertainty reduction in the trajectory models when a given trait is considered. The analysis provides important insights on the dependency between the features and the aforementioned factors. Motivated by these results, we propose a feature normalization technique based on the whitening transformation that aims to compensate for speaker and lexical variabilities. The benefit of employing this normalization scheme is validated with the presented factor analysis method. The emotion recognition experiments show that the normalization approach can attenuate the variability imposed by the verbal content and speaker identity, yielding 4.1% and 2.4% relative performance improvements on a selected set of features, respectively.
ieee international conference on automatic face gesture recognition | 2013
Soroosh Mariooryad; Carlos Busso
Along with emotions, modulation of the lexical content is an integral aspect of spontaneously produced facial expressions. Hence, the verbal content introduces an undesired variability for solving the facial emotion recognition problem, especially in continuous frame-by-frame analysis during spontaneous human interactions. This study proposes feature and model level compensation approaches to address this problem. The feature level compensation scheme builds upon a trajectory-based modeling of facial features and the whitening transformation of the trajectories. The approach aims to normalize the lexicon-dependent patterns observed in the trajectories. The model level compensation approach builds viseme-dependent emotional classifiers to incorporate the lexical variability. The emotion recognition experiments on the IEMOCAP corpus validate the effectiveness of the proposed techniques both at the viseme and utterance levels. The accuracies of viseme level and utterance level emotion recognitions increase by 2.73% (5.9% relative) and 5.82% (11 % relative), respectively, over a lexicon-independent baseline. These performances represent statistically significant improvements.
international conference on acoustics, speech, and signal processing | 2014
Soroosh Mariooryad; Anitha Kannan; Dilek Hakkani-Tür; Elizabeth Shriberg
Recent studies have shown the importance of using online videos along with textual material in educational instruction, especially for better content retention and improved concept understanding. A key question is how to select videos to maximize student engagement, particularly when there are multiple possible videos on the same topic. While there are many aspects that drive student engagement, in this paper we focus on presenter speaking styles in the video. We use crowd-sourcing to explore speaking style dimensions in online educational videos, and identify six broad dimensions: liveliness, speaking rate, pleasantness, clarity, formality and confidence. We then propose techniques based solely on acoustic features for automatically identifying a subset of the dimensions. Finally, we perform video re-ranking experiments to learn how users apply their speaking style preferences to augment textbook material. Our findings also indicate how certain dimensions are correlated with perceptions of general pleasantness of the voice.
international conference on image processing | 2012
Soroosh Mariooryad; Carlos Busso
An effective human computer interaction system should be equipped with mechanisms to recognize and respond to the affective state of the user. However, spoken message conveys different communicative aspects such as the verbal content, emotional state and idiosyncrasy of the speaker. Each of these aspects introduces variability that will affect the performance of an emotion recognition system. If the models used to capture the expressive behaviors are constrained by the lexical content and speaker identity, it is expected that the observed uncertainty in the channel will decrease, improving the accuracy of the system. Motivated by these observations, this study aims to quantify and localize the speaker, lexical and emotional variabilities observed in the face during human interaction. A metric inspired in mutual information theory is proposed to quantify the dependency of facial features on these factors. This metric uses the trace of the covariance matrix of facial motion trajectories to measure the uncertainty. The experimental results confirm the strong influence of the lexical information in the lower part of the face. For this facial region, the results demonstrate the benefit of constraining the emotional model on the lexical content. The ultimate goal of this research is to utilize this information to constrain the emotional models on the underlying lexical units to improve the accuracy of emotion recognition systems.
IEEE Transactions on Affective Computing | 2016
Soroosh Mariooryad; Carlos Busso
During spontaneous conversations the articulation process as well as the internal emotional states influence the facial configurations. Inferring the conveyed emotions from the information presented in facial expressions requires decoupling the linguistic and affective messages in the face. Normalizing and compensating for the underlying lexical content have shown improvement in recognizing facial expressions. However, this requires the transcription and phoneme alignment information, which is not available in broad range of applications. This study uses the asymmetric bilinear factorization model to perform the decoupling of linguistic and affective information when they are not given. The emotion recognition evaluations on the IEMOCAP database show the capability of the proposed approach in separating these factors in facial expressions, yielding statistically significant performance improvements. The achieved improvement is similar to the case when the ground truth phonetic transcription is known. Similarly, experiments on the SEMAINE database using image-based features demonstrate the effectiveness of the proposed technique in practical scenarios.
IEEE Transactions on Affective Computing | 2017
Soroosh Mariooryad; Carlos Busso
Many pattern recognition problems involve characterizing samples with continuous labels instead of discrete categories. While regression models are suitable for these learning tasks, these labels are often discretized into binary classes to formulate the problem as a conventional classification task (e.g., classes with low versus high values). This methodology brings intrinsic limitations on the classification performance. The continuous labels are typically normally-distributed, with many samples close to the boundary threshold, resulting in poor classification rates. Previous studies only use the discretized labels to train binary classifiers, neglecting the original, continuous labels. This study demonstrates that, even in binary classification problems, exploiting the original labels before splitting the classes can lead to better classification performance. This work proposes an optimal classifier based on the Bayesian maximum a posterior (MAP) criterion for these problems, which effectively utilizes the real-valued labels. We derive the theoretical average performance of this classifier, which can be considered as the expected upper bound performance for the task. Experimental evaluations on synthetic and real data sets show the improvement achieved by the proposed classifier, in contrast to conventional classifiers trained with binary labels. These evaluations clearly demonstrate the optimality of the proposed classifier, and the precision of the expected upper bound obtained by our derivation.