Emily Mower
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Emily Mower.
Speech Communication | 2007
Michael Grimm; Kristian Kroschel; Emily Mower; Shrikanth Narayanan
Emotion primitive descriptions are an important alternative to classical emotion categories for describing a humans affective expressions. We build a multi-dimensional emotion space composed of the emotion primitives of valence, activation, and dominance. In this study, an image-based, text-free evaluation system is presented that provides intuitive assessment of these emotion primitives, and yields high inter-evaluator agreement. An automatic system for estimating the emotion primitives is introduced. We use a fuzzy logic estimator and a rule base derived from acoustic features in speech such as pitch, energy, speaking rate and spectral characteristics. The approach is tested on two databases. The first database consists of 680 sentences of 3 speakers containing acted emotions in the categories happy, angry, neutral, and sad. The second database contains more than 1000 utterances of 47 speakers with authentic emotion expressions recorded from a television talk show. The estimation results are compared to the human evaluation as a reference, and are moderately to highly correlated (0.42< r <0.85). Different scenarios are tested: acted vs. authentic emotions, speaker-dependent vs. speaker-independent emotion estimation, and gender-dependent vs. gender-independent emotion estimation. Finally, continuous-valued estimates of the emotion primitives are mapped into the given emotion categories using a k-nearest neighbor classifier. An overall recognition rate of up to 83.5% is accomplished. The errors of the direct emotion estimation are compared to the confusion matrices of the classification from primitives. As a conclusion to this continuous-valued emotion primitives framework, speaker-dependent modeling of emotion expression is proposed since the emotion primitives are particularly suited for capturing dynamics and intrinsic variations in emotion expression.
Speech Communication | 2011
Chi-Chun Lee; Emily Mower; Carlos Busso; Sungbok Lee; Shrikanth Narayanan
Automated emotion state tracking is a crucial element in the computational study of human communication behaviors. It is important to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities to support human decision making and to design human-machine interfaces that facilitate efficient communication. We introduce a hierarchical computational structure to recognize emotions. The proposed structure maps an input speech utterance into one of the multiple emotion classes through subsequent layers of binary classifications. The key idea is that the levels in the tree are designed to solve the easiest classification tasks first, allowing us to mitigate error propagation. We evaluated the classification framework on two different emotional databases using acoustic features, the AIBO database and the USC IEMOCAP database. In the case of the AIBO database, we obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The performance measure of the average unweighted recall on the evaluation data set improves by 3.37% absolute (8.82% relative) over a Support Vector Machine baseline model. In the USC IEMOCAP database, we obtain an absolute improvement of 7.44% (14.58%) over a baseline Support Vector Machine modeling. The results demonstrate that the presented hierarchical approach is effective for classifying emotional utterances in multiple database contexts.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Emily Mower; Maja J. Matarić; Shrikanth Narayanan
Automatic recognition of emotion is becoming an increasingly important component in the design process for affect-sensitive human-machine interaction (HMI) systems. Well-designed emotion recognition systems have the potential to augment HMI systems by providing additional user state details and by informing the design of emotionally relevant and emotionally targeted synthetic behavior. This paper describes an emotion classification paradigm, based on emotion profiles (EPs). This paradigm is an approach to interpret the emotional content of naturalistic human expression by providing multiple probabilistic class labels, rather than a single hard label. EPs provide an assessment of the emotion content of an utterance in terms of a set of simple categorical emotions: anger; happiness; neutrality; and sadness. This method can accurately capture the general emotional label (attaining an accuracy of 68.2% in our experiment on the IEMOCAP data) in addition to identifying underlying emotional properties of highly emotionally ambiguous utterances. This capability is beneficial when dealing with naturalistic human emotional expressions, which are often not well described by a single semantic label.
affective computing and intelligent interaction | 2009
Emily Mower; Angeliki Metallinou; Chi-Chun Lee; Abe Kazemzadeh; Carlos Busso; Sungbok Lee; Shrikanth Narayanan
Emotion expression is a complex process involving dependencies based on time, speaker, context, mood, personality, and culture. Emotion classification algorithms designed for real-world application must be able to interpret the emotional content of an utterance or dialog given the modulations resulting from these and other dependencies. Algorithmic development often rests on the assumption that the input emotions are uniformly recognized by a pool of evaluators. However, this style of consistent prototypical emotion expression often does not exist outside of a laboratory environment. This paper presents methods for interpreting the emotional content of non-prototypical utterances. These methods include modeling across multiple time-scales and modeling interaction dynamics between interlocutors. This paper recommends classifying emotions based on emotional profiles, or soft-labels, of emotion expression rather than relying on just raw acoustic features or categorical hard labels. Emotion expression is both interactive and dynamic. Consequently, to accurately recognize emotional content, these aspects must be incorporated during algorithmic design to improve classification performance.
IEEE Transactions on Multimedia | 2009
Emily Mower; Maja J. Matarić; Shrikanth Narayanan
Computer simulated avatars and humanoid robots have an increasingly prominent place in todays world. Acceptance of these synthetic characters depends on their ability to properly and recognizably convey basic emotion states to a user population. This study presents an analysis of the interaction between emotional audio (human voice) and video (simple animation) cues. The emotional relevance of the channels is analyzed with respect to their effect on human perception and through the study of the extracted audio-visual features that contribute most prominently to human perception. As a result of the unequal level of expressivity across the two channels, the audio was shown to bias the perception of the evaluators. However, even in the presence of a strong audio bias, the video data were shown to affect human perception. The feature sets extracted from emotionally matched audio-visual displays contained both audio and video features while feature sets resulting from emotionally mismatched audio-visual displays contained only audio information. This result indicates that observers integrate natural audio cues and synthetic video cues only when the information expressed is in congruence. It is therefore important to properly design the presentation of audio-visual cues as incorrect design may cause observers to ignore the information conveyed in one of the channels.
international conference on multimedia and expo | 2010
Dongrui Wu; Thomas D. Parsons; Emily Mower; Shrikanth Narayanan
Speech processing is an important element of affective computing. Most research in this direction has focused on classifying emotions into a small number of categories. However, numerical representations of emotions in a multi-dimensional space can be more appropriate to reflect the gradient nature of emotion expressions, and can be more convenient in the sense of dealing with a small set of emotion primitives. This paper presents three approaches (robust regression, support vector regression, and locally linear reconstruction) for emotion primitives estimation in 3D space (valence/activation/dominance), and two approaches (average fusion and locally weighted fusion) to fuse the three elementary estimators for better overall recognition accuracy. The three elementary estimators are diverse and complementary because they cover both linear and nonlinear models, and both global and local models. These five approaches are compared with the state-of-the-art estimator on the same spontaneously elicited emotion dataset. Our results show that all of our three elementary estimators are suitable for speech emotion estimation. Moreover, it is possible to boost the estimation performance by fusing them properly since they appear to leverage complementary speech features.
international conference on multimedia and expo | 2011
Emily Mower; Matthew P. Black; Elisa Flores; Marian E. Williams; Shrikanth Narayanan
Increasingly, multimodal human-computer interactive tools are leveraged in both autism research and therapies. Embodied conversational agents (ECAs) are employed to facilitate the collection of socio-emotional interactive data from children with autism. In this paper we present an overview of the Rachel system developed at the University of Southern California. The Rachel ECA is designed to elicit and analyze complex, structured, and naturalistic interactions and to encourage affective and social behavior. The pilot studies suggest that this tool can be used to effectively elicit social conversational behavior. This paper presents a description of the multimodal human-computer interaction system and an overview of the collected data. Future work includes utilizing signal processing techniques to provide a quantitative description of the interaction patterns.
international conference on acoustics, speech, and signal processing | 2011
Emily Mower; Shrikanth Narayanan
The goal of emotion classification is to estimate an emotion label, given representative data and discriminative features. Humans are very good at deriving high-level representations of emotion state and integrating this information over time to arrive at a final judgment. However, currently, most emotion classification algorithms do not use this technique. This paper presents a hierarchical static-dynamic emotion classification framework that estimates high-level emotional judgments and locally integrates this information over time to arrive at a final estimate of the affective label. The results suggest that this framework for emotion classification leads to more accurate results than either purely static or purely dynamic strategies.
robot and human interactive communication | 2007
Emily Mower; David J. Feil-Seifer; Maja J. Matarić; Shrikanth Narayanan
Achieving and maintaining user engagement is a key goal of human-robot interaction. This paper presents a method for determining user engagement state from physiological data (including galvanic skin response and skin temperature). In the reported study, physiological data were measured while participants played a wire puzzle game moderated by either a simulated or embodied robot, both with varying personalities. The resulting physiological data were segmented and classified based on position within trial using the K-Nearest Neighbors algorithm. We found it was possible to estimate the users engagement state for trials of variable length with an accuracy of 84.73%. In future experiments, this ability would allow assistive robot moderators to estimate the users likelihood of ending an interaction at any given point during the interaction. This knowledge could then be used to adapt the behavior of the robot in an attempt to re-engage the user.
international conference on acoustics, speech, and signal processing | 2008
Emily Mower; Sungbok Lee; M.J. Matarific; Shrikanth Narayanan
Audio-visual emotion expression by synthetic agents is widely employed in research, industrial, and commercial applications. However, the mechanism through which people judge the multimodal emotional display of these agents is not yet well understood. This study is an attempt to provide a better understanding of the interaction between video and audio channels through the use of a continuous dimensional evaluation framework of valence, activation, and dominance. The results indicate that the congruent audio-visual presentation contains information allowing users to differentiate between happy and angry emotional expressions to a greater degree than either of the two channels individually. Interestingly, however, sad and neutral emotions which exhibit a lesser degree of activation show more confusion when presented using both channels. Furthermore, when faced with a conflicting emotional presentation, users predominantly attended to the vocal channel. It is speculated that this is most likely due to the limited level of facial emotion expression inherent in the current animated face. The results also indicate that there is no clear integration of audio and visual channels in emotion perception as in speech perception indicated by the McGurk effect. The final judgments were biased toward the modality with stronger expression power.