Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Emily Mower Provost is active.

Publication


Featured researches published by Emily Mower Provost.


international conference on acoustics, speech, and signal processing | 2013

Deep learning for robust feature generation in audiovisual emotion recognition

Yelin Kim; Honglak Lee; Emily Mower Provost

Automatic emotion recognition systems predict high-level affective content from low-level human-centered signal cues. These systems have seen great improvements in classification accuracy, due in part to advances in feature selection methods. However, many of these feature selection methods capture only linear relationships between features or alternatively require the use of labeled data. In this paper we focus on deep learning techniques, which can overcome these limitations by explicitly capturing complex non-linear feature interactions in multimodal data. We propose and evaluate a suite of Deep Belief Network models, and demonstrate that these models show improvement in emotion classification performance over baselines that do not employ deep learning. This suggests that the learned high-order non-linear relationships are effective for emotion recognition.


ieee automatic speech recognition and understanding workshop | 2013

Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks

Duc Le; Emily Mower Provost

Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.


international conference on acoustics, speech, and signal processing | 2014

Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech

Zahi N. Karam; Emily Mower Provost; Satinder P. Singh; Jennifer Montgomery; Christopher Archer; Gloria J. Harrington; Melvin G. McInnis

Speech patterns are modulated by the emotional and neurophysiological state of the speaker. There exists a growing body of work that computationally examines this modulation in patients suffering from depression, autism, and post-traumatic stress disorder. However, the majority of the work in this area focuses on the analysis of structured speech collected in controlled environments. Here we expand on the existing literature by examining bipolar disorder (BP). BP is characterized by mood transitions, varying from a healthy euthymic state to states characterized by mania or depression. The speech patterns associated with these mood states provide a unique opportunity to study the modulations characteristic of mood variation. We describe methodology to collect unstructured speech continuously and unobtrusively via the recording of day-to-day cellular phone conversations. Our pilot investigation suggests that manic and depressive mood states can be recognized from this speech data, providing new insight into the feasibility of unobtrusive, unstructured, and continuous speech-based wellness monitoring for individuals with BP.


international conference on acoustics, speech, and signal processing | 2013

Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions

Yelin Kim; Emily Mower Provost

Human emotion changes continuously and sequentially. This results in dynamics intrinsic to affective communication. One of the goals of automatic emotion recognition research is to computationally represent and analyze these dynamic patterns. In this work, we focus on the global utterance-level dynamics. We are motivated by the hypothesis that global dynamics have emotion-specific variations that can be used to differentiate between emotion classes. Consequently, classification systems that focus on these patterns will be able to make accurate emotional assessments. We quantitatively represent emotion flow within an utterance by estimating short-time affective characteristics. We compare time-series estimates of these characteristics using Dynamic Time Warping, a time-series similarity measure. We demonstrate that this similarity can effectively recognize the affective label of the utterance. The similarity-based pattern modeling outperforms both a feature-based baseline and static modeling. It also provides insight into typical high-level patterns of emotion. We visualize these dynamic patterns and the similarities between the patterns to gain insight into the nature of emotion expression.


IEEE Transactions on Affective Computing | 2017

MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception

Carlos Busso; Srinivas Parthasarathy; Alec Burmania; Mohammed Abdelwahab; Najmeh Sadoughi; Emily Mower Provost

We present the MSP-IMPROV corpus, a multimodal emotional database, where the goal is to have control over lexical content and emotion while also promoting naturalness in the recordings. Studies on emotion perception often require stimuli with fixed lexical content, but that convey different emotions. These stimuli can also serve as an instrument to understand how emotion modulates speech at the phoneme level, in a manner that controls for coarticulation. Such audiovisual data are not easily available from natural recordings. A common solution is to record actors reading sentences that portray different emotions, which may not produce natural behaviors. We propose an alternative approach in which we define hypothetical scenarios for each sentence that are carefully designed to elicit a particular emotion. Two actors improvise these emotion-specific situations, leading them to utter contextualized, non-read renditions of sentences that have fixed lexical content and convey different emotions. We describe the context in which this corpus was recorded, the key features of the corpus, the areas in which this corpus can be useful, and the emotional content of the recordings. The paper also provides the performance for speech and facial emotion classifiers. The analysis brings novel classification evaluations where we study the performance in terms of inter-evaluator agreement and naturalness perception, leveraging the large size of the audiovisual database.


international conference on acoustics, speech, and signal processing | 2013

Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow

Emily Mower Provost

Emotion recognition is the process of identifying the affective characteristics of an utterance given either static or dynamic descriptions of its signal content. This requires the use of units, windows over which the emotion variation is quantified. However, the appropriate time scale for these units is still an open question. Traditionally, emotion recognition systems have relied upon units of fixed length, whose variation is then modeled over time. This paper takes the view that emotion is expressed over units of variable length. In this paper, variable-length units are introduced and used to capture the local dynamics of emotion at the sub-utterance scale. The results demonstrate that subsets of these local dynamics are salient with respect to emotion class. These salient units provide insight into the natural variation in emotional speech and can be used in a classification framework to achieve performance comparable to the state-of-the-art. This hints at the existence of building blocks that may underlie natural human emotional communication.


international conference on acoustics, speech, and signal processing | 2016

Mood state prediction from speech of varying acoustic quality for individuals with bipolar disorder

John Gideon; Emily Mower Provost; Melvin G. McInnis

Speech contains patterns that can be altered by the mood of an individual. There is an increasing focus on automated and distributed methods to collect and monitor speech from large groups of patients suffering from mental health disorders. However, as the scope of these collections increases, the variability in the data also increases. This variability is due in part to the range in the quality of the devices, which in turn affects the quality of the recorded data, negatively impacting the accuracy of automatic assessment. It is necessary to mitigate variability effects in order to expand the impact of these technologies. This paper explores speech collected from phone recordings for analysis of mood in individuals with bipolar disorder. Two different phones with varying amounts of clipping, loudness, and noise are employed. We describe methodologies for use during preprocessing, feature extraction, and data modeling to correct these differences and make the devices more comparable. The results demonstrate that these pipeline modifications result in statistically significantly higher performance, which highlights the potential of distributed mental health systems.


international conference on acoustics, speech, and signal processing | 2016

Cross-corpus acoustic emotion recognition from singing and speaking: A multi-task learning approach

Biqiao Zhang; Emily Mower Provost; Georg Essi

Emotion is expressed over both speech and song. Previous works have found that although spoken and sung emotion recognition are different tasks, they are related. Classifiers that explicitly utilize this relatedness can achieve better performance than classifiers that do not. Further, research in speech emotion recognition has demonstrated that emotion is more accurately modeled when gender is taken into account. However, it is not yet clear how domain (speech or song) and gender can be jointly leveraged in emotion recognition systems nor how systems leveraging this information can perform in cross-corpus settings. In this paper, we explore a multi-task emotion recognition framework and compare the performance across different classification models and output selection/fusion methods using cross-corpus evaluation. Our results show the classification accuracy is the highest when information is shared only between closely related tasks and when the output of disparate models are fused.


acm multimedia | 2014

Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition

Yelin Kim; Emily Mower Provost

Facial movement is modulated both by emotion and speech articulation. Facial emotion recognition systems aim to discriminate between emotions, while reducing the speech-related variability in facial cues. This aim is often achieved using two key features: (1) phoneme segmentation: facial cues are temporally divided into units with a single phoneme and (2) phoneme-specific classification: systems learn patterns associated with groups of visually similar phonemes (visemes), e.g. P, B, and M. In this work, we empirically compare the effects of different temporal segmentation and classification schemes for facial emotion recognition. We propose an unsupervised segmentation method that does not necessitate costly phonetic transcripts. We show that the proposed method bridges the accuracy gap between a traditional sliding window method and phoneme segmentation, achieving a statistically significant performance gain. We also demonstrate that the segments derived from the proposed unsupervised and phoneme segmentation strategies are similar to each other. This paper provides new insight into unsupervised facial motion segmentation and the impact of speech variability on emotion classification.


ieee international conference on automatic face gesture recognition | 2015

Modeling transition patterns between events for temporal human action segmentation and classification

Yelin Kim; Jixu Chen; Ming-Ching Chang; Xin Wang; Emily Mower Provost; Siwei Lyu

We propose a temporal segmentation and classification method that accounts for transition patterns between events of interest. We apply this method to automatically detect salient human action events from videos. A discriminative classifier (e.g., Support Vector Machine) is used to recognize human action events and an efficient dynamic programming algorithm is used to jointly determine the starting and ending temporal segments of recognized human actions. The key difference from previous work is that we introduce the modeling of two kinds of event transition information, namely event transition segments, which capture the occurrence patterns between two consecutive events of interest, and event transition probabilities, which model the transition probability between the two events. Experimental results show that our approach significantly improves the segmentation and recognition performance for the two datasets we tested, in which distinctive transition patterns between events exist.

Collaboration


Dive into the Emily Mower Provost's collaboration.

Top Co-Authors

Avatar

Duc Le

University of Michigan

View shared research outputs
Top Co-Authors

Avatar

Yelin Kim

University of Michigan

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

John Gideon

University of Michigan

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Georg Essl

University of Michigan

View shared research outputs
Top Co-Authors

Avatar

Shrikanth Narayanan

University of Southern California

View shared research outputs
Top Co-Authors

Avatar

Keli Licata

University of Michigan

View shared research outputs
Researchain Logo
Decentralizing Knowledge