Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yelin Kim is active.

Publication


Featured researches published by Yelin Kim.


international conference on acoustics, speech, and signal processing | 2013

Deep learning for robust feature generation in audiovisual emotion recognition

Yelin Kim; Honglak Lee; Emily Mower Provost

Automatic emotion recognition systems predict high-level affective content from low-level human-centered signal cues. These systems have seen great improvements in classification accuracy, due in part to advances in feature selection methods. However, many of these feature selection methods capture only linear relationships between features or alternatively require the use of labeled data. In this paper we focus on deep learning techniques, which can overcome these limitations by explicitly capturing complex non-linear feature interactions in multimodal data. We propose and evaluate a suite of Deep Belief Network models, and demonstrate that these models show improvement in emotion classification performance over baselines that do not employ deep learning. This suggests that the learned high-order non-linear relationships are effective for emotion recognition.


international conference on acoustics, speech, and signal processing | 2013

Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions

Yelin Kim; Emily Mower Provost

Human emotion changes continuously and sequentially. This results in dynamics intrinsic to affective communication. One of the goals of automatic emotion recognition research is to computationally represent and analyze these dynamic patterns. In this work, we focus on the global utterance-level dynamics. We are motivated by the hypothesis that global dynamics have emotion-specific variations that can be used to differentiate between emotion classes. Consequently, classification systems that focus on these patterns will be able to make accurate emotional assessments. We quantitatively represent emotion flow within an utterance by estimating short-time affective characteristics. We compare time-series estimates of these characteristics using Dynamic Time Warping, a time-series similarity measure. We demonstrate that this similarity can effectively recognize the affective label of the utterance. The similarity-based pattern modeling outperforms both a feature-based baseline and static modeling. It also provides insight into typical high-level patterns of emotion. We visualize these dynamic patterns and the similarities between the patterns to gain insight into the nature of emotion expression.


acm multimedia | 2014

Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition

Yelin Kim; Emily Mower Provost

Facial movement is modulated both by emotion and speech articulation. Facial emotion recognition systems aim to discriminate between emotions, while reducing the speech-related variability in facial cues. This aim is often achieved using two key features: (1) phoneme segmentation: facial cues are temporally divided into units with a single phoneme and (2) phoneme-specific classification: systems learn patterns associated with groups of visually similar phonemes (visemes), e.g. P, B, and M. In this work, we empirically compare the effects of different temporal segmentation and classification schemes for facial emotion recognition. We propose an unsupervised segmentation method that does not necessitate costly phonetic transcripts. We show that the proposed method bridges the accuracy gap between a traditional sliding window method and phoneme segmentation, achieving a statistically significant performance gain. We also demonstrate that the segments derived from the proposed unsupervised and phoneme segmentation strategies are similar to each other. This paper provides new insight into unsupervised facial motion segmentation and the impact of speech variability on emotion classification.


ieee international conference on automatic face gesture recognition | 2015

Modeling transition patterns between events for temporal human action segmentation and classification

Yelin Kim; Jixu Chen; Ming-Ching Chang; Xin Wang; Emily Mower Provost; Siwei Lyu

We propose a temporal segmentation and classification method that accounts for transition patterns between events of interest. We apply this method to automatically detect salient human action events from videos. A discriminative classifier (e.g., Support Vector Machine) is used to recognize human action events and an efficient dynamic programming algorithm is used to jointly determine the starting and ending temporal segments of recognized human actions. The key difference from previous work is that we introduce the modeling of two kinds of event transition information, namely event transition segments, which capture the occurrence patterns between two consecutive events of interest, and event transition probabilities, which model the transition probability between the two events. Experimental results show that our approach significantly improves the segmentation and recognition performance for the two datasets we tested, in which distinctive transition patterns between events exist.


international conference on multimodal interfaces | 2016

Emotion spotting: discovering regions of evidence in audio-visual emotion expressions

Yelin Kim; Emily Mower Provost

Research has demonstrated that humans require different amounts of information, over time, to accurately perceive emotion expressions. This varies as a function of emotion classes. For example, recognition of happiness requires a longer stimulus than recognition of anger. However, previous automatic emotion recognition systems have often overlooked these differences. In this work, we propose a data-driven framework to explore patterns (timings and durations) of emotion evidence, specific to individual emotion classes. Further, we demonstrate that these patterns vary as a function of which modality (lower face, upper face, or speech) is examined, and consistent patterns emerge across different folds of experiments. We also show similar patterns across emotional corpora (IEMOCAP and MSP-IMPROV). In addition, we show that our proposed method, which uses only a portion of the data (59% for the IEMOCAP), achieves comparable accuracy to a system that uses all of the data within each utterance. Our method has a higher accuracy when compared to a baseline method that randomly chooses a portion of the data. We show that the performance gain of the method is mostly from prototypical emotion expressions (defined as expressions with rater consensus). The innovation in this study comes from its understanding of how multimodal cues reveal emotion over time.


acm multimedia | 2015

Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face

Yelin Kim; Emily Mower Provost

The need for human-centered, affective multimedia interfaces has motivated research in automatic emotion recognition. In this article, we focus on facial emotion recognition. Specifically, we target a domain in which speakers produce emotional facial expressions while speaking. The main challenge of this domain is the presence of modulations due to both emotion and speech. For example, an individuals mouth movement may be similar when he smiles and when he pronounces the phoneme /IY/, as in “cheese”. The result of this confusion is a decrease in performance of facial emotion recognition systems. In our previous work, we investigated the joint effects of emotion and speech on facial movement. We found that it is critical to employ proper temporal segmentation and to leverage knowledge of spoken content to improve classification performance. In the current work, we investigate the temporal characteristics of specific regions of the face, such as the forehead, eyebrow, cheek, and mouth. We present methodology that uses the temporal patterns of specific regions of the face in the context of a facial emotion recognition system. We test our proposed approaches on two emotion datasets, the IEMOCAP and SAVEE datasets. Our results demonstrate that the combination of emotion recognition systems based on different facial regions improves overall accuracy compared to systems that do not leverage different characteristics of individual regions.


affective computing and intelligent interaction | 2015

Leveraging inter-rater agreement for audio-visual emotion recognition

Yelin Kim; Emily Mower Provost

Human expressions are often ambiguous and unclear, resulting in disagreement or confusion among different human evaluators. In this paper, we investigate how audiovisual emotion recognition systems can leverage prototypicality, the level of agreement or confusion among human evaluators. We propose the use of a weighted Support Vector Machine to explicitly model the relationship between the prototypicality of training instances and evaluated emotion from the IEMOCAP corpus. We choose weights of prototypical and non-prototypical instances based on the maximal accuracy of each speaker. We then provide per-speaker analysis to understand specific speech characteristics associated with the information gain of emotion given prototypicality information. Our experimental results show that neutrality, one of the most challenging emotion to recognize, has the highest performance gain from prototypicality information, compared to other emotion classes: Angry, Happy, and Sad. We also show that the proposed method improves the overall multi-class classification accuracy significantly over traditional methods that do not leverage prototypicality.


IEEE Access | 2018

Towards Emotionally Aware AI Smart Classroom: Current Issues and Directions for Engineering and Education

Yelin Kim; Tolga Soyata; Reza Feyzi Behnagh

Future smart classrooms that we envision will significantly enhance learning experience and seamless communication among students and teachers using real-time sensing and machine intelligence. Existing developments in engineering have brought the state-of-the-art to an inflection point, where they can be utilized as components of a smart classroom. In this paper, we propose a smart classroom system that consists of these components. Our proposed system is capable of making real-time suggestions to an in-class presenter to improve the quality and memorability of their presentation by allowing the presenter to make real-time adjustments/corrections to their non-verbal behavior, such as hand gestures, facial expressions, and body language. We base our suggested system components on existing research in affect sensing, deep learning-based emotion recognition, and real-time mobile-cloud computing. We provide a comprehensive study of these technologies and determine the computational requirements of a system that incorporates these technologies. Based on these requirements, we provide a feasibility study of the system. Although the state-of-the-art research in most of the components we propose in our system are advanced enough to realize the system, the main challenge lies in: 1) the integration of these technologies into a holistic system design; 2) their algorithmic adaptation to allow real-time execution; and 3) quantification of valid educational variables for use in algorithms. In this paper, we discuss current issues and provide future directions in engineering and education disciplines to deploy the proposed system.


IEEE Transactions on Affective Computing | 2017

ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition

Yelin Kim; Emily Mower Provost

Emotion is an essential part of human interaction. Automatic emotion recognition can greatly benefit human-centered interactive technology, since extracted emotion can be used to understand and respond to user needs. However, real-world emotion recognition faces a central challenge when a user is speaking: facial movements due to speech are often confused with facial movements related to emotion. Recent studies have found that the use of phonetic information can reduce speech-related variability in the lower face region. However, methods to differentiate upper face movements due to emotion and due to speech have been underexplored. This gap leads us to the proposal of the Informed Segmentation and Labeling Approach (ISLA). ISLA uses speech signals that alter the dynamics of the lower and upper face regions. We demonstrate how pitch can be used to improve estimates of emotion from the upper face, and how this estimate can be combined with emotion estimates from the lower face and speech in a multimodal classification system. Our emotion classification results on the IEMOCAP and SAVEE datasets show that ISLA improves overall classification performance. We also demonstrate how emotion estimates from different modalities correlate with each other, providing insights into the differences between posed and spontaneous expressions.


international conference on multimodal interfaces | 2016

Wild wild emotion: a multimodal ensemble approach

John Gideon; Biqiao Zhang; Zakaria Aldeneh; Yelin Kim; Soheil Khorram; Duc Le; Emily Mower Provost

Automatic emotion recognition from audio-visual data is a topic that has been broadly explored using data captured in the laboratory. However, these data are not necessarily representative of how emotion is manifested in the real-world. In this paper, we describe our system for the 2016 Emotion Recognition in the Wild challenge. We use the Acted Facial Expressions in the Wild database 6.0 (AFEW 6.0), which contains short clips of popular TV shows and movies and has more variability in the data compared to laboratory recordings. We explore a set of features that incorporate information from facial expressions and speech, in addition to cues from the background music and overall scene. In particular, we propose the use of a feature set composed of dimensional emotion estimates trained from outside acoustic corpora. We design sets of multiclass and pairwise (one-versus-one) classifiers and fuse the resulting systems. Our fusion increases the performance from a baseline of 38.81% to 43.86% and from 40.47% to 46.88%, for validation and test sets, respectively. While the video features perform better than audio features alone, a combination of the two modalities achieves the greatest performance, with gains of 4.4% and 1.4%, with and without information gain, respectively. Because of the flexible design of the fusion, it is easily adaptable to other multimodal learning problems.

Collaboration


Dive into the Yelin Kim's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Duc Le

University of Michigan

View shared research outputs
Top Co-Authors

Avatar

Honglak Lee

University of Michigan

View shared research outputs
Top Co-Authors

Avatar

John Gideon

University of Michigan

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge