Che-Wei Huang
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Che-Wei Huang.
PeerJ | 2016
Bo Xiao; Che-Wei Huang; Zac E. Imel; David C. Atkins; Panayiotis G. Georgiou; Shrikanth Narayanan
Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapists language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training.
conference of the international speech communication association | 2016
Che-Wei Huang; Shrikanth Narayanan
Recently, attention mechanism based deep learning has gained much popularity in speech recognition and natural language processing due to its flexibility at the decoding phase. Through the attention mechanism, the relevant encoding context vectors contribute a majority portion to the construction of the decoding context, while the effect of the irrelevant ones is minimized. Inspired by this idea, a speech emotion recognition system is proposed in this work for an active selection of sub-utterance representations to better compose a discriminative utterance representation. Compared to the baseline of a model based on the uniform attention, i.e. no attention at all, an attention based model improves the weighted accuracy by an absolute of 1.46% (and relative 57.87% to 59.33%) on the emotion classification task. Moreover, the selection distribution leads to a better understanding of the sub-utterance structure in an emotional utterance.
international workshop on machine learning for signal processing | 2016
Che-Wei Huang; Shrikanth Narayanan
We propose a rate-distortion based deep neural network (DNN) training algorithm using a smooth matrix functional on the manifold of positive semi-definite matrices as the non-parametric entropy estimator. The objective in the optimization function includes not only the measure of performance of the output layer but also the measure of information distortion between consecutive layers in order to produce a concise representation of its input on each layer. An experiment on speech emotion recognition shows the DNN trained by such method reaches comparable performance with an encoder-decoder system.
multimedia signal processing | 2016
Naveen Kumar; Tanaya Guha; Che-Wei Huang; Colin Vaz; Shrikanth Narayanan
The majority of computational work on emotion in music concentrates on developing machine learning methodologies to build new, more accurate prediction systems, and usually relies on generic acoustic features. Relatively less effort has been put to the development and analysis of features that are particularly suited for the task. The contribution of this paper is twofold. First, the paper proposes two features that can efficiently capture the emotion-related properties in music. These features are named compressibility and sparse spectral components. These features are designed to capture the overall affective characteristics of music (global features). We demonstrate that they can predict emotional dimensions (arousal and valence) with high accuracy as compared to generic audio features. Secondly, we investigate the relationship between the proposed features and the dynamic variation in the emotion ratings. To this end, we propose a novel Haar transform-based technique to predict dynamic emotion ratings using only global features.
international conference on multimodal interfaces | 2015
Tanaya Guha; Che-Wei Huang; Naveen Kumar; Yan Zhu; Shrikanth Narayanan
The goal of this paper is to enable an objective understanding of gender portrayals in popular films and media through multimodal content analysis. An automated system for analyzing gender representation in terms of screen presence and speaking time is developed. First, we perform independent processing of the video and the audio content to estimate gender distribution of screen presence at shot level, and of speech at utterance level. A measure of the movies excitement or intensity is computed using audiovisual features for every scene. This measure is used as a weighting function to combine the gender-based screen/speaking time information at shot/utterance level to compute gender representation for the entire movie. Detailed results and analyses are presented on seventeen full length Hollywood movies.
multimedia signal processing | 2016
Che-Wei Huang; Shrikanth Narayanan
In this work, we studied the problem of fall detection using signals from tri-axial wearable sensors. In particular, we focused on the comparison of methods to combine signals from multiple tri-axial accelerometers which were attached to different body parts in order to recognize human activities. To improve the detection rate while maintaining a low false alarm rate, previous studies developed detection algorithms by cascading base algorithms and experimented on each sensory data separately. Rather than combining base algorithms, we explored the combination of multiple data sources. Based on the hypothesis that these sensor signals should provide complementary information to the characterization of humans physical activities, we benchmarked a feature level and a kernel-level fusions to learn the kernel that incorporates multiple sensors in the support vector classifier. The results show that given the same false alarm rate constraint, the detection rate improves when using signals from multiple sensors, compared to the baseline where no fusion was employed.
international conference on multimedia and expo | 2017
Che-Wei Huang; Shrikanth Shri Narayanan
conference of the international speech communication association | 2014
Che-Wei Huang; Bo Xiao; Panayiotis G. Georgiou; Shrikanth Narayanan
conference of the international speech communication association | 2018
Che-Wei Huang; Shrikanth Narayanan
international conference on acoustics, speech, and signal processing | 2018
Che-Wei Huang; Shrikanth Narayanan