Is this you? Create Your Porfile

Wen-Li Wei

National Cheng Kung University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wen-Li Wei is active.

Explore More

Publication

Featured researches published by Wen-Li Wei.

IEEE Transactions on Multimedia | 2012

Error Weighted Semi-Coupled Hidden Markov Model for Audio-Visual Emotion Recognition

Jen-Chun Lin; Chung-Hsien Wu; Wen-Li Wei

This paper presents an approach to the automatic recognition of human emotions from audio-visual bimodal signals using an error weighted semi-coupled hidden Markov model (EWSC-HMM). The proposed approach combines an SC-HMM with a state-based bimodal alignment strategy and a Bayesian classifier weighting scheme to obtain the optimal emotion recognition result based on audio-visual bimodal fusion. The state-based bimodal alignment strategy in SC-HMM is proposed to align the temporal relation between audio and visual streams. The Bayesian classifier weighting scheme is then adopted to explore the contributions of the SC-HMM-based classifiers for different audio-visual feature pairs in order to obtain the emotion recognition output. For performance evaluation, two databases are considered: the MHMC posed database and the SEMAINE naturalistic database. Experimental results show that the proposed approach not only outperforms other fusion-based bimodal emotion recognition methods for posed expressions but also provides satisfactory results for naturalistic expressions.

IEEE Transactions on Multimedia | 2013

Two-Level Hierarchical Alignment for Semi-Coupled HMM-Based Audiovisual Emotion Recognition With Temporal Course

Chung-Hsien Wu; Jen-Chun Lin; Wen-Li Wei

A complete emotional expression typically contains a complex temporal course in face-to-face natural conversation. To address this problem, a bimodal hidden Markov model (HMM)-based emotion recognition scheme, constructed in terms of sub-emotional states, which are defined to represent temporal phases of onset, apex, and offset, is adopted to model the temporal course of an emotional expression for audio and visual signal streams. A two-level hierarchical alignment mechanism is proposed to align the relationship within and between the temporal phases in the audio and visual HMM sequences at the model and state levels in a proposed semi-coupled hidden Markov model (SC-HMM). Furthermore, by integrating a sub-emotion language model, which considers the temporal transition between sub-emotional states, the proposed two-level hierarchical alignment-based SC-HMM (2H-SC-HMM) can provide a constraint on allowable temporal structures to determine an optimal emotional state. Experimental results show that the proposed approach can yield satisfactory results in both the posed MHMC and the naturalistic SEMAINE databases, and shows that modeling the complex temporal structure is useful to improve the emotion recognition performance, especially for the naturalistic database (i.e., natural conversation). The experimental results also confirm that the proposed 2H-SC-HMM can achieve an acceptable performance for the systems with sparse training data or noisy conditions.

IEEE Transactions on Multimedia | 2013

Speaking Effect Removal on Emotion Recognition From Facial Expressions Based on Eigenface Conversion

Chung-Hsien Wu; Wen-Li Wei; Jen-Chun Lin; Wei-Yu Lee

Speaking effect is a crucial issue that may dramatically degrade performance in emotion recognition from facial expressions. To manage this problem, an eigenface conversion-based approach is proposed to remove speaking effect on facial expressions for improving accuracy of emotion recognition. In the proposed approach, a context-dependent linear conversion function modeled by a statistical Gaussian Mixture Model (GMM) is constructed with parallel data from speaking and non-speaking facial expressions with emotions. To model the speaking effect in more detail, the conversion functions are categorized using a decision tree considering the visual temporal context of the Articulatory Attribute (AA) classes of the corresponding input speech segments. For verification of the identified quadrant of emotional expression on the Arousal-Valence (A-V) emotion plane, which is commonly used to dimensionally define the emotion classes, from the reconstructed facial feature points, an expression template is constructed to represent the feature points of the non-speaking facial expressions for each quadrant. With the verified quadrant, a regression scheme is further employed to estimate the A-V values of the facial expression as a precise point in the A-V emotion plane. Experimental results show that the proposed method outperforms current approaches and demonstrates that removing the speaking effect on facial expression is useful for improving the performance of emotion recognition.

international conference on acoustics, speech, and signal processing | 2016

DEMV-matchmaker: Emotional temporal course representation and deep similarity matching for automatic music video generation

Jen-Chun Lin; Wen-Li Wei; Hsin-Min Wang

This paper presents a deep similarity matching-based emotion-oriented music video (MV) generation system, called DEMV-matchmaker, which utilizes an emotion-oriented deep similarity matching (EDSM) metric as a bridge to connect music and video. Specifically, we adopt an emotional temporal course model (ETCM) to respectively learn the relationship between music and its emotional temporal phase sequence and the relationship between video and its emotional temporal phase sequence from an emotion-annotated MV corpus. An emotional temporal structure preserved histogram (ETPH) representation is proposed to keep the recognized emotional temporal phase sequence information for EDSM metric construction. A deep neural network (DNN) is then applied to learn an EDSM metric based on the ETPHs for the given positive (official) and negative (artificial) MV examples. For MV generation, the EDSM metric is applied to measure the similarity between ETPHs of video and music. The results of objective and subjective experiments demonstrate that DEMV-matchmaker performs well and can generate appealing music videos that can enhance the viewing and listening experience.

acm multimedia | 2015

EMV-matchmaker: Emotional Temporal Course Modeling and Matching for Automatic Music Video Generation

Jen-Chun Lin; Wen-Li Wei; Hsin-Min Wang

This paper presents a novel content-based emotion-oriented music video (MV) generation system, called EMV-matchmaker, which utilizes the emotional temporal phase sequence of the multimedia content as a bridge to connect music and video. Specifically, we adopt an emotional temporal course model (ETCM) to respectively learn the relationship between music and its emotional temporal phase sequence and the relationship between video and its emotional temporal phase sequence from an emotion-annotated MV corpus. Then, given a video clip (or a music clip), the visual (or acoustic) ETCM is applied to predict its emotional temporal phase sequence in a valence-arousal (VA) emotional space from the corresponding low-level visual (or acoustic) features. For MV generation, string matching is applied to measure the similarity between the emotional temporal phase sequences of video and music. The results of objective and subjective experiments demonstrate that EMV-matchmaker performs well and can generate appealing music videos that can enhance the viewing and listening experience.

international conference on acoustics, speech, and signal processing | 2017

Deep-net fusion to classify shots in concert videos

Wen-Li Wei; Jen-Chun Lin; Tyng-Luh Liu; Yi-Hsuan Yang; Hsin-Min Wang; Hsiao-Rong Tyan; Hong-Yuan Mark Liao

Varying types of shots is a fundamental element in the language of film, commonly used by a visual storytelling director to convey the emotion, ideas, and art. To classify such types of shots from images, we present a new framework that facilitates the intriguing task by addressing two key issues. We first focus on learning more effective features by fusing the layer-wise outputs extracted from a deep convolutional neural network (CNN), pre-trained on a large-scale dataset for object recognition. We then introduce a probabilistic fusion model, termed as error weighted deep cross-correlation model (EW-Deep-CCM), to boost the classification accuracy. Specifically, the deep neural network-based cross-correlation model (Deep-CCM) is constructed to not only model the extracted feature hierarchies of CNN independently but also relate the statistical dependencies of paired features from different layers. Then, a Bayesian error weighting scheme for classifier combination is adopted to explore the contributions from individual Deep-CCM classifiers to enhance the accuracy of shot classification. We provide extensive experimental results on a dataset of live concert videos to demonstrate the advantage of the proposed EW-Deep-CCM over existing popular fusion approaches. The video demos can be found at https://sites.google.com/site/ewdeepccm2/demo.

acm multimedia | 2017

Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing

Jen-Chun Lin; Wen-Li Wei; James Ching-Yu Yang; Hsin-Min Wang; Hong-Yuan Mark Liao

An automated process that can suggest a soundtrack to a user-generated video (UGV) and make the UGV a music-compliant professional-like video is challenging but desirable. To this end, this paper presents an automatic music video (MV) generation system that conducts soundtrack recommendation and video editing simultaneously. Given a long UGV, it is first divided into a sequence of fixed-length short (e.g., 2 seconds) segments, and then a multi-task deep neural network (MDNN) is applied to predict the pseudo acoustic (music) features (or called the pseudo song) from the visual (video) features of each video segment. In this way, the distance between any pair of video and music segments of same length can be computed in the music feature space. Second, the sequence of pseudo acoustic (music) features of the UGV and the sequence of the acoustic (music) features of each music track in the music collection are temporarily aligned by the dynamic time warping (DTW) algorithm with a pseudo-song-based deep similarity matching (PDSM) metric. Third, for each music track, the video editing module selects and concatenates the segments of the UGV based on the target and concatenation costs given by a pseudo-song-based deep concatenation cost (PDCC) metric according to the DTW-aligned result to generate a music-compliant professional-like video. Finally, all the generated MVs are ranked, and the best MV is recommended to the user. The MDNN for pseudo song prediction and the PDSM and PDCC metrics are trained by an annotated official music video (OMV) corpus. The results of objective and subjective experiments demonstrate that the proposed system performs well and can generate appealing MVs with better viewing and listening experiences.

acm multimedia | 2016

Automatic Music Video Generation Based on Emotion-Oriented Pseudo Song Prediction and Matching

Jen-Chun Lin; Wen-Li Wei; Hsin-Min Wang

The main difficulty in automatic music video (MV) generation lies in how to match two different media (i.e., video and music). This paper proposes a novel content-based MV generation system based on emotion-oriented pseudo song prediction and matching. We use a multi-task deep neural network (MDNN) to jointly learn the relationship among music, video, and emotion from an emotion-annotated MV corpus. Given a queried video, the MDNN is applied to predict the acoustic (music) features from the visual (video) features, i.e., the pseudo song corresponding to the video. Then, the pseudo acoustic (music) features are matched with the acoustic (music) features of each music track in the music collection according to a pseudo-song-based deep similarity matching (PDSM) metric given by another deep neural network (DNN) trained on the acoustic and pseudo acoustic features of the positive (official), less-positive (artificial), and negative (artificial) MV examples. The results of objective and subjective experiments demonstrate that the proposed pseudo-song-based framework performs well and can generate appealing MVs with better viewing and listening experiences.

affective computing and intelligent interaction | 2011

Semi-coupled hidden Markov model with state-based alignment strategy for audio-visual emotion recognition

Jen-Chun Lin; Chung-Hsien Wu; Wen-Li Wei

This paper presents an approach to bi-modal emotion recognition based on a semi-coupled hidden Markov model (SC-HMM). A simplified state-based bi-modal alignment strategy in SC-HMM is proposed to align the temporal relation of states between audio and visual streams. Based on this strategy, the proposed SC-HMM can alleviate the problem of data sparseness and achieve better statistical dependency between states of audio and visual HMMs in most real world scenarios. For performance evaluation, audio-visual signals with four emotional states (happy, neutral, angry and sad) were collected. Each of the invited seven subjects was asked to utter 30 types of sentences twice to generate emotional speech and facial expression for each emotion. Experimental results show the proposed bi-modal approach outperforms other fusion-based bi-modal emotion recognition methods.

affective computing and intelligent interaction | 2011