Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Mengyi Liu is active.

Publication


Featured researches published by Mengyi Liu.


computer vision and pattern recognition | 2014

Learning Expressionlets on Spatio-temporal Manifold for Dynamic Facial Expression Recognition

Mengyi Liu; Shiguang Shan; Ruiping Wang; Xilin Chen

Facial expression is temporally dynamic event which can be decomposed into a set of muscle motions occurring in different facial regions over various time intervals. For dynamic expression recognition, two key issues, temporal alignment and semantics-aware dynamic representation, must be taken into account. In this paper, we attempt to solve both problems via manifold modeling of videos based on a novel mid-level representation, i.e. expressionlet. Specifically, our method contains three key components: 1) each expression video clip is modeled as a spatio-temporal manifold (STM) formed by dense low-level features, 2) a Universal Manifold Model (UMM) is learned over all low-level features and represented as a set of local ST modes to statistically unify all the STMs. 3) the local modes on each STM can be instantiated by fitting to UMM, and the corresponding expressionlet is constructed by modeling the variations in each local ST mode. With above strategy, expression videos are naturally aligned both spatially and temporally. To enhance the discriminative power, the expressionlet-based STM representation is further processed with discriminant embedding. Our method is evaluated on four public expression databases, CK+, MMI, Oulu-CASIA, and AFEW. In all cases, our method reports results better than the known state-of-the-art.


international conference on multimodal interfaces | 2014

Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild

Mengyi Liu; Ruiping Wang; Shaoxin Li; Shiguang Shan; Zhiwu Huang; Xilin Chen

In this paper, we present the method for our submission to the Emotion Recognition in the Wild Challenge (EmotiW 2014). The challenge is to automatically classify the emotions acted by human subjects in video clips under real-world environment. In our method, each video clip can be represented by three types of image set models (i.e. linear subspace, covariance matrix, and Gaussian distribution) respectively, which can all be viewed as points residing on some Riemannian manifolds. Then different Riemannian kernels are employed on these set models correspondingly for similarity/distance measurement. For classification, three types of classifiers, i.e. kernel SVM, logistic regression, and partial least squares, are investigated for comparisons. Finally, an optimal fusion of classifiers learned from different kernels and different modalities (video and audio) is conducted at the decision level for further boosting the performance. We perform an extensive evaluation on the challenge data (including validation set and blind test set), and evaluate the effects of different strategies in our pipeline. The final recognition accuracy achieved 50.4% on test set, with a significant gain of 16.7% above the challenge baseline 33.7%.


ieee international conference on automatic face gesture recognition | 2013

AU-aware Deep Networks for facial expression recognition

Mengyi Liu; Shaoxin Li; Shiguang Shan; Xilin Chen

In this paper, we propose to construct a deep architecture, AU-aware Deep Networks (AUDN), for facial expression recognition by elaborately utilizing the prior knowledge that the appearance variations caused by expression can be decomposed into a batch of local facial Action Units (AUs). The proposed AUDN is composed of three sequential modules: the first module consists of two layers, i.e., a convolution layer and a max-pooling layer, which aim to generate an over-complete representation encoding all expression-specific appearance variations over all possible locations; In the second module, an AU-aware receptive field layer is designed to search subsets of the over-complete representation, each of which aims at best simulating the combination of AUs; In the last module, multilayer Restricted Boltzmann Machines (RBM) are exploited to learn hierarchical features, which are then concatenated for final expression recognition. Experiments on three expression databases CK+, MMI and SFEW demonstrate the effectiveness of AUDN in both lab-controlled and wild environments. All our results are better than or at least competitive to the best known results.


asian conference on computer vision | 2014

Deeply Learning Deformable Facial Action Parts Model for Dynamic Expression Analysis

Mengyi Liu; Shaoxin Li; Shiguang Shan; Ruiping Wang; Xilin Chen

Expressions are facial activities invoked by sets of muscle motions, which would give rise to large variations in appearance mainly around facial parts. Therefore, for visual-based expression analysis, localizing the action parts and encoding them effectively become two essential but challenging problems. To take them into account jointly for expression analysis, in this paper, we propose to adapt 3D Convolutional Neural Networks (3D CNN) with deformable action parts constraints. Specifically, we incorporate a deformable parts learning component into the 3D CNN framework, which can detect specific facial action parts under the structured spatial constraints, and obtain the discriminative part-based representation simultaneously. The proposed method is evaluated on two posed expression datasets, CK+, MMI, and a spontaneous dataset FERA. We show that, besides achieving state-of-the-art expression recognition accuracy, our method also enjoys the intuitive appeal that the part detection map can desirably encode the mid-level semantics of different facial action parts.


Neurocomputing | 2015

AU-inspired Deep Networks for Facial Expression Feature Learning

Mengyi Liu; Shaoxin Li; Shiguang Shan; Xilin Chen

Most existing technologies for facial expression recognition utilize off-the-shelf feature extraction methods for classification. In this paper, aiming at learning better features specific for expression representation, we propose to construct a deep architecture, AU-inspired Deep Networks (AUDN), inspired by the psychological theory that expressions can be decomposed into multiple facial Action Units (AUs). To fully exploit this inspiration but avoid detecting AUs, we propose to automatically learn: (1) informative local appearance variation; (2) optimal way to combining local variation and (3) high level representation for final expression recognition. Accordingly, the proposed AUDN is composed of three sequential modules. Firstly, we build a convolutional layer and a max-pooling layer to learn the Micro-Action-Pattern (MAP) representation, which can explicitly depict local appearance variations caused by facial expressions. Secondly, feature grouping is applied to simulate larger receptive fields by combining correlated MAPs adaptively, aiming to generate more abstract mid-level semantics. Finally, a multi-layer learning process is employed in each receptive field respectively to construct group-wise sub-networks for higher-level representations. Experiments on three expression databases CK+, MMI and SFEW demonstrate that, by simply applying linear classifiers on the learned features, our method can achieve state-of-the-art results on all the databases, which validates the effectiveness of AUDN in both lab-controlled and wild environments.


international conference on multimodal interfaces | 2013

Partial least squares regression on grassmannian manifold for emotion recognition

Mengyi Liu; Ruiping Wang; Zhiwu Huang; Shiguang Shan; Xilin Chen

In this paper, we propose a method for video-based human emotion recognition. For each video clip, all frames are represented as an image set, which can be modeled as a linear subspace to be embedded in Grassmannian manifold. After feature extraction, Class-specific One-to-Rest Partial Least Squares (PLS) is learned on video and audio data respectively to distinguish each class from the other confusing ones. Finally, an optimal fusion of classifiers learned from both modalities (video and audio) is conducted at decision level. Our method is evaluated on the Emotion Recognition In The Wild Challenge (EmotiW 2013). The experimental results on both validation set and blind test set are presented for comparison. The final accuracy achieved on test set outperforms the baseline by 26%.


international conference on computer vision | 2015

Exploiting Feature Hierarchies with Convolutional Neural Networks for Cultural Event Recognition

Mengyi Liu; Xin Liu; Yan Li; Xilin Chen; Alexander G. Hauptmann; Shiguang Shan

Cultural events are kinds of typical events closely related to history and nationality, which play an important role in cultural heritage through generations. However, automatically recognizing cultural events still remains a great challenge since it depends on understanding of complex image contents such as people, objects, and scene context. Therefore, it is intuitive to associate this task with other high-level vision problems, e.g., object detection, recognition, and scene understanding. In this paper, we address this problem by combining both ideas of object / scene contents mining and strong image representation via CNN into a whole framework. Specifically, for object / scene contents mining, we employ selective search to extract a batch of bottom-up region proposals, which are served as key object / scene candidates in each event image, while for representation via CNN, we investigate two state-of-the-art deep architectures, VGGNet and GoogLeNet, and adapt them to our task by performing domain-specific (i.e., event) fine-tuning on both global image and hierarchical region proposals. These two models can complementarily exploit feature hierarchies spatially, which simultaneously capture the global context and local evidences within the image. In our final submission for ChaLearn LAP Challenge ICCV 2015, nine kinds of features extracted from five different deep models were exploited and followed with two kinds of classifiers for decision level fusion. Our method achieves the best performance of mAP=0.854 among all the participants in the track of cultural event recognition.


IEEE Transactions on Image Processing | 2016

Learning Expressionlets via Universal Manifold Model for Dynamic Facial Expression Recognition

Mengyi Liu; Shiguang Shan; Ruiping Wang; Xilin Chen

Facial expression is a temporally dynamic event which can be decomposed into a set of muscle motions occurring in different facial regions over various time intervals. For dynamic expression recognition, two key issues, temporal alignment and semantics-aware dynamic representation, must be taken into account. In this paper, we attempt to solve both problems via manifold modeling of videos based on a novel mid-level representation, i.e., expressionlet. Specifically, our method contains three key stages: 1) each expression video clip is characterized as a spatial-temporal manifold (STM) formed by dense low-level features; 2) a universal manifold model (UMM) is learned over all low-level features and represented as a set of local modes to statistically unify all the STMs; and 3) the local modes on each STM can be instantiated by fitting to the UMM, and the corresponding expressionlet is constructed by modeling the variations in each local mode. With the above strategy, expression videos are naturally aligned both spatially and temporally. To enhance the discriminative power, the expressionlet-based STM representation is further processed with discriminant embedding. Our method is evaluated on four public expression databases, CK+, MMI, Oulu-CASIA, and FERA. In all cases, our method outperforms the known state of the art by a large margin.Facial expression is a temporally dynamic event which can be decomposed into a set of muscle motions occurring in different facial regions over various time intervals. For dynamic expression recognition, two key issues, temporal alignment and semantics-aware dynamic representation, must be taken into account. In this paper, we attempt to solve both problems via manifold modeling of videos based on a novel mid-level representation, i.e., expressionlet. Specifically, our method contains three key stages: 1) each expression video clip is characterized as a spatial-temporal manifold (STM) formed by dense low-level features; 2) a universal manifold model (UMM) is learned over all low-level features and represented as a set of local modes to statistically unify all the STMs; and 3) the local modes on each STM can be instantiated by fitting to the UMM, and the corresponding expressionlet is constructed by modeling the variations in each local mode. With the above strategy, expression videos are naturally aligned both spatially and temporally. To enhance the discriminative power, the expressionlet-based STM representation is further processed with discriminant embedding. Our method is evaluated on four public expression databases, CK+, MMI, Oulu-CASIA, and FERA. In all cases, our method outperforms the known state of the art by a large margin.


Journal on Multimodal User Interfaces | 2016

Video modeling and learning on Riemannian manifold for emotion recognition in the wild

Mengyi Liu; Ruiping Wang; Shaoxin Li; Zhiwu Huang; Shiguang Shan; Xilin Chen

In this paper, we present the method for our submission to the emotion recognition in the wild challenge (EmotiW). The challenge is to automatically classify the emotions acted by human subjects in video clips under real-world environment. In our method, each video clip can be represented by three types of image set models (i.e. linear subspace, covariance matrix, and Gaussian distribution) respectively, which can all be viewed as points residing on some Riemannian manifolds. Then different Riemannian kernels are employed on these set models correspondingly for similarity/distance measurement. For classification, three types of classifiers, i.e. kernel SVM, logistic regression, and partial least squares, are investigated for comparisons. Finally, an optimal fusion of classifiers learned from different kernels and different modalities (video and audio) is conducted at the decision level for further boosting the performance. We perform extensive evaluations on the EmotiW 2014 challenge data (including validation set and blind test set), and evaluate the effects of different components in our pipeline. It is observed that our method has achieved the best performance reported so far. To further evaluate the generalization ability, we also perform experiments on the EmotiW 2013 data and two well-known lab-controlled databases: CK+ and MMI. The results show that the proposed framework significantly outperforms the state-of-the-art methods.


asian conference on computer vision | 2012

Enhancing expression recognition in the wild with unlabeled reference data

Mengyi Liu; Shaoxin Li; Shiguang Shan; Xilin Chen

Collaboration


Dive into the Mengyi Liu's collaboration.

Top Co-Authors

Avatar

Shiguang Shan

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Xilin Chen

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Ruiping Wang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Shaoxin Li

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Zhiwu Huang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Xin Liu

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Yan Li

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge