Lejun Yu
Beijing Normal University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lejun Yu.
international conference on multimodal interfaces | 2015
Bo Sun; Liandong Li; Guoyan Zhou; Xuewen Wu; Jun He; Lejun Yu; Dongxue Li; Qinglan Wei
In this paper, we describe our work in the third Emotion Recognition in the Wild (EmotiW 2015) Challenge. For each video clip, we extract MSDF, LBP-TOP, HOG, LPQ-TOP and acoustic features to recognize the emotions of film characters. For the static facial expression recognition based on video frame, we extract MSDF, DCNN and RCNN features. We train linear SVM classifiers for these kinds of features on the AFEW and SFEW dataset, and we propose a novel fusion network to combine all the extracted features at decision level. The final achievement we gained is 51.02% on the AFEW testing set and 51.08% on the SFEW testing set, which are much better than the baseline recognition rate of 39.33% and 39.13%.
international conference on multimodal interfaces | 2016
Bo Sun; Qinglan Wei; Liandong Li; Qihua Xu; Jun He; Lejun Yu
In this paper, we describe our work in the fourth Emotion Recognition in the Wild (EmotiW 2016) Challenge. For video based emotion recognition sub-challenge, we extract acoustic features, LBPTOP, Dense SIFT and CNN-LSTM features to recognize the emotions of film characters. For group level emotion recognition sub-challenge, we use LSTM and GEM model. We train linear SVM classifiers for these kinds of features on the AFEW6.0 and HAPPEI dataset, and use a fusion network we proposed to combine all the extracted features at the decision level. The final achievements we have gained are 51.54% accuracy on the AFEW testing set and 0.836 RMSE on the HAPPEI testing set.
acm multimedia | 2016
Bo Sun; Siming Cao; Liandong Li; Jun He; Lejun Yu
This paper presents our work in the Emotion Sub-Challenge of the 6th Audio/Visual Emotion Challenge and Workshop (AVEC 2016), whose goal is to explore utilizing audio, visual and physiological signals to continuously predict the value of the emotion dimensions (arousal and valence). As visual features are very important in emotion recognition, we try a variety of handcrafted and deep visual features. For each video clip, besides the baseline features, we extract multi-scale Dense SIFT features (MSDF), and some types of Convolutional neural networks (CNNs) features to recognize the expression phases of the current frame. We train linear Support Vector Regression (SVR) for every kind of features on the RECOLA dataset. Multimodal fusion of these modalities is then performed with a multiple linear regression model. The final Concordance Correlation Coefficient (CCC) we gained on the development set are 0.824 for arousal, and 0.718 for valence; and on the test set are 0.683 for arousal and 0.642 for valence.
international conference on multimodal interfaces | 2017
Qinglan Wei; Yijia Zhao; Qihua Xu; Liandong Li; Jun He; Lejun Yu; Bo Sun
In this paper, we target the Group-level emotion recognition sub-challenge of the fifth Emotion Recognition in the Wild (EmotiW 2017) Challenge, which is based on the Group Affect Database 2.0 containing images of groups of people in a wide variety of social events. We use Seetaface to detect and align the faces in the group images and extract two kinds of face-image visual features: VGGFace-lstm, DCNN-lstm. As group image features, we propose using Pyramid Histogram of Oriented Gradients (PHOG), CENTRIST, DCNN features, VGG features. To the testing group images on which the faces have been detected, the final emotion is estimated using group image features and face-level visual features. While to the testing group images on which the faces cannot be detected, the face-level visual features are fused for final recognition. The final achievements we have gained are 79.78% accuracy on the Group Affect Database 2.0 testing set, which is much higher than the corresponding baseline results 53.62%.
acm multimedia | 2017
Bo Sun; Yinghui Zhang; Jun He; Lejun Yu; Qihua Xu; Dongliang Li; Zhaoying Wang
Audio/visual and mood disorder cues have been recently explored to assist psychologists and psychiatrists in Depression Diagnosis. In this paper, we propose a random forest method with a Selected-Text feature which is according to the analysis on the transcript in different depressive levels. The files are consisted of sleep quality, PTSD/Depression Diagnostic, successive treatment, personal preference and feeling. Experiments are carried out on the Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) database[6]. Comparing with results obtained with audio based, video based or multi-feature based cascade decision-level fusion features, Selected-Text feature based method has obtained very promising results on the development and test sets. The root mean square error (RMSE) reaches 4.7, and mean absolute error (MAE) reaches 3.9, which are better than the baseline result, i.e. 7.05/5.66.
Proceedings of SPIE | 2016
Bo Sun; Qinglan Wei; Jun He; Lejun Yu; Xiaoming Zhu
In the field of pedagogy or educational psychology, emotions are treated as very important factors, which are closely associated with cognitive processes. Hence, it is meaningful for teachers to analyze students’ emotions in classrooms, thus adjusting their teaching activities and improving students ’ individual development. To provide a benchmark for different expression recognition algorithms, a large collection of training and test data in classroom environment has become an acute problem that needs to be resolved. In this paper, we present a multimodal spontaneous database in real learning environment. To collect the data, students watched seven kinds of teaching videos and were simultaneously filmed by a camera. Trained coders made one of the five learning expression labels for each image sequence extracted from the captured videos. This subset consists of 554 multimodal spontaneous expression image sequences (22,160 frames) recorded in real classrooms. There are four main advantages in this database. 1) Due to recorded in the real classroom environment, viewer’s distance from the camera and the lighting of the database varies considerably between image sequences. 2) All the data presented are natural spontaneous responses to teaching videos. 3) The multimodal database also contains nonverbal behavior including eye movement, head posture and gestures to infer a student ’ s affective state during the courses. 4) In the video sequences, there are different kinds of temporal activation patterns. In addition, we have demonstrated the labels for the image sequences are in high reliability through Cronbachs alpha method.
ieee international conference on automatic face gesture recognition | 2017
Jun He; Dongliang Li; Bin Yang; Siming Cao; Bo Sun; Lejun Yu
This paper presents our work in the FG 2017 Facial Expression Recognition and Analysis challenge (FERA 2017) and we participate in the AU occurrence sub-challenge. Our work of AU occurrence recognition is based on deep learning, and we design convolution neural network (CNN) models for two types of work: facial view recognition and AU occurrence recognition. For facial view recognition, our model could achieve 97.7% accuracy on validation dataset about 9 facial views. For AU occurrence recognition, we use both visual features and temporal information of dataset. We use CNN models to get deep visual feature and then use BLSTM-RNN to learn the high-level feature in the time domain. When training models, we divide dataset into 9 parts based on 9 facial views, and each model is trained in a specific view. When recognizing AUs, we recognize facial view first and then choose the corresponding model for AU occurrence recognition. Finally, our method shows good performance, the F1 score of test data is 0.507 and the accuracy is 0.735.
Signal Processing-image Communication | 2017
Qinglan Wei; Bo Sun; Jun He; Lejun Yu
Abstract In college classrooms, large quantities of digital-media data showing students’ affective behaviors are continuously captured by cameras on a daily basis. To provide a bench mark for affect recognition using these big data collections, in this paper we propose the first large-scale spontaneous and multimodal student affect database. All videos in our database were selected from daily big data recordings. The recruited subjects extracted one-person image sequences of their own affective behaviors, and then they made affect annotations under standard rules set beforehand. Ultimately, we have collected 2117 image sequences with 11 types of students’ affective behaviors in a variety of classes. The Beijing Normal University Large-scale Spontaneous Visual Expression Database version 2.0 (BNU-LSVED2.0) is an extension database of our previous BNU-LSVED1.0 and it has a number of new characteristics. The nonverbal behaviors and emotions in the new version database are more spontaneous since all image sequences are from the recording videos recorded in actual classes, rather than of behaviors stimulated by induction videos. Moreover, it includes a greater variety of affective behaviors, from which can be inferred students’ learning status during classes; these behaviors include facial expressions, eye movements, head postures, body movements, and gestures. In addition, instead of providing only categorical emotion labels, the new version also provides affective behavior labels and multi-dimensional Pleasure–Arousal–Dominance (PAD) labels that have been assigned to the image sequences. Both the detailed subjective descriptions and the statistical analyses of the self-annotation results demonstrate the reliability and the effectiveness of the multi-dimensional labels in the database.
Journal of Electronic Imaging | 2017
Bo Sun; Siming Cao; Jun He; Lejun Yu; Liandong Li
Constrained by the physiology, the temporal factors associated with human behavior, irrespective of facial movement or body gesture, are described by four phases: neutral, onset, apex, and offset. Although they may benefit related recognition tasks, it is not easy to accurately detect such temporal segments. An automatic temporal segment detection framework using bilateral long short-term memory recurrent neural networks (BLSTM-RNN) to learn high-level temporal–spatial features, which synthesizes the local and global temporal–spatial information more efficiently, is presented. The framework is evaluated in detail over the face and body database (FABO). The comparison shows that the proposed framework outperforms state-of-the-art methods for solving the problem of temporal segment detection.
chinese conference on pattern recognition | 2016
Bo Sun; Qihua Xu; Jun He; Lejun Yu; Liandong Li; Qinglan Wei
In this paper, we explored a multi-feature based classification framework for the Multimodal Emotion Recognition Challenge, which is part of the Chinese Conference on Pattern Recognition (CCPR 2016). The task of the challenge is to recognize one of eight facial emotions in short video segments extracted from Chinese films, TV plays and talk shows. In our framework, both traditional methods and Deep Convolutional Neural Network (DCNN) methods are used to extract various features. With different features, different classifiers are trained to predict video emotion labels. Moreover, a decision-level fusion method is explored to aggregate these different prediction results. According to the results on the competition database, our method shows better effectiveness on Chinese facial emotion.