Developing emotion recognition for video conference software to support people with autism
Marc Franzen, Michael Stephan Gresser, Tobias Müller, Prof. Dr. Sebastian Mauser
DDeveloping emotion recognition for videoconference software to support people with autism
Marc Franzen
Faculty ERavensburg-Weingarten University
Michael Stephan Gresser
Faculty ERavensburg-Weingarten University
Tobias Müller
Faculty ERavensburg-Weingarten University
Prof. Dr. Sebastian Mauser
Faculty ERavensburg-Weingarten University
Abstract —We develop an emotion recognition software for theuse with a video conference software for autistic individuals whichare unable to recognize emotions properly. It can get an imageout of the video stream, detect the emotion in it with the helpof a neural network and display the prediction to the user. Thenetwork is trained on facial landmark features. The software isfully modular to support adaption to different video conferencesoftware, programming languages and implementations.
Index Terms —emotion recognition, communication aids, com-puter science, artificial intelligence
I. I
NTRODUCTION
Autism is a spectrum-condition, where the affected personcan have a wide variety of impacts on various abilities. Someindividuals have repetitive and stereotypical interests whichmakes them prefer doing recurring and monotonous work.Other individuals have impacts on social skills, like the abilityto interact and communicate well with other people. [1][2]Oftentimes, they lack the ability to detect emotions in thefaces of their conversation partners. [3] This makes it difficultfor them to estimate the course of an interview and thereforewe want to support them with our software.The software recognizes the emotion of a communicationpartner during a video conference and displays the result to theautistic individual. As a result, the software may compensatefor a vital part of the social skillset of an autistic individual.II. S
OLUTIONS
In terms of emotion recognition of a facial image, there arealready different commercial and non-commercial solutionswith different accuracies available.
A. Commercial solution
In the commercial space, Affectiva is a company thatdevelops emotion measurement technology and offers it asa service for other companies to use in their products. Theircore product is the AFFDEX algorithm [4], that is used inAffectiva’s AFFDEX SDK, which is available for purchase ontheir website [5]. It is mainly meant for market research, but it is also used in other environments, such as the automotiveindustry to monitor emotions and reactions of a driver. [6]AFFDEX has already found its way into a similar contextas this project. With the help of AR glasses, children andadults can be assisted in learning "crucial social and cognitiveskills" with a special focus on emotion recognition. [7] Thisalso supports our motivation, since many people with autismhave problems understanding emotions.Also, AFFDEX has a high accuracy in detecting "keyemotions", more precisely in the "high 90th percentile". [8]As the AFFDEX SDK is a commercial solution, it is noteasily accessible and has a price point of 25000 USD [5].Therefore, this SDK is not the ideal solution for this researchproject and other solutions must be considered.
B. Existing solutions
As an overview, we consult the paper [9], which is sum-marized in the following paragraphs. It distinguishes betweenthree different steps:
Image Acquisition, Feature Extraction,Classification . a) Image Acquisition: This step contains the acquisitionof images from various sources, including "a database, a livevideo stream or other sources, in 2D or 3D". [9]The data source can either be static by using still images ordynamic by using image sequences.Afterwards, there might be pre-processing applied to thedata. This could be de-noising, scaling and cropping to opti-mize the data for the next step. b) Feature Extraction:
The extraction of features fromfacial data is an essential step, as they describe the "physicalphenomena" [9] on which the detection of facial expressionsis based on.The better the selection of features and their representationof the face, the more robust the recognition afterwards is.The available methods can be grouped into appearancefeatures, geometric features, a hybrid approach using both anda template-based approach. a r X i v : . [ c s . C V ] J a n ) Classification: In the final classification step, the de-tected facial expression is assigned to a predefined expression.The available classifiers can typically be split into para-metric and non-parametric machine-learning-methods, whereeither a predetermined function is given and parameters forthis function are learned or the mapping function is learnedwithout predetermining the form of a function.Non-machine-learning methods might be feasible but arenot discussed in this overview-paper.
C. Examplesa) Linear Directional Patterns:
A solution referenced in[9] uses Linear Directional Patterns (LDP) representing thechanges in the image as a histogram, which results in stablefacial features.These features are then fed into a Support Vector Machine(SVM) that maps these features to their corresponding emo-tions.They state, that they reach an accuracy between 80% and99% depending on the used parameters and test-scenarios. b) Active Appearance Model:
Another already existingproject uses an Active Appearance Model (AAM) to gatherfeatures which are then used to detect emotions in an image.[10] The AAM takes the statistical model of a face, represent-ing its shape and tries to match this model to the image thatis currently processed. More precisely, the model consists of aset of connected points (landmark coordinates) which are theniteratively deformed until they fit onto the current image. Thisprocess can be improved by training a model with images andtheir respective landmark coordinates. [11]In that project, the extracted features are then used tocalculate a mean parameter vector for each of seven emotions.These mean parameter vectors are then compared to theEuclidean distance from the face parameters of the currentimage. The accuracy of this approach is at around 90% forthe emotions fear, joy, disgust and neutral, but around 60-80%for surprise, anger and sadness. [10]
D. Proposed solution
Our hypothesis is that the direct matching from faces toemotions, as well as the manual feature engineering like in theabove solutions with e.g. the Euclidean distance can lead to alesser accurate detection than could be theoretically possible.For similar tasks, neural networks are often considered,because of their ability to learn relevant correlations in the dataon their own and it could perform better because of additionallearned features.Because of this, we propose the use of a neural networkas our classifier to match faces to emotions, because they canlearn features, a human would not consider to be relevant inthe face.To circumvent the problem to match raw image data toemotions, we take the common approach to use stable featuressimilar to AAM in [10] as the input to our neural networkwhich then matches these to the corresponding emotions.This enables a more diverse prediction, without the need todistinguish between e.g. genders, skin-color and age. III. M
ETHODTABLE IM
ODULES
Module Description main
Starts the other modules and waits forthem to exit. input
Gathers the image in some way andsends the image-data as .jpg over thesocket. model
Receives the image-data and constructsa standardized OpenCV image object. controller
Receives the OpenCV image object,performs the emotion recognition andsends the conclusion as string. view
Receives the string and displays it onthe screen.
Our solution is divided into four different kinds of modules(See Tab. I).Each module runs in a separate process and uses ZeroMQfor communication. ZeroMQ is a programming-language in-dependent library, which enables us to open sockets on thelocalhost of a machine and then use TCP connections toexchange data [12]. We implement this as a "Request-Reply"pattern, where one module requests data by sending "ready"over the socket, where the other module then sends the datain the previously defined format. If a module stops, it sends"done" and quits. As soon as another module receives "done"instead of "ready", it itself sends the "done" message to itsadjacent module and quits.By implementing this pattern, we achieve an exchangeabil-ity, in which we can replace any module from any program-ming language, which supports OpenCV and ZeroMQ.
A. Module: input
We decide to use the static approach described in II-B forbetter performance. For this, the input takes a screenshot usingthe mss library [13], converts it to a .jpg format and sends itas a byte-stream to the model.
B. Module: model
The model expects any image type as a byte-stream andcreates an OpenCV image object from that. To improve theperformance, we scale this image down and thus reduce theamount of data for further processing. It is then serialized andforwarded to the controller.
C. Module: controller
The controller module expects a serialized OpenCV imagefrom the model. It then de-serializes it and starts processing itin two steps, as shown in figure 1 and outlined in the followingparagraphs. acial LandmarkRecognition
DNNDNN DNNDNNCNNCNN
Calculate
Relative Coordinates „ happiness “ Fig. 1. Steps for detecting emotions. Photo of face from dataset [14]
1) Facial landmark detection:
For facial landmark-detection we use the technique from [15] that is included in thedlib library [16] and adapted the implementation from [17],which is using the pre-trained detector inside the library. Wechoose this method from the summary [18] because it incor-porates the face detection and the facial landmark detection ina single solution while providing real-time capabilities and anopen source.
2) Emotion detection:
From the gathered facial landmarkpoints, we use artificial intelligence as a mapping functionbetween the facial landmark features and the given emotion.The neural network learns the correlation between certainpoints of a face and the corresponding emotion while ignoringirrelevant points. As an example, humans would rate, that theposition of the tip of the nose does not correspond as stronglyto emotions as the corners of the mouth do. a) Dataset:
We use the Dataset "facial_expressions" [14]from this challenge [19]. This dataset has 13718 images,each labeled manually by the submitter with one of 8 facialexpressions: anger, contempt, disgust, fear, happiness, sadness,surprise and neutral. We choose this dataset because it containsnatural images of celebrities, as these reflect real emotionsbetter than datasets crafted with actors. For better trainingwe remove the data submitted by "jhamski" and "628" asthese images vary in resolution and color space and aresometimes distorted. 456 images were not included into thedataset because the chosen facial landmark recognition couldnot detect a face in them. We also move random pictures ofevery category from the training dataset to a validation set tocompare this solution against others (see Section IV). Theseimages are not used to train the network.The exact composition of the full possible training andvalidation sets is shown in table II.Our final dataset consists of 12309 gray-scale images, eachwith a resolution of 350x350 pixels. The composition isunevenly distributed and thus not suitable for training neuralnetworks. This is verified through initial tests, where all classeswith small amounts of images are ignored in any predictions.However, we determine, that the emotions contempt, dis-gust, fear and sadness are not important in a business videocall. In most cases, happiness and neutral indicate if a con- ference is going well. Therefore anger and surprise are lessimportant.So, for the scope of this project we decide to leave themout for now and focus on happiness and neutral as facialexpressions. If this works well, it should be expandable toother emotions with larger datasets.This leaves us with 11228 images for training and 382 forvalidation.
TABLE IIN
UMBER OF IMAGES FOR EACH LABEL
Emotion Training Validation Importance
Anger 169 45 mediumContempt 9 - lowDisgust 12 - lowFear 11 - lowHappiness 4961 191 highNeutral 6267 191 highSadness 116 - lowSurprise 288 49 medium b) Training:
For training we use two common imple-mentations of neural networks and some variation in the feddata which results in three total tested solutions.We use TensorFlow [20] with Keras [21] as the frameworkto create all neural networks.
Fully-Connected Neural Network:
The first tested network is a dense neural network (DNN)to create a multi-layer perceptron (MLP). It has the followingarchitecture: • Input Layer with X- & Y-Coordinates for each of the 68detected facial landmarks (136 inputs) • Dense hidden Layer with 1024 neurons, followed by adropout layer with rate 0.5 • Dense hidden Layer with 512 neurons, followed by adropout layer with rate 0.5 • Dense hidden Layer with 256 neurons, followed by adropout layer with rate 0.5 • Dense output layer with a node for each of the 2 possibleemotionsAs this model hits a plateau between 75% and 80% accuracyon the training and validation dataset, we suspect this could becaused by the changing absolute coordinates between images.So as a second attempt we add additional features and convertall absolute coordinates to relative ones.
Fully-Connected Neural Network with modified features:
As the most relevant changes happen relative to the centerof the respective portion of the face, the center point of eachportion is calculated. This was added as an additional featureand all points of this portion were added relative to that center.We decide to split the face in 4 parts: mouth, nose, lefteye, right eye . The eyebrows are given from the center of thecorresponding eye, because we think that the movement ofn eyebrow relative to the eye (raising eyebrows) is the mostsignificant change in an emotion.Additionally, the positions of the landmarks depend on theshape of a face. As an accommodation, we add the width,height and the center point of the face outline as features forthe neural network. With these, the network can potentiallypredict the emotion independently from the geometry of theface.This modified dataset uses the same DNN architecture withthe only difference being the input layer to accommodate forthe additional features.This yields slightly better results on both datasets, as laterdiscussed in IV.
Convolutional Neural Network:
Convolutional Neural Networks (CNN) are usually used forimage recognition, as they can make use of the spatial positionof features. [22] So we think this concept might be appliedhere, since the position of each facial landmark can correspondto certain emotions.For the dataset to be used with a CNN, all points of eachimage were projected into a 350x350 matrix, initially filledwith zeros, where each added ’1’ corresponds to a faciallandmark position.The CNN has the following architecture: • Input layer with a 350x350 matrix with a depth of 1 forthe single channel representing either a facial landmarkat the given point or not • • Maximum 2D-Pooling layer with a size of 2x2 • Dropout layer with a rate of 0.25 to combat overfitting • Flattening layer to prepare the data for the fully connectedlayer • Dense output layer with a node for each of the 2 possibleemotionsTo combat overfitting, additionally to the dropout layer, weuse data augmentation since in this case it is easy to flip eachmatrix horizontally to get twice as many data rows as before.
Common parameters:
All neural networks were trained using the optimizer Adam[23] with a learning rate of ∗ − . The used activationfunction was always Rectified Linear Unit (relu), except forthe output layer, which used softmax. The range of the fullwidth and height (350px each) of all coordinates for the DNNswas mapped to values from 0 to 1. Model architecture creation:
For a basis on which we can improve on, we use the modelarchitectures provided by the Keras framework as examples[24].With the Multilayer Perceptron and Convolutional NeuralNetwork as starting points for our further development, weadapt them to our data format and reach the final presentednetworks by training, testing and altering the models.The base remains conceptually the same, with minor modi-fications, like adding or removing layers and adjusting param-eters, like the neuron-count or the dropout rate. c) Use:
With the trained model, we run inference on thesolution. The trained model is simply loaded using the Kerasframework and inference is called using the given methodsprovided by the Keras Model class.The result of this inference is then sent as a string for theemotion to the view module.
D. Module: view
The view accepts any kind of string and displays it in awindow on the screen. In future work, there could be animplementation of this bachelor’s thesis [1].IV. R
ESULTS
To evaluate the software, we define the following officescenario: The opponent has a webcam with a resolution of720p30 and the user has a PC with a recent (as of 2019, 9thGeneration) Intel Core i5 processor.To compare our solution with a representative of thecommercial sector, we selected the AFFDEX algorithm. Itis implemented in the demo-program "AffdexMe" [25]. Weinstall it on the same machine we use for our own solution.Instead of the webcam input we use a stream of the desktop tosimulate how AffdexMe would perform with the same input.
A. Performance
As AffdexMe does not provide interfaces for time mea-surement, we use a camera with a high framerate to countthe milliseconds passed between the image being visible forAffdexMe and the program detecting the face and emotion.In our solution, we also measure the time between the imagereaching the controller module and the detected emotion leav-ing the controller module for the different types of networks.All tests are executed on the same machine with an IntelCore i5 5500U and 8 GB of RAM.The performance of each software varies greatly with theinput resolution. For AffdexMe the input-images have a resolu-tion of 525 x 525 pixels, while our software takes a screenshotwith a resolution of 1366 x 768 pixels.This has to be taken into consideration when evaluating theresults in table III.
TABLE IIIP
ERFORMANCE
Average Standard Deviation
DNN 737,1ms 35,3msModified DNN 727,4ms 25,5msCNN 767,8ms 40,7msAffdexMe 367,8ms 116,9ms
B. Precision
We compare all of our 3 optional solutions with each otherand the solution from Affectiva [4]. We separated randomimages from the dataset for validation, based on our rating ofimportance of each emotion for a business video conference. A cc u r a c y EpochDNN training accuracyDNN validation accuracy
Modified DNN training accuracy
Modified DNN validation accuracy
Fig. 2. DNN: Accuracy on training and validation data over 5000 epochs a) Metrics:
We use two metrics to evaluate the results.For one, we collect the certainty for the emotions in AffdexMeand our solutions that correspond to the labeled emotions ofthe dataset and calculate the average per emotion. As Affectivadoes not have a "neutral" emotion, we use the strongestdetected emotion and calculate the reciprocal value. This isshown in table IV.
TABLE IVC
ERTAINTY
Happiness Neutral TotalImages tested
191 191 382DNN 69,48% 75,04% 72,26%Modified DNN 72,39% 76,79% 74,59%CNN 55,27% 62,46% 58,87%Affectiva 52,36% 26,54% 39,45%
As the other metric we calculate the accuracy of the pre-dictions by each solution and AffdexMe. This is achieved bymeasuring the percentage of all correctly classified emotionson our dataset for validation. These results are shown in tableV.
TABLE VA
CCURACY
Happiness Neutral TotalImages tested
191 191 382DNN 74,35% 86,91% 80,63%Modified DNN 78,01% 86,39% 82,20%CNN 58,64% 75,39% 67,02%Affectiva 59,26% 52,15% 55,70% b) History of accuracy while training:
The DNNs aretrained over 5000 epochs. The accuracy on both the trainingand validation datasets are shown in figure 2 over time. A cc u r a c y Epoch CNN training accuracyCNN validation accuracy
Fig. 3. CNN: Accuracy on training and validation data over 500 epochs
Both versions increase steadily on the training dataset, whileon the validation dataset it has more variation. The trainingaccuracy increases faster with modified features, but they getto an equal growth towards the end.With modified features it performs better overall, but onlyby a small margin. We suspect this is because the relativefeatures have a slightly more relevant changes in their valuesand it might also be caused by the additional width and heightfeatures.The training begins to plateau at about 3000 epochs. Afterthat it only increases slowly. They might not be fully fitted yet,but are close to being fully trained, where it does not increaseanymore or even decreases.As the modified features train faster and get better overallresults, this might be the preferred option when continuing todevelop these methods further.Like the DNN, the CNN continues improving over thewhole 500 trained epochs and also does not reach a point yetwhere any accuracy starts to decrease. From this we can alsoargue, that with more training time these results can be furtherimproved. As training on the CNN takes 72 to 73 seconds perepoch, it is stopped at 500 epochs, which is equivalent to 10hours of training time, because of the time constraint of thisproject.We suspect it would get better quickly on the trainingdataset but would stay in the same range on the validationdataset, as it already plateaus early on and increases only bya small margin towards the end.As it is shown in figure 3 the accuracy of the CNN is after1/6th of the steps already more accurate on the training dataset.At the end of our training it is less accurate on the validationdata than the DNN at its end.To perform similar or better on the validation set than theDNNs, we think the CNNs architecture would need somemodification and fine tuning. c) Discussion:
As the DNN is smaller in file size of themodel and is faster for inference on a single image, this is thepreferred type for our solution for now, based on these results.s already mentioned, we choose the DNN with relativefeature points, as it is slightly better. The time for calculatingrelative coordinates from the absolute ones is negligible as thefew needed array operations lay in the margin of error whenmeasuring runtime.Although the results show, that we do not reach accuracieson the validation-dataset in the higher 90th percentile, like theshown solutions in II-C, we think that our solution has thebenefit of using natural emotions for training.We see many of the existing solutions use images of actedemotions, while our solution is trained on non-acted, everydayimages of famous people.To support this claim, we conducted several real-worldtests with our solution in different conditions. They show asubjectively higher accuracy than our validation dataset.Additionally, our solution cannot be easily compared withmany other solutions, as we do not use the same datasets forvalidation. With different images, our accuracy can improvesignificantly when compared to the other solutions.Finally, our solutions are more accurate on our chosendataset than "AffdexMe", but also slightly slower. As alreadymentioned, this is due to the different resolutions.Also, based on our experience using the AFFDEX algo-rithm, we think that it benefits from a stream of movingimages. However, we use static images to test it to be ableto compare the results to our solution.Furthermore, our solution has the benefit compared toAFFDEX, that we are just training on the relevant emotions,while their solution is influenced by additional emotions whichcan result in more false predictions.V. F
UTURE I MPROVEMENTS a) Multi-modal information:
Because our social interac-tion is not just based on visual feedback alone, we proposethe idea of combining several sources of human reactions asa result of the current emotional state.This might improve the overall precision and confidence inthe recognition system.In case of video conference software, voice data might betaken into consideration. This was already evaluated in moredetail in [26] and [27].Additionally, chat messages can reflect the current emo-tional state and might also be used as another source ofinformation, as discussed in [28].These sources might be used as a starting point for furtherinvestigation. b) Adaption to video conference software:
Because thisproject is built with modularity in mind, it is easy to provideadaptions for various video conference software and differentUser-Interfaces like the one presented in [1]. c) Larger Dataset:
Our dataset consists of 12309 imageswith labels. As the labels are distributed very unequally,future work can concentrate on gathering a larger and moreevenly distributed dataset to improve the detection of the lessimportant emotions as well as helping the classifier to distinctbetween similar emotions like anger and disgust. VI. C
ONCLUSION
We conclude that this software is a big step forward for theeffective employment of individuals with autism.Although our solution is limited to two distinct emotions,we think it can already help such individuals to achieve theeveryday task of video conferences in an office environment.And, as we stated before, with more data and furtherlearning this can be improved to include more emotions.It might be a good approach before deploying such asoftware, to discuss how this affects the emotional climate inthe office, as well as what the effects of a wrong classificationcould be, if decisions are based on these judgements.Overall, it is a helpful software that could already be usedin its current state when all other hurdles about privacy,fairness and the possible impacts of misclassifications can beovercome. R
EFERENCES [1] S. S. Bobek,
User Experience Design für Anwendungenmit Künstlicher Intelligenz , Doggenriedstraße, 88250Weingarten, Germany, 2019.[2] Auticon. (2019). “Autismus | auticon erklärt wie wir dasAutismus-Spektrum sehen,” [Online]. Available: https://auticon.de/autismus/ (visited on Aug. 31, 2019).[3] M. Uljarevic and A. Hamilton, “Recognition of emo-tions in autism: A formal meta-analysis,”
Journal ofautism and developmental disorders , vol. 43, no. 7,pp. 1517–1526, 2013.[4] D. McDuff, A. Mahmoud, M. Mavadati, M. Amr, J.Turcot, and R. e. Kaliouby, “Affdex sdk: A cross-platform real-time multi-face expression recognitiontoolkit,” in
Proceedings of the 2016 CHI ConferenceExtended Abstracts on Human Factors in ComputingSystems , ser. CHI EA ’16, San Jose, California, USA:ACM, 2016, pp. 3723–3726,
ISBN : 978-1-4503-4082-3.
DOI : 10.1145/2851581.2890247. [Online]. Available:http://doi.acm.org/10.1145/2851581.2890247.[5] Affectiva. (2019). “Pricing – Affectiva Developer Por-tal,” [Online]. Available: https : / / developer . affectiva .com/pricing/ (visited on Aug. 31, 2019).[6] D. McDuff, R. El Kaliouby, J. F. Cohn, and R. W.Picard, “Predicting ad liking and purchase intent: Large-scale analysis of facial responses to ads,”
IEEE Transac-tions on Affective Computing , IEEE, 2013, pp. 157–161.[10] M. S. Ratliff and E. Patterson, “Emotion recognition us-ing facial expressions with active appearance models,”in
Proc. of HRI , Citeseer, 2008.[11] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Activeappearance models,” in
European conference on com-puter vision , Springer, 1998, pp. 484–498.[12] The ZeroMQ authors. (2019). “ZeroMQ | Get started,”[Online]. Available: https : / / zeromq . org / get - started/(visited on Sep. 3, 2019).[13] T2. (2019). “mss · PyPI,” [Online]. Available: https ://pypi.org/project/mss/ (visited on Oct. 4, 2019).[14] B. L. Y. Rowe. (2019). “GitHub - muxs-pace/facial_expressions: A set of images for classifyingfacial expressions,” [Online]. Available: https ://github.com/muxspace/facial%5C_expressions (visitedon Sep. 1, 2019).[15] V. Kazemi and J. Sullivan, “One millisecond facealignment with an ensemble of regression trees,” in
Proceedings of the IEEE conference on computer visionand pattern recognition arXiv preprint arXiv:1412.6980 ,2014.[24] Keras Documentation. (2019). “Guide to the Sequentialmodel - Examples,” [Online]. Available: https://keras.io / getting - started / sequential - model - guide /
InternationalJournal of Computational Linguistics & Chinese Lan-guage Processing, Volume 9, Number 2, August 2004:Special Issue on New Trends of Speech and LanguageProcessing , 2004, pp. 45–62.[27] L. De Silva, T. Miyasato, and R. Nakatsu, “Facialemotion recognition using multi-modal information,” in
Proceedings of the IEEE Intelligent Conf. Information,Comm. And Signal Processing , vol. 1, Oct. 1997, 397–401 vol.1,
ISBN : 0-7803-3676-3.
DOI : 10.1109/ICICS.1997.647126.[28] C.-H. Wu, Z.-J. Chuang, and Y.-C. Lin, “Emotionrecognition from text using semantic labels and sep-arable mixture models,”