A Deep Learning Approach for Multimodal Deception Detection
Gangeshwar Krishnamurthy, Navonil Majumder, Soujanya Poria, Erik Cambria
AA Deep Learning Approach for MultimodalDeception Detection
Gangeshwar Krishnamurthy , Navonil Majumder , Soujanya Poria , and ErikCambria A*STAR Artificial Intelligence Initiative (A*AI),Institute of High Performance Computing, Singapore [email protected] Centro de Investigacin enComputacin, IPN, Mexico [email protected] Temasek Laboratories,Nanyang Technological University, Singapore [email protected] School of Computer Science and Engineering,Nanyang Technological University, Singapore [email protected]
Abstract.
Automatic deception detection is an important task that hasgained momentum in computational linguistics due to its potential appli-cations. In this paper, we propose a simple yet tough to beat multi-modalneural model for deception detection. By combining features from differ-ent modalities such as video, audio, and text along with Micro-Expressionfeatures, we show that detecting deception in real life videos can be moreaccurate. Experimental results on a dataset of real-life deception videosshow that our model outperforms existing techniques for deception de-tection with an accuracy of 96.14% and ROC-AUC of 0.9799.
We face deceptive behavior in our day-to-day life. People lie to escape from asituation that seems unfavorable to them. As a consequence, some lies are in-nocuous but others may have severe ramifications in the society. Reports suggestthat the ability of humans to detect deception without special aids is only 54%[1].A study by DePaulo et al. [2] found that deception without any particular mo-tivation or intention exhibited almost no detectable cues of deception. However,cues were significantly more when lies were about transgressions.With the rise in the number of criminal cases filed every year in the US , itis ethically and morally important to accuse only the guilty defendant and freethe innocent. Since the judgment for any case is mostly based on the hearingsand evidence from the stakeholders (accused, witnesses, etc.), the judgment is a r X i v : . [ c s . C L ] M a r ost likely to go wrong if the stakeholders do not speak the truth. It is, hence,important to detect deceptive behavior accurately in order to upkeep the lawand order.Social media can be characterized as a virtual world where people interactwith each other without the human feel and touch. It is easy to not reveal one’sidentity and/or pretend to be someone else on the social media. Cyberbullying isincreasingly becoming a common problem amongst the teenagers nowadays [3].These include spreading rumors about a person, threats, and sexual harassment.Cyberbullying adversely affects the victim and leads to a variety of emotionalresponses such as lowered self-esteem, increased suicidal thoughts, anger, anddepression[4]. Teenagers fall prey to these attacks due to their inability to com-prehend the chicanery and pretentious behavior of the attacker.Another area where deception detection is of paramount importance is withthe increased number of false stories, a.k.a Fake News, on the Internet. Recentreports suggest that the outcome of the U.S. Presidential Elections is due tothe rise of online fake news. Propagandists use arguments that, while sometimesconvincing, are not necessarily valid. Social media, such as Facebook and Twit-ter, have become the propellers for this political propaganda. Countries aroundthe world, such as France [5], are employing methods that would prevent thespread of fake news during their elections. Though these measures might help,there is a pressing need for the computational linguistics community to deviseefficient methods to fight Fake News given that humans are poor at detectingdeception.This paper is organized as follows. In section 2, we will talk about the pastwork in deception detection; section 3 describes our approach to solving decep-tion detection. Section 4 explains our experimental setup. In section 5 and 6, wediscuss our results and drawbacks respectively. And finally, conclude with futurework in section 7. Past research in the detection of deception can be broadly classified as Verbaland Non-verbal. In verbal deception detection, the features are based on thelinguistic characteristics, such as n-grams and sentence count statistics [6], ofthe statement by the subject under consideration. Use of more complex featuressuch as psycholinguistic features [7] based on the Linguistic Inquiry and WordCount (LIWC) lexicon, have also been explored by [6] and shown that they arehelpful in detecting deceptive behavior. Yancheva and Rudzicz studied the re-lation between the syntactic complexity of text and deceptive behavior [8]. 1 Innon-verbal deception detection, physiological measures were the main source ofsignals for detecting deceptive behavior. Polygraph tests measure physiologicalfeatures such as heart rate, respiration rate, skin temperature of the subjectunder investigation. But these tests are not reliable and often misleading asindicated by [9, 10] since judgment made by humans are often biased. Facial ex-pressions and hand gestures were found to be very helpful in detecting deceptiveature. Ekman [11] defined micro-expressions as short involuntary expressions,which could potentially indicate deceptive behavior. Caso et al. [12] identifiedparticular hand gesture to be important to identify the act of deception. Co-hen et al. [13] found that fewer iconic hand gestures were a sign of a deceptivenarration.Previous research was focused on detecting deceit behavior under constrainedenvironments which may not be applicable in real life surroundings. Recently, thefocus has been towards experiments in real life scenarios. Towards this, P´erez-Rosas et al. [14] introduced a new multi-modal deception dataset having real-lifevideos of courtroom trials. They demonstrated the use of features from differentmodalities and the importance of each modality in detecting deception. Theyalso evaluated the performance of humans in deception detection and comparedit with their machine learning models. Wu et al. [15] have developed methodsthat leverage multi-modal features for detecting detection. Their method heavilyemphasizes on feature engineering along with manual cropping and annotatingvideos for feature extraction.In this paper, we describe our attempt to use neural models that uses featuresfrom multiple modalities for detecting deception. We believe our work is the firstattempt at using neural networks for deceit classification. We show that with theright features and simple models, we can detect deceptive nature in real life trialvideos more accurately.
The first stage is to extract unimodal features from each video. We extracttextual, audio and visual features as described below.
Visual Feature Extraction
For extracting visual features from the videos,we use 3D-CNN [16]. 3D-CNN has achieved state-of-the-art results in objectclassification on tridimensional data [16]. 3D-CNN not only extracts featuresfrom each image frame, but also extracts spatiotemporal features [17] from thewhole video which helps in identifying the facial expressions such as smile, fear,or stress.The input to 3D-CNN is a video v of dimension ( c, f, h, w ), where c rep-resents the number of channels and f, h, w are the number of frames, height,and width of each frames respectively. A 3D convolutional filter, f l of dimension( f m , c, f d , f h , f w ) is applied, where f m = number of feature maps, c = numberof channels, f d = number of frames (also called depth of the filter), f h = heightof the filter, and f w = width of the filter. This filter, f l , produces an output, convout of dimension ( f m , c, f − f d + 1 , h − f h + 1 , w − f w + 1) after slidingacross the input video, v . Max pooling is applied to convout with window sizebeing ( m p , m p , m p ). Subsequently, this output is fed to a dense layer of size d f nd softmax. The activations of this dense layer is used as the visual featurerepresentation of the input video, v .In our experiments, we consider only RBG channel images, hence c = 3.We use 32 feature maps and 3D filters of size, f d = f h = f w = 5. Hence, thedimension of the filter, f l , is 32 × × × × m p , of themax pooling layer is 3. Finally, we obtain a feature vector, v f , of dimension 300for an input video, v . Textual Features Extraction
We use Convolutional Neural Networks (CNN)[18, 19] to extract features from the transcript of a video, v . First, we use pre-trained Word2Vec [20] model to extract the vector representations for everyword in the transcript. These vectors are concatenated and fed as input vectorto the CNN. We use a simple CNN with one convolutional layer and a max-pooling layer, to get our sentence representation. In our experiments, filters ofsize 3, 5 and 8 with 20 feature maps each is used. Window-size of 2 is employedfor max-pooling over these feature maps. Subsequently, a full-connected layerwith 300 neurons is used with rectified linear unit (ReLU) [21] as the activationfunction. The activations of this full-connected layer is used as the textual featurerepresentation of the input video, v . Finally, we obtain a feature vector, t f , ofdimension 300 for an input text (transcript), t . Audio Feature Extraction openSMILE [22] is an open-source toolkit used toextract high dimensional features from an audio file. In this work, we use openS-MILE to extract features from the input audio. Before extracting the features,we make sure that there are no unnecessary signals in the audio that affects thequality of the extracted features. Hence, the background noise is removed fromthe audio and Z-standardization is used to perform voice normalization. To re-move the background noise, we use SoX (Sound eXchange) [23] audio processingtool. The noiseless input audio is then fed to the openSMILE tool to extracthigh-dimensional features. These features are functions of low-level descriptor(LLD) contours. Specifically, we use the
IS13-ComParE openSMILE configu-ration to extract features which are of dimension 6373 for every input audio, a . After these features are extracted, a simple fully-connected neural networkis trained to reduce the dimension to 300. Finally, we obtain a feature vector, a f , of dimension 300 for an input audio, a . Micro-Expression Features
Veronica et al. manually annotated facial ex-pressions and use binary features derived from the ground truth annotations topredict deceptive behavior. Facial micro-expressions are also considered to playan important role in detecting deceptive behavior. The data provided by [14]contains 39 facial micro-expressions such as frowning, smiling, eyebrows raising,etc. These are binary features and taken as a feature vector, m p of dimension39. ig. 1. Architecture of model
MLP C We use a simple Multi-Layer perceptron (MLP) (cite)with hidden layer of size 1024 followed by a linear output layer. We use the rec-tified linear unit (ReLU) activation function [21] for non-linearity at the hiddenlayer. A dropout [24] of keep probability, p = 0 .
5, is applied to the hidden layerfor regularization.Figure 1 and 2 shows the architecture of our models,
M LP C and M LP H + C . Unimodal Models
We perform the evaluation on individual modalities anduse the same architecture as multi-modal model. The only difference is that theinput is either v f , or a f , or t f , or m f . Hence no data fusion is performed. Wename the model as M LP U . We fuse the features from individual modalities to map them into a joint space.To achieve this, we try different kinds of data fusion techniques:
Concatenation
In this method, the features from all the modalities are simplyconcatenated into a single feature vector. Thus, the extracted features, t f , a f , v f ig. 2. Architecture of model
MLP H + C and m f , are simply concatenated to form the representation: z f = [ t f ; a f ; v f ; m f ]of dimension d in = 939.We call this model configuration as M LP C and is shown in the figure 1. Hadamard + Concatenation
In this method, the audio features, visual fea-tures and, textual features are fusion by using hadamard product. Then theMicro-Expression features are concatenated with the product. Thus, we have z f = [ t f (cid:12) a f (cid:12) v f ; m f ] of dimension d in = 339, where ( A (cid:12) B ) is element-wisemultiplication between matrices A and B (also known as Hadamard product).We call this model configurtion as M LP H + C and is shown in the figure 2. The training objective is to minimise the cross-entropy between the model’soutput and the true labels. We trained the models with back-propagation usingStochastic Gradient Descent optimizer. The loss function is: J = − N N (cid:88) i =1 C (cid:88) j =1 y i,j log (ˆ y i,j ) (1)ere, N is the number of samples, C is the number of categories (in our case, C = 2). y i is the one-hot vector ground truth of i th sample and ˆ y i,j is its predictedprobability of belonging to class j . For evaluating our deception detection model, we use a real-life deception detec-tion dataset by [14]. This dataset contains 121 video clips of courtroom trials.Out of these 121 videos, 61 of them are of deceptive nature while the remaining60 are of truthful nature.The dataset contains multiple videos from one subject. In order to avoidbleeding of personalities between train and test set, we perform a 10-fold crossvalidation with subjects instead of videos as suggested by Wu et al. [15]. Thisensures that videos of the same subjects are not in both training and test set. Wu et al. [15] have made use of various classifiers such as Logistic Regression(LR), Linear SVM (L-SVM), Kernel SVM (K-SVM). They report the AUC-ROC values obtained by the classifiers for different combination of modalities.We compare the AUC obtained by our models against only Linear SVM (L-SVM)and Logistic Regression (LR).P´erez-Rosas et al. [14] use Decision Trees (DT) and Random Forest (RF) astheir classifiers. We compare the accuracies of our models against DT and RF. Table 1.
Comparision of AUC
Features
L-SVM [15] LR [15]
MLP U MLP C MLP H + C Random - - 0.4577 0.4788 0.4989
Audio 0.7694
Visual - -
Textual (Static) - -
Textual (Non-static) - - - -
Micro-Expression
All Features (Static) - - - 0.9538
Results
Tables 1 & 2 presents the performances of
M LP and its variants along withthe state-of-the-art models. During feature extraction from text, we train theTextCNN model with two different settings: one, by keeping the word vectorrepresentation static; two, by optimizing the vector along with the training (non-static). In our results, we also show the performance of the model with thesetwo textual features separately. Additionally, we also mention the results we gotfrom our models for feature vectors initialized with random numbers.Table 1 shows that our model,
M LP H + C , obtains an AUC of 0.9799 whileoutperforming all other competitive baselines with a huge margin. Table 2.
Comparing accuracies of our model with baselines
Features
DT [14] RF [14]
MLP U MLP C MLP H + C Random - - 43.90% 45.32% 48.51%
Audio - - - -
Visual - - - -
Textual (Static) - -
Textual (Non-static) - - - -
Micro-Expression - -
All Features (Static) - - - 95.24%
Table 2 compares the performance our models with Decision Tree and LinearRegression models [14]. Our model,
M LP H + C , again outperforms other baselinesby achieving an accuracy of 96.14%. We can also infer that visual and textualfeatures play a major role in the performance of our models; followed by Micro-Expressions and audio. This conforms with the findings by [14] that facial displayfeatures and unigrams contribute the most to detecting deception.As we can see that, our approach outperforms the baselines by a huge margin.Our neural models simple and straightforward, hence the results show that rightfeature extraction methods can help in unveiling significant signals that are usefulfor detecting deceptive nature. Though our models outperform the previous state-of-the-art models, we stillacknowledge the drawbacks of our approach as follows: – Our models still rely on a small dataset with only 121 videos. Due to this,our models are prone to over-fitting if not carefully trained with properregularization.
Also, due to the limited scenarios in the dataset, the model may not showthe same performance for out-of-domain scenarios.
In this paper, we presented a system to detect deceptive nature from videos.Surprisingly, our model performed well even with only 121 videos provided inthe dataset, which is generally not a feasible number of data points for neu-ral models. As a result, we conclude that there exists a certain pattern in thevideos that provide highly important signals for such precise classification. Weperformed various other evaluations not presented in this paper, to confirm theperformance of our model. From these experiments, we observed that visual andtextual features predominantly contributed to accurate predictions followed byMicro-Expression features. Empirically, we observed that our model
M LP H + C converged faster in comparison with M LP C .While our system performs well on the dataset by [14], we can not claimthe same performance of our model for larger datasets covering a larger numberof environments into consideration. Hence, creating a large multi-modal datasetwith a large number of subjects under various environmental condition is partof our future work. This would pave a way to build more robust and efficientlearning systems for deception detection. Another interesting path to explore isdetecting deception under social dyadic conversational setting. References
1. Bond Jr, C.F., DePaulo, B.M.: Accuracy of deception judgments. Personality andSocial Psychology Review (2006) 214–2342. DePaulo, B.M., Lindsay, J.J., Malone, B.E., Muhlenbruck, L., Charlton, K.,Cooper, H.: Cues to deception. Psychological bulletin (2003) 743. Smith, P.K., Mahdavi, J., Carvalho, M., Fisher, S., Russell, S., Tippett, N.: Cy-berbullying: its nature and impact in secondary school pupils. Journal of ChildPsychology and Psychiatry (2008) 376–3854. Hinduja, S., Patchin, J.W.: Bullying beyond the schoolyard: Preventing and re-sponding to cyberbullying. Corwin Press (2014)5. Baptiste Su, J.: France to impose restrictions on facebook, twitter in fight againstfake news during elections (2018) [Online; posted 09-January-2018].6. Mihalcea, R., Pulman, S.G.: Linguistic ethnography: Identifying dominant wordclasses in text. In: CICLing, Springer (2009) 594–6027. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count:Liwc 2001. Mahway: Lawrence Erlbaum Associates (2001) 20018. Yancheva, M., Rudzicz, F.: Automatic detection of deception in child-producedspeech using syntactic complexity features. In: Proceedings of the 51st AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers),Association for Computational Linguistics (2013) 944–9539. Vrij, A.: Detecting lies and deceit: The psychology of lying and implications forprofessional practice. Wiley (2000)0. Gannon, T.A., Beech, A.R., Ward, T.: Risk assessment and the polygraph. Theuse of the polygraph in assessing, treating and supervising sex offenders: A prac-titioner’s guide (2009) 129–15411. Ekman, P.: Telling lies: Clues to deceit in the marketplace, politics, and marriage(revised edition). WW Norton & Company (2009)12. Caso, L., Maricchiolo, F., Bonaiuto, M., Vrij, A., Mann, S.: The impact of deceptionand suspicion on different hand movements. Journal of Nonverbal behavior (2006) 1–1913. Cohen, D., Beattie, G., Shovelton, H.: Nonverbal indicators of deception: Howiconic gestures reveal thoughts that cannot be suppressed. Semiotica (2010)133–17414. P´erez-Rosas, V., Abouelenien, M., Mihalcea, R., Burzo, M.: Deception detectionusing real-life trial data. In: Proceedings of the 2015 ACM on International Con-ference on Multimodal Interaction, ACM (2015) 59–6615. Wu, Z., Singh, B., Davis, L.S., Subrahmanian, V.: Deception detection in videos.arXiv preprint arXiv:1712.04415 (2017)16. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human actionrecognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2013) 221–23117. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotem-poral features with 3d convolutional networks. In: Proceedings of the IEEE Inter-national Conference on Computer Vision. (2015) 4489–449718. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprintarXiv:1408.5882 (2014)19. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural networkfor modelling sentences. arXiv preprint arXiv:1404.2188 (2014)20. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-sentations of words and phrases and their compositionality. In: Advances in NeuralInformation Processing Systems. (2013) 3111–311921. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-chines. In: Proceedings of the 27th International Conference on Machine Learning(ICML-10). (2010) 807–81422. Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in opensmile,the munich open-source multimedia feature extractor. In: Proceedings of the 21stACM International Conference on Multimedia. MM ’13, New York, NY, USA,ACM (2013) 835–83823. Norskog, L.: Sound exchange. http://sox.sourceforge.net/ (1991)24. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:Dropout: a simple way to prevent neural networks from overfitting. Journal ofMachine Learning Research15