Unsupervised Audio-Visual Subspace Alignment for High-Stakes Deception Detection
UUNSUPERVISED AUDIO-VISUAL SUBSPACE ALIGNMENTFOR HIGH-STAKES DECEPTION DETECTION
Leena Mathur and Maja J Matari´c
Department of Computer ScienceUniversity of Southern California, Los Angeles, CA
ABSTRACT
Automated systems that detect deception in high-stakes situ-ations can enhance societal well-being across medical, socialwork, and legal domains. Existing models for detecting high-stakes deception in videos have been supervised , but labeleddatasets to train models can rarely be collected for most real-world applications. To address this problem, we proposethe first multimodal unsupervised transfer learning approachthat detects real-world, high-stakes deception in videos with-out using high-stakes labels. Our subspace-alignment (SA)approach adapts audio-visual representations of deceptionin lab-controlled low-stakes scenarios to detect deception inreal-world, high-stakes situations. Our best unsupervised SAmodels outperform models without SA, outperform humanability, and perform comparably to a number of existing su-pervised models. Our research demonstrates the potential forintroducing subspace-based transfer learning to model high-stakes deception and other social behaviors in real-worldcontexts with a scarcity of labeled behavioral data.
Index Terms — transfer learning, deception detection
1. INTRODUCTION
Advances in human-centered signal processing and multi-modal machine learning are enabling the development ofautomated systems that can detect human social behaviors,including deception [1]. Deception involves the intentionalcommunication of false or misleading information [2] andhas been categorized as occurring in either high-stakes or low-stakes social situations [3]. Deceivers in high-stakescontexts face substantial consequences if their deception isdiscovered, in contrast to deceivers in low-stakes contexts.Healthcare providers, social workers, and legal groups havefostered an interest in detecting high-stakes deception forapplications that enhance societal well-being (e.g., helpingtherapists recognize whether clients are masking negativeemotions, helping judges assess courtroom testimonies ofchildren coerced to lie) [4, 5]. Human deception detectionability has been determined as close to chance level [6], moti-vating the development of computational approaches that canhelp humans in this challenging task. A fundamental challenge in modeling high-stakes decep-tion is the scarcity of labeled high-stakes deception data , dueto the difficulty in collecting large amounts of real-world datawith verifiable ground truth [3]. Lab-controlled experimentsto simulate realistic high-stakes scenarios are not ethical,because they require the use of threats to impose substan-tial consequences on deceivers. Therefore, most researchershave collected data of participants communicating truthfullyand deceptively in lab-controlled, low-stakes situations (e.g.,mock crime scenarios) to study behavioral cues that could beindicative of high-stakes deception [7, 3].Unsupervised domain adaptation (UDA) leverages knowl-edge from labeled source domains to perform tasks in re-lated, unlabeled target domains [8, 9]. We addressed thedata scarcity problem of high-stakes deception by proposinga novel UDA approach, based on subspace alignment (SA)[9], to detect high-stakes deception without using any high-stakes labels. Grounded in psychology research regarding thegeneralizability of deceptive cues across contexts [2], we hy-pothesized that audio-visual representations of low-stakes de-ception in lab-controlled situations can be leveraged by SA todetect high-stakes deception in real-world situations .We experimented with unimodal and multimodal audio-visual SA models to contribute effective modalities, fusionapproaches, and behavioral signals for unsupervised high-stakes deception detection. Our best unsupervised SA models(75% AUC, 74% accuracy) outperform models without SA,outperform human ability, and perform comparably to a num-ber of existing supervised models [10, 11, 12, 13]. Ourresearch demonstrates the potential for introducing unsuper-vised subspace-based transfer learning approaches to addressthe scarcity of labeled data when modeling high-stakes de-ception and other social behaviors in real-world situations.
2. BACKGROUND
Existing approaches for detecting high-stakes deception invideos have leveraged supervised machine learning modelsthat exploit discriminative patterns in human visual, verbal,vocal, and physiological cues to distinguish deceptive andtruthful communication [5]. To address the scarcity of labeledhigh-stakes deception data for training models [14], prior re- a r X i v : . [ c s . C V ] F e b earch has focused on developing models that are robust tosmall numbers of samples [14, 15]. While supervised transferlearning models have been developed to detect deception [16,17], to the best of our knowledge, no existing research hasintroduced unsupervised models to address the data scarcityproblem of high-stakes deception detection.Psychology studies have found that humans can rely onaudio-visual behavioral cues (e.g., facial expressions, voicepitch) to detect deception across different people and contextsin lab-controlled experiments that attempt to simulate high-stakes situations (e.g., mock crime scenarios) [2, 6]. Thesefindings motivated our decision to leverage representations ofaudio-visual cues during low-stakes deception to detect high-stakes deception across different people and contexts. Wenote that subspace-based transfer learning has been success-fully used to detect emotion [18, 19], but has not been previ-ously developed to detect social behaviors. To the best of ourknowledge, our unsupervised subspace-based transfer learn-ing approach is the first to detect a social behavior, leveragingaudio-visual cues to detect high-stakes deception.
3. DATASETS
This section describes the video datasets used for real-world high-stakes deception and lab-collected low-stakes deception.
For high-stakes deception, we used a publicly-available videodataset of people in 121 real-world courtroom trial situations( ∼
28 sec per video); this dataset is the current benchmark forhigh-stakes deception detection in videos [11]. Each videowas labeled as truthful or deceptive per police testimony andtrial information; we used these labels as ground truth. Percriteria used by prior research with this dataset [20, 21], weidentified a usable subset of 108 videos (53 truthful videos,55 deceptive videos; 47 speakers of diverse race and gender).All videos were collected in unconstrained situations with avariety of illuminations, camera angles, and face occlusions. For low-stakes deception, we used the UR Lying Dataset [7],the only publicly-available video dataset that had participantsvoluntarily communicating truthfully and deceptively in lab-controlled game scenarios ( ∼
23 min per video). Participantschose to respond either truthfully or deceptively to questionsabout images (e.g., picture of a flower). They won $10 ifthey were believed while communicating truthfully and $20if they were believed while communicating deceptively. Weused 107 videos (44 truthful videos, 63 deceptive videos; 29speakers of diverse race and gender). All videos were col-lected in situations with consistent illuminations and cameraangles, as well as minimal face occlusions.
4. METHODOLOGY
This section describes our approach (
Fig. 1 ) for detecting high-stakes deception through unsupervised audio-visual SA.
We extracted the same set of 89 audio-visual behavioral cuesfrom speakers in both datasets. The OpenSMILE [22] toolkitwas used to extract 58 audio features from the eGeMAPs[23] and MFCC feature sets, to capture spectral, cepstral,prosodic, and voice quality information at each audio frame.The eGeMAPs and MFCC features sets in OpenSMILE over-lap on MFCC features, and we do not include any duplicates.The OpenFace [24] toolkit was used to extract 31 visual fea-tures to capture facial action units (FAU), head pose, and eyegaze information from speakers at each visual frame.The 89 audio-visual features were extracted frame-by-frame from videos. To prepare these features for a binaryvideo classification task, we represented each as a fixed-length vector of time-series attributes during variable-lengthvideos, similar to prior methods [10, 21]. For each feature,the TsFresh toolkit [25] computed 12 attributes: mean, stan-dard deviation, aggregate autocorrelation across differenttime lags, and changes in feature values across quantiles.These functions are documented in the EfficientFCParame-ters class of TsFresh. A fixed-length feature vector of 1068audio-visual time-series features ( × ) was computed torepresent each video. We formulated high-stakes deception detection as an unsu-pervised subspace alignment (SA) transfer learning problem[9]. Given a set S of Z labeled low-stakes deception sam-ples (source domain) and a set T of Y unlabeled high-stakesdeception samples (target domain), the modeling goal is totrain a classifier on S that predicts deceptive and truthful labelsof T . Both S and T are in M -dimensional feature spaces andare drawn according to different marginal distributions. PCAcomputes the first D principal components of each domainto create D -dimensional subspace embeddings C S (sourcecomponents) and C T (target components). The optimal lin-ear transformation matrix φ is computed by minimizing thedifference between C S and C T with the following equation: φ = argmin φ || C s φ − C T || F = argmin φ || C TS C S φ − C TS C T || F = argmin φ || φ − C TS C T || F = C TS C T (1)where || · || denotes the Frobenius norm. This matrix φ isused to transform low-stakes deception embeddings C S toalign with high-stakes deception. Classifiers are then trained ig. 1 : We propose unsupervised, audio-visual subspace alignment (SA) for detecting high-stakes deception without using anyhigh-stakes labels. M -dimensional subspaces of low-stakes deception (blue) are aligned with high-stakes deception (orange)through transition matrices φ and used in our D -dimensional SA approach (green) to predict high-stakes deception.on these aligned low-stakes embeddings and tested on high-stakes embeddings C T to predict deceptive or truthful labelsfor high-stakes samples. Our approach is visualized in Fig. 1 ;additional details on unsupervised SA are in [9].
We experimented with 9 different unimodal and multimodalunsupervised SA approaches, described below, to identify ef-fective modalities, feature sets, and modeling approaches.
Unimodal audio SA classifiers were trained on the set of allaudio features and separately on MFCC and eGeMAPs fea-ture sets. Unimodal visual SA classifiers were trained on theset of all visual features and separately on FAU, eye gaze,and head pose feature sets. Multimodal SA classifiers weretrained with two approaches: (1) early-fusion and (2) late-fusion . Early-fusion SA classifiers were trained on concate-nated feature vectors of all audio-visual features. For late-fusion, separate unimodal SA classifiers were trained on au-dio and visual features, and a majority vote of their predictedclass probabilities determined the final prediction. The classi-fier for all SA experiments was K Nearest Neighbors (KNN),implemented with scikit-learn [26]. Similar to KNN hyper-parameter tuning for SA in [9], all SA experiments were con-ducted with 3-fold cross-validation in the source subspace toidentify optimal values of the KNN nearest-neighbors hyper-parameter k in the discrete range [1, 30]. An optimal sub-space dimension D was determined in the discrete range [1,10], since smaller values of D avoid computationally expen-sive eigendecomposition. For each unimodal and multimodal SA model, we imple-mented baseline KNN classifiers trained on low-stakes de-ception data and tested on high-stakes deception data, withoutalignment, in order to evaluate the effectiveness of SA. Forfair comparison, each baseline KNN was implemented withthe same hyper-parameter k as the optimal k that was auto-matically computed during 3-fold cross-validation in the cor-responding SA model. To compare our model performanceto human deception detection ability, which is at chance level[6], we defined the human performance baseline as a clas-sifier that would achieve 51% accuracy (always predicting deceptive for 55 deceptive videos out of 108 videos). Aligned with previous research [21, 20, 10], the followingmetrics were computed to evaluate classifiers: (1) ACC, clas-sification accuracy across the videos; (2) AUC, the probabilityof the classifier ranking a randomly chosen deceptive sam-ple higher than a randomly chosen truthful one; (3) F1-score,weighted average of precision and recall.
5. RESULTS AND DISCUSSION
Modeling results from classification experiments are pre-sented in
Table 1 and visualized in
Fig. 2 . All unimodaland multimodal unsupervised SA models substantially out-performed the baseline models without SA and the humanperformance chance level, demonstrating the effectiveness ofunsupervised SA for modeling high-stakes deceptive behaviorwithout any high-stakes labels . Significance values of differ-ences in model performance were computed with McNemar’stest ( α =0.05) with continuity correction [27]. odel ACC AUC F1Audio Audio (MFCC + eGeMAPs) 0.63 0.65 0.69MFCC ** Visual
Visual (FAU + Gaze + Pose) ** FAU * * * Audio-Visual
Audio-Visual Early Fusion (EF) ** * Significant difference ( p < . ) between this SA model and baseline ** Significant difference ( p < . ) between this SA model and baseline Table 1 : Classification ResultsThe unimodal visual SA model, trained on all visual fea-tures, had the highest performance (hyper-parameters D =6and k =3). This model achieved an ACC of 74%, AUC of75%, and F1-score of 73%, and significantly outperformedthe visual baseline model without SA (p < gaze features out-performed SA models trained on the facial action units andhead pose features. These results suggest that representationsof visual behavior, in particular eye gaze, demonstrate moretransfer potential than audio and audio-visual representationsin subspace-based transfer learning approaches for detectingdeception across different people in different social contexts.Our unsupervised visual SA model performed compara-bly to existing fully-supervised , automated approaches thatused the same dataset (ACC 75% [11], 77% [13], 79% [12];AUC 70% [10]). While some prior fully-supervised, auto-mated approaches [21, 15, 20] outperformed our unsuper-vised SA, our findings support the potential for introducingunsupervised SA to address the data scarcity problem ofmodeling high-stakes deception. Our results support ourhypothesis that audio-visual representations of low-stakesdeception in lab-controlled situations can be leveraged by SAto detect high-stakes deception in real-world situations.
Fig. 2 : Classification accuracies of each unimodal and multi-modal SA model (blue) compared to baseline models (orange)and the chance human performance (dashed line).To analyze audio-visual behavioral signals that are sim-ilar in distribution across low-stakes and high-stakes decep-tion contexts, we conducted a two-tail independent sampleWelch’s t-test between each feature’s low-stakes and high-stakes distributions without assuming equal variance [28].The following 10 behavioral signals exhibited the most simi-larity in distribution: jitter, left eye gaze, pitch, yaw of headpose, roll of head pose, right eye gaze, the ratio of the firsttwo harmonics of fundamental frequency, and the MFCCcoefficients 0, 7, and 8. The similar distributions of thesefeatures across different groups of people and different socialcontexts demonstrate their potential for use as transferable cues in models for cross-situational deception detection.
6. CONCLUSION
This paper proposes the first unsupervised transfer learn-ing approach for detecting real-world, high-stakes deceptionin videos without using high-stakes labels. Our subspace-alignment models addressed the data scarcity problem ofmodeling high-stakes deception by adapting audio-visual rep-resentations of deception in lab-controlled low-stakes scenar-ios to detect deception in real-world, high-stakes situations.We contribute effective modalities, modeling approaches, andaudio-visual cues, for high-stakes deception detection. Ourresearch demonstrates the potential for introducing subspace-based transfer learning approaches to model high-stakes de-ception and other social behaviors in real-world contexts,when faced with a scarcity of labeled behavioral data.
7. ACKNOWLEDGEMENTS
We thank the Rochester HCI Group for sharing their low-stakes deception dataset. This research was supported by theUSC Provost’s Undergraduate Research Fellowship. . REFERENCES [1] T. Baltrusaitis, C. Ahuja, and L.P. Morency, “Mul-timodal machine learning: A survey and taxonomy,”
IEEE Trans. PAMI , vol. 41(2), 423–443, Feb. 2019.[2] M. Frank and P. Ekman, “The ability to detect deceitgeneralizes across different types of high-stake lies,”
J.Pers. Soc. Psychol , vol. 72, pp. 1429–39, 1997.[3] S. Porter and L. ten Brinke, “The truth about lies: Whatworks in detecting high-stakes deception?,”
Legal andCriminological Psych. , vol. 15, no. 1, pp. 57–75, 2010.[4] V. Ardulov, Z. Durante, S. Williams, T. Lyon, andS. Narayanan, “Identifying truthful language in child in-terviews,” in
IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , 2020.[5] M. Burzo et al.,
Multimodal Deception Detection , p.419–453, Association for Computing Machinery, 2018.[6] C.F. Bond and B.M. Depaulo, “Accuracy of deceptionjudgments.,”
Personality and Social Psych. Rev. , vol.10(3), 214-34, 2006.[7] T. Sen et al., “Automated dyadic data recorder (addr)framework and analysis of facial cues in deceptive com-munication,”
ACM Interactive, Mobile, Wearable andUbiquitous Technologies , vol. 1, no. 4, Jan. 2018.[8] S. J. Pan and Q. Yang, “A survey on transfer learning,”
IEEE Transactions on Knowledge and Data Engineer-ing , vol. 22, no. 10, pp. 1345–1359, 2010.[9] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars,“Unsupervised visual domain adaptation using subspacealignment,” in , 2013, pp. 2960–2967.[10] R. Rill-Garc´ıa et al., “High-level features for multi-modal deception detection in videos,” , pp. 1565–1573, 2019.[11] V. P´erez-Rosas et al., “Deception detection using real-life trial data,” in
ACM International Conference onMultimodal Interaction (ICMI) , 2015, p. 59–66.[12] M. Jaiswal, S. Tabibu, and R. Bajpai, “The truth andnothing but the truth: Multimodal analysis for deceptiondetection,” in , 2016, pp. 938–943.[13] D. Avola et al., “Automatic deception detection in rgbvideos using facial action units,” in
ACM InternationalConference on Distributed Smart Cameras , 2019.[14] H. Karimi, J. Tang, and Y. Li, “Toward end-to-end de-ception detection in videos,” in
IEEE International Con-ference on Big Data , 2018, pp. 1278–1283. [15] M. Ding, A. Zhao, Z. Lu, T. Xiang, and J. Wen, “Face-focused cross-stream network for deception detection invideos,” in
IEEE CVPR , 2019, pp. 7794–7803.[16] K. Hasan et al., “Facial expression based imaginationindex and a transfer learning approach to detect decep-tion,” in
International Conference on Affective Comput-ing and Intelligent Interaction (ACII) , 2019.[17] Q. Luo, R. Gupta, and S. Narayanan, “Transfer learn-ing between concepts for human behavior modeling: Anapplication to sincerity and deception prediction,” in
In-terspeech , 2017, pp. 1462–1466.[18] Z. Yang, B. Gong, and S. Narayanan, “Weightedgeodesic flow kernel for interpersonal mutual influencemodeling and emotion recognition in dyadic interac-tions,” in
International Conference on Affective Com-puting and Intelligent Interaction (ACII) , 2017.[19] P. Song and W. Zheng, “Feature selection based trans-fer subspace learning for speech emotion recognition,”
IEEE Trans. Affect. Comput. , vol. 11(3), 373-82, 2020.[20] Z. Wu et al., “Deception detection in videos,” in
Pro-ceedings of AAAI 2018 , pp. 1695–1702.[21] L. Mathur and M.J. Matari´c, “Introducing representa-tions of facial affect in automated multimodal deceptiondetection,” in
ACM International Conference on Multi-modal Interaction (ICMI) , 2020, p. 305–314.[22] Eyben F., Weninger F., Gross F., and Schuller B.,“Recent developments in opensmile, the munich open-source multimedia feature extractor,” in
ACM Interna-tional Conference on Multimedia , 2013, p. 835–838.[23] F. Eyben et al., “The geneva minimalistic acoustic pa-rameter set (gemaps) for voice research and affectivecomputing,”
IEEE Trans. Affect. Comput. , vol. 7(2),190-202, 2016.[24] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L. Morency,“Openface 2.0: Facial behavior analysis toolkit,” in
IEEE International Conference on Automatic Face Ges-ture Recognition (FG) , 2018, pp. 59–66.[25] M. Christ et al., “Time series feature extraction on basisof scalable hypothesis tests,”
Neurocomputing , 2018.[26] Pedregosa F. et al., “Scikit-learn: Machine learning inPython,”
JMLR , vol. 12, pp. 2825–2830, 2011.[27] W. Dupont and W. Plummer, “Power and sample sizecalculations. a review and computer program,”
Con-trolled clinical trials , vol. 11, pp. 116–28, 05 1990.[28] M. Delacre, D. Lakens, and C. Leys, “Why psycholo-gists should by default use welch’s t-test instead of stu-dent’s t-test,”