Inverted Vocal Tract Variables and Facial Action Units to Quantify Neuromotor Coordination in Schizophrenia
Yashish Maduwantha H.P.E.R.S, Chris Kitchen, Deanna L. Kelly, Carol Espy-Wilson
IInverted Vocal Tract Variables and Facial Action Units to QuantifyNeuromotor Coordination in Schizophrenia
Yashish Maduwantha H.P.E.R.S , Chris Kitchen , Deanna L. Kelly , Carol Espy-Wilson Univeristy of Maryland College park University of Maryland School of Medicine [email protected], [email protected], [email protected], [email protected]
Abstract
This study investigates the speech articulatory coordination inschizophrenia subjects exhibiting strong positive symptoms (e.g.hallucinations and delusions), using a time delay embeddedcorrelation analysis. We show that the schizophrenia subjectswith strong positive symptoms and who are markedly ill posecomplex coordination patterns in facial and speech gesturesthan what is observed in healthy subjects. This observationis in contrast to what previous studies have shown in MajorDepressive Disorder (MDD), where subjects with MDD showa simpler coordination pattern with respect to healthy controlsor subjects in remission. This difference is not surprising givenMDD is necessarily accompanied by Psychomotor slowing (i.e.,negative symptoms) which affects speech, ideation and motility.With respect to speech, psychomotor slowing results in slowedspeech with more and longer pauses than what occurs in speechfrom the same speaker when they are in remission and from ahealthy subject. Time delay embedded correlation analysis hasbeen used to quantify the differences in coordination patternsof speech articulation. The current study is based on 17 Fa-cial Action Units (FAUs) extracted from video data and 6 VocalTract Variables (TVs) obtained from simultaneously recordedaudio data. The TVs are extracted using a speech inversionsystem based on articulatory phonology that maps the acousticsignal to vocal tract variables. The high-level time delay em-bedded correlation features computed from TVs and FAUs areused to train a stacking ensemble classifier fusing audio andvideo modalities. The results show that there is a promisingdistinction between healthy and schizophrenia subjects (withstrong positive symptoms) in terms of neuromotor coordinationin speech.
Keywords:
Schizophrenia, Positive symptoms, Facial ActionUnits, Vocal Tract Variables, Neuromotor coordination
1. Introduction
Schizophrenia is a chronic mental disorder with heterogeneouspresentations that affect around 60 million (1%) of the world’sadult population (Kuperberg 2010). Symptoms of schizophre-nia are broadly categorized as positive, which are pathologicalfunctions not present in healthy individuals (e.g., hallucinationsand delusions); negative, which involve the loss of functions orabilities (e.g., apathy, lack of pleasure, blunted affect and poorthinking); and cognitive (deficits in attention, memory and ex-ecutive functioning) (Andreasen and Olsen 1982, Demily andFranck 2008). From previous studies it has been found thatindividuals suffering from major depressive disorder (MDD)are subjected to neurophysiological changes which often altermotor control and thus affects mechanisms controlling speech production and facial expressions. Clinically these changes areassociated with psychomotor slowing, which is a condition ofslowed neuromotor output causing slowed speech, decreasedmovement and impaired cognitive functions (Buyukdura, Mc-Clintock, and Croarkin 2011). Previous studies have shownpromising results in identifying the severity of depression byusing coordination features based on the correlation structureof the movements of various articulators (Espy-Wilson et al.2019). This motivated us to investigate how neuromotor coor-dination is altered in schizophrenic patients who are markedlyill and exhibit strong positive schizophrenic symptoms by ana-lyzing facial activity and speech gestures.Previous studies in MDD have used vocal tract variablesextracted from audio data (Seneviratne et al. n.d.) and facialaction units extracted from video data (Williamson, Young, etal. 2019) as low level features to classify subjects with MDDfrom healthy. Time-delay embedded correlation (TDEC) anal-ysis has shown promising results in assessing neuromotor co-ordination in MDD, and normalized eigenspectra derived fromthe low level features have been used to develop those classi-fiers (Williamson, Young, et al. 2019,Seneviratne et al. n.d.,Williamson, Quatieri, et al. 2014). In this study we extendthese experiments to assess neuromotor coordination in speechof subjects with strong positive symptoms in schizophrenia. Wealso show that fusion of audio and video modalities to come upwith a multi-modal system results in better classification met-rics.In Section 2, we explain the dataset, the estimation of theFAUs and TVs, computation of the coordination features, andthe details of the classification experiments. Section 3 describesour results in terms of eigenspectra plots and classification out-comes. Interpretation of the results and planned future studiesare described in section 4
2. Methods
A database recently collected for a collaborative observationalstudy conducted by the University of Maryland School ofMedicine and the University of Maryland College Park hasbeen used for this study(Kelly et al. 2020). The database con-tains video and audio data of free response assessments admin-istered in an interview format. Data for this study was col-lected from 23 schizophrenic patients, 18 patients with MDDand 20 healthy controls. All of the schizophrenic and MDDpatients were clinically diagnosed. Every subject participatedin four interview sessions over a period of six weeks. Eachinterview session is 10-45 minutes long and every subject is as-sessed using standard depression severity measures and global a r X i v : . [ ee ss . A S ] F e b sychopathology measures by a clinician and themselves. Forthis study, we used the clinician assessments based on the 18-item Brief Psychiatric Rating Scale(BPRS), where we selectedsubjects based on the total BPRS score, and the subscores forpsychosis (BPRS item11,item12,item4, item15) and activation(BPRS item6, item7, item17), and the Hamilton Rating Scalefor Depression (HAMD). Table 1 lists the details of the Dataset.Table 2 presents the information on the subset of data usedfor our study. The 6 schizophrenic subjects are selected suchthat they are markedly ill (BPRS total ≥ ≥
20) but are not schizophrenic (BPRS <32). For this prelim-inary study we only used data from a single session of the pa-tient’s visits.Table 1:
Details on the UMCP-UMB dataset
Longitudinal 5 weeksNumber of Subjects 31 Male, 30 FemaleDemography 26 African American, 28 Caucasian, 5 AsianAssessment HDRS, MADRS, BPRS, CAPE-42Recording Type Video and AudioSession Length 10-50 mins
Table 2:
Details on the subset of data used for the study
SZ HC MDDNumber of Subjects 6 6 3BPRS score range 45 We used a speech inversion system (Sivaraman et al. 2016,Sivaraman 2017) developed based on Articulatory Phonology(AP) (Browman and Goldstein 1992) that maps the acoustic sig-nal into vocal tract variables (TVs). The TVs define the kine-matic state of each constrictor by its corresponding constrictiondegree and location coordinates (refer Table 3 and Figure 1 formore details). The speech inversion systems samples TVs at100 Hz sampling rate.Table 3: List of TVs and constrictors Constrictors Vocal Tract Variables (TVs) Lip Lip Aperture (LA), Lip Protrusion(LP)Tongue Tip Tongue tip constriction degree (TTCD),Tongue tip constriction location (TTCL)Tongue Body Tongue body constriction degree (TBCD),Tongue body constriction location (TBCL)Velum Velum (VEL)Glottis Glottis (GLO) The video-based Facial Action Units (FAUs) provide a for-malized method for identifying changes in facial expressions.We used the Openface 2.0: Facial Behaviour Analysis toolkit(Baltrusaitis et al. 2018) to extract seventeen FAUs (FAU1,2,4,5,6,7,9,10,12,14,15,17,20,23,25,26 and 45 as in FACScoding system (Prince, Martin, and Messinger 2015)) from therecorded videos of the subjects during the interviews. The FAUfeatures were sampled at a rate of 28 frames per second. Weonly analyzed those portions of the video when the subject was Figure 1: Visual representation of the vocal tract variablesat five distinct constriction organs (taken from Saltzman &Munhall (Saltzman and Munhall 1989)), along with a listingof constrictors and their vocal tract variables. See Table 3 forTV labelstalking. The features computed by the tool for the entire videowere segmented based on timestamps extracted from manuallytranscribed transcripts from the audio and relevant speaker IDfor the subject. Coordination among the seventeen FAUs and among the sixTVs (LA, LP, TTCD, TTCL, TBCD and TBCL) were estimatedusing the correlation structure features. These features are es-timated by computing a channel delay correlation matrix usingtime delay embedding at a fixed delay scale (Espy-Wilson et al.2019,Williamson, Young, et al. 2019). For FAUs, 3 sampleswas chosen as the delay scale and it corresponds to 3/28 = 107ms and for TVs, 7 samples was chosen as the delay scale and itcorresponds to 7/100 = 70 ms. For FAUs, each correlation ma-trix has a dimensionality of (255 x 255) with 17 channels and15 time delays per channel. For TVs, each correlation matrix is(90 x 90) dimensional with 6 channels and 15 time delays perchannel. After speech diarization, to calculate correlation fea-tures, only the segments of the subject which are greater than 5seconds were used.From the correlation matrix R i calculated for each sample i , the eigenspectrum is computed. The eigenspectrum generatedfor FAUs is a 255- dimensional vector which is rank ordered(in the descending order of magnitude of eigenvalues) from in-dex j=1,..,255. The eigenspectrum generated from TVs is a 90-dimensional vector rank ordered from index j=1,..,90.The eigenspectrum generated can be considered as a highlevel feature designed to characterize properties of coordinationand timing from the low level features (Williamson, Young, etal. 2019). The eigenspectrum characterizes the within-channeland cross-channel distributional properties of the multivariateFAU and TV time series. The magnitude of the eigenvaluesrepresent the average correlation in the direction of correspond-ing eigenvectors. Therefore the significance of the magnitude ofeigenvalues indicate the number of independent dimensions thatcan be used to represent speech belonging to different groups.Therefore, a few significant eigenvalues imply a simpler artic-ulatory coordination pattern whereas a large number of signif-icant eigenvalues correspond to more complex articulatory co-ordination. From Table 2, all the 6 Schizophrenic subjects and all the 6Healthy controls are chosen to train a Support Vector Machine a) Averaged eigenspectra from TVs(b) Difference plot Figure 2: Averaged eigenspectrum for TVs (left) and corre-sponding difference plot (right)(SVM) classifier with radial basis function kernel. The classifierwas trained on the coordination features computed over FAUsand TVs to classify a given subject as a schizophrenic subjector a healthy control. Eigenvalues were averaged over multipleindex ranges of the normalized eigenspectrum (equations usedfor calculating normalized eigenspectra are from (Espy-Wilsonet al. 2019)) to be used as the input features to the classifier. Thefeatures calculated are standardized across all instances beforemodel training and testing. We first trained individual SVMmodels for TVs and FAUs from eigenspectra features and thentrained a fused model by combining TV and FAU features usinga stacking ensemble model(Wolpert 1992).The SVM models were trained and evaluated in leave-one-subject-out-cross-validation fashion with a total of 12 folds.The average accuracy and F1 scores are computed across allfolds. 3. Results In the first experiment, we compared 3 subjects from eachschizophrenic, MDD and healthy groups by calculating theeigenspectra from TVs and FAUs and their corresponding dif-ference plots. Figure 2 shows the averaged eigenspectra plotand the corresponding difference plot obtained from TVs. Theeigenvalues are plotted in the logarithmic scale and the plotis zoomed in at low and high rank indices to see where thecurves lie with respect to each other. The difference curves forschizophrenia and MDD in the difference plot are calculatedrelative to healthy.Figure 3 shows the eigenspectra and difference plots ob-tained from both TVs an FAUs for the classification experiment.It also confirms the agreement between coordination patterns Figure 3: Averaged eigenspectra for TVs and FAUs (left) andcorresponding difference plots (right) for classification experi-mentsseen in TVs and FAUs for schizophrenia subjects.Table 4 shows the average accuracies and F1 scoresobtained from the classification experiments in section 2.5.The highest accuracy of 68.19% (F1 scores of 70.12 forschizophrenic group and 65.23 for healthy group) was achievedfrom the fused model which is a promising improvement withrespect to the individual modalities.Table 4: Classification Results Method Index range Accuracy F1(S)/F1(H)FAU [0-0.02],[0.96-1] 65.63 % 67.89/61.37TV [0-0.03],[0.95-1] 61.68 % 63.45/59.21 Multi-modal - 4. Discussion Figure 2 shows that the low rank eigenvalues are larger forMDD subjects relative to the schizophrenic patients and thehealthy controls, and this trend is reversed towards the highrank eigenvalues. This pattern is a key observation asso-ciated with depression severity (Williamson, Quatieri, et al.2014,Williamson, Young, et al. 2019,Espy-Wilson et al. 2019).The magnitude of high rank eigenvalues indicates the dimen-sionality of the time-delay embedded feature space. Thus,larger values in the high rank eigenvalues can be associated withgreater complexity of articulatory coordination (Espy-Wilsonet al. 2019). Thus, we can conclude that the schizophrenicsubjects with strong positive symptoms have a higher articula-tory coordination complexity than the healthy controls and theMDD patients, and the MDD patients have a simpler articu-latory coordination pattern relative to the healthy controls andthe schizophrenic patients. These results are likely due to thenegative symptoms of depression which results in psychomo-tor slowing (i.e., simpler coordination) and the strong positivesymptoms of the schizophrenic patients such as activation thatresults in motor hyperactivity (i.e., complex coordination). Wesee this effect in both the eigenvalues computed from the FAUsand from the TVs.The classification results indicate that there is a notable dis-crimination between the coordination features for schizophrenicubjects with strong positive symptoms and those of healthysubjects. From this preliminary study, we observe that facialgestures were more effective compared to TVs in the classifi-cation experiments. This could be because of the inclusion ofa wider range of facial muscle movements which were not lim-ited to only those around the speech articulators. From previousstudies it has been shown that some FAUs are significant in un-derstanding depression severity (Girard et al. 2013). Followingthat line, we could come up with attention based deep learn-ing models to select the most discriminative set of FAUs, fromwhich the performance of the classification models can be fur-ther improved. Finally, as Seneviratne et al. (Seneviratne etal. n.d.) have shown, the performance of the TV based classi-fication models can be improved by adding glottal TVs to theconstriction degree and location TVs in detecting subjects withsevere depression. We will investigate the use of these glottalTVs as well as the velar TVs to get a full representation of thespeech gestures and their coordination.It should also be noted that (Tron et al.Tron et al. 2016)found a strong correlation between negative symptoms ofschizophrenia (e.g., blunted affect) and various facial dynam-ics. Further, there have been other studies(Trémeau et al. 2005)where the schizophrenic and the depressed patients were com-pared with healthy controls using facial expressiveness in termsof negative symptoms. The study of Tremeau et al.(Trémeauet al. 2005) observed similar deficits in both the depressed andschizophrenic subjects. But our study, which focused on dif-ferentiating subjects with strong positive symptoms based oncoordination features, presents the first evidence that the posi-tive symptoms of schizophrenia can be characterized by com-plex articulatory coordination pattern of the speech and facialgestures.In future work, we plan to validate these preliminary find-ings using a larger dataset. We are also working on developing aMulti-modal Convolutional Neural Network (CNN) based deeplearning model where the correlation matrices are fed directlyto perform classification. 5. Acknowledgements This work was supported by a UMCP UMB - AI + Medicinefor High Impact (AIM-HI) Challenge Award. We would like tothank our AIM-HI group for valuable discussions and providingthe transcripts of the clinical interviews 6. References Andreasen, Nancy C. and Scott Olsen (July 1982). “Negative vPositive Schizophrenia: Definition and Validation”. In: Archivesof General Psychiatry DOI : 10 . 1001 /archpsyc . 1982 . 04290070025006 . eprint: https : / /jamanetwork . com / journals / jamapsychiatry /articlepdf/492832/archpsyc\_39\_7\_006.pdf .Baltrusaitis, T., A. Zadeh, Y. C. Lim, and L. Morency (2018). “Open-Face 2.0: Facial Behavior Analysis Toolkit”. In: , pp. 59–66.Browman, Catherine P and Louis Goldstein (1992). “ArticulatoryPhonology : An Overview *”. In: Phonetica 49, pp. 155–180.Buyukdura, J. S., S. M. McClintock, and P. E. Croarkin (Mar. 2011).“Psychomotor retardation in depression: biological underpinnings,measurement, and treatment”. In: Prog. Neuropsychopharmacol.Biol. Psychiatry Expert Re-view of Neurotherapeutics DOI : . Espy-Wilson, Carol, Adam C. Lammert, Nadee Seneviratne, andThomas F. Quatieri (2019). “Assessing Neuromotor Coordina-tion in Depression Using Inverted Vocal Tract Variables”. In: Proc. Interspeech 2019 , pp. 1448–1452. DOI : 10 . 21437 /Interspeech.2019-1815 .Girard, Jeffrey, Jeffrey Cohn, Mohammad Mahoor, SeyedmohammadMavadati, and Dean Rosenwald (Apr. 2013). “Social Risk and De-pression: Evidence from Manual and Automatic Facial ExpressionAnalysis”. In: pp. 1–8. DOI : .Kelly, Deanna L., Max Spaderna, Vedrana Hodzic, Suraj Nair, Christo-pher Kitchen, Anne E. Werkheiser, Megan M. Powell, Fang Liu,Glen Coppersmith, Shuo Chen, and Philip Resnik (2020). “BlindedClinical Ratings of Social Media Data are Correlated with In-Person Clinical Ratings in Participants Diagnosed with Either De-pression, Schizophrenia, or Healthy Controls”. In: Psychiatry Re-search DOI : https://doi.org/10.1016/j.psychres.2020.113496 .Kuperberg, G. R. (Aug. 2010). “Language in schizophrenia Part 1: anIntroduction”. In: Lang Linguist Compass Ecological Psychology 1, pp. 333–382. DOI : 10 . 1207 /s15326969eco0104_2 .Seneviratne, Nadee, James R. Williamson, Adam C. Lammert, ThomasF. Quatieri, and Carol Espy-Wilson (n.d.). “Extended Study on theUse of Vocal Tract Variables to Quantify Neuromotor Coordinationin Depression”. In: Submitted to Interspeech 2020 .Sivaraman, Ganesh (2017). “Articulatory representations to addressacoustic variability in speech”. PhD thesis. University of MarylandCollege Park.Sivaraman, Ganesh, Vikramjit Mitra, Hosung Nam, Mark K. Tiede, andCarol Y. Espy-Wilson (2016). “Vocal Tract Length Normalizationfor Speaker Independent Acoustic-to-Articulatory Speech Inver-sion”. In: Proceedings of Interspeech , pp. 455–459.Trémeau, Fabien, Dolores Malaspina, Fabrice Duval, Humberto Corrêa,Michaela Hager-Budny, Laura Coin-Bariou, Jean-Paul Macher,and Jack Gorman (Feb. 2005). “Facial Expressiveness in PatientsWith Schizophrenia Compared to Depressed Patients and Nonpa-tient Comparison Subjects”. In: The American journal of psychia-try DOI : .Tron, Talia, Abraham Peled, Alexander Grinsphoon, and DaphnaWeinshall (2016). “Automated Facial Expressions Analysisin Schizophrenia: A Continuous Dynamic Approach”. In: Perva-sive Computing Paradigms for Mental Health . Ed. by Silvia Serino,Aleksandar Matic, Dimitris Giakoumis, Guillaume Lopez, andPietro Cipresso. Cham: Springer International Publishing, pp. 72–81.Williamson, James R., Thomas F. Quatieri, Brian S. Helfer, Gre-gory Ciccarelli, and Daryush D. Mehta (2014). “Vocal and Fa-cial Biomarkers of Depression Based on Motor Incoordinationand Timing”. In: Proceedings of the 4th International Workshopon Audio/Visual Emotion Challenge . AVEC ’14. Orlando, Florida,USA: Association for Computing Machinery, pp. 65–72. DOI : . URL : https://doi.org/10.1145/2661806.2661809 .Williamson, James R., Diana Young, Andrew A. Nierenberg, JamesNiemi, Brian S. Helfer, and Thomas F. Quatieri (2019). “Trackingdepression severity from audio and video based on speech articula-tory coordination”. In: Computer Speech & Language 55, pp. 40–56. DOI : https://doi.org/10.1016/j.csl.2018.08.004 .Wolpert, David H. (1992). “Stacked generalization”. In: Neural Net-works DOI : https://doi.org/10.1016/S0893-6080(05)80023-1https://doi.org/10.1016/S0893-6080(05)80023-1