Automated Evaluation Of Psychotherapy Skills Using Speech And Language Technologies
Nikolaos Flemotomos, Victor R. Martinez, Zhuohao Chen, Karan Singla, Victor Ardulov, Raghuveer Peri, Derek D. Caperton, James Gibson, Michael J. Tanana, Panayiotis Georgiou, Jake Van Epps, Sarah P. Lord, Tad Hirsch, Zac E. Imel, David C. Atkins, Shrikanth Narayanan
““Am I A Good Therapist?” Automated Evaluation Of PsychotherapySkills Using Speech And Language Technologies
Nikolaos Flemotomos , Victor R. Martinez , Zhuohao Chen , Karan Singla , Victor Ardulov ,Raghuveer Peri , Derek D. Caperton , James Gibson , Michael J. Tanana , Panayiotis Georgiou , JakeVan Epps , Sarah P. Lord , Tad Hirsch , Zac E. Imel , David C. Atkins , and Shrikanth Narayanan Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, USA Department of Computer Science, University of Southern California, Los Angeles, CA, USA Department of Educational Psychology, University of Utah, Salt Lake Ciity, UT, USA Behavioral Signal Technologies Inc., Los Angeles, CA, USA College of Social Work, University of Utah, Salt Lake Ciity, UT, USA University Counseling Center, University of Utah, Salt Lake Ciity, UT, USA Department of Psychiatry and Behavioral Sciences, University of Washington, Seattle, WA, USA Department of Art + Design, Northeastern University, Boston, MA, USA
With the growing prevalence of psychological interventions, it is vital to have measures whichrate the e ff ectiveness of psychological care, in order to assist in training, supervision, andquality assurance of services. Traditionally, quality assessment is addressed by human raterswho evaluate recorded sessions along specific dimensions, often codified through constructsrelevant to the approach and domain. This is however a cost-prohibitive and time-consumingmethod which leads to poor feasibility and limited use in real-world settings. To facilitatethis process, we have developed an automated competency rating tool able to process the rawrecorded audio of a session, analyzing who spoke when, what they said, and how the healthprofessional used language to provide therapy. Focusing on a use case of a specific type ofpsychotherapy called Motivational Interviewing, our system gives comprehensive feedback tothe therapist, including information about the dynamics of the session (e.g., therapist’s vs.client’s talking time), low-level psychological language descriptors (e.g., type of questionsasked), as well as other high-level behavioral constructs (e.g., the extent to which the therapistunderstands the clients’ perspective). We describe our platform and its performance, usinga dataset of more than 5,000 recordings drawn from its deployment in a real-world clinicalsetting used to assist training of new therapists. We are confident that a widespread use ofautomated psychotherapy rating tools in the near future will augment experts’ capabilities byproviding an avenue for more e ff ective training and skill improvement and will eventually leadto more positive clinical outcomes. Keywords: quality assessment, psychotherapy, motivational interviewing, machine learning,speech processing, MISC
Need for Psychotherapy Quality Assessment Tools
Recent epidemiological research suggests that devel-oping some kind of mental disorder is the norm, rather thanthe exception, estimating that the lifetime prevalence of di-agnosable mental disorders (i.e., the proportion of the pop-ulation that, at some point in their life, have experienced orwill experience a mental disorder) is around 50% (Kessler etal., 2005) or even more (Schaefer et al., 2017). Accordingto data from 2018, an estimated 47.6 million adults in theUnited States had some mental illness, while about 1 in 7adults received mental health services (Substance Abuse andMental Health Services Administration, 2019).Psychotherapy is a commonly used process in which mental health disorders are treated through communicationbetween an individual and a trained mental health profes-sional. Even though the positive e ff ects of psychotherapyhave been well studied and documented (Lambert & Bergin,2002; Perry, Banon, & Ianni, 1999; Weisz, Weiss, Han,Granger, & Morton, 1995), there is definitely room for im-provement in terms of the quality of services provided. Inparticular, a substantial number of patients report negativeoutcomes, with signs of mental health deterioration after theend of therapy (Curran et al., 2019; Klatte, Strauss, Flück-iger, & Rosendahl, 2018). It has been argued that, apart frompatient characteristics (Lambert & Bergin, 2002), therapistfactors (Saxon, Barkham, Foster, & Parry, 2017) also playa significant and clinically important role in contributing to a r X i v : . [ ee ss . A S ] F e b FLEMOTOMOS ET AL. negative outcomes. This has direct implications for morerigorous training and supervision (Lambert & Ogles, 1997),quality improvement, and skill development. A critical factorthat can lead to increased performance and thus ensure highquality of services is the provision of accurate feedback to thepractitioner (Hattie & Timperley, 2007). This can take vari-ous forms; both client progress monitoring (Lambert, Whip-ple, & Kleinstäuber, 2018) and performance-based feedback(Schwalbe, Oh, & Zweben, 2014) have been reported to re-duce therapeutic skill erosion and to contribute to improvedclinical outcomes. The timing of the feedback is of utmostimportance as well, since it has been shown that immedi-ate feedback is more e ff ective than delayed (Kulik & Kulik,1988). In psychotherapy practice, however, providing reg-ular and immediate performance evaluation is almost im-possible most of the time. Behavioral coding—the processof listening to audio recordings and / or reading session tran-scripts in order to observe therapists’ behaviors and assessinterviewing and interpersonal skills (Bakeman & Quera,2012)—is both time-consuming and cost-prohibitive whenapplied in real-world settings. It has been reported (Moy- James Gibson performed this work while at the University ofSouthern California.David C. Atkins, Shrikanth Narayanan, Zac E. Imel, and Panayi-otis Georgiou conceived the idea of the described system. JamesGibson developed a first prototype of the processing pipeline. Niko-laos Flemotomos, Victor R. Martinez, Zhuohao Chen, Karan Singla,Victor Ardulov, and Raghuveer Peri worked on training, adapting,and evaluating the various individual modules and the end to endsystem. Michael J. Tanana helped testing the system in real-worldsettings, which led to several revisions and improvements. JakeVan Epps played a crucial role on data collection. Derek D. Ca-perton and Sarah P. Lord led the e ff orts in human behavioral cod-ing. Tad Hirsch led the e ff orts of designing the graphical interface.Nikolaos Flemotomos reviewed the literature and drafted the origi-nal manuscript. All authors provided critical input, discussion, andrevision of the manuscript.Michael J. Tanana, Tad Hirsch, Zac E. Imel, David C. Atkins,and Shrikanth Narayanan are co-founders with equity stake in atechnology company, Lyssn.io, focused on tools to support training,supervision, and quality assurance of psychotherapy and counsel-ing. Sarah P. Lord also holds an equity stake in Lyssn.io. ShrikanthNarayanan is Chief Scientist and co-founder with equity stake inBehavioral Signal Technologies, a company focused on creatingtechnologies for emotional and behavioral machine intelligence.James Gibson is a Senior Machine Learning Engineer in BehavioralSignal Technologies. The remaining authors report no conflicts ofinterest.Correspondence concerning this article should be addressed toNikolaos Flemotomos, Department of Electrical and Computer En-gineering, University of Southern California, Los Angeles, CA90089. Contact: fl[email protected] ers, Martin, Manuel, Hendrickson, & Miller, 2005) that, af-ter intensive training and supervision that lasts on average3 months, a proficient coder would need up to two hours tocode just a 20min-long session of Motivational Interviewing(MI), a specific type of psychotherapy which is also the fo-cus of the current study. The labor-intensive nature of cod-ing means that the vast majority of psychotherapy sessionsare not evaluated. As a result, many providers get inade-quate feedback on their therapy skills after their initial train-ing (Miller, Sorensen, Selzer, & Brigham, 2006) and behav-ioral coding is mainly applied for research purposes with lim-ited outreach to community settings (Proctor et al., 2011).At the same time, the barriers imposed by manual codingusually lead to research studies with relatively small samplesizes (Magill et al., 2014), limiting the research progress inthe field. It is, thus, made apparent that being able to evaluatea therapy session and provide feedback to the practitioner ata low cost and in a timely manner would both boost psy-chotherapy research and scale up quality assessment to real-world use. In the current work, we investigate whether it isfeasible to analyze a therapy session recording in a fully au-tomatic way and provide rich feedback to the therapist withinshort time. Behavioral Coding for Motivational Interviewing
Motivational Interviewing (MI) (Miller & Rollnick,2012), often used for treating addiction and other conditions,is a client-centered intervention that aims at helping clientsmake behavioral changes through resolution of ambivalence.It is a psychotherapy with evidence supporting that spe-cific skills are correlated with the clinical outcome (Gaume,Gmel, Faouzi, & Daeppen, 2009) and also that those skillscannot be maintained without ongoing feedback (Schwalbeet al., 2014). Thus, great e ff ort from MI researchers has beendevoted to developing instruments to evaluate the therapistfidelity to MI techniques.The gold standard for monitoring clinician fidelity totreatment is behavioral observation and coding (Bakeman& Quera, 2012). During that process, trained coders as-sign specific labels or numeric values to the psychotherapysession, which are expected to provide important therapy-related details (e.g., “how many open questions were posedby the therapist?” or “did the counselor accept and respectthe client’s ideas?”) and essentially reflect particular thera-peutic skills. While a variety of coding schemes has beenproposed (Madson & Campbell, 2006), in this study we arefocusing on a widely used research tool, the Motivational In-terviewing Skill Code (MISC 2.5; Houck, Moyers, Miller,Glynn, and Hallgren (2010)), which was specifically devel-oped for use with recorded MI sessions (Madson & Camp-bell, 2006). MISC defines behavior codes both for the coun-selor and the patient, but for the automated system reportedin this paper we focus on counselor behaviors. AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES
Psychotherapy Evaluation in the Digital Era
Psychotherapy sessions are interventions primarilybased on spoken language, which means that the informationcapturing the session quality is encoded in the speech signaland the language patterns of the interaction. Thus, with therapid technological advancements in the fields of Speech andNatural Language Processing (NLP) over the last few years(e.g. Devlin, Chang, Lee, and Toutanova (2019); Xiong etal. (2017)), and despite many open challenges specific to thehealthcare domain (Quiroz et al., 2019), it is not surprisingto see trends in applying computational techniques to auto-matically analyze and evaluate therapy sessions.Such e ff orts span a wide range of psychotherapeuticapproaches including couples therapy (Black et al., 2013),MI (Xiao, Can, et al., 2016) and cognitive behavioral ther-apy (Flemotomos, Martinez, et al., 2018), used to treat a va-riety of conditions such as addiction (Xiao, Can, et al., 2016)and post-traumatic stress disorder (Shiner et al., 2012). Bothtext-based (Imel, Steyvers, & Atkins, 2015; Xiao, Can, Geor-giou, Atkins, & Narayanan, 2012) and audio-based (Blacket al., 2013; Xiao et al., 2014) behavioral descriptors havebeen explored in the literature and have been used eitherunimodally or in combination with each other (Singla et al.,2018). In this study we focus on behavior code predic-tion from textual data. Most research studies focused ontext-based behavioral coding have relied on written text ex-cerpts (Barahona et al., 2018) or used manually-derived tran-scriptions of the therapy session (Can, Atkins, & Narayanan, 2015; Gibson et al., 2019; F.-T. Lee, Hull, Levine, Ray,& McKeown, 2019). However, a fully automated evalua-tion system for deployment in real-world settings requires aspeech processing pipeline that can analyze the audio record-ing and provide a reliable speaker-segmented transcript ofwhat was spoken by whom. This is a necessary condi-tion before such an approach is introduced into clinical set-tings since, otherwise, it may eliminate the burden of man-ual behavioral coding, but it introduces the burden of manualtranscription. Transcription errors introduced by AutomaticSpeech Recognition (ASR) algorithms may have a signifi-cant e ff ect to the performance of NLP-based models (Malik,Barange, Saunier, & Pauchet, 2018), so demonstrating thepractical feasibility of a fully automated pipeline is an im-portant task.An end-to-end system is presented by Xiao, Imel,Georgiou, Atkins, and Narayanan (2015) and Xiao, Huang,et al. (2016), where the authors report a case study of auto-matically predicting the empathy expressed by the provider.A similar platform, focused on couples therapy, is presentedby Georgiou, Black, Lammert, Baucom, and Narayanan(2011). Even employing an ASR module with relatively higherror rate, those systems were reported to provide compet-itive prediction performance (Georgiou et al., 2011). Thescope of the particular studies, though, was limited only tosession-level codes, while the evaluation sessions were se-lected from the two extremes of the coding scale. Thus, foreach code the problem was formulated as a binary classifica-tion task trying to identify therapy sessions where a particularcode (or its absence) is represented more prominently (e.g.,identify ‘low’ vs. ‘high’ empathy). Current Study
In the current paper we demonstrate and analyzea platform (Figure 1) able to process a raw recording ofa psychotherapy session and provide, within short time,performance-based feedback according to therapeutic skillsand behaviors expressed both at the utterance and at the ses-sion level. We focus on dyadic psychotherapy interactions(i.e., one therapist and one client) and the quality assessmentis based on the counselor-related codes of the MISC proto-col (Houck et al., 2010). The behavioral codes are predictedby NLP algorithms that analyze the linguistic informationcaptured in the automatically derived transcriptions of thesession.The overall architecture is illustrated in Figure 1a.After both parties have formally consented, the therapist be-gins recording the session. The digital recording is directlysent to the processing pipeline and appropriate acoustic fea-tures are extracted from the raw speech signal. The rich au-dio transcription component of the system (Figure 1b) con-sists of five main steps: (a) Voice Activity Detection (VAD),where speech segments are detected over silence or back-
FLEMOTOMOS ET AL.
Table 1
Therapist-related session-level codes, as defined by MISC 2.5. Each code is scored on a 5-point Likert scale. name high score means that counselor...acceptance consistently communicates acceptance and respect to the clientempathy makes an e ff ort to accurately understand the client’s perspectivedirection is focused on a specific target behaviorautonomy support does not attempt to control the client’s behavior or choicescollaboration interacts with their clients as partners and avoids an authoritarian attitudeevocation tries to “draw out” client’s own desire for changing Table 2
Therapist-related utterance-level codes, as defined by MISC 2.5. Most of the examples are drawn from the MISCmanual (Houck et al., 2010). Many of the code assignments depend on the client’s previous utterance (C). abbr. name exampleADP Advise with Permission Would it be all right if I suggested something?ADW Advise w / o Permission I recommend that you attend 90 meetings in 90 days.AF A ffi rm Thank you for coming today.CO Confront (C: I don’t feel like I can do this.) Sure you can.DI Direct Get out there and find a job.EC Emphasize Control It is totally up to you whether you quit or cut down.FA Facilitate Uh huh. ( keep-going acknowledgment )FI Filler Nice weather today!GI Giving Information Your blood pressure was elevated [...] this morning.QUO Open Question Tell me about your family.QUC Closed Question How often did you go to that bar?RCP Raise Concern with Permission Could I tell you what concerns me about your plan?RCW Raise Concern w / o Permission That doesn’t seem like the safest plan.RES Simple Reflection (C: The court sent me here.) That’s why you’re here.REC Complex Reflection (C: The court sent me here.) This wasn’t your choice to be here.RF Reframe (C: [...] something else comes up [...]) You have clear priorities.SU Support I’m sorry you feel this way.ST Structure Now I’d like to switch gears and talk about exercise.WA Warn Not showing up for court will send you back to jail.NC No Code You know, I. . . ( meaning is not clear )ground noise, (b) speaker diarization, where the speech seg-ments are clustered into same-speaker groups (e.g., speakerA, speaker B of a dyad), (c) Automatic Speech Recogni-tion (ASR), where the audio speech signal of each speaker-homogeneous segment is transcribed to words, (d) SpeakerRole Recognition (SRR), where each speaker group is as-signed their role: in our case study, ‘therapist’ or ‘client’,and (e) utterance segmentation, where the speaker turns areparsed into utterances which are the basic units of behavioralcoding. The generated trsncription is used to estimate a vari-ety of behavior codes both at the utterance and at the sessionlevel, which reflect target constructs related to therapist be-haviors and skills.The behavioral analysis of the counselor is summa- rized into a comprehensive feedback report provided throughan interactive web-based platform (Hirsch et al., 2018; Imelet al., 2019). Through the platform, the user is able to reviewthe raw MISC predictions of the system (e.g., empathy scoreand utterances labeled as reflections), several theory-drivenfunctionals of those (e.g., ratio of questions to reflections),session statistics (e.g., ratio of client’s to therapist’s talkingtime), as well as the entire speaker-segmented transcription,accompanied by the corresponding audio recording. Addi-tionally, the user is given the option to take notes and makecomments linked to specific timestamps or utterances. Thatway, the platform can be used directly by the provider asa self-assessment method or by a supervisor as a support-ive tool that helps them deliver more e ff ective and engaging AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES Audio Feature Extraction Rich Transcription Quality AssuranceUtterance-levelpredictionsSession-levelpredictionsoops something went wrong (a)
Voice ActivityDetectionIs someone speaking? speechno speech
SpeakerDiarizationWho is speaking? speaker Aspeaker B
AutomaticSpeech RecognitionWhat are they saying?
I don’t like my jobI ’ll quitIt sounds like…
Speaker Role RecognitionIs it the therapist or the client?
I don’t like my jobI ’ll quitIt sounds like… CT Rich Transcription
UtteranceSegmentationHow is the text parsed into utterances?
I don’t like my jobIt sounds like… CT I ’ll quit C (b) Figure 1 (a) Overview of the system used to assess the quality of a psychotherapy session and provide feedback to the therapist. Oncethe audio is recorded, it is automatically transcribed to find who spoke when and what they said. If the transcription meetscertain quality criteria, this textual information is used to predict utterance-level and session-level behavior codes which aresummarized into an interactive feedback report. Otherwise, an error message is displayed to the user.(b) Rich transcription module. The dyadic interaction is transcribed through a pipeline that extracts the linguistic informationencoded in the speech signal and assigns each speaker turn to either the therapist or the client. training.Since the system was designed with real-world de-ployment in mind, it was important to incorporate specificconfidence metrics which reflect the quality of the auto-matic transcription. Employing quality safeguards helps usboth identify potential computational errors, and determinewhether the input was an actual therapy session or not (e.g.,whether the therapist pushed the recording button by mis-take). If certain quality thresholds are not met, then the final report is not generated and feedback is not providedfor the specific session. Instead, an error message is dis-played to the counselor. For example, in a scenario wherespeaker segmentation fails because the recording is too noisyor the two speakers have very similar acoustic characteris-tics, the system would not know which utterances correspondto the provider and which correspond to the client; as a re-sult, the subsequent prediction algorithms would fail to ac-curately capture counselor-related behaviors. Being able to
FLEMOTOMOS ET AL. avoid such scenarios is of crucial importance for a systemused in clinical settings.As illustrated in Figure 1, we have chosen a pipelinedimplementation of the system, as opposed to a more con-voluted architecture, potentially able to predict behavioralcodes directly from the speech waveform. That way, weare able to provide a feedback report containing much richerinformation than merely the behavior codes or statistics ofthose. In particular, the user has access to the entire transcriptand can understand how particular behaviors are linked to thelinguistic content of the corresponding utterances. This de-sign increases the interpretability and, as a result, the trustof the clinical provider to the system. Additionally, we areable to extract and provide information critical for the qual-ity assessment of the therapy session, not directly related tobehavior codes, such as the client’s speaking time. Finally,the quality assurance of the generated transcription is basedon certain quality safeguards (described later in the paper)corresponding to specific sub-modules of the pipeline, suchas the VAD and the diarization. So, if a potential error isdetected at an early stage of the pipeline (e.g., VAD), theentire processing can be halted, thus avoiding wasting com-putational resources.
Materials and MethodsDatasets
The design of the system presented in this work isbased on datasets drawn from a variety of sources. Wehave combined large speech and language corpora both fromthe psychotherapy domain and from other fields (meetings,telephone conversations, etc.). That way, we wanted to en-sure high in-domain accuracy when analyzing psychotherapydata, but also robustness across various recording conditions.In order to use and evaluate the system in real-world clinicalsettings, we have additionally collected and analyzed a setof more than 5,000 recordings of therapy sessions betweena provider and a patient at a University Counseling Center(UCC). The details of the various datasets are presented inthe following sections.
Out-of-Domain Corpora
Audio Sources.
The acoustic modeling performedin this work was mainly based on a large collection of speechcorpora, widely used by the research community for a va-riety of speech processing tasks. Specifically, we used theFisher English (Cieri, Miller, & Walker, 2004), ICSI Meet-ing Speech (Janin et al., 2003), WSJ (Paul & Baker, 1992),and 1997 HUB4 (Gra ff , Wu, MacIntyre, & Liberman, 1997)corpora, available through the Linguistic Data Consortium(LDC), as well as Librispeech (Panayotov, Chen, Povey, &Khudanpur, 2015), TED-LIUM (Rousseau, Deléglise, & Es-teve, 2014), and AMI (Carletta et al., 2005). This combined speech dataset consists of more than 2,000h of audio andcontains recordings from a variety of scenarios, includingbusiness meetings, broadcast news, telephone conversations,and audiobooks / articles. Text Sources.
The aforementioned datasets are ac-companied by manually-derived transcriptions which can beused for language modeling tasks. In our case, since weneed to capture linguistic patterns specific to the psychother-apy domain, the main reason we need some out-of-domaintext corpus is to build a background model that guaranteesa large enough vocabulary and minimizes the unseen wordsduring evaluation. To that end, we use the transcriptions ofthe Fisher English corpus, featuring a vocabulary of 58.6Kwords and totaling more than 21M tokens.
Psychotherapy-Related Corpora
Audio Sources.
In order to train and adapt ourmachine learning models, used both for the transcriptioncomponent of the system and for the behavior coding pre-dictions, we also used several psychotherapy-focused cor-pora. In particular, we used a collection of 337 MI ses-sions (for which audio, transcription and manual coding in-formation were available) from six independent clinical tri-als (ARC, ESPSB, ESP21, iCHAMP, HMCBI, CTT). Inmore detail, ARC (9 sessions) (Tollison et al., 2008), ES-PSB (38 sessions) (C. M. Lee et al., 2014) and ESP21 (19sessions) (Neighbors et al., 2012) feature brief alcohol inter-ventions. CTT (194 sessions) (Baer et al., 2009) also con-sists of alcohol interventions, but using standardized patients(i.e., actors portraying patients). Finally, iCHAMP (7 ses-sions) (C. M. Lee et al., 2013) addresses marijuana addic-tion and HMCBI (70 sessions) (Krupski et al., 2012) ad-dresses poly-drug abuse. We refer to the combined datasetas the TOPICS-CTT corpus and we have split it into train(TOPICS-CTT train ; 242 sessions) and test (TOPICS-CTT test ;95 sessions) sets.The mean duration of the sessions is 29 . = . ff erent clients is not known. How-ever, under the assumption that it is highly improbable forthe same client to visit di ff erent therapists in the same study,and having the necessary metadata available for the rest ofthe corpus, we make the train / test split in a way that we arehighly confident there is no overlap between speakers. Thisis important since we want to make sure that our models cap-ture universal behavior-specific patterns during training andnot speaker-specific linguistic information. Text Sources.
The transcripts of the aforemen-tioned MI sessions were enhanced by data providedby the Counseling and Psychotherapy Transcripts Se-ries (CPTS), available from the Alexander Street Press
AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES Table 3
Number of sessions, unique therapists, and unique clients in the 6 clinical trials composing the TOPICS-CTT corpus.The client IDs are not known for the HMCBI data.
ARC ESPSB ESP21 iCHAMP HMCBI CTT alexanderstreet.com ) via library subscription. This in-cluded transcripts from a variety of therapy interventions to-taling about 300K utterances and 6.5M words. For this cor-pus, no audio or behavioral coding are available, and the datawere hence used only for language-based modeling tasks.
University Counseling Center Data Collection
Through a collaboration with the university-basedcounseling center of a large western university, we gathereda corpus of real-world psychotherapy sessions to evaluate theproposed system. Therapy treatment was provided by a com-bination of licensed sta ff as well as trainees pursuing clini-cal degrees. Topics discussed span a wide range of concernscommon among students, including depression, anxiety, sub-stance use, and relationship concerns. All the participants(both patients and therapists) had formally consented to theirsessions being recorded. Study procedures were approvedby the institutional review board of the University of Utah.Each session was recorded by two microphones hung fromthe ceiling of the clinic o ffi ces, one omni-directional and onedirected to where the therapist generally sits.Data reported in this article were collected betweenSeptember, 2017 and March, 2020, for a total of 5,097recordings. Every time a session is recorded, it is au-tomatically sent to the audio processing pipeline, and aperformance-based feedback report is generated. We notethat some of those recordings were not actually valid therapysessions (e.g., the therapist pushed the recording button bymistake); however we do have relevant safeguards for suchcases, as described later in the article. Eventually, 4,268 ses-sions were successfully processed with a mean duration of49 . = . ff erences in the pro-cedure between the two. For the first coding trial (96 ses-sions), the transcriptions were stripped of punctuation andcoders were asked to parse the session into utterances. Dur- ing the second trial (92 sessions), the human transcriber wasasked to insert punctuation, which was used to assist pars-ing. Additionally, for the second batch of transcriptions,stacked behavioral codes (more than one code per utterance)were allowed in case one of the codes is question (QUC orQUO). Because of those di ff erences in the coding approach,we are reporting results independently for the two trials; inparticular, we have split the first trial into train (UCC train ;50 sessions), development (UCC dev ; 26 sessions), and test(UCC test ; 20 sessions) sets, while we refer to the secondtrial as the UCC test set and we only use it for evaluation.That way, we are able to monitor the robustness of the systemthrough time, without continuously adapting to new data. Forsimilar reasons as in the case of the TOPICS-CTT corpus, thesplit for the first trial was done in a way so that there is nospeaker overlap between the di ff erent sets.Each of the 188 sessions was coded by at least oneof three coders. Among those, 14 sessions (from the firsttrial) were coded by two or three coders, so that we can havea measure of inter-rater reliability (IRR). To that end, weestimated Krippendor ff ’s alpha ( α ) (Krippendor ff , 2018) foreach code, a statistic which is generalizable to di ff erent typesof variables and flexible with missing observations (Hall-gren, 2012). Since sessions were parsed into utterances fromthe human raters, the unit of coding is not fixed, so we gotan estimate for the utterance-level codes at the session levelby using the per session occurrences of each label. For theIRR analysis, we treated the occurrences of the utterance-level codes as ratio variables and the values of the session-level codes as ordinal variables. The results for all the codesare given in Table 4. For the session-level codes, the ‘withinone’ reliability is also provided, since it is recommended thatonly a distance between the raters’ di ff erent scores greaterthan one point in the Likert scale should be considered dis-agreement (Schmidt, Andersen, Nielsen, & Moyers, 2019). Data pre-processing
The manually transcribed UCC sessions do not con-tain any timing information, which means that we neededto align the provided audio with text. That way, we wereable to get estimates of the “ground truth” information re-quired to evaluate some of the modules of our system, such
FLEMOTOMOS ET AL.
Table 4
Krippendor ff ’s alpha ( α ) to estimate inter-rater reliability for the utterance-level (upper 4 tables; ratio measurementlevel) and the session-level (lower table; ordinal measurement level) codes in UCC data. For the utterance-level codes, we getan estimate through their per-session occurrences. For the session-level codes, the ‘within one’ agreement is also provided,demonstrating whether the distance between the raters’ di ff erent scores was at most one point in the Likert scale. ∗ denotesthat the particular code was not used (count =
0) by at least 2 coders for at least half of the analyzed sessions. RCP was neverused by any coder. code IRR ( α )ADP 0.542 ∗ ADW 0.422AF 0.123CO 0.497 ∗ DI 0.590 code IRR ( α )EC 0.558FA 0.868FI 0.784GI 0.861QUO 0.945 code IRR ( α )QUC 0.897RCP – ∗ RCW 0.000 ∗ RES 0.268REC 0.478 code IRR ( α )RF 0.093 ∗ SU 0.345ST 0.434WA -0.054 ∗ code IRR ( α ) IRR ‘within 1’ ( α )acceptance 0.468 0.747empathy 0.532 1.000direction 0.593 0.795aut. support 0.464 0.743collaboration 0.287 0.472evocation 0.410 0.626as VAD and diarization. We did so by using the Gentleforced aligner ( github.com/lowerquality/gentle ), anopen-source, Kaldi-based (Povey et al., 2011) tool, in orderto align at the word level. However, we should note that thisinevitably introduces some error to the evaluation process,since 9 .
4% of the words per session on average (std = . Audio Feature Extraction
For all the modules of the speech pipeline (VAD,diarization, ASR), the acoustic representation is basedon the widely used Mel Frequency Cepstrum Coef-ficients (MFCCs). For the UCC data, the chan-nels from the two recording microphones are combinedthrough acoustic beamforming (Anguera, Wooters, & Her-nando, 2007), using the open-source BeamformIt tool( github.com/xanguera/BeamformIt ). Automatic Rich Transcription
Before proceeding to the automatic behavioral cod-ing, we need to transcribe the raw audio recording, in orderto get information about the content, the speakers, and theutterance boundaries. This is not just a pre-processing stepallowing us to apply NLP algorithms, but it also providesinvaluable information which will be later incorporated inthe final feedback report (e.g., talking time of each speaker).The rich transcription pipeline we propose is illustrated inFigure 1b. In the following sections, we describe the vari-ous sub-modules of the system. Further technical details areprovided in Appendix A.
AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES Table 5
Mapping between the MISC-defined behavior codes and the grouped target labels, together with the occurrences ofeach group in the training and development UCC sets. The inter-rater reliability for the grouped labels is also given in termsof the Krippendor ff ’s alpha ( α ) value. group MISC codes count (UCC train ) count (UCC dev ) IRR ( α )FA FA 5581 2500 0.868GI GI, FI 3797 1695 0.898QUC QUC 1911 693 0.897QUO QUO 1116 405 0.946REC REC, RF 2212 1041 0.479RES RES 609 155 0.268MIN ADP, ADW, CO, DI, RCW, RCP, WA 479 163 0.606MIA AF, EC, SU 428 238 0.363ST ST 542 135 0.434 Voice Activity Detection
The first step of the transcription pipeline is to extractthe voiced segments of the input audio session. The rest ofthe session is considered to be silence, music, backgroundnoise, etc., and is not taken into account for the subsequentsteps. To that end, a two-layer feed-forward neural networkis used giving a frame-level probability. This is a pre-trainedmodel, initially developed as part of the SAIL lab e ff orts forthe Robust Automatic Transcription of Speech (RATS) pro-gram (Thomas, Saon, Van Segbroeck, & Narayanan, 2015).The model was trained to reliably detect speech activity inhighly noisy acoustic scenarios, with most of the noise typesincluded during training being military noises like machinegun, helicopter, etc. Hence, in order to make the model bet-ter suited to our task, the original model was adapted us-ing the UCC dev data. Optimization of the various parame-ters was done with respect to the unweighted average recall.The frame-level outputs are smoothed via a median filter andconverted to longer speech segments which are passed to thediarization sub-system. During this process, if the silencebetween any two contiguous voiced segments is less than0.5sec, the corresponding segments are merged together. Speaker Diarization
Speaker diarization answers the question “who spokewhen” and it traditionally consists of two steps. First, thespeech signal is partitioned into segments where a singlespeaker is present. Then, those speaker-homogeneous seg-ments are clustered into same-speaker groups. For this workwe follow the x-vector / PLDA paradigm, an approach knownto achieve state-of-the-art performance for speaker recog-nition and diarization (Sell et al., 2018; Snyder, Garcia-Romero, Sell, Povey, & Khudanpur, 2018). In particu-lar, each voiced segment, as predicted by VAD, is parti-tioned uniformly into subsegments and for each subseg- ment a fixed-dimensional speaker embedding (x-vector) isextracted. Once the x-vectors have been extracted, an a ffi n-ity matrix is constructed with the pairwise distances betweenthe subsegments. The similarity metric used is based on theProbabilistic Linear Discriminant Analysis (PLDA) frame-work (Io ff e, 2006; Prince & Elder, 2007), within which eachdata point is considered to be the output of a model whichincorporates both within-individual and between-individualvariation. The subsegments are finally clustered togetheraccording to Hierarchical Agglomerative Clustering (HAC).The assumption here is that each session has exactly twospeakers (i.e. therapist vs. client), so we continue the HACprocedure until two clusters have been constructed. As apost-processing step after diarization, adjacent speech seg-ments assigned to the same speaker are concatenated togetherinto a single speaker turn, allowing a maximum of 1sec in-turn silence. Automatic Speech Recognition
After we get the speaker-homogeneous segmentsfrom the diarization module, we need to extract the linguis-tic content captured within each segment, since this will bethe information supplied to the subsequent text-based al-gorithms. ASR depends on two components; the Acous-tic Model (AM), which calculates the likelihood of acousticobservations given a sequence of words, and the LanguageModel (LM), which calculates the likelihood of a word se-quence by describing the distribution of typical language us-age. In order to train the AM, we build a Time-Delay Neu-ral Network (TDNN) with subsampling (Peddinti, Povey, &Khudanpur, 2015), an architecture which has been success-fully applied in conversational speech achieving remarkableperformance (Peddinti, Chen, et al., 2015). To increase ro-bustness against di ff erent speakers, the input features areaugmented by i-vectors (Saon, Soltau, Nahamoo, & Picheny,0 FLEMOTOMOS ET AL. ff erent conditions and reflecting a variety of acousticenvironments. The ASR AM was built and trained using theKaldi speech recognition toolkit (Povey et al., 2011).In order to build the LM, we independently train two3-gram models using the SRILM toolkit (Stolcke, 2002).One is trained with in-domain psychotherapy data from theCPTS transcribed sessions. This is interpolated with a largebackground model, in order to minimize the unseen wordsduring the sytem employment. The background LM istrained with the Fisher English corpus, which features con-versational telephone data. Speaker Role Recognition
After diarization has been performed, we have theentire set of utterances clustered into two groups; however,there is not a natural correspondence between the cluster la-bels and the actual speaker roles (i.e., therapist and client).For our purposes, Speaker Role Recognition (SRR) is ex-actly the task of finding the mapping between the two. Eventhough di ff erent speaker roles follow distinct patterns acrossvarious modalities (e.g., audio, language, structure), the lin-guistic stream of information is often the most useful forthe task in hand (Flemotomos, Papadopoulos, Gibson, &Narayanan, 2018). So, in this work we are focusing on thismodality, provided by the ASR output.Let’s denote the two clusters which have been iden-tified by diarization as S and S , each one containing theutterances assigned to the two di ff erent speakers. We knowa priori that one of those speakers is the therapist (T) andone is the client (C). In order to do the role matching, twotrained LMs, one for the therapist (LM T ) and one for theclient (LM C ), are used. We then estimate the perplexitiesof S and S with respect to the two LMs and we assignto S i the role that yields the minimum perplexity. In caseone role minimizes the perplexity for both speakers, we firstassign the speaker for whom we are most confident. Theconfidence metric is based on the absolute distance betweenthe two estimated perplexities (Flemotomos, Martinez, et al.,2018). The required LMs are 3-gram models trained with theSRILM toolkit (Stolcke, 2002), using the TOPICS-CTT train and CPTS corpora. Utterance Segmentation
The output of the ASR and SRR modules is at thesegment level, with the segments defined by the VAD and di-arization algorithms. However, silence and speaker changesare not always the right cues to help us distinguish betweenutterances, which are the basic units of behavioral coding.The presence of multiple utterances per speaker turn is achallenge we often face when dealing with conversationalinteractions. Especially in the psychotherapy domain, it hasbeen shown that the right utterance-level segmentation cansignificantly improve the performance of automatic behaviorcode prediction (Chen et al., 2020).Thus, we have included an utterance segmenta-tion module at the end of the automatic transcription, be-fore employing the subsequent NLP algorithms. In par-ticular, we merge together all the adjacent segments be-longing to the same speaker in order to form speaker-homogeneous talk-turns, and we then segment each turnusing the DeepSegment tool ( github.com/notAI-tech/deepsegment ). DeepSegment has been designed to performsentence boundary detection having specifically ASR outputsin mind, where punctuation is not readily available. In thisframework, sentence segmentation is viewed as a sequencelabeling problem, where each word is tagged as being eitherat the beginning of a sentence (utterance), or anywhere else.DeepSegment addresses the problem employing a Bidirec-tional Long-Short Term Memory (BiLSTM) network witha Conditional Random Field (CRF) inference layer (Ma &Hovy, 2016).
Quality Assurance
The goal of the current study is to provide accurateand reliable feedback to the counselor in a real-world en-vironment. Thus, it is essential that we ensure we do notproduce feedback reports which are problematic, either be-cause of bad audio quality, or because of errors during com-putation. We have identified that most of the errors are pro-duced during the first steps of the processing pipeline andare propagated to the subsequent steps. Thus, we have in-corporated simple quality safeguards, able to catch errors as-sociated with the audio recording, the VAD, or the diariza-tion. Specifically, before any further processing, the follow-ing conditions need to be met:1. The duration of the entire recording has to be between60sec and 5h. Given that a typical therapy sessionin our study is about 50min-long, a session outsidethis range indicates either that the provider pushed therecording button by mistake, or that they forgot to stoprecording.2. At least 25% of the session has to be flagged as voiced,according to the VAD output. During a typical conver-sational interaction, there are pauses of silence which
AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES
Utterance-level and Session-level Labeling
Once the entire session is transcribed at the utterancelevel, we are able to employ text-based algorithms for thetask of behavior code prediction. Both utterance-level andsession-level behavior codes are predicted and provided backto the counselor as part of the feedback report, as describedbelow.
Utterance-level Code Prediction
We are focusing on counselor behaviors, so we onlytake into account the utterances assigned to the therapist ac-cording to SRR. Each one of those needs to be assigned asingle code from the 9 target labels summarized in Table 5.This is achieved through a BiLSTM network with attention mechanism (Singla et al., 2018) which only processes tex-tual features. The input to the system is a sequence of word-level embeddings for each utterance. The recurrent layer ex-ploits the sequential nature of language and produces hid-den vectors which take into account the entire context ofeach word within the utterance. The attention layer can thenlearn to focus on salient words carrying valuable informationfor the task of code prediction, thus enhancing robustnessand interpretability. The network was first trained on theTOPICS-CTT data using class weights to handle the problemof skewed code distribution in the data (Table 5). The systemwas further fine-tuned by continuing training on the UCC train data in order to be suitably fitted to the UCC conditions.
Session-level Code Prediction
Apart from the utterance-level codes, our system as-signs a score to each one of the global codes of Table 1, rang-ing from 1 to 5. To that end, we represent the entire session,using the utterances assigned to both the therapist and theclient, by the term frequency - inverse document frequency(tf-idf) (Salton & McGill, 1986) transformation of unigrams,bigrams, and trigrams found within the session, excludingcommon stop words. Those features are l -normalized andpassed to a Support Vector Regressor (SVR) which givesthe final prediction. After hyper-parameter tuning, we chosepolynomial SVR kernel (4th-degree) for acceptance and au-tonomy support, linear kernel for empathy, collaboration andevocation, and gaussian kernel for direction.Contrary to the training approach followed for theutterance-level codes, here we train using only UCC data.The reason is that there is a discrepancy between the globalsassigned by human raters to the TOPICS-CTT and the UCCsessions, since di ff erent coding procedures were followed. Inparticular, the TOPICS-CTT sessions were coded only acrosstwo global codes (empathy and MI spirit) following the Mo-tivational Interviewing Treatment Integrity (MITI) (Moyers,Rowell, Manuel, Ernst, & Houck, 2016) coding scheme.Thus, due to the limited amount of training data (only 188sample points—UCC sessions—in total), we apply a 5-foldcross validation scheme across the UCC dataset (from bothcoding trials) for any hyper-parameter tuning and we thenkeep those parameters to re-train using the entire UCC set. Final Report
After we have the automatically generated transcriptand all the session-level and utterance-level predictionsthrough our system, those are provided to the therapist as afeedback report through an interactive, web-based platformwhich we refer to as the Counselor Observer Ratings Ex-pert for Motivational Interviewing (CORE-MI) (Hirsch et al.,2018; Imel et al., 2019). A video demonstration of the plat-form and its functionality is available at .2 FLEMOTOMOS ET AL.
CORE-MI features two main views, the session viewand the report view (Appendix B). In the first one, the usercan listen to the recording of the therapy, watch the video(if available) and read the generated transcript, which isscrollable and searchable. Additionally, they can keep noteslinked to specific timestamps and utterances of the session.The report view provides the actual therapy sessionevaluation. The entire session timeline is presented in a formof a bar where talk turns of the two speakers are displayed indi ff erent colors. Hovering over a specific turn brings up thecorresponding transcription and—in case the turn is assignedto the therapist—the corresponding MISC code(s). Based onthe results reported later, we have decided to collapse thesimple and complex reflections into one composite reflection(RE) label. The global behavior codes are also displayed aswell as a set of summary indicators which reflect the adher-ence to MI therapeutic skills. Those are the ratio of reflec-tions (simple and complex) to questions (open and closed),the percentage of the open questions asked (among all thequestions), the percentage of the complex reflections (amongall the reflections), the percentage of each speaker’s talkingtime, the MI adherence and the overall MI fidelity. MI ad-herence reflects the percentage of utterances where the coun-selor used MI-consistent techniques (e.g., asking open ques-tions or giving advice with permission). Finally, the overallMI fidelity score is a composite metric rated on a 12-pointscale that takes all the above into consideration and reflectsthe proficiency of the counselor to the di ff erent aspects of MItherapy. In particular, a provider can receive one point forpassing pre-defined basic proficiency benchmarks and twopoints for passing advanced competency benchmarks acrossthe following six measures of quality: empathy, MI spirit,reflection-to-question ratio, percentage of open questions,percentage of complex reflections, MI adherence. MI spiritis estimated as the average of evocation, collaboration, andautonomy support (Houck et al., 2010).The main design characteristics of the CORE-MIplatform have been tested in a past study (Hirsch et al., 2018;Imel et al., 2019) and results showed that the providers findthe system easy to use and the feedback easy to understand.Additionally, most of the professional therapists that partic-ipated in the survey seemed excited about the potential op-portunity to use such a system in clinical practice. Results and DiscussionAutomatic Rich Transcription
All the submodules of the transcription pipeline areevaluated on the two UCC test sets we have described(UCC test , UCC test ), both individually and as part of theoverall system. That way, we want to evaluate the perfor-mance of each one of the models, but, more importantly, in-vestigate any error propagation that inevitably takes place. VAD / Diarization
During evaluation, VAD is usually viewed as part ofa diarization system (e.g., Sell et al. (2018)), so for evalua-tion purposes we consider our diarization model as the firstcomponent of the pipeline (frame-level VAD results are pro-vided in Appendix C). The standard evaluation metric fordiarization is called Diarization Error Rate (DER; Angueraet al. (2012)) and it incorporates three sources of error: falsealarms, missed speech, and speaker error. False alarm speech(the percentage of speech in the output but not in the groundtruth) and missed speech (the percentage of speech in theground truth but not in the output) are mostly associatedwith VAD. Speaker error is the percentage of speech as-signed to the wrong speaker cluster after an optimal map-ping between speaker clusters and true speaker labels. Weestimate the DER on the UCC data using the md-eval tool which was developed as part of the Rich Transcrip-tion (RT) evaluation series ( ). We have used a for-giveness collar of 0 . Table 6
Diarization results ( % ) for the test sets of the UCCdata. Diarization Error Rate (DER) is estimated as the sumof false alarm, missed speech, and speaker error rates. set FalseAlarm MissedSpeech SpeakerError DERUCC test test AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES ffi ciently long segments so thatthey capture a meaningful speaker representation. Second, itensures larger language context, which means that the LM ofthe ASR system can choose the right word path with higherconfidence. Automatic Speech Recognition
The evaluation of an ASR system is usually per-formed through the Word Error Rate (WER) metric which isthe normalized Levenshtein distance between the ASR out-put and the ground truth transcript. This includes errors be-cause of word substitutions, word deletions, and word in-sertions. For instance, word insertion rate is the number ofnew words included in the prediction which are not found inthe original transcript over the total number of ground truthwords. WER is calculated as the sum of those three errorrates. Those errors are typically estimated for each utterancewhich is given to the ASR module and then summed up forall the evaluation data, in order to get an overall WER. How-ever, when we analyze an entire therapy session which hasbeen processed by the VAD and diarization sub-systems, the“utterances” are di ff erent than the ones identified by the hu-man transcriber. In that case, we perform the evaluation atthe session level, ignoring the speaker labels (from diariza-tion) and concatenating all the utterances of the session. Wedo the same for the original transcript and we hence view theentire session as a “single utterance” for the purposes of ASRevaluation. The results are reported in Table 7. Table 7
Automatic Speech Recognition (ASR) results ( % ) forthe test sets of the UCC data: substitution, insertion, anddeletion rates, together with the total Word Error Rate(WER), estimated as the sum of those three. Results arereported when using either the machine-generated segments(pipeline) or the ones derived by the manual transcriptions(oracle). set segmentation subs del ins WERUCC test oracle 18.3 15.3 3.5 37.1pipeline 20.0 13.9 4.3 38.1UCC test oracle 14.3 13.8 2.3 30.4pipeline 16.1 12.5 3.1 31.6As we can see, ASR performance is not severely de-graded by any error propagation from the pre-processing stepof diarization (WER is increased about 1% absolute). Inter-estingly, even though the insertion rate is increased, the dele-tion rate is decreased when the machine-generated segmentsare provided. This is explained by the long segments con-structed by the diarization algorithm and the post-processingof its output after concatenating consecutive segments. On the one hand, labeling silence or noise as “speech” associ-ated with some speaker occasionally leads ASR to predictwords where in reality there is no speech activity—thus in-creasing the insertion rate. On the other hand, this minimizesthe probability of missing some words because of missedspeech. Such deleted words may occur when providing theoracle segments because of inaccuracies during the construc-tion of the “ground truth” through forced alignment.We note that, even though the estimated error is high,WERs in the range reported (30% − Speaker Role Recognition
The described SRR algorithm operates at the sessionlevel, which means that, for evaluation purposes, it su ffi cesto examine how many sessions are labeled correctly with re-spect to speaker roles. When oracle diarization informationis provided, coupled either with the manual transcriptions orwith the ASR results, our algorithm achieves a perfect recog-nition result for all the UCC test and UCC test sessions. Whenspeaker segmentation and clustering is performed throughthe diarization algorithm of the processing system, the SRRmodule fails to find the right mapping between roles andspeakers for seven sessions from the UCC test set.This behavior is associated with error propagationfrom the previous steps which is made apparent from the factthat the speaker error rate for those seven sessions is 42 . = . ffi ciently distinguish between the two speakers,probably because of similar acoustic characteristics. Thus,there is not enough reliable speaker-specific linguistic in-formation that the SRR can use during the role assignment.This example of error propagation also highlights the needfor quality assurance through specific safeguards at the earlysteps of the processing pipeline. Utterance Segmentation
The last step of the transcription pipeline is the ut-terance segmentation, which provides the basic units for be-havioral coding. We get a rough indication of the quality ofour segmentation process by estimating the correlation be-tween the total number of utterances per session that havebeen assigned to the therapist by the human annotators and4
FLEMOTOMOS ET AL. by the processing pipeline. The Spearman correlation be-tween them, when all the UCC test and UCC test sessions aretaken into account, is 0 .
478 ( p < − ). The number of themanually-defined utterances is usually higher than the num-ber of the ones identified by our system, because the auto-mated rich transcription module often fails to capture veryshort utterances (e.g., ‘yeah’, ‘right’, etc.). Quality Assurance
According to the quality safeguards introduced, 16out of the 112 sessions are flagged as “problematic” in ourcombined test set of UCC test and UCC test . All of those donot meet the fourth condition, related to the minimum al-lowed speaking time attributed to each speaker. This meansthat in practice the processing would halt after the end of thediarization algorithm, with an error message displayed to theuser. When we ran the entire set of 5,097 UCC recordingsthrough the pipeline, 4,268 met all the four criteria and weresuccessfully processed.It is interesting that, after excluding the “problem-atic” sessions from the test sets (UCC test , UCC test ), theSpearman correlation between the total number of therapistutterances per session as assigned by the human coders vs.by the automated system is increased from 0.478 to 0.639.This is explained by the fact that, in several of those cases,poor diarization performance led the subsequent SRR mod-ule to assign almost the entire session to the client. As a re-sult, the number of therapist-attributed utterances was muchsmaller than expected. Utterance-level and Session-level Labeling
As in the case of the transcription pipeline submod-ules, we examine the e ff ectiveness of the proposed models,both when provided with oracle information and when be-ing part of the end-to-end system. In the following sectionswe discuss the results of the MISC code (utterance-level andsession-level) prediction models. Utterance-level Code Prediction
When we use the manually transcribed data to per-form utterance-level MISC code prediction, the overall F score is 0 .
524 for the UCC test and 0 .
514 for the UCC test sets. The F scores for each individual code are reported inTable 8. As expected, the results are better for the highly fre-quent codes (Table 5), such as FA, since the machine learn-ing models have more training examples to learn from. Onthe other hand, the models do not perform as well for lessfrequent codes, such as MIN and MIA. However, comparingTable 8 and Table 4, we can also see that for several of thecodes that our system performs relatively poorly (e.g., RES,MIA, ST), the inter-annotator agreement is also considerablylow. A notable example which does not follow this pattern is the non-adherent behavior (MIN) where our system achievesthe lowest results among all the codes, while there is a sub-stantial inter-annotator agreement ( α = . p < .
01) positive correlationfor all the codes, apart from FA. The Spearman correlationfor the 9 codes is on average 0.446 (std = = test and UCC test ) are taken into consideration. Thissuggests that our test sets are indicative of the entire dataset AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES Table 8 F scores for the predicted utterance-level codes using the manually transcribed UCC data. set FA GI QUC QUO REC RES MIN MIA STUCC test test Session-level Code Prediction
As mentioned in the Materials and Mehods section,the session-level code predictor is the only model where, dueto the limited amount of training data, we apply a 5-fold crossvalidation scheme across the entire coded UCC dataset (all188 sessions). The cross-validation results are reported inTable 9. Results are given in terms of accuracy and averaged F score, after the output of SVR is rounded to the closestinteger in the range from 1 to 5 and after we collapse classes1 and 2 together (due to the very limited number of sessionsscored as 1 in the reference data). We also report the ‘withinone’ accuracy, demonstrating whether the distance betweenthe reference and predicted scores was at most one. In gen-eral, the predictive power of the models seems to be lowerfor the codes where the inter-rater reliability (Table 4) is alsolow. Additionally, the performance is not severely a ff ectedby the usage of the speech pipeline, when compared to usingthe manual transcriptions.The distributions of the six global codes across allthe 4,268 psychotherapy sessions that were successfully pro-cessed are given in Figure 4. All the codes, with the excep-tion of direction, are skewed towards the higher scores ofthe scale (higher than 3). As was the case with the utterance-level codes (Figure 3), we get a very similar distribution if weillustrate the results only for the sessions in the UCC test setsfor which manual transcription and behavior coding informa-tion were available. This is indicative of the generalizationof the system and its performance to future therapy sessions. Limitations and Conclusions
In this article we presented and analyzed a processingpipeline able to automatically evaluate recorded psychother-apy sessions. The application of such a system in real-worldsettings could guarantee the provision of fast and low-costfeedback. Performance-based feedback is an essential as-pect both for training new therapists and for maintaining ac-quired skills, and can eventually lead to improved quality ofservices and more positive clinical outcomes. Additionally,being able to record, transcribe, and code interventions atlarge scale opens up ample opportunities for psychotherapyresearch studies with increased statistical power. At the point of writing, we have processed a col-lection of more than 5,000 recordings, 4,268 of which metour quality criteria and are now accompanied by transcrip-tions and behavioral coding information. Both utterance-level and session-level MISC codes are available coveringa wide range of behaviors (Figures 3 and 4). As we areplanning on expanding our corpus with more data, we areconfident that such a dataset will lead to novel interestingstudies in the fields of psychotherapy, computational model-ing, and their intersection. For example, the transcriptionsof a subset of those data have been already used to studytherapeutic alliance directly using text-based features (Gold-berg et al., 2020) or modeling clients and therapists as narra-tive characters (Martinez et al., 2019). Even though we havehere focused on Motivational Interviewing, the basic ideasof the speech processing pipeline remain the same for otherdyadic interactions as well. For instance, the same modulesanalyzed in this article have been used to automatically tran-scribe and subsequently analyze cognitive behavior therapysessions (Chen et al., 2020).Despite the promising results presented here, we rec-ognize that there is room for improvement in almost all thesub-modules of the pipeline. Our analysis showed that di-arization failed for some of the sessions that human tran-scribers had no problem processing. Additionally, there wasa consistent underrepresentation of verbal fillers (e.g., uh-huh) and the relevant MISC label (FA) in the automaticallygenerated transcripts, as a result of the system struggling tocapture and transcribe very short speaker turns. Moreover,the architecture design followed, where the various modulesare trained independently and are then connected to form apipeline, inevitably leads to error propagation. There are in-dications that alternative frameworks could reduce errors inspecific cases, if for example diarization is aware of the dif-ferent speaker roles (Flemotomos, Georgiou, & Narayanan,2020) or if the two tasks of diarization and role recognitionare performed simultaneously (Flemotomos, Papadopoulos,et al., 2018).For this work we only used text-based methods forbehavioral coding. Acoustic features, however, and espe-cially prosodic cues, play a major role in understanding lan-guage (Cutler, Dahan, & Van Donselaar, 1997) and havebeen successfully used in the past for MISC code predic-tion (Singla et al., 2018; Xiao et al., 2014). Recent studieshave even shown that audio-only approaches, where word6
FLEMOTOMOS ET AL. P r o ce ss e d Spearman Cor = 0.194p-value = 4.05E-02 FA GI P r o ce ss e d Spearman Cor = 0.303p-value = 1.16E-03
RES
20 40 60 80 100 120 140050100 Spearman Cor = 0.388p-value = 2.33E-05
REC P r o ce ss e d Spearman Cor = 0.634p-value = 6.43E-14
QUC
QUO P r o ce ss e d Spearman Cor = 0.451p-value = 6.15E-07
MIA
MIN P r o ce ss e d Spearman Cor = 0.428p-value = 2.51E-06 ST
20 40 60 80 100 120 140 160Reference0100200 Spearman Cor = 0.417p-value = 4.86E-06 RE Figure 2
Count of each target MISC label per session when coded by humans (reference) and when processed by the pipeline.All the sessions in the UCC test and UCC test sets are shown and the correlation values are calculated based on all of them.The sessions flagged as problematic by the quality safeguards are denoted by square markers. RE is a composite labelcontaining both RES and REC. embeddings are directly learnt from spoken language, canyield improved results (Singla, Chen, Atkins, & Narayanan,2020). Additionally, for the most part of our analysis, wehave focused only on therapist characteristics. However, spe-cific dialog attributes, such as speech rate entrainment (Xiao, Imel, Atkins, Georgiou, & Narayanan, 2015) and languagesynchrony (Lord, Sheng, Imel, Baer, & Atkins, 2015; Nasiret al., 2019) between the two involved parties (therapist vs.client) can be proved useful for identifying therapy-relatedbehaviors. AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES FA GI QUC QUO REC RES MIN MIA ST0 . . . . . . f r e q u e n c y alltest Figure 3
Frequency of the utterance-level MISC codes for all the UCC recordings processed and for the subset included in theUCC test sets. Only the sessions successfully processed (that met our quality criteria) are taken into consideration here. Thetotal number of therapist-assigned utterances is about 1.2M for all the sessions (4,269 sessions) and 28K for only the sessionsincluded in the UCC test sets (UCC test and UCC test ; 96 sessions). Table 9
Averaged F scores and accuracy for the predicted session-level MISC codes using the manually transcribed (oracle)or the pipeline-generated data, based on a 5-fold cross validation scheme across all the UCC test data. The ‘within one’accuracy demonstrates whether the distance between the predicted and reference scores was at most one point in the Likertscale. metric F acc acc (‘within 1’)ASR method oracle pipeline oracle pipeline oracle pipelineacceptance 0.318 0.297 0.478 0.457 0.771 0.755empathy 0.342 0.342 0.586 0.580 0.819 0.851direction 0.380 0.261 0.426 0.389 0.740 0.697autonomy support 0.303 0.261 0.495 0.451 0.878 0.840collaboration 0.285 0.199 0.437 0.346 0.654 0.612evocation 0.274 0.188 0.362 0.335 0.751 0.671Another direction for potential future improvementsis related to the modeling approach followed for theutterance-level codes. The system presented here treats allthe codes evenly and employs a single neural architecturegiving one output label for every utterance. However, sincehuman coders often stack multiple codes for a single ut-terance (e.g., asking for permission to give advice [ADP]through a closed question [QUC]), a hierarchical algorithmwhich di ff erentiates between codes with increasing granular-ity and allows for multiple codes per utterance may be useful.In such a scenario, a hybrid method which uses the model-ing strength of neural networks and at the same time exploits knowledge-based information distilled from the coding man-uals and clinical practice, can potentially improve the robust-ness and increase the interpretability of the results. This strat-egy would particularly benefit codes where our system per-formed relatively poorly (e.g., MIN, MIA; Table 8), due tolimited training examples or due to insu ffi cient informationcaptured just from available linguistic cues. Keeping in mindthat psychotherapy is a dyadic interaction, incorporating con-textual information from the client’s neighboring utterancescould also lead to performance improvements, especially forcodes such as reflections (RES and REC) that depend seman-tically on client’s language (Table 2).8 FLEMOTOMOS ET AL. . . . . f r e q u e n c y acceptance alltest 1 2 3 4 50 . . . . . empathy alltest 1 2 3 4 50 . . . . direction alltest1 2 3 4 5score0 . . . . . f r e q u e n c y autonomy support alltest 1 2 3 4 5score0 . . . . . collaboration alltest 1 2 3 4 5score0 . . . . . evocation alltest Figure 4
Distribution of the session-level MISC codes for all the UCC recordings processed and for the subset included in theUCC test sets. Only the sessions successfully processed (that met our quality criteria) are taken into consideration here.
Limitations imposed by the available number oftraining samples is a crucial aspect regarding any machinelearning based model. Even though herein we present anduse one of the largest available corpora constructed for thepurpose of automatic behavioral coding, the performance ofall the models involved is still critically dependent on thesample size. This is why we decided to use a lot of third-party sources, both for training the behavior code predictorsand for any audio or language modeling needed for the tran-scription pipeline. Applying external datasets, however, wasnot possible for all the tasks. In particular, for the session-level code prediction, we only had the internal 188 labeledsamples available and we, hence, decided to apply a cross-validation scheme with a statistical model (support vector re-gression) that does not require as much data as a more con-voluted deep learning model to converge. In any case, allthe results were reported on evaluation sets not seen duringtraining, while the distributions of the predicted codes (Fig-ures 3 and 4) suggest that those results are indicative of theperformance on a much bigger dataset of therapy sessions.An aspect of crucial importance for our system is thequality assurance of the final evaluation report provided tothe counselor. Being able to determine computational errorsat an early stage and giving relevant warning messages to theuser is an essential prerequisite before mental health practi-tioners trust computer-based tools and introduce them intoclinical settings. We have already implemented several qual-ity safeguards, with results indicating that they are towards the right direction. We are planning on implementing moreconfidence metrics, which take into account ASR and behav-ior coding results, apart from VAD and diarization. Humanannotators can still be used for the sessions or parts of ses-sions for which confidence is low. Such manually-annotatedsessions can be a valuable source of information to be usedfor further adapting our algorithms. That way, we can intro-duce an active learning scenario where the system incremen-tally becomes more accurate and reliable.Likewise, it is important that we have indicative eval-uation metrics both for the individual modules and for theend-to-end system. Standard metrics, such as the Word ErrorRate (WER) and the Diarization Error Rate (DER) used inASR and in diarization, respectively, are useful during mod-eling in order to have benchmarks and quantifiable areas ofimprovement. However, they do not necessarily reflect thetranscript quality from a user’s perspective (Silovsky, Zdan-sky, Nouza, Cerva, & Prazak, 2012) and they are not alwaysrepresentative of the performance with respect to semanticsand to clinical impact (Miner et al., 2020). Qualitative sur-veys where experts share their opinions on the accuracy ofthe system output could assist highlighting specific areas ofclinical importance on which the modeling e ff orts should fo-cus. We should here underline that our goal is to build asystem that will not replace the human input, but will in-stead assist medical experts increasing e ffi ciency and accu-racy. Technology-based tools have seen a rapid rise dur- AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES ff er opportunities for ongoing self-assessment and self-improvement and would open new dis-cussions on the development of specific skills between pro-fessionals or between trainees and supervisors. Additionally,even with a widespread usage of automatic psychotherapyevaluation systems, the community will still need skilled andobjective behavioral coders, both for the evaluation and thetraining of the systems, since any machine learning algorithmis only as good as the training data we provide (Caliskan,Bryson, & Narayanan, 2017).In any case, it is essential that the users be adequatelytrained to understand the meaning of an automatically gener-ated feedback and what the several scores represent. It hasbeen reported that experienced counselors are more likelyto be sceptical about the validity of their ratings (Hirsch etal., 2018), as opposed to new and young therapists who maybe attracted by the lure of machine learning, even withoutbeing fully aware of how their performance-based scoresare estimated. Technology-based systems have the poten-tial to transform mental healthcare. Being receptive to sucha transformation should not mean uncritically accepting anymachine-generated results. In fact, well-intentioned scepti-cism and criticism will accelerate the research in the fieldand will lead to an incremental improvement of the relevanttechnologies. Open Practices Statement
The original data collected for this study consist ofreal-world therapist-client sessions recorded at the Univer-sity Counseling Center (UCC) of a large public Western uni-versity and have to remain within the UCC servers at alltimes for privacy reasons; thus they cannot be made pub-licly available. The psychotherapy data used from previousstudies (Baer et al., 2009; Krupski et al., 2012; C. M. Leeet al., 2013, 2014; Neighbors et al., 2012; Tollison et al.,2008) for adaptation are also protected and not publiclyavailable. The speech corpora used to train the ASR sys-tem are either freely available or provided through theLanguage Data Consortium (LDC) to members and non-members for a fee ( ). In particu-lar, Librispeech (Panayotov et al., 2015) ( ), TED-LIUM (Rousseau et al., 2014) ( lium.univ-lemans.fr/ted-lium2 ), and AMI (Carletta et al., 2005)( groups.inf.ed.ac.uk/ami/corpus ) are freely available to the community; Fisher English (Cieri et al., 2004) (Part1: LDC2004S13 and Part 2: LDC2005S13), ICSI MeetingSpeech (Janin et al., 2003) (LDC2004S02), WSJ (Paul &Baker, 1992) (Part 1: LDC93S6A and Part 2: LDC94S13A),and 1997 HUB4 (Gra ff et al., 1997) (LDC98S71) are pro-vided through LDC. The Counseling and PsychotherapyTranscripts (without accompanying audio) that were used forsome of the language-based modeling can be accessed on re-quest at alexanderstreet.com/products/counseling-and-psychotherapy-transcripts-series .Our models are trained on real-world, sensitive, andprotected data. Thus, our trained models cannot be madepublicly available. Acoustic feature extraction and acous-tic modeling was performed using the Kaldi toolkit which isavailable at github.com/kaldi-asr/kaldi . The Beam-formIt tool used for acoustic beamforming is available at github.com/xanguera/BeamformIt . Language modelswere built using the SRILM toolkit, available at . The neural network usedfor utterance-level code prediction was built on Tensor-Flow ( ), while the tf-idf / SVR frame-work used for session-level code prediction made use ofthe scikit-learn
Python library ( scikit-learn.org/stable ). The md-eval tool, developed by the NationalInstitute of Standards and Technology (NIST) and used fordiarization evaluation, is no longer available by NIST, butcan be found at github.com/nryant/dscore . The esti-mation of Krippendor ff ’s alpha ( α ) for inter-rater reliabilitywas based on the implementation available at github.com/pln-fing-udelar/fast-krippendorff . Acknowledgements
Funding was provided by the National Institutes ofHealth / National Institute on Alcohol Abuse and Alcoholism(R01 AA018673). References
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland,G., & Vinyals, O. (2012). Speaker diarization: A review ofrecent research.
IEEE Transactions on Audio, Speech, andLanguage Processing , (2), 356–370.Anguera, X., Wooters, C., & Hernando, J. (2007). Acoustic beam-forming for speaker diarization of meetings. IEEE Trans-actions on Audio, Speech, and Language Processing , (7),2011–2022.Baer, J. S., Wells, E. A., Rosengren, D. B., Hartzler, B., Beadnell,B., & Dunn, C. (2009). Agency context and tailored train-ing in technology transfer: A pilot evaluation of motivationalinterviewing training for community counselors. Journal ofsubstance abuse treatment , (2), 191–202.Bakeman, R., & Quera, V. (2012). Behavioral observation. InH. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rind-skopf, & K. J. Sher (Eds.), Apa handbook of research meth-ods in psychology, vol. 1. foundations, planning, measures, FLEMOTOMOS ET AL. and psychometrics (pp. 207–225). Washington, DC: Ameri-can Psychological Association. doi: https: // doi.org / / Proc. international work-shop on health text mining and information analysis (pp. 44–54).Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C.-C., Lammert,A. C., Christensen, A., . . . Narayanan, S. S. (2013). Towardautomating a human behavioral coding system for marriedcouples’ interactions using speech acoustic features.
Speechcommunication , (1), 1–21.Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics de-rived automatically from language corpora contain human-like biases. Science , (6334), 183–186.Can, D., Atkins, D. C., & Narayanan, S. S. (2015). A dialog act tag-ging approach to behavioral coding: A case study of addic-tion counseling conversations. In Proc. annual conference ofthe international speech communication association.
Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain,T., . . . others (2005). The AMI meeting corpus: A pre-announcement. In
Proc. international workshop on machinelearning for multimodal interaction (pp. 28–39).Chen, Z., Flemotomos, N., Ardulov, V., Creed, T. A., Imel, Z. E.,Atkins, D. C., & Narayanan, S. (2020).
Feature fusionstrategies for end-to-end evaluation of cognitive behaviortherapy sessions. (preprint at https://arxiv.org/abs/2005.07809 )Cieri, C., Miller, D., & Walker, K. (2004). The fisher corpus: aresource for the next generations of speech-to-text. In
Proc.language resources and evaluation conference (pp. 69–71).Cowie, M. R., Blomster, J. I., Curtis, L. H., Duclaux, S., Ford, I.,Fritz, F., . . . others (2017). Electronic health records tofacilitate clinical research.
Clinical Research in Cardiology , (1), 1–9.Curran, J., Parry, G. D., Hardy, G. E., Darling, J., Mason, A.-M., &Chambers, E. (2019). How does therapy harm? A model ofadverse process using task analysis in the meta-synthesis ofservice users’ experience. Frontiers in Psychology , , 347.doi: 10.3389 / fpsyg.2019.00347Cutler, A., Dahan, D., & Van Donselaar, W. (1997). Prosody inthe comprehension of spoken language: A literature review. Language and speech , (2), 141–201.Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT:Pre-training of deep bidirectional transformers for languageunderstanding. In Proc. conference of the north americanchapter of the association for computational linguistics: Hu-man language technologies, volume 1 (long and short pa-pers) (pp. 4171–4186).Flemotomos, N., Georgiou, P., & Narayanan, S. (2020). Linguis-tically aided speaker diarization using speaker role informa-tion. In
Proc. odyssey: The speaker and language recogni-tion workshop (pp. 117–124).Flemotomos, N., Martinez, V. R., Gibson, J., Atkins, D. C., Creed,T., & Narayanan, S. (2018). Language features for auto-mated evaluation of cognitive behavior psychotherapy ses- sions. In
Proc. annual conference of the international speechcommunication association (pp. 1908–1912).Flemotomos, N., Papadopoulos, P., Gibson, J., & Narayanan, S.(2018). Combined speaker clustering and role recognitionin conversational speech. In
Proc. annual conference of theinternational speech communication association (pp. 1378–1382).Gangadharaiah, R., Shivade, C., Bhatia, P., Zhang, Y., & Kass-Hout,T. (2020). Why conversational AI won’t replace healthcareproviders. In
Conversational agents for health and wellbe-ing, chi workshop.
Gaume, J., Gmel, G., Faouzi, M., & Daeppen, J.-B. (2009). Coun-selor skill influences outcomes of brief motivational inter-ventions.
Journal of substance abuse treatment , (2), 151–159.Georgiou, P. G., Black, M. P., Lammert, A. C., Baucom, B. R., &Narayanan, S. S. (2011). “That’s aggravating, very aggra-vating": Is it possible to classify behaviors in couple interac-tions using automatically derived lexical features? In Inter-national conference on a ff ective computing and intelligentinteraction (pp. 87–96).Gibson, J., Atkins, D., Creed, T., Imel, Z., Georgiou, P., &Narayanan, S. (2019). Multi-label multi-task deep learn-ing for behavioral coding. IEEE Transactions on A ff ectiveComputing .Goldberg, S. B., Flemotomos, N., Martinez, V. R., Tanana, M. J.,Kuo, P. B., Pace, B. T., . . . others (2020). Machine learningand natural language processing in psychotherapy research:Alliance as example use case. Journal of Counseling Psy-chology , (4), 438–448.Gra ff , D., Wu, Z., MacIntyre, R., & Liberman, M. (1997). The1996 broadcast news speech and language-model corpus. In Proc. darpa workshop on spoken language technology (pp.11–14).Hallgren, K. A. (2012). Computing inter-rater reliability for obser-vational data: an overview and tutorial.
Tutorials in quanti-tative methods for psychology , (1), 23–34.Hattie, J., & Timperley, H. (2007). The power of feedback. Reviewof educational research , (1), 81–112.Heldner, M., & Edlund, J. (2010). Pauses, gaps and overlaps inconversations. Journal of Phonetics , (4), 555–568.Hill, C. E. (2009). Helping skills: Facilitating, exploration, in-sight, and action . Washington, DC: American PsychologicalAssociation.Hirsch, T., Soma, C., Merced, K., Kuo, P., Dembe, A., Caperton,D. D., . . . Imel, Z. E. (2018). “It’s hard to argue with acomputer": Investigating psychotherapists’ attitudes towardsautomated evaluation. In
Proc. designing interactive systemsconference (pp. 559–571).Houck, J. M., Moyers, T. B., Miller, W. R., Glynn, L. H., & Hall-gren, K. A. (2010).
Motivational interviewing skill code(misc) version 2.5. (Available from http://casaa.unm.edu/download/misc25.pdf )Imel, Z. E., Pace, B. T., Soma, C. S., Tanana, M., Hirsch, T., Gibson,J., . . . Atkins, D. C. (2019). Design feasibility of an auto-mated, machine-learning based feedback system for motiva-tional interviewing.
Psychotherapy , (2), 318.Imel, Z. E., Steyvers, M., & Atkins, D. C. (2015). Computa- AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES tional psychotherapy research: Scaling up the evaluation ofpatient–provider interactions. Psychotherapy , (1), 19.Io ff e, S. (2006). Probabilistic linear discriminant analysis. In Proc.european conference on computer vision (pp. 531–542).Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan,N., . . . others (2003). The ICSI meeting corpus. In
Proc. in-ternational conference on acoustics, speech, and signal pro-cessing.
Kessler, R. C., Berglund, P., Demler, O., Jin, R., Merikangas, K. R.,& Walters, E. E. (2005). Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the national co-morbidity survey replication.
Archives of general psychiatry , (6), 593–602.Klatte, R., Strauss, B., Flückiger, C., & Rosendahl, J. (2018). Ad-verse e ff ects of psychotherapy: protocol for a systematic re-view and meta-analysis. Systematic reviews , (1), 135.Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audioaugmentation for speech recognition. In Proc. annual con-ference of the international speech communication associa-tion.
Kodish-Wachs, J., Agassi, E., Kenny III, P., & Overhage, J. M.(2018). A systematic comparison of contemporary auto-matic speech recognition engines for conversational clini-cal speech. In
Proc. AMIA annual symposium (Vol. 2018,p. 683).Krippendor ff , K. (2018). Content analysis: An introduction to itsmethodology . Los Angeles, CA: Sage publications.Krupski, A., Joesch, J. M., Dunn, C., Donovan, D., Bumgardner, K.,Lord, S. P., . . . Roy-Byrne, P. (2012). Testing the e ff ects ofbrief intervention in primary care for problem drug use in arandomized controlled trial: rationale, design, and methods. Addiction science & clinical practice , (1), 27.Kulik, J. A., & Kulik, C.-L. C. (1988). Timing of feedback andverbal learning. Review of educational research , (1), 79–97.Lambert, M. J., & Bergin, A. E. (2002). The e ff ectiveness of psy-chotherapy. In M. Hersen & W. Sledge (Eds.), Encyclopediaof psychotherapy (Vol. 1, pp. 709–714). USA: Elsevier Sci-ence.Lambert, M. J., & Ogles, B. M. (1997). The e ff ectiveness of psy-chotherapy supervision. In C. E. Watkins (Ed.), Handbook ofpsychotherapy supervision (pp. 421–446). USA: John Wiley& Sons, Inc.Lambert, M. J., Whipple, J. L., & Kleinstäuber, M. (2018). Col-lecting and delivering progress feedback: A meta-analysis ofroutine outcome monitoring.
Psychotherapy , (4), 520.Lee, C. M., Kilmer, J. R., Neighbors, C., Atkins, D. C., Zheng, C.,Walker, D. D., & Larimer, M. E. (2013). Indicated preven-tion for college student marijuana use: a randomized con-trolled trial. Journal of consulting and clinical psychology , (4), 702.Lee, C. M., Neighbors, C., Lewis, M. A., Kaysen, D., Mittmann,A., Geisner, I. M., . . . others (2014). Randomized controlledtrial of a spring break intervention to reduce high-risk drink-ing. Journal of consulting and clinical psychology , (2),189.Lee, F.-T., Hull, D., Levine, J., Ray, B., & McKeown, K. (2019).Identifying therapist conversational actions across diverse psychotherapeutic approaches. In Proc. workshop on com-putational linguistics and clinical psychology (pp. 12–23).Levitt, H. M. (2001). Sounds of silence in psychotherapy: Thecategorization of clients’ pauses.
Psychotherapy Research , (3), 295–309.Lord, S. P., Sheng, E., Imel, Z. E., Baer, J., & Atkins, D. C. (2015).More than reflections: empathy in motivational interview-ing includes language style synchrony between therapist andclient. Behavior therapy , (3), 296–303.Ma, X., & Hovy, E. (2016). End-to-end sequence labeling viabi-directional LSTM-CNNs-CRF. In Proc. annual meetingof the association for computational linguistics (Vol. 1, pp.1064–1074).Madson, M. B., & Campbell, T. C. (2006). Measures of fidelity inmotivational enhancement: a systematic review.
Journal ofsubstance abuse treatment , (1), 67–73.Magill, M., Gaume, J., Apodaca, T. R., Walthers, J., Mastroleo,N. R., Borsari, B., & Longabaugh, R. (2014). The technicalhypothesis of motivational interviewing: A meta-analysis ofMI’s key causal model. Journal of consulting and clinicalpsychology , (6), 973.Malik, U., Barange, M., Saunier, J., & Pauchet, A. (2018). Perfor-mance comparison of machine learning models trained onmanual vs asr transcriptions for dialogue act annotation. In (pp. 1013–1017).Martinez, V. R., Flemotomos, N., Ardulov, V., Somandepalli, K.,Goldberg, S. B., Imel, Z. E., . . . Narayanan, S. (2019). Iden-tifying therapist and client personae for therapeutic allianceestimation. Proc. Annual Conference of the InternationalSpeech Communication Association , 1901–1905.Miller, W. R., & Rollnick, S. (2012).
Motivational interviewing:Helping people change . Guilford press.Miller, W. R., Sorensen, J. L., Selzer, J. A., & Brigham, G. S.(2006). Disseminating evidence-based practices in substanceabuse treatment: A review with suggestions.
Journal of sub-stance abuse treatment , (1), 25–39.Miner, A. S., Haque, A., Fries, J. A., Fleming, S. L., Wilfley, D. E.,Wilson, G. T., . . . Shah, N. H. (2020). Assessing the accu-racy of automatic speech recognition for psychotherapy. npjDigital Medicine , (82), 82.Moyers, T. B., Martin, T., Manuel, J. K., Hendrickson, S. M., &Miller, W. R. (2005). Assessing competence in the use ofmotivational interviewing. Journal of substance abuse treat-ment , (1), 19–26.Moyers, T. B., Rowell, L. N., Manuel, J. K., Ernst, D., & Houck,J. M. (2016). The motivational interviewing treatment in-tegrity code (miti 4): rationale, preliminary reliability andvalidity. Journal of substance abuse treatment , , 36–42.Nasir, M., Chakravarthula, S. N., Baucom, B. R., Atkins, D. C.,Georgiou, P., & Narayanan, S. (2019). Modeling inter-personal linguistic coordination in conversations using wordmover’s distance. Proc. Annual Conference of the Interna-tional Speech Communication Association , 1423–1427.Neighbors, C., Lee, C. M., Atkins, D. C., Lewis, M. A., Kaysen,D., Mittmann, A., . . . Larimer, M. E. (2012). A randomizedcontrolled trial of event-specific prevention strategies for re-ducing problematic drinking associated with 21st birthday FLEMOTOMOS ET AL. celebrations.
Journal of consulting and clinical psychology , (5), 850.Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Lib-rispeech: an asr corpus based on public domain audio books.In Proc. international conference on acoustics, speech andsignal processing (pp. 5206–5210).Paul, D. B., & Baker, J. M. (1992). The design for the wall streetjournal-based csr corpus. In
Proc. workshop on speech andnatural language (pp. 357–362).Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., & Khudan-pur, S. (2015). JHU aspire system: Robust lvcsr with tdnns,ivector adaptation and rnn-lms. In
Proc. workshop on auto-matic speech recognition and understanding (pp. 539–546).Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delayneural network architecture for e ffi cient modeling of longtemporal contexts. In Proc. annual conference of the inter-national speech communication association.
Perry, J. C., Banon, E., & Ianni, F. (1999). E ff ectiveness of psy-chotherapy for personality disorders. American Journal ofPsychiatry , (9), 1312–1321.Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O.,Goel, N., . . . Vesely, K. (2011). The Kaldi Speech Recogni-tion Toolkit. In Proc. workshop on automatic speech recog-nition and understanding.
Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminantanalysis for inferences about identity. In
Proc. internationalconference on computer vision (pp. 1–8).Proctor, E., Silmere, H., Raghavan, R., Hovmand, P., Aarons, G.,Bunger, A., . . . Hensley, M. (2011). Outcomes for imple-mentation research: conceptual distinctions, measurementchallenges, and research agenda.
Administration and Pol-icy in Mental Health and Mental Health Services Research , (2), 65–76.Quiroz, J. C., Laranjo, L., Kocaballi, A. B., Berkovsky, S., Reza-zadegan, D., & Coiera, E. (2019). Challenges of developinga digital scribe to reduce clinical documentation burden. npjDigital Medicine , (1), 1–6.Rousseau, A., Deléglise, P., & Esteve, Y. (2014). Enhancing theTED-LIUM corpus with selected data for language model-ing and more TED talks. In Proc. language resources andevaluation conference (pp. 3935–3939).Salton, G., & McGill, M. J. (1986).
Introduction to modern infor-mation retrieval . New York, NY: McGraw-Hill, Inc.Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013).Speaker adaptation of neural network acoustic models usingi-vectors. In
Proc. workshop on automatic speech recogni-tion and understanding (pp. 55–59).Saxon, D., Barkham, M., Foster, A., & Parry, G. (2017). The con-tribution of therapist e ff ects to patient dropout and deterio-ration in the psychological therapies. Clinical psychology & psychotherapy , (3), 575–588.Schaefer, J. D., Caspi, A., Belsky, D. W., Harrington, H., Houts, R.,Horwood, L. J., . . . Mo ffi tt, T. E. (2017). Enduring men-tal health: prevalence and prediction. Journal of abnormalpsychology , (2), 212.Schmidt, L. K., Andersen, K., Nielsen, A. S., & Moyers, T. B.(2019). Lessons learned from measuring fidelity with themotivational interviewing treatment integrity code (miti 4). Journal of Substance Abuse Treatment , , 59–67.Schwalbe, C. S., Oh, H. Y., & Zweben, A. (2014). Sustaining mo-tivational interviewing: A meta-analysis of training studies. Addiction , (8), 1287–1294.Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J.,Maciejewski, M., . . . others (2018). Diarization is hard:Some experiences and lessons learned for the JHU team inthe inaugural DIHARD challenge. In Proc. annual confer-ence of the international speech communication association (pp. 2808–2812).Shiner, B., D’Avolio, L. W., Nguyen, T. M., Zayed, M. H., Watts,B. V., & Fiore, L. (2012). Automated classification of psy-chotherapy note text: implications for quality assessment inPTSD care.
Journal of evaluation in clinical practice , (3),698–701.Silovsky, J., Zdansky, J., Nouza, J., Cerva, P., & Prazak, J. (2012).Incorporation of the ASR output in speaker segmentationand clustering within the task of speaker diarization ofbroadcast streams. In Proc. international workshop on mul-timedia signal processing (pp. 118–123).Singla, K., Chen, Z., Atkins, D. C., & Narayanan, S. (2020). To-wards end-2-end learning for predicting behavior codes fromspoken utterances in psychotherapy conversations. In
Proc.annual meeting of the association for computational linguis-tics.
Singla, K., Chen, Z., Flemotomos, N., Gibson, J., Can, D., Atkins,D. C., & Narayanan, S. (2018). Using prosodic and lexi-cal information for learning utterance-level behaviors in psy-chotherapy. In
Proc. annual conference of the internationalspeech communication association (pp. 3413–3417).Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur,S. (2018). X-vectors: Robust dnn embeddings for speakerrecognition. In
Proc. international conference on acoustics,speech and signal processing (pp. 5329–5333).Stolcke, A. (2002). SRILM-an extensible language modelingtoolkit. In
Proc. international conference on spoken lan-guage processing.
Substance Abuse and Mental Health Services Administration.(2019).
Key substance use and mental health indicators inthe United States: Results from the 2018 national survey ondrug use and health . Rockville, MD: Center for BehavioralHealth Statistics and Quality.Sutton, R. T., Pincock, D., Baumgart, D. C., Sadowski, D. C., Fedo-rak, R. N., & Kroeker, K. I. (2020). An overview of clinicaldecision support systems: benefits, risks, and strategies forsuccess. npj Digital Medicine , (1), 1–10.Thomas, S., Saon, G., Van Segbroeck, M., & Narayanan, S. S.(2015). Improvements to the IBM speech activity detectionsystem for the DARPA RATS program. In Proc. interna-tional conference on acoustics, speech and signal processing (pp. 4500–4504).Tollison, S. J., Lee, C. M., Neighbors, C., Neil, T. A., Olson, N. D.,& Larimer, M. E. (2008). Questions and reflections: the useof motivational interviewing microskills in a peer-led briefalcohol intervention for college students.
Behavior Therapy , (2), 183–194.Weisz, J. R., Weiss, B., Han, S. S., Granger, D. A., & Morton, T.(1995). E ff ects of psychotherapy with children and adoles- AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES cents revisited: a meta-analysis of treatment outcome stud-ies. Psychological bulletin , (3), 450.Xiao, B., Bone, D., Segbroeck, M. V., Imel, Z. E., Atkins, D. C.,Georgiou, P. G., & Narayanan, S. S. (2014). Modeling ther-apist empathy through prosody in drug addiction counseling.In Proc. annual conference of the international speech com-munication association.
Xiao, B., Can, D., Georgiou, P. G., Atkins, D., & Narayanan, S. S.(2012). Analyzing the language of therapist empathy in mo-tivational interview based psychotherapy. In
Proc. asia pa-cific signal and information processing association annualsummit and conference (pp. 1–4).Xiao, B., Can, D., Gibson, J., Imel, Z. E., Atkins, D. C., Georgiou,P. G., & Narayanan, S. S. (2016). Behavioral coding of ther-apist language in addiction counseling using recurrent neuralnetworks. In
Proc. annual conference of the internationalspeech communication association (pp. 908–912).Xiao, B., Huang, C., Imel, Z. E., Atkins, D. C., Georgiou, P., &Narayanan, S. S. (2016). A technology prototype system forrating therapist empathy from audio recordings in addictioncounseling.
PeerJ Computer Science , , e59.Xiao, B., Imel, Z. E., Atkins, D. C., Georgiou, P. G., & Narayanan,S. S. (2015). Analyzing speech rate entrainment and its re-lation to therapist empathy in drug addiction counseling. In Proc. annual conference of the international speech commu-nication association (pp. 2489–2493).Xiao, B., Imel, Z. E., Georgiou, P. G., Atkins, D. C., & Narayanan,S. S. (2015). “Rate my therapist": Automated detectionof empathy in drug and alcohol counseling via speech andlanguage processing.
PloS one , (12).Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stol-cke, A., . . . Zweig, G. (2017). Toward human parity inconversational speech recognition. IEEE / ACM Transactionson Audio, Speech, and Language Processing , (12), 2410–2423. Appendix ASystem Training Details
The following sections provide some technical details relatedto the described system, including hyper-parameter valuesand training procedures.
Audio Feature Extraction
MFCCs are extracted every 10msec using 25msec-long windows.
Voice Activity Detection
The feed-forward neural network comprises twolayers of 512 neurons each and sigmoid activation functions,before a final inference layer giving a frame-level probabil-ity. The input is a 13-dimensional MFCC vector characteriz-ing a frame, spliced with a context of 30 neighboring frames(15 + Speaker Diarization
Each voiced segment, as predicted by VAD, is parti-tioned uniformly into subsegments of length equal to 1 . . kaldi-asr.org/models/m6 . This was originally used todiarize telephone conversations and expects 23-dimensionalMFCCs as input features. In particular, the first 5 layersof the neural architecture (x-vector extractor) operate at theframe level and are inspired from the Time-Delay Neural net-works (TDNNs) where each layer sees a di ff erent temporalcontext. Those are followed by a statistics pooling layer thatcomputes the mean and standard deviation vectors. Those arethe inputs to a fully-connected layer that operates at the seg-ment level before a final softmax inference layer that mapssegments to speaker labels. The 128-dimensional embed-dings used are the outputs of the final hidden layer and thoseare further mean- and length-normalized. The subsegmentsare finally clustered together according to a Hierarchical Ag-glomerative Clustering (HAC) approach with average link-ing, using PLDA as the similarity metric. After this step,adjacent speech segments assigned to the same speaker areconcatenated together into a single speaker turn, allowing amaximum of 1sec in-turn silence. Automatic Speech Recognition
The input feature vectors to the TDNN architec-ture are 40-dimensional MFCCs which are augmented by100-dimensional i-vectors, extracted online through a slid-ing window. First, word alignments are derived based onthe GMM / HMM paradigm. The training data consists ofthe Fisher English, ICSI Meeting Speech, WSJ, 1997 HUB4,Librispeech, TED-LIUM, AMI, and TOPICS-CTT corpora.We use the o ffi cially recommended training subsets for Lib-rispeech and TED-LIUM and the recommended training anddevelopment sets for AMI. We randomly choose 95% of theavailable Fisher utterances and 80% of the available ICSI,WSJ, and HUB4 utterances. We also use the 242 TOPICS-CTT train sessions described.in the paper. We have kept therest of the combined dataset for internal validation and eval-uation of the ASR system. Among the aforementioned cor-pora, TED-LIUM and the clean portion of Librispeech areaugmented with speed perturbation, noise, and reverberation.The ASR AM was built and trained using the Kaldi speechrecognition toolkit following the nnet3 ‘chain’ setup. Thetwo 3-gram LMs are trained using the SRILM toolkit andare interpolated with a mixing weight equal to 0.8 for thein-domain model and 0.2 for the background model. Speaker Role Recognition
The required LMs are 3-gram models with Kneser-Ney smoothing, trained with the SRILM toolkit, using the4
FLEMOTOMOS ET AL.
TOPICS-CTT train and CPTS corpora with mixing parameters0.8 and 0.2, respectively.
Utterance-level Code Prediction
The BiLSTM network was first trained on theTOPICS-CTT data using the Adam optimizer with a learn-ing rate of 0.001 and an exponential decay of 0.9. The batchsize was equal to 256 utterances. The system was trainedon that dataset for 30 epochs with an early stopping strategy,keeping the model with the lowest validation loss. During thetraining process we used class weights inversely proportionalto the class frequencies. The system was further fine-tunedby continuing training on the UCC train data.
Appendix BCORE-MI Final Report Example
In Figure B1, the two main views of CORE-MI are dis-played; the session view and the report view. In the first one,the user can listen to the recording of the therapy, watch thevideo (if available) and read the generated transcript, whichis scrollable and searchable. The report view provides theactual therapy session evaluation. Among the session-levelcodes, only empathy is shown in the version displayed here.
Appendix CVoice Activity Detection - Frame Level Results
Voice Activity Detection (VAD) performance is typically in-corporated in the evaluation of a diarization system in theform of false alarms and missed speech rates (Table 6 of thepaper). However, especially in our system, those results are heavily influenced by post-processing steps and do not ac-curately represent VAD performance. The VAD results onthe UCC data at the frame level are given in Table C1. Inparticular, the problem is treated as a binary classificationone where each frame can take a binary value (voiced or un-voiced). Results are reported in terms of accuracy, precisionand recall. The unweighted average recall is also reportedand it is the metric used during hyper-parameter tuning.
Table C1
Voice Activity Detection (VAD) results ( % ) at the frame levelfor the test sets of the UCC data. UAR is the UnweightedAverage Recall and it was the main metric used foroptimization of the VAD system. set accuracy precision recall UARUCC test test . AM I A GOOD THERAPIST?” AUTOMATED EVALUATION OF PSYCHOTHERAPY SKILLS USING SPEECH AND LANGUAGE TECHNOLOGIES Figure B1
CORE-MI platform to provide therapy feedback. In the session view (up) the user can listen to the recording, watchthe video, read the generated transcription, and keep notes. In the report viewreport view