A Study on the Manifestation of Trust in Speech
Lara Gauder, Leonardo Pepino, Pablo Riera, Silvina Brussino, Jazmín Vidal, Agustín Gravano, Luciana Ferrer
AA Study on the Manifestation of Trust in Speech
Lara Gauder a,b,1 , Leonardo Pepino a,b , Pablo Riera a , Silvina Brussino c,d ,Jazm´ın Vidal a,b , Agust´ın Gravano e , Luciana Ferrer a a Instituto de Investigaci´on en Ciencias de la Computaci´on (ICC),CONICET-UBA, Argentina b Departamento de Computaci´on, Facultad de Ciencias Exactas y Naturales,Universidad de Buenos Aires (UBA), Argentina c Facultad de Psicolog´ıa, Universidad Nacional de C´ordoba (UNC), Argentina d Instituto de Investigaciones Psicol´ogicas, CONICET-UNC, Argentina e Escuela de Negocios, Universidad Torcuato Di Tella, Argentina
Abstract
Research has shown that trust is an essential aspect of human-computerinteraction directly determining the degree to which the person is willing touse a system. An automatic prediction of the level of trust that a user hason a certain system could be used to attempt to correct potential distrust byhaving the system take relevant actions like, for example, apologizing or ex-plaining its decisions. In this work, we explore the feasibility of automaticallydetecting the level of trust that a user has on a virtual assistant (VA) basedon their speech. Since, to our knowledge, no public databases were availableto study the effect of trust in speech, we developed a novel protocol for col-lecting speech data from subjects induced to have different degrees of trustin the skills of a VA. The protocol consists of an interactive session wherethe subject is asked to respond to a series of factual questions with the helpof a virtual assistant. In order to induce subjects to either trust or distrustthe VA’s skills, they are first informed that the VA was previously rated byother users as being either good or bad; subsequently, the VA answers the Lara and Leonardo contributed equally to the paper. Lara focused on the design,implementation and deployment of the data collection protocol as well as data curation.Pablo and Lara worked on the analysis of the collected database. Leonardo worked onthe machine learning approaches, experimentation and analysis of results. Silvina helpeddesign the data collection protocol. Jazmin helped with data curation. Agust´ın andLuciana directed the work, with hands-on contributions in the design of the protocol, thecode, the machine learning approaches and the experiments.
Preprint submitted to Computer Science and Language February 19, 2021 a r X i v : . [ c s . H C ] F e b ubjects’ questions consistently to its alleged abilities. All interactions arespeech-based, with subjects and VAs communicating verbally, which allowsthe recording of speech produced under different trust conditions. Using thisprotocol, we collected a speech corpus in Argentine Spanish. We show clearevidence that the protocol effectively succeeded in influencing subjects intothe desired mental state of either trusting or distrusting the agent’s skills,and present results of a perceptual study of the degree of trust performed byexpert listeners. Finally, we found that the subject’s speech can be used todetect which type of VA they were using, which could be considered a proxyfor the user’s trust toward the VA’s abilities, with an accuracy up to 76%,compared to a random baseline of 50%. These results are obtained usingfeatures that have been previously found useful for detecting speech directedto infants and non-native speakers. The collected speech dataset is publiclyavailable for research use.
1. Introduction
The ability to dynamically monitor the user’s mental state, including theirengagement, satisfaction, and emotions in general (Kiseleva et al., 2016; Sanoet al., 2016; Kiseleva and de Rijke, 2017) is becoming an increasingly impor-tant component of conversational agents. In particular, and especially forvirtual assistants (VA), tracking the user’s degree of trust in the system’sskills may be critical for the success of the interaction (Parasuraman andRiley, 1997; Drnec et al., 2016). Distrust in a system can cause the userto under-use its capabilities. If a user starts displaying cues of distrust andthe system can effectively detect such cues, then the dialogue manager couldchoose to act in consequence for regaining the user’s trust (Muir, 1994; Oka-mura and Yamada, 2020; Drnec et al., 2016).Several disciplines have investigated trust for decades, including psychol-ogy, anthropology, sociology, economics and political science. One importantarea of research has been the search for the factors that explain trust. Mayeret al. consider trust to depend on the trustor’s perception of the trustee’s ability, benevolence and integrity (Mayer et al., 1995). Other such factorshave been proposed, including contextual and situational factors (Gulati,1995; Zucker, 1986), and the propensity of a person to trust (Rotter, 1967),among others. Trust has been also described as a dynamic phenomenon – itcan be created or destroyed during a conversation (Zand, 1972; Korsgaardet al., 2014). 2or systems that communicate with the user through speech, one possibleway to track the level of trust is through the user’s voice. It is reasonableto assume that a person would change the way they speak depending onwhether they trust their interlocutor or not. Trust’s nature has been de-picted both as rational or cognitive (Coleman, 1990; Hardin, 2006) and asemotional or affective (Miller, 1974), or a combination of both (M¨ollering,2006). In either case, we hypothesize that the degree of trust affects and isaffected by linguistic aspects (e.g. the form and content of discourse) and par-alinguistic aspects (e.g. the intonation, pitch, speech rate and voice quality)of the trustor’s and trustee’s speech. The main goal of this research projectis to study to what extent the trustor’s degree of trust can be predicted fromtheir speech signal during a human-computer interaction using fully auto-matic methods. We focus on eliciting and detecting trust or distrust of aperson (the trustor) toward a VA’s (the trustee’s) abilities, leaving the studyon the effect of benevolence and integrity for future work.To date, very little research has been done in this area, but there aresome indications that trust can be detected from the trustor’s speech. Waberet al. (2015) studied paralinguistic aspects in medical conversations betweennurses and found that the emphasis used by an outgoing nurse when talkingto an incoming nurse was significantly related to the degree of trust thatthe outgoing nurse reported to have on their colleagues. Further, Elkins andDerrick (2013) found that the variations of pitch in time were related to thedegree of trust of the speaker during human-computer interactions in theform of interviews.In this paper, we present a preliminary set of results that we hope willcontribute to answer the question of whether trust can be detected fromthe trustor’s voice. The experiments were conducted on the
Trust-UBADatabase , which was collected specifically for this purpose, since, to ourknowledge, no speech corpus was available with annotations of varying de-grees of trust and large enough to allow for statistical analyses and machinelearning experiments. Consequently, we designed and implemented a novelprotocol for inducing varying degrees of trust in which subjects interactedwith a virtual assistant (VA) in order to respond a series of factual questions.Before each series of questions, the subjects were told that the particularVA they were using had been rated with a high or a low score by previoususers. This initial bias was later reinforced during the task by having theVA respond all or only some of the questions correctly, respectively. Wesubsequently used this protocol for building the
Trust-UBA Database in Ar-3entine Spanish. We show results that indicate that the protocol succeededin eliciting varying degrees of trust. Further, we show that a team of expertlisteners achieved very low agreement in the annotation of the perceived de-gree of trust based only on the speech signal, indicating the difficulty of thetask.Using the Trust-UBA database, we investigated whether it is possible toautomatically detect which type of VA the user is interacting with, a reli-able or an unreliable one. To this end, we implemented a classification systembased on a set of features extracted automatically from the user’s speech. Thefeatures were motivated by the work done on speech directed to at-risk lis-teners like infants, non-native speakers and people with hearing impairment,which we believed would share similar characteristics to speech directed toan unreliable VA. Research shows (Hazan et al., 2015; Scarborough et al.,2007; Uther et al., 2007; Saint-Georges et al., 2013) that non-native- andinfant-directed speech include some of the following characteristics: vowelhyper-articulation, a decrease in speech rate, an increase in the number andlength of pauses, and an increase in pitch excursions. Further, work on speechdirected to computers that make mistakes have been found to have similarcharacteristics (Oviatt et al., 1998). Based on these works, we designed a setof features aimed at capturing these effects.Our experiments on the Trust-UBA database using these features showthat it is possible to detect whether a user is talking to a reliable or anunreliable VA based on their speech with an accuracy of up to 76%. Notethat we are not directly detecting mistrust but rather, a proxy given by thereliability of the VA.Yet, as discussed below, evidence indicates that users did trust the un-reliable VA less than the reliable VA. These findings suggest that we mayindeed be able to detect the trust level from a user’s speech, though furtherexperiments are needed to confirm these findings on larger datasets with lesscontrolled scenarios.The rest of the paper is organized as follows. Section 2 describes and pro-vides an analysis of the Trust-UBA database, Section 3 describes the machinelearning experiments and results, and Section 4 provides the conclusions.
2. The Trust-UBA Database
In this section we describe the data collection protocol and the resultingdatabase, and include an analysis of different aspects of the data. The full4atabase is available for research purposes. Please, contact the authors ifyou are interested in obtaining the dataset.
The protocol consists of an interactive session in which the subject mustrespond to a series of questions, aided by a speech-enabled virtual assistant(VA). A text-only version of this protocol was first described and evaluatedin (Gauder et al., 2019). In this section we describe the speech-based versionof the protocol in detail.
The subjects’ task is to respond to a series of factual questions with thehelp of a VA. For each question, subjects are required to (1) interact verballywith the VA to find the answer to the question; (2) listen and transcribe theanswer given by the VA; (3) rate their confidence in the response given bythe VA using a 5-level Likert scale; and (4) enter their own answer, basedon what they believe to be correct (this may or may not match the VA’sresponse). Figure 1 shows a screenshot of the user interface: the currentfactual question is shown at the top left of the screen; below that is a formin which subjects must enter the VA’s response, their confidence in the VA’sresponse, and their own response. On the right lies a voice recording buttonused to communicate with the VA.At the beginning of a series of questions, subjects are told that the VAthey will interact with was previously rated by other users with either a veryhigh or very low score: 4.9 and 1.4 out of 5 stars, respectively (these twovalues were chosen empirically based on pilots tests; see Gauder et al., 2019).These two conditions are central in our protocol and are meant to bias theuser toward either trusting or distrusting the VA’s skills. We refer to themas the high-score and the low-score conditions, or simply H and L. With thissetup we intend to benefit from a well-studied cognitive phenomenon called anchoring or previous-opinion bias, in which a person’s decision-making pro-cess is influenced by an initial piece of information offered to them, such asa house valuation made by another broker, or a patient diagnoses made byanother doctor (Sackett, 1979; Tversky and Kahneman, 1974). Subsequently,the quality of the responses given by the VA is consistent with the informedabilities, making no mistakes in the H condition, and making some mistakesin the L condition. This is intended to reinforce in the subject the feelingthat the former system is good, and the latter is bad.5 igure 1: Screenshot of the user interface for the question “What is the distance betweenBariloche and Buenos Aires?” (1). Subjects enter, from top to bottom: the answer ac-cording to the VA (2), their confidence in the VA’s answer (3), and their own answer (4).Subjects interact with the VA using the voice recording button shown on the right: “Talkto the Virtual Assistant” (5). For each condition, the subject solves a questionnaire that contains 18questions which were chosen from a set of 36. In this paper, we will call eachsequence of 18 questions corresponding to one condition from a recordingsession a series . When both conditions were completed by a subject, ques-tions did not repeat across the two series. Hence, in those cases, subjectssaw all 36 questions. Most questions appear in both conditions, though notexactly the same number of times, and 6 of the 36 question only appeared inthe low-score condition.Of the 18 questions in each series, 6 were selected to be easy and 12 tobe difficult . Easy questions are about topics that should be obviously knownby anyone (e.g., “How many days are there in a week?” ) and are used togenerate the feeling in the subject that the VA actually works. Difficultquestions, on the other hand, were selected so that their correct answers6ould likely be unknown to most people (e.g., “What are the three longestrivers in Argentina?” ). Thus, for difficult questions subjects should dependon the VA’s responses. Furthermore, from the subjects’ perspective, difficultquestions make the task more challenging and interesting; and from our part,these questions allow us to manipulate the subjects’ varying degree of trustin the VA’s skills.Difficult questions may be answered either correctly or incorrectly by theVA, as a reinforcement of the corresponding initial bias presented to thesubject. In the L condition, 6 of the 12 difficult questions are answered in-correctly; in the H condition, all 12 difficult questions are answered correctly.Importantly, no easy questions are ever answered incorrectly, since we foundin pilot tests that doing so typically caused unnecessary frustration in thesubjects, along with an irreversible feeling that the VA was useless. For thatreason, incorrect answers to difficult questions were chosen to trigger a senseof doubt in the subjects; even though they may not know the correct answer,they should feel that the VA’s answer is wrong, without seriously hurtingits reputation. For example, for the question, “What is the distance betweenBarcelona and Madrid?” , the VA’s incorrect answer is 1000 km (it is actually504 km).Lastly, questions can also be divided into two types, depending on thelength of the interaction they are expected to trigger. Some questions andanswers were prepared for forcing an exchange of several conversational turns,aiming at collecting more speech data from the subjects. For example, afterthe subject asks “What is the melting temperature of aluminum?” , the VAmay ask in which unit of measurement it should provide the answer (Celsiusor Fahrenheit degrees). Using this strategy, we force subjects to have longerinteractions with the VA and produce more dialogue acts – not only questionsbut also answers.
At the beginning of each recording session, subjects are asked to completea few questions regarding their demographic information, including gender,age, birthplace, and first and second languages. Next, they rate their self-perceived personality traits along 15 dimensions which can be mapped to thebig five personality types, based on work by Mondak (2010): creative, curi-ous, intelligent, tidy, careful, hardworking, outgoing/enthusiastic/expressive,conversationalist, sociable, warm/affectionate, friendly, comprehensive, re-laxed, calm, and stable. Lastly, subjects rate their familiarity with, and7rust in, virtual assistants and other digital systems, including GPS systems,automatic phone menus, and interactive-voice-response systems.To assess the progress of the subjects’ degree of trust throughout theseries, they are required to complete simple evaluation surveys after questions6, 12 and 18. The first question, “So far, how confident are you in thesystem’s ability to answer questions?,” is answered in a 5-level Likert scalepresented using a 5-star metaphor, as seen in the top part of Figure 2. Onlyafter answering this question, subjects are reminded that the current VAreceived an average of X stars by other users (as explained above, X = 4 . X = 1 . Figure 2: Screenshot of an intermediate survey (1). Subjects answer two questions: “Sofar, how confident are you in the system’s ability to answer questions?” (2); “The VirtualAssistant you are evaluating received an average of 1.4 stars from other users. Why doyou think you gave a higher rating than the average user?” (3).
The purpose of the evaluation surveys is twofold. They measure theprogression of the subjects’ degree of trust along the series, and they alsorefresh the anchoring bias introduced at the series beginning, thus reinforcingthe H or L condition. The required textual explanation is intended to makesubjects more conscious of this bias.After completing a series of 18 questions and the third evaluation sur-8ey, subjects are required to rate how useful they found the VA, their degreeof frustration with it, and how much they trusted it. They also report theextent to which they felt the following emotions and sentiments during theinteraction: active, afflicted, attentive, tired, decided, disgusted, distracted,enthusiastic, inspired, uneasy, nervous, and fearful. All questions are an-swered using a 5-point Likert scale. These surveys are intended to furthermonitor and understand the subjects’ behavior during their interaction withthe VA.
Subjects interact with the VA via a voice recording button (right part ofFigure 1), with which they may request the information needed to answereach question. The study interface was implemented online, to allow for datacollection both in a controlled laboratory, and remotely over the Internet.We built the VA dialogue system with the OpenDial toolkit (Lison andKennington, 2016), using a separate ‘dialogue domain’ for each question –i.e., a separate set of rules to trigger the system responses. We built a set ofdeterministic pattern-matching rules to cover the potentially many ways inwhich subjects may phrase their sentences.We synthesized the VA’s responses using Microsoft’s publicly availablespeech synthesizer, with the HelenaRUS female voice in Spain Spanish withdefault settings. The subjects’ utterances were transcribed using Google’spublicly available automatic speech recognition system. Subjects were recruited via ads on social media, emails to student mailinglists, and posters at the University campus. A total of 50 subjects partic-ipated at the University, in a controlled, silent environment (we call thesethe in-lab subjects). This group was asked to solve two series of questions(one in each condition) and received a small monetary compensation for theirtime. Other 110 subjects participated over the Internet (these are the remote subjects). In this case we had no control of the environment, which couldresult in poorer recording quality and lower concentration levels. This groupwas required to finish at least one series of questions, and were included inbiweekly draws for a small monetary prize as compensation. https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech https://cloud.google.com/speech-to-text The main purpose of the protocol is to induce subjects into either trustingor distrusting the VA’s skills. In this section we analyze its effectiveness, (1)by looking at the subjects’ ratings given to the VA at each question and at theintermediate and final surveys, and (2) by comparing the answers subjectswrote based on the VA response and based on their own knowledge (boxes 2and 4 in Figure 1). 10s explained in the previous section, occasionally, the VA made mistakesthat caused it to ask the subject to repeat a question. All analyses in thissection consider only the questions and surveys within a series that occurredbefore the first error made by the VA, since any responses from the subjectafter that are likely to be affected by the error. We assumed that the negativeeffect of a VA error in the subject’s perception would carry over until a newseries started and the user was told that this was now a new VA with adifferent score. Hence, questions in the second series (until the first error inthe series) were included even if system errors occurred in the first series.In the case of the per-question analyses, we only included difficult questionsthat were answered correctly by the VA. We consider these questions to bethe most relevant ones for our analyses, since we expect the rating given foreasy questions and difficult questions answered incorrectly to be affected bythe fact that the subject knows whether the answer to the question is corrector incorrect, and not by their level of trust toward the VA. Finally, for theseanalyses, we discard 39 subjects (19 in-lab and 20 remote of the 84 selectedfor the experiments in this paper) that experienced a system error beforequestion number 6 and, in consequence, would have not answered any surveybefore that first error. Hence, the following analyses were made using thesessions from 45 subjects in total, 31 in-lab and 14 remote.The code used for the analyses in this section is included with the publicly-available Trust database to ease replicability.
First, we used a Generalized Linear Mixed-Effects Model (GLMM) (Bateset al., 2015; Luke, 2017; Kuznetsova et al., 2017) to predict the conditionof the series (H or L) given the score reported by subjects for each of thequestions (star rating in Figure 1) including random effects given by thesubjects. We assume that the logit of the probability that a certain question i corresponds to condition H, given the score s i reported for that question andthe subject identity u , is given by logit( P (H | s i , u )) = a · s i + b u + b , wherelogit( p ) = p/ (1 − p ), a is a global factor, b u is a subject-dependent bias, and b is a global bias. To estimate the parameters of this model we use the glmer function from the lme4 package in R. We found that the prediction can bemade with high significance, z -value = 4 .
07 and p = 4 . × − , indicatingthat the scores reported by the subject were, as desired, affected by thecondition of each series. A similar observation can be made when predictingthe condition of the series given the scores provided by the subjects on the11 ubjects (sorted) D i ff e r e n c e b e t w ee n a v e r a g e p e r - q u e s t i o n s c o r e s Subjects (sorted) D i ff e r e n c e b e t w ee n a v e r a g e i n t e r m e d i a t e s u r v e y s c o r e s Figure 3: Differences between average scores for the H and L condition for each of thesubjects. On the left, per-question scores; on the right, intermediate survey scores. intermediate and final surveys (taken after questions 6, 12 and 18, see section2.1.3). In this case, a GLMM with subjects and survey number (1, 2 or final)as random effects, gave z -value = 5 .
34 and p = 9 . × − .To visualize the effect captured by the GLMMs, we plot the differencebetween the average scores reported by the subject (before the first systemerror) for both conditions, on the per-question scores and on the intermediatesurvey scores. Figure 3 shows these differences for each of the 39 subjectswith at least one intermediate survey before the first system error; the samesamples used for the GLMM models. Subjects are sorted by the score dif-ference. Positive values indicate the effect of the condition happened in thedesired direction: a higher score was given, on average, to the more reliableVA. We can see that, for both types of scores, the majority of speakers wasaffected by the condition in the expected direction. As we explained in Section 2.1.1, for each question, subjects were requiredto type the VA’s answer and their own answer. By comparing both answersfor a certain question we can get indirect information about whether or notthe subject trusted the skills of the VA. That is, we expected that subjectswould use the VA’s answer for the difficult questions as their own answermore times for the H condition than for the L condition.To analyze whether this expected behavior indeed occurred, we manuallylabelled the two answers (typed in boxes 2 and 4 in Figure 1) for a certainquestion as being either equivalent or different. We considered the answersto be the equivalent if the content but not necessarily the form was thesame. For example, for the question “In which year was the first traffic ight installed?” , the answers “It was in 1868” and “1868” were consideredequivalent, while “1868” and “1890” were considered different. While doingthis manual annotation, we found that 12 of the 45 subjects did not fill intheir own answer with an actual response to the question, but rather usedphrases such as “I don’t know” or “I agree” . While for some of these answersone could infer whether the subject trusted or distrusted the VA, this was notalways the case. For example, the answer “I don’t know” does not necessarilyimply that the subject did not trust the VA’s answer. Hence, we were forcedto discard these speakers from this analysis.For the remaining 33 subjects we calculated the ratio of questions forwhich both answers were equivalent in each condition, considering (as ex-plained above) only difficult questions answered correctly by the VA, andprior to the first error made by the system. Figure 4 shows the differencebetween these ratios for conditions H and L for each subject. We can seethat, as expected, the subjects repeated the answer given by the VA as theirown answer for a larger percentage of questions in the H condition than inthe L condition.In addition, we found a significant positive correlation between the differ-ence between the means of the scores for each question for both conditions(corresponding to the y-axis in the left plot in Figure 3) and the difference be-tween the ratios of equivalent answers for both conditions (corresponding tothe y-axis in Figure 4). The Pearson correlation coefficient was r (31) = 0 . p -value = 0 . An additional research question in the current project is whether hu-mans are capable of telling solely from the speech signal whether the speakertrusted or distrusted the VA’s skills. We first conducted informal pilot stud-ies in which the authors tried to perform this task, only to find it extremelydifficult. We thus decided to gather a team of psychology researchers andpractitioners who, as experts in human behavior, would be good candidatesfor succeeding in this task. 13 ubjects (sorted)0.40.20.00.20.4 D i ff e r e n c e b e t w ee n r a t i oo f e q u i v a l e n t s a n s w e r s Figure 4: Differences between ratio of equivalent answers given to questions in the H andL conditions.
As a result, ten female expert annotators were asked to listen to a pair ofsequences of audios from each in-lab subject (they thus listened to 50 pairsof audios). Each such sequence was formed by the first recordings producedby the subject for each of the final six questions in a series – during which weexpect the trust/distrust effect to be at its maximum level. The six audios ina sequence were merged into a wav file, separated by a simple tone. Each pairof sequences was presented to annotators on a web page, in random order.For each pair, they had to select which audio corresponded to utterancesdirected at the less trustworthy VA, together with their confidence level in a5-level Likert scale. At the end of this study, annotators were asked to writein their own words what factors they considered when conducting this task.All annotators were paid for this task.We examined inter-annotator agreement using Fleiss’ κ measure (Fleiss,1971), which yielded a value of 0.116. This is interpreted as “slight” agree-ment above chance. We also conducted a permutation test (Good, 2013)(with 1000 permutations), which confirmed that this slight agreement is in-deed significantly not random ( p ≈
3. Automatic Detection of Trust
In this section we describe our experimental setup and results for the taskof automatically predicting the condition (high- or low-score, or simply H andL) from the subject’s speech for each question or series of questions. Given14he analysis presented in Section 2.3.1, where we show that the condition ofeach series of questions significantly affected the level of trust reported by thesubjects, the condition given by the initial bias for each series of questionscan be considered a proxy for the trust that the subject had on the VA.Alternatively, we could have attempted to predict the scores reported bythe subject (either on the intermediate surveys or per question). Yet, thesescores are likely to be affected by hidden biases that are subject-dependent,since people may use the Likert scale very differently from each other. Hence,we leave the task of predicting the scores reported by the subjects for futurework, and focus on predicting the condition.
As described in the introduction, the features used in this paper are mo-tivated by previous work done on speech directed to at-risk listeners. In par-ticular, we focused on the characteristics described by Oviatt et al. (1998) asbeing related to hyper-articulation, which include more frequent and longerpauses, slower speech rate, clearer differentiation of vowel space with respectto formant values, increased pitch, and expansion of pitch range. The fea-tures described below aim to represent these characteristics using measuresthat can be automatically extracted from the waveforms. The only manualannotation done on this dataset is the orthographic transcription, which wasfirst done automatically and then corrected manually when necessary.In order to detect the start and end of the speech for each utterance, andthe duration of intermediate pauses, forced alignments to the manual tran-scriptions were performed using Montreal Forced Aligner (McAuliffe et al.,2017). The extracted features were computed only over regions determinedas speech by the forced aligner, and pauses shorter than 50 ms were consid-ered part of the surrounding speech. Speech duration was calculated in twodifferent ways: considering the full duration T F from the start to the endof the utterance; and also considering only the speech regions T S , ignoringpauses between speech regions.We computed a total of 16 features for each waveform, 3 related to speak-ing rate, 7 related to pitch, 2 related to energy, and 4 related to formantinformation. Syllable rates including and excluding pauses, were calculated dividingthe number of syllables by T F and T S respectively. The number of syllables15n each utterance was calculated from the transcriptions using Syltippy, a Spanish syllabification tool based on (Hern´andez-Figueroa et al., 2013). Pause to speech ratio was also calculated as the total pause durationdivided by the total speech duration T S . Pitch features were extracted using frame level estimates of the funda-mental frequency (F0). F0 was calculated using OpenSmile’s smileF0 con-figuration file, with an F0 tracking frequency range of 100 to 620Hz overframes of 50 ms shifted by 10 ms. OpenSmile assigns frames estimated tobe unvoiced an undefined F0 value. The resulting F0 signals were furthermasked, turning all F0 values detected over pause regions into undefined val-ues. The resulting signal was turned into a logarithmic scale and split intoregions, defined as a sequence of consecutive frames separated by more than50 ms of unvoiced frames. Finally, we computed the following summarizedvalues: range, given by the difference between the 95th and 5th quantilesover all values; median over all values; mean and standard deviation over theregions of the median within each region; mean and standard deviation overthe regions of the range within each region; and final slope, calculated usinga linear regression over the last 25 (defined) frames of the F0 signal.
Energy features were given by the range and the slope over the last 25frames for the energy signal extracted using OpenSmile. As for the pitch,this signal was restricted to have undefined values over unvoiced regions andturned into a logarithmic scale before computing the features.
Formant features were extracted for the first two formants. The for-mant estimates were obtained over voiced frames using OpenSmile and di-vided into regions as for the F0 signal. For each of the formants we thencalculated the mean and standard deviation of the ranges over the regions.Note that we summarized the pitch contours using 7 functionals, butthe energy and formant features using only a subset of 2 and 3 functionals,respectively. This selection arose from the characteristics described in Oviattet al. (1998). We also tried summarizing the energy and formant featuresusing the same 7 functionals applied to pitch, but this additional featuresdid not improve classification performance. https://github.com/nur-ag/syltippy .2. Experimental Design For the automatic prediction experiments in this section, we used onlythe sessions from Trust-DB recorded at the school laboratory, since they hadbetter sound quality and fewer network communication and system errors.As for the analysis in Section 2.3, we discarded all questions within a seriesthat came after a VA’s request to repeat a question due to ASR or dialog-system errors. Further, for each question in the series, we used only the firstwaveform from the user that was not a mistake which required repetition dueto the user’s fault (e.g., stopping the recording before finishing the question).These waveforms had an average duration of 5 seconds.To make the best use of the data, the experiments were done using aleave-one-speaker-out (LOSO) strategy, where a model is evaluated for eachspeaker using all the other speakers for training. The scores generated forall speakers using each corresponding model were pooled together to obtainone set of scores on the full dataset.The experiments were performed using a subset of 19 speakers for whichat least 12 questions were available on each of the two conditions considering,for this count, only the questions (1) without transmission errors, (2) beforethe first system error in the series, and (3) that are not in the list of 6questions that only appear in the L condition. We considered two differenttasks: • Question-level:
The unit of classification is each question within eachseries. In this case, the goal is to detect the condition (H or L) for theseries in which the question is found. • Series-level:
The units are the question-series. The ground truth inthis case is the condition of the series. As explained later, for this case,the question-level features are summarized into series-level features forclassification.
The features described in Section 3.1 are likely to be highly affected bythe speaker identity. This effect could potentially be more salient in the fea-tures than the condition we aim to detect. For this reason, we normalizedthe question-level features described in Section 3.1 over each subject’s data,after filtering out questions with network communication errors or after thefirst system error in the series. For normalization we use the standard score,17lso called z -score (Kreyszig, 1979), where each feature is normalized by sub-tracting its mean and dividing by its standard deviation. Since the number ofquestions in each condition after filtering may be different, we used weightedstatistics rather than standard ones, to avoid a per-subjects bias resultingfrom the imbalance in the conditions. This bias would hurt performancesince, for some subjects the statistics would be computed with data mostlyfrom condition H and, for others, with data mostly from condition L, makingthe resulting normalized features incomparable with each other.We estimated the mean for feature j for a certain subject s as the weightedsample mean given by ˆ µ j,s = N s (cid:88) i =1 w s,i · x j,s,i , (1)where N s is the total number of questions available for both conditions forsubject s , x j,s,i is the value of feature j for question i from subject s , andthe weight w s,i is given by N s,cs,i , where c s,i is the condition (H or L) ofquestion i for subject s and N s,c s,i is the number of questions for this conditionavailable for subject s . Note that, with these definitions, N s = N s,H + N s,L and (cid:80) N s i =1 w s,i = 1. The resulting mean is equivalent to computingthe average of the by-condition means for the subject. The correspondingstandard deviation was estimated using the same weights, asˆ σ j,s = (cid:118)(cid:117)(cid:117)(cid:116) N s (cid:88) i =1 w s,i · (cid:0) x j,s,i − ˆ µ j,s (cid:1) . (2)Finally, we normalized each feature by computing z -scores with weightedstatistics. Hence, the normalized features ˜ x are given by˜ x j,s,i = x j,s,i − ˆ µ j,s ˆ σ j,s . (3)Beside the user, another factor that could affect the features is the typeof question. Some questions were simple to formulate (e.g., “Define the wordbank” ), some were more complex or longer (e.g., “In what year did Hungaryjoin the European Union?” ) and some were composed of two clear parts(e.g., “Which are the three largest countries in the world, from high to low?” )which most speakers divided up using a pause. For this reason, we exploredthe option of further normalizing the features by question. That is, after18eatures are normalized by speaker, we computed the statistics per questionover those features using only the training data. Those statistics were thenused to normalize both the training and the test samples. This was doneseparately for each model being trained and the corresponding test samplesfor that model.The question statistics were also computed using weights to ensure thatthey were not biased by the imbalance between conditions for that ques-tion. For the six questions that appeared only in the low condition, weightedstatistics could not be computed, since one of the conditions does not havesamples. This was not a problem, since, as we will see, these questions wereeliminated by the balancing process described below.As shown in Section 3.3, this normalization procedure led to significantgains in performance with respect to using raw features. Since each of the 36 questions used for the experiments did not appearthe same number of times in the L and H condition, we undersampled eachof the questions across the training data for each model to obtain exactlythe same number of L and H cases. At this stage, the questions that ap-peared only in the L condition were discarded. This balancing of conditionper question avoids a possibly optimistic result where the model could beusing the features to identify the questions, since the identity of the questionwould contain information about the condition. The undersampling is donerandomly using 10 different seeds. For each seed, we ran LOSO and obtaineda full set of scores on the test data. The scores obtained using different seedswere averaged obtaining a single score per sample in the test data.Note that the balancing was only done during training, not during testing,since we only wanted to prevent the model from learning about the imbalance.During testing, all available questions (except the 6 that never appear in theH condition, since we do not have normalization statistics for them) wereused to compute the summarized features, as described below.
Classification was done using random forests (Hastie et al., 2001), whichare ensemble models designed to reduce the high variance of the estimationsmade by decision trees. Decision trees are very flexible models but, for thissame reason, they are also very sensitive to small changes in the trainingset. Random forests, which are ensembles of such trees, reduce the variance19nd have been successfully used for many different classification problems(Hastie et al., 2001). We also tried using support vector machines, which gavesimilar performance in terms of accuracy but which have the disadvantageof not providing a probabilistic measure at the output. Further, given thesmall amount of available data and the lack of held-out data, we did notwant to try several different types of models or attempt to tune the differentparameters in these models, since this could easily result in overly optimisticresults.We trained random forests consisting of 500 trees with a maximum depthof 20 using scikit-learn (Pedregosa et al., 2011). At each split, √ N F fea-tures were randomly considered as candidates, where N F is the total numberof input features. Splits were selected by minimizing the Gini impurity.These parameters were chosen based on our previous experience with ran-dom forests; they were not tuned to this dataset for the reasons mentionedabove.For the question-level experiments, the features input to the randomforests were the features described in Section 3.1, either raw, normalized byuser, or normalized by user and question. For the series-level experiments,we computed new features that summarize the distribution of each featureover the series. That is, given the questions within a series, we computedthe 25, 50 and 75% quantiles for each feature. This resulted in 48 summaryfeatures per series (16x3) which were used as input for the random forestclassifiers. We reported results in terms of cross-entropy and accuracy. The cross-entropy is given by − /N (cid:80) i log( p i ) where N is the number of test samplesand p i is the posterior given by the system to the true class for sample i . Wenormalized the cross-entropy by the value it would have on a system thatalways outputs a 0.5 posterior, log(2). The accuracy is obtained over harddecisions made by thresholding the posterior returned by the random forestswith a threshold of 0.5.We choose to report cross-entropy as well as accuracy because, while theaccuracy is a standard and intuitive classification metric, it is quite noisy onsmall datasets like ours. A small change in the system output can signifi-cantly change the value of the metric. On the other hand, the cross-entropyis much more robust because it does not rely on hard decisions. For example,let’s consider a comparison between two systems where the output changes20or a single positive sample from 0.49 to 0.51. With that small change inoutput, this sample goes from being misclassified to being correctly classi-fied. In our case, where we have 38 samples in total, the accuracy wouldchange by 0.026, a relative improvement of 3% if the accuracy of the firstsystem is 0.76, as in our best system, which might tempt us to think thatwe have improved our system. Yet, this is probably not a change that wouldgeneralize to other data given the output changed only for one sample andonly by a very small amount. Note that this would also be true for anymetric that relies on hard decisions, like recall, precision or F1. On the otherhand, for this same example, the cross-entropy would change by 0.00152, arelative improvement of 0.2% for a initial cross-entropy of 0.83, as in ourbest system. This would tell us that the change between the two systems is,indeed, insignificant.Further, if the output for that same positive sample had changed from0.499 to 0.8 (the new system is now very sure the sample is positive), theaccuracy would still change by 3%, but the cross-entropy would change 2.3%,instead of 0.2%, letting us distinguish these two cases. For this reason,we choose to show cross-entropy, alongside the more intuitive and standardaccuracy values.When evaluating a model, the value of the performance metric may de-pend on the test set size and composition. In order to assess the uncertaintyof our results, we obtained confidence intervals using bootstrapping (Efronand Tibshirani, 1993). In this method, multiple test sets are generated bysampling instances from the original test set with replacement. The perfor-mance metric is then calculated for each of the resulting test sets. Thesevalues can be used to derive a confidence interval measuring the variabilityof the metric of interest due to the test set. In this paper we use a versionof the bootstrap method that considers the dependency across scores fromthe same speaker. As explained in (Poh and Bengio, 2007), when the scoresunder evaluation come from a set of speakers with several instances each,the sampling should be done by speaker rather than by instance, so that thedependency across instances from the same speaker is taken into account. Ifthis dependency was not accounted for, it could lead to an underestimationof the confidence interval. When sampling by speaker, for each bootstrapsample, some speakers might be missing and others might be repeated sev-eral times. The scores for the units (questions or series) for each speaker arediscarded or repeated accordingly, and the resulting set of scores is used tocompute a new estimate of the metric (cross-entropy, in our case). The con-21 uestion Series0.60.70.80.91.01.1 N o r m a li ze d c r o ss - e n tr o p y Figure 5: Normalized cross-entropy obtained at question and series levels for differentnormalization strategies. The height of the bars indicates the normalized cross-entropy,while the error bars are the confidence intervals obtained by bootstrapping. The numericvalue at the bottom of each bar is the corresponding accuracy. fidence intervals shown in Figure 5 are given by the 2.5 and 97.5 percentilesof the distribution of cross-entropy values calculated with this method using1000 estimates of the cross-entropy obtained by sampling the speakers withdifferent random seeds.
In this section, we report results for the model trained using featuresobtained at question and series levels applying the different normalizationstrategies discussed in Section 3.2.1. Figure 5 shows the cross-entropy andaccuracy for the different tasks and normalization methods. We can seethat, for both tasks, normalizing by user is better than not normalizing, andnormalizing by user and question leads to the highest accuracy and lowestcross-entropy. These results indicate that, as suspected, features are affectedby the user and the question, and removing these effects helps the modelpredict the type of VA more effectively.22or question-level classification, Figure 5 shows that the performance isclose to random, even after normalization is done (which, in effect, means thatthe features have information about the whole session). This indicates thatjust a few seconds of speech may not be enough to perform this complex task.On the other hand, the performance for series-level is better than random,indicating that the model is able to extract from the speech features usefulinformation about the task of detecting the condition of the series. For theuser- and question-normalized case, the accuracy of the series-level systemwas 76% and the cross-entropy was 0.828, with a confidence interval thatdoes not include the 1.0 value corresponding to a random system.Speech features might be affected not just by the condition of the seriesbut also by whether the person is getting tired or bored with the task. Hence,we analyzed whether the series-level predictions were affected by the orderin which the two conditions occurred within the session (H followed by L,or the other way around). The scores given by the model to the 12 subjectswho did the L series first had a mean of 0.574 (stdev 0.136), while the scoresof the 7 subjects who did the H series first had a mean of 0.588 (stdev 0.104).The difference between these two distributions was not significant (Welch’s t -test, t (17) = − . p ≈ .
19 94 123 142 99 − − − Figure 6: Box and whisker plots of the syllable rate with pauses and pitch median nor-malized by user and question for the 5 users with highest shift of the median between Land H conditions. median. We sorted subjects by their absolute difference in median across con-ditions for each of these features, and plotted the distributions for the fivesubjects that exhibit the highest differences (Figure 6). It can be seen thatsubjects that showed differences in syllable rate and pitch median tended tospeak faster and with higher pitch in the H condition (with a single exceptionfor the case of pitch). We also analyzed the distributions for final pitch slope,but did not find any clear patterns. We hypothesize that this feature maybe used by the decision trees to condition the use of other features.
The experiments described above were performed on the 19 in-lab subjectsthat had at least 12 questions before the first system error (see Section 3.2).We will call this set of subjects 12Q-in-lab. As we saw in the previoussection, the system performance on those subjects is significantly better thanrandom. Encouraged by these results, we decided to test the system on thein-lab speakers with at least 6 questions but fewer than 12 questions beforethe first system error and on the remote speakers with at least 6 questionsbefore the first system error. We will call these sets 6Q-in-lab and 6Q-remote.They include 12 and 11 subjects, respectively. Results showed that the modeltrained on the 12Q-in-lab set did not provide good performance on these two24ther sets, achieving a cross-entropy of 1.033 for the 6Q-in-lab set and of 0.99for the 6Q-remote set. These cross-entropy values around 1.0 indicate thatthe systems are not better than a random system that outputs a posteriorof 0.5 for all samples. Further, retraining the models using LOSO on thosenew sets also resulted in poor performance. In this section we provide somehypotheses that might explain these results.The poor results in the 6Q-remote set were expected, since the remotesessions had a much less controlled environment than those recorded in ourlaboratory: subjects used different microphones, and had various backgroundnoises and distortion. Further, remote subjects were likely to be more dis-tracted, less committed to the task. One symptom of this was the larger andmore variable duration of remote sessions (see Section 2.2) compared within-lab sessions, indicating that subjects took breaks (in one case, a break ofa whole night in between series). Finally, as we mentioned in Section 2.2, theage range in the remote subjects was much wider than for in-lab subjects.In summary, the remote sessions were much more heterogeneous, making thetask harder than for the in-lab sessions. We hypothesize that, to obtain a ro-bust model under these more challenging conditions, we would need to trainit on a much larger number of subjects.We also expected results on the remaining in-lab subjects to be poorerthan on the 12Q-in-lab set selected for our experiments, given that theamount of questions available for each sample (i.e., for each series of ques-tions) before the first system error was smaller. To determine whether thiswas enough to explain the poor performance, we tested the models for the12Q-in-lab subjects on a downsampled version of each of the test samples,where we kept only the first 6 questions in each series. The cross-entropyfor this experiment was 0.885, a degradation compared to the 0.828 value weget when using all available test questions (Section 3.3), but still better thanthe random performance we get on the 6Q-in-lab set. This suggests that thesmaller number of questions per series is probably not the only reason whyour model does not generalize to this set of subjects.Another notable difference between the two sets is that the 12Q-in-labsubjects reported a higher difference in the average survey scores for theH and L series in the session ( M = 0 . , SD = 0 . M = 0 . , SD = 0 . t (29) = 3 . , p = 0 . r (17) = 0 . , p =0 . M = 0 . , SD = 0 . M = 0 . , SD = 0 . t (30) = 1 . , p = 0 .
4. Conclusions
We have presented results on the prediction of a proxy for the trust thata user has on a virtual assistant (VA) during a dialog, based on the user’sspeech. The proxy is given by the reliability of the VA (high or low) whichcorrelates with the user’s trust in the VA. Our task is, given two series ofspeech waveforms from the user recorded under both conditions (high or lowreliability), predict the condition for each series. We show that a system canlearn to perform this task with an accuracy of up to 76%, where a randombaseline would have a 50% accuracy. To solve the task we designed featuresthat are related to those previously found to be useful for detecting speechdirected to “at risk” listeners such as infants, non-native speakers, or peoplewith hearing loss.We would like to emphasize that the experimental design used in thispaper does not correspond to a realistic use case, since it assumes that data26rom both conditions are available for each user during training and testing.This setup was selected for its simplicity as a first approach for assessingwhether this task could be solved automatically. The results should be onlyinterpreted as a preliminary analysis suggesting that the proposed featuresdo indeed contain useful information about the task. In contrast, a set of 10expert human listeners reached very low agreement when solving this sametask, highlighting the inherent difficulty of the problem.The results mentioned above were obtained on a subset of speakers se-lected for having few errors by the VA. When evaluating on the rest of thespeakers, we found that the system did not perform better than random. Webelieve this happens due to a combination of less controlled environments,in the case of the sessions recorded through the internet, smaller amounts ofdata per sample before the first error by the VA, and subjects that were lesscommitted to the task, reporting smaller differences in scores across condi-tions. Further data collection is needed to confirm the findings in this paperin a less controlled setting with larger amounts of subjects and more speechper subject. We plan to start the work of collecting a new dataset in thenear future, using a novel protocol designed to engage the subjects using agame scenario.Finally, several interesting questions arise from these experiments. Whyare some speakers more affected by the initial score than others? Could thisbehavior be predicted from their personality traits or from their prior famil-iarity with VAs? Are some phrases more prone to contain useful informationabout the condition than others? We will address these and other questionsin future research.
5. Acknowledgements
This material is based upon work supported by the Air Force Office ofScientific Research under award number FA9550-18-1-0026. We also grate-fully acknowledge the support of NVIDIA Corporation for the donation of aTitan Xp GPU.
References
Bates, D., M¨achler, M., Bolker, B., Walker, S., 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67, 1–48. doi: . 27oleman, J.S., 1990. Foundations of Social Theory. Harvard UniversityPress.Drnec, K., Marathe, A.R., Lukos, J.R., Metcalfe, J.S., 2016. From trustin automation to decision neuroscience: applying cognitive neurosciencemethods to understand and improve interaction decisions involved in hu-man automation interaction. Frontiers in Human Neuroscience 10, 290.Efron, B., Tibshirani, R.J., 1993. An Introduction to the Bootstrap. 1 ed.,Chapman & Hall/CRC Monographs on Statistics and Applied Probability.Elkins, A.C., Derrick, D.C., 2013. The sound of trust: Voice as a measure-ment of trust during interactions with embodied conversational agents.Group Decision and Negotiation 22, 897–913.Fleiss, J.L., 1971. Measuring nominal scale agreement among many raters.Psychological Bulletin 76, 378–382.Gauder, L., Gravano, A., Ferrer, L., Riera, P., Brussino, S., 2019. A protocolfor collecting speech data with varying degrees of trust, in: Proc. SMM19,Workshop on Speech, Music and Mind 2019, pp. 6–10.Good, P., 2013. Permutation tests: a practical guide to resampling methodsfor testing hypotheses. Springer Science & Business Media.Gulati, R., 1995. Does familiarity breed trust? The implications of repeatedties for contractual choice in alliances. Academy of Management Journal38, 85–112.Hardin, R., 2006. Trust. Polity Press, Malden, MA.Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of StatisticalLearning. Springer Series in Statistics, Springer New York Inc., New York,NY, USA.Hazan, V., Uther, M., Granlund, S., 2015. How does foreigner-directed speechdiffer from other forms of listener-directed clear speaking styles?, in: Pro-ceedings of the 18th International Congress of Phonetic Sciences, 18thInternational Congress of Phonetic Sciences.28ern´andez-Figueroa, Z., Carreras-Riudavets, F.J., Rodr´ıguez-Rodr´ıguez, G.,2013. Automatic syllabification for spanish using lemmatization andderivation to solve the prefix’s prominence issue. Expert Systems withApplications 40, 7122–7131.Kiseleva, J., de Rijke, M., 2017. Evaluating personal assistants on mobile de-vices, in: Proceedings of the 1st International Workshop on ConversationalApproaches to Information Retrieval (CAIR’17), Tokyo, Japan.Kiseleva, J., Williams, K., Hassan Awadallah, A., Crook, A.C., Zitouni, I.,Anastasakos, T., 2016. Predicting user satisfaction with intelligent assis-tants, in: Proceedings of the 39th International ACM SIGIR conferenceon Research and Development in Information Retrieval, ACM. pp. 45–54.Korsgaard, M.A., Brower, H.H., Lester, S.W., 2014. It isn’t always mutual: Acritical review of dyadic trust. Journal of Management , 0149206314547521.Kreyszig, E., 1979. Advanced Engineering Mathematics. 4 ed., Wiley.Kuznetsova, A., Brockhoff, P.B., Christensen, R.H.B., 2017. lmerTest pack-age: Tests in linear mixed effects models. Journal of Statistical Software82, 1–26. doi:10.18637/jss.v082.i13