[PDF] Impact of Explanation on Trust of a Novel Mobile Robot

Abstract

One challenge with introducing robots into novel environments is misalignment between supervisor expectations and reality, which can greatly affect a user's trust and continued use of the robot. We performed an experiment to test whether the presence of an explanation of expected robot behavior affected a supervisor's trust in an autonomous robot. We measured trust both subjectively through surveys and objectively through a dual-task experiment design to capture supervisors' neglect tolerance (i.e., their willingness to perform their own task while the robot is acting autonomously). Our objective results show that explanations can help counteract the novelty effect of seeing a new robot perform in an unknown environment. Participants who received an explanation of the robot's behavior were more likely to focus on their own task at the risk of neglecting their robot supervision task during the first trials of the robot's behavior compared to those who did not receive an explanation. However, this effect diminished after seeing multiple trials, and participants who received explanations were equally trusting of the robot's behavior as those who did not receive explanations. Interestingly, participants were not able to identify their own changes in trust through their survey responses, demonstrating that the dual-task design measured subtler changes in a supervisor's trust.

Full PDF

IImpact of Explanation on Trust of a Novel Mobile Robot

Stephanie Rosenthal and Elizabeth J. Carter

Carnegie Mellon UniversityPittsburgh PA 15213 { rosenthal,lizcarter } @cmu.edu Abstract

One challenge with introducing robots into novel environ-ments is misalignment between supervisor expectations andreality, which can greatly affect a user’s trust and continueduse of the robot. We performed an experiment to test whetherthe presence of an explanation of expected robot behavior af-fected a supervisor’s trust in an autonomous robot. We mea-sured trust both subjectively through surveys and objectivelythrough a dual-task experiment design to capture supervisors’neglect tolerance (i.e., their willingness to perform their owntask while the robot is acting autonomously). Our objectiveresults show that explanations can help counteract the noveltyeffect of seeing a new robot perform in an unknown environ-ment. Participants who received an explanation of the robot’sbehavior were more likely to focus on their own task at therisk of neglecting their robot supervision task during the ﬁrsttrials of the robot’s behavior compared to those who did notreceive an explanation. However, this effect diminished afterseeing multiple trials, and participants who received explana-tions were equally trusting of the robot’s behavior as thosewho did not receive explanations. Interestingly, participantswere not able to identify their own changes in trust throughtheir survey responses, demonstrating that the dual-task de-sign measured subtler changes in a supervisor’s trust.

Introduction and Related Work

As we introduce robots that perform tasks into our environ-ments, the people who live and work around the robots willbe expected to maintain their own productivity while largelyignoring the robots as they move around and complete jobs.While this pattern of behavior around robots can be expectedto develop over time, the introduction of a new robot is fre-quently disruptive to people in its environment in severalways. First, people are uncertain of a robot’s autonomous be-haviors when it is ﬁrst introduced. People for whom a robotis novel are typically observed testing the robot’s abilities(e.g., (Gockley et al. 2005; Bohus, Saw, and Horvitz 2014))and monitoring robot behavior in the environment rather than executing their own tasks (e.g., (Burgard et al. 1998;Thrun et al. 1999; Kanda et al. 2010; Rosenthal and Veloso2012)). Additionally, even people who understand basic be-haviors of robots often must intervene to help robots over-come failures or errors in their autonomy (De Visser et al.2006). Failures impact both human productivity and theirtrust in the robot’s behavior (Desai et al. 2012).One proposed technique to create appropriate user expec-tations (Tolmeijer et al. 2020) and overcome the challengesof human uncertainty and mistrust for different types of in-telligent systems is to provide feedback and explanations tousers (e.g., (Lim, Dey, and Avrahami 2009; Ribeiro, Singh,and Guestrin 2016; Desai et al. 2013; Abdul et al. 2018)).Bussone, Stumpf, and O’Sullivan (2015) found that expla-nations of machine learning predictions align user’s mentalmodels such that they increase their trust and reliance onthe predictions. Recent work has extended the idea of ex-plainability to robot decision processes to help people un-derstand, for example, why a robot performs an action basedon its policy (Hayes and Shah 2017) or its reward function(Sukkerd, Simmons, and Garlan 2018), or to summarize therobot’s recent actions for different people and/or purposes(Rosenthal, Selvaraj, and Veloso 2016). While explanationalgorithms have been successfully compared to human ex-planations, little has been done to understand how explana-tions impact trust of autonomous mobile robots.Most commonly, researchers use subjective surveys tomeasure trust on binary (Hall 1996), ordinal (Muir 1989a),and continuous (Lee and Moray 1992) scales. These scalescan be measured one time or many times to build up metricssuch as the Area Under the Trust Curve (AUTC) (Desai etal. 2012) and Trust of Entirety (Yang et al. 2017). Objectivemeasures of trust have also been proposed, including neglecttolerance, which we use in this work. The neglect time of arobot is the mean amount of time that the robot can func-tion with task performance above a certain threshold with-out human intervention, and neglect tolerance is the overallmeasure of robot autonomy when neglected (Goodrich andOlsen Jr 2003). Neglect tolerance in particular is an impor-tant objective measure because autonomous operation is aprimary goal of robotics research and development. For auser, it is a key contributor to the amount of trust that can a r X i v : . [ c s . R O ] J a n e placed in a robot. The more a user can neglect instead ofattend to a robot, the more they can focus on other tasks.Towards the goal of measuring the effects of explanationson both subjective and objective trust, we designed a dual-task experiment in which participants were asked to allo-cate their attention to a robot’s behavior and a video game.Participants were asked to play a basic video game whilemonitoring a robot as it navigated a path across a large gridmap. They were directed to note if the robot entered cer-tain squares on the grid while also playing the game to theirmaximum ability. Some participants also received an expla-nation of how the robot would move around the map andwhich squares it would try to avoid, and some did not. Weused video game performance as a proxy for measuring ne-glect tolerance, trust and reliance. Slower and less successfulgameplay indicated that more attention was diverted to therobot, which in turn implied less trust and reliance on theautonomy. Additionally, we used questionnaires to examinesubjective ratings of trust in the robot across navigation con-ditions. We hypothesized that receiving an explanation ofthe robot’s behavior would increase trust as measured by thetime allocated to gameplay versus supervising the robot aswell as subjective trust ratings.Our results show that our dual-task experiment was able tomeasure differences in trust. Key press count and key pressrate for the game were both slower when the robot madeerrors by entering target squares compared to when it didnot, indicating that the participants spent more time moni-toring the robot during periods when an error occurred. Par-ticipants’ time to report an error did not drop, indicating thatparticipants traded off gameplay performance (key presses)in order to monitor the robot rather than miss an opportunityto report an error. In contrast, our surveys did not show anydifferences in trust when the robot made errors compared towhen they did not. This result indicates that our dual-taskexperiment measured subtle changes in trust that the surveyresults could not identify.The presence of an explanation additionally affected par-ticipant trust behaviors in the ﬁrst two of three trials. Theresults show that participants who had received an expla-nation of the robot’s behavior had higher key press countsand a lower number of game losses during times when therobot made errors in the ﬁrst two trials compared to the lasttrial. This result indicates that participants initially were ableto focus on the game more because the explanation gavethem information about the robot’s behavior. However, bythe third trial, all participants understood the robot’s behav-ior and the effect of explanation was no longer signiﬁcant.We conclude that an explanation can help counteract thenovelty effect an unfamiliar robot by improving user trust. Experiment Method

In order to measure the effect of explanations on trust, weperformed a dual-task experiment in which we asked par-ticipants to simultaneously play an online game while theymonitored each of three robots as they navigated their largegrid maps. Half of our participants also received a briefexplanation before the robots executed their tasks of whatsquares on the map each of the robots was programmed to avoid, while half were given no explanation about the robot’sprogramming. For neglect tolerance, we measured the timethat participants spent playing our game as well as the timeit took them to report that the robot was making an error(i.e., entering a target type of square) during its execution.Between each robot execution, participants also completeda questionnaire about aspects of trust they had in that partic-ular robot during that trial. We evaluated the differences ineach of our dependent measures to understand the how thetasks affected participant trust through the study.

Autonomous Robot Navigation Setup

We used a Cozmo robot from Anki for this experiment.Cozmo is a small robot that can move across ﬂat surfaces ontwo treads, and it has an associated software development kit(SDK) allows for programming and executing a large rangeof behaviors, including navigation.We programmed Cozmo to navigate Adventure Mapsfrom the Cubetto playset available from Primo Toys. Thesemaps are large, colorful grids with six rows and columns oficons and measure approximately 1 by 1 meter each. Eachsquare in the grid fell into one of two categories: backgroundpatterns or icons. For each category, there were multiple ex-amples: a background on an Egypt-themed map could bewater or sand, and an icon on a space-themed map includeda rocketship. We used the maps from the Big City (referredto as the Street map), Deep Space (Space map), and AncientEgypt (Egypt map) Adventure Packs mounted on foam coreposter board for stability (Figure 1).Three paths were chosen for each map: one that did notenter a particular type of square on the map ( no error ), onethat entered that square type at the beginning of the path( early error ), and one that entered that square type at theend of the path ( late error ). The Egypt map had ten watersquares that were deﬁned as errors. In the Space map, threecomet squares were identiﬁed as errors. The Street map con-tained ﬁve error squares that looked like streets. The pathsand indicated errors are shown in Figure 1, and these arereferred to as the Error Finding conditions.Cozmo navigated the paths in an open loop as it was notactively sensing its location on the maps. Cozmo’s pathswere found to be very consistent in terms of the robot stay-ing in the required squares throughout the study. The exper-imenter could select a map and a speciﬁc path at the begin-ning of each trial. All of the robot’s motions were loggedand timestamped in a ﬁle labeled by the participant numberand their error condition order.We told participants that there were three distinct robotsindicated by different colored tapes on their backs in orderto reduce potential confusion about whether the robots wereusing the same algorithms. However, there was only one ac-tual robot used for consistency in navigation.

Participant Tasks

Participants were asked to simultaneously supervise therobots as they navigated the maps and maximize their scoresin an online game of Snake. a) Egypt map (b) Space map (c) Street map

Figure 1: Participants were each asked to monitor three robots executing tasks (one per map). They were each randomly assignedto a condition order in which they saw a path with no error (indicated with green circles), early error (yellow circles), or lateerror (red circles) for 6 total possible condition combinations.

Supervisory Task

Participants were asked to indicate bya button press when the robot entered the indicated type oferror square (i.e., whether/when it enters water for the Egyptmap, a comet square on the Space map, and a street squareon the Street map). This task required them to maintain someknowledge about where the robot was located on the mapand where the potential error squares were located, typicallyby occasionally watching the robot’s behavior.

Snake Game Task

In order to simulate a real-world sce-nario in which the human supervisor of a robot would needto perform other tasks at the same time (including, perhaps,supervising multiple robots or performing their own task),we created another responsibility for our participants. Whilethe robot was navigating its path, participants were providedwith a laptop on which to play a web-based, single-playergame of Snake. The goal of Snake is to direct a movingchain of yellow squares (the snake) around the screen us-ing the arrow keys and collect as many additional red foodsquares as possible by aiming the snake directly at them andbumping them with the head of the snake (Figure 2(a)). Weasked participants to maximize their score in the game bycollecting as many food pieces as possible without hittingone of the outer walls (in this case, red squares positionedalong the edges of the gameplay window) or accidentallyhitting the snake body with the head (which becomes moredifﬁcult as the snake becomes longer). In these cases, thesnake dies and participants start over. Participants were notable to pause the game, so they had to make tradeoffs in theirgameplay in order to successfully monitor the robot.By hosting the Snake game on a website, we were ableto collect data about every button press made, the score atany time, the duration of each game, and whether partici-pants had to restart the game due to the snake hitting obsta-cles or itself. These data were collected on every event andmeasured to the millisecond. We used these logs to mea-sure differences in the rate and count of key presses and thenumber of obstacles hit (game deaths) across trials. The de-gree to which participants were attending to the game ver- sus visually inspecting the robot’s progress and monitoringits errors should be apparent in gameplay slowdown and/orincreases in obstacles hit when participants are not watchingthe Snake’s motion.

Explanation Condition

The key between-subjects variable for this experiment wasthe explanation provided to the participants about Cozmo’snavigation behavior. There are many possible explanationswe could have provided, including summaries of the path therobot would take and the policy in each grid square. How-ever, we chose a short explanation that followed a similarpattern found in prior work (Li et al. 2017) in which pref-erences for particular squares were noted. This brief expla-nation format was developed to be easy to understand andrecall while not inducing the attribution of goals and mentalstates to the robot. In this experiment, only a single squaretype was avoided, so it was simple and concise to provideparticipants with this information.Half of the participants (No Explanation condition) wereonly told the map description (Egypt, Space, Street) and topress the button if Cozmo entered one of the error squares(water, comets, or street). For example:“This Cozmo navigates the space map. Hit the button ifCozmo hits the comets.”For the other half of the participants (Explanation condi-tion), an additional explanation was provided to explain whythe participants were being directed to hit the button if theCozmo entered an error square: to report the mistake.“This Cozmo navigates the space map and is pro-grammed to avoid the comets. Hit the button if Cozmohits the comets anyway.”

Study Design

Experimental Setup

The experiment took place in a smallconference room with an oblong table about 1.3 by 3.5 me-ters in size. On one half of the table were two places foreople to sit facing each other, one for the experimenter andthe other for the participant. A laptop was positioned at eachspot, and a USB-linked button was positioned to the left ofthe participant laptop and connected to the experimenter lap-top. The other half of the table was used for the three maps,each of which had been afﬁxed to a piece of foam core inorder to ensure that it would stay ﬂat enough for the robotto traverse. Before each trial, the experimenter placed theappropriate map to the left of the participant and positionedthe robot in the correct square. The setup is shown in Fig-ure 2(b).

Conditions

All of the participants saw each of the threedifferent path conditions (No Error, Early Error, Late Error),one on each of the three maps for the within-subjects vari-able Error Finding. They saw the Egypt map ﬁrst, followedby the Space map and the Street map. Map order was heldconstant because of technological constraints. The order ofthe three Error Finding conditions (No, Early, or Late Er-ror) was randomized for each participant (six total combina-tions). Alternating participants were assigned to one of thetwo Explanation conditions: Explanation or No Explanation.

Participants

Participants were recruited using a community research re-cruitment website run by the university. In order to take partin this research, participants had conﬁrm that they were 18years of age or older and had normal or corrected-to-normalhearing and vision. Sixty individuals successfully completedthe experiment (29/30/1 female/male/nonbinary; age range19-61 years, M age = 28.65, SD age = 10.39), including ﬁvein each of the twelve combinations of conditions (6 ErrorFinding x 2 Explanation). They provided informed consentand received compensation for their time. This research wasapproved by our Institutional Review Board. Procedure

Upon arrival at the lab, each participant provided informedconsent and was given the opportunity to ask the exper-imenter questions. They then completed a questionnaireabout demographics (including age, gender, languages spo-ken, country of origin, ﬁeld of study, and familiarity withrobots, computers, and pets) and the Ten-Item PersonalityInventory (Gosling, Rentfrow, and Swann Jr 2003).The participant was then told that the goal of the experi-ment was to assess people’s ability to simultaneously mon-itor the robot while completing their own task. The experi-menter introduced the Snake game and the participant wasgiven the opportunity to practice playing Snake on the lap-top for up to ﬁve minutes (as long as it took for them tofeel comfortable) in the absence of any other task. Next, theexperimenter instructed the participant that there would bethree scenarios in which the participant would play Snake asmuch and as well as possible while also monitoring the robotas it completed its map navigation task. The participant wastold to press the yellow button to the left of the laptop whenthe robot entered the indicated squares and that the buttonwould make the computer beep to record the feedback, but the robot would continue entering the square. The partici-pant was asked to press the button for familiarization and toensure ﬁrm presses.The experimenter set up the Cozmo robot and the ﬁrstmap. She told the participant that Cozmo would be navigat-ing the map and to press the button if it ventured into the rel-evant squares. The participants in the Explanation conditionwere told speciﬁcally that the Cozmo had been programmedto avoid these squares and to press the button if it enteredthem anyway. Participants in the No Explanation conditionwere told to hit the button if Cozmo entered speciﬁc squares.For each participant, the experimenter selected a random or-der of Error Finding conditions and the robot was preparedto complete the ﬁrst condition. The experimenter and partic-ipant verbally coordinated so that the Snake game and therobot navigation began simultaneously. After approximatelyone minute (range 57-67 seconds), the Cozmo completed itsjourney and the experimenter instructed the participant toend the Snake game (i.e., let the snake crash into the wall).The participant then completed a survey about their trust inthe robot and their ability to complete the two simultaneoustasks. The same procedure was then repeated for the sec-ond and third maps. For each map, the robot had a piece ofcolored tape covering its back in order to enable the conceitthat three different robots were being used. This tape wasswitched out of sight of the participants, so it appeared asthough the experimenter had brought a different robot to thetable. We provided this visual differentiation to attenuate theeffects of participants developing mental models of the robotacross maps.

Measures

We used participant performance on the Snake game andtheir ability to detect robot navigation errors as objectivemeasures. Subjective measures included questionnaire re-sponses from the participant after each trial.

Snake Game Task Objective Measures

To analyze per-formance on the Snake task, we created windows that ex-tended 10 seconds before and after the time at which therobot was programmed to commit an early or late error foreach map. We created three variables: key count , the num-ber of times a participant pressed a key to control the gameduring the 20-second window; key rate , the average time be-tween each key press, measured in milliseconds; and deathcount , the number of times the participant died in the gameduring the window. We were thus able to compare behavioracross the two 20-second windows for each map and deter-mine the degree to which game performance was affectedby the occurrence of an error in one speciﬁc window (errorsonly occurred in one of the two windows per map). We usedthese data as proxy measures for participant attention to thegame at any given time and examined how these numberscorresponded to the status of the robot and the experimentcondition.

Robot Monitoring Task Objective Measures

Using theCozmo log ﬁles, we calculated the latency between Cozmoentering an error square and participant button press to no-tify us of the error. These response times were compared a) Experimental Setup (b) Snake Game

Figure 2: (a) The experimental setup shows the robot on the Egypt map, the participant computer for the online Snake game,and the experimenter’s computer logging the robot’s behavior and the button presses from the yellow button. (b) Participantswere asked to play the Snake game by pressing the arrow keys to move the snake head (indicated with a red circle) over the redfood pieces while avoiding hitting itself and the red walls around the board.across conditions to determine how the timing of an errorand the task explanation affected participant performance onthe Error Finding task. We also noted if the participant ne-glected to report any errors that did occur.

Subjective Measures

Participants completed a question-naire after every trial of the study that included 15 rat-ing questions and a question about estimating the numberof errors made by the robot. The rating questions werecompleted on a 7-point scale ranging from Strongly Dis-agree to Strongly Agree and included questions on wariness,conﬁdence, robot dependability, robot reliability, robot pre-dictability, the extent to which the robot could be countedon to do its job, the degree to which the robot was mal-functioning, participant trust in this robot, participant trust inall robots generally, whether the participant will trust robotsas much as before, whether the robot made a lot of errors,whether the participant could focus on Snake or if the robotrequired too much attention, whether the participant spentmore time on Snake or robot monitoring, whether it was hardto complete the Snake task during robot monitoring, andwhether the participant would spend more time watching therobot if doing the study again. Many of the questions on thepost-experiment questionnaire were adapted from previousresearch by Jian and colleagues (Jian, Bisantz, and Drury2000) and Muir (Muir 1989b); others were created speciﬁ-cally by us to assess dual-task experiences.

Hypotheses

We hypothesized that the explanation of the robot’s behav-ior allows participants to anticipate the robot’s behavior sothat they can be more selective in how they focus their at-tention between the two tasks. In terms of our measures, weexpected explanations to result in better game task perfor-mance (higher key counts, lower key rate, and fewer snakedeaths) compared to no explanations (H1). We thought thatthe explanations would have a greater effect when the robotis novel and diminish over time (H2), and they would lead tohigher subjective measures of trust (H3). Additionally, fol-lowing prior work, we expected to ﬁnd that robot errors re-duced both objective and subjective trust measures (H4).

Results

We used performance metrics from the two tasks in the ex-periment and responses to the questionnaires to assess trustin the robot both directly and indirectly.

Dual-task performance

First, we examined key count for the Snake game. Partici-pants did not signiﬁcantly press fewer buttons during timewindows in which the robot made an error, F = 3.112, p =0.080 (Figure 3(a)). There were no signiﬁcant main effectsof explanation condition, error order, or map. We found asigniﬁcant interaction between map and explanation condi-tion, F = 3.161, p = 0.045, such that participants in the expla-nation condition had higher key counts than those in the noexplanation condition for the ﬁrst two maps, but similar keycounts in the last map, although the pairwise comparisonswere not quite signiﬁcant (Figure 3(b)).We found a signiﬁcant main effect on key rate for whetherthere was an error, F = 4.868, p = 0.029, such that the timebetween key presses was higher (i.e., a lower key press rate)when the robot made an error than when it did not. No othersigniﬁcant main effects or interactions were found.For death count , there were no signiﬁcant main effects,but there was a signiﬁcant interaction between map and ex-planation condition, F = 4.374, p = 0.0139 (Figure 3(c)).Although pairwise comparisons were again not signiﬁcant,a pattern of effects was found that participants who receivedno explanation had higher death counts for the ﬁrst two mapsthan those who received explanations, but this difference di-minished by the third map. There was also a signiﬁcant inter-action between error order and whether there was an error, F = 5.536, p = 0.0198. An early error with no explanationwas most likely to result in death, followed by early errorwith explanation, late error with explanation, and late errorwith no explanation. Pairwise comparisons were signiﬁcantbetween early error/no explanation and late error/no expla-nation only.We also examined button press data to assess whether par-ticipants were attending to the robot as it traversed the maps.There were no signiﬁcant main effects of any of our condi-tion manipulations on how long it took participants to hit a) (b) (c) Figure 3: (a) Participants’ key counts when the robot was not making an error compared to when it was. (b) Participants whoreceived an explanation made signiﬁcantly more key presses in the ﬁrst two maps compared to those who did not. There wasno difference between explanation conditions on the last map. (c) Similarly, participants who received an explanation died inthe Snake game less frequently in the ﬁrst two maps, but not the third.Questionnaire Item Signiﬁcant EffectsWary Error**Conﬁdent Error**Dependable Error**Reliable Error**Count on this robot Error**Trust this robot Error**Predictable Interaction Map x Error*Malfunctioning Error*, Interaction Map x Error*Trust robots in general —Not trust robots as much Interaction Map x Explanation*The robot made errors Error**Table 1: Signiﬁcant main effects and interactions for trust-related questionnaire items. * = p < p < key rate and although did not quite affect keycount , partially in line with our fourth hypothesis (H4) forthe objective measures of trust. This result suggests thatthe supervisory task did require that participants slow downtheir game performance to report errors for the robot. Partic-ipants were able to slow down the key rate without reducingthe accuracy of reporting errors and without an increasedSnake death count. Additionally, although our ﬁndings donot support our hypothesis that explanations would improvegameplay overall (H1), the Explanation condition had no-table effects on key count and death count in the beginningof the experiment on the ﬁrst two maps and decreased forthe last map, providing some support for hypothesis H2. Questionnaires

Participants answered 16 questions after each trial to exam-ine their feelings about the speciﬁc robot they had just seenas well as robots in general. For many of these questions, there was a signiﬁcant main effect of which error conditionthey had just seen on the participants’ responses.Ratings of “I am wary of the robot” were signiﬁcantly af-fected by error condition, F = 6.260, p = 0.003, such that rat-ings for early error and late error were signiﬁcantly higher(measured by Tukey HSD pairwise comparisons) than rat-ings when there was no error, p < F = 3.471, p = 0.068.Similarly, there was a signiﬁcant main effect of error con-dition for “I am conﬁdent in the robot,” F = 10.628, p < < F > p < p < F = 3.118, p = 0.017, but no signiﬁcant pairwisecomparisons were identiﬁed using Tukey HSD.Ratings of “The robot was malfunctioning” were signiﬁ-cantly affected by error condition, F = 11.448, p < M = 1.517, SD =0.965) than for the early ( M = 2.183, SD = 1.432) or late ( M = 2.267, SD = 1.388) error conditions, p < F = 3.205, p = 0.045, such that explanations com-bined with early and late errors elicited signiﬁcantly higherratings than when there were no errors, regardless of expla-nation condition. Having no explanation combined with ei-ther early or late error produced intermediate ratings thatwere not signiﬁcantly different from other combinations’ratings. The explanation says that the robot is programmedo avoid those squares, resulting in an assessment of mal-function when it does.None of our manipulations affected ratings of “I trustrobots in general.” There were no signiﬁcant main effectsof our manipulations on ratings for “I will not trust robotsas much as I did before,” although there was an interac-tion between map and condition, F = 2.550, p = 0.0416. Nopairwise comparisons were signiﬁcant, however. These twoquestions sought to measure whether our study affected trustin robots beyond the experiment itself.We asked participants two questions speciﬁcally abouthow many errors the robots made. For “The robot made a lotof errors,” there was a signiﬁcant main effect of error condi-tion, F = 23.093, p < p < F = 149.079, p < F = 3.239, p = 0.0431, such that an earlyerror with an explanation elicited higher ratings than no er-ror with an explanation ( p < F = 4.161, p = 0.0182, with ratings for the ﬁrst map beinghigher than the second and third maps ( p < F = 2.821, p = 0.0641, with not-quite-signiﬁcantly higherratings for early error than for late error or no error. Therewas a signiﬁcant interaction of map and error condition, F =2.568, p = 0.0407, but no signiﬁcant pairwise comparisons.Overall, the questionnaire responses clearly reﬂect thatparticipants were monitoring the robot’s performance levels,and errors made by the robot were reﬂected in assessmentsincluding trust, dependability, and reliability. These ﬁndingsprovide partial support for our fourth hypothesis (H4) byconﬁrming that errors reduced subjective measures of trust.Having an explanation for the robot’s behavior had no major,independent effects on questionnaire responses. This fails toconﬁrm our hypothesis H3 that explanations would improvesubjective measures of trust. Discussion

Our results partially supported our hypothesis H2 that expla-nations of the robot’s behavior would signiﬁcantly affect theparticipants’ gameplay during early trials of the dual-taskexperiment but not in the last trial, when the robot was morefamiliar. By the third trial, the participants who received noexplanation for the robot’s behavior improved their game-play enough that the explanation did not matter. However, there was no main effect of explanations on objective trust(H1) nor subjective trust (H3) throughout the entire experi-ment. Additionally, there was some support for our hypothe-sis H4 that participant trust, measured both by gameplay andby questionnaire, was signiﬁcantly affected by the robot’serrors.

Role of Explanations.

Neglect tolerance measures in ourdual-task experiment suggest that errors in robot perfor-mance deﬂect effort from the game task to increase moni-toring of the robot. While robot errors reduced participantneglect tolerance (supporting H4), providing explanationsfor the robot’s behavior boosted this tolerance during earlytrials (supporting H2). We provided a relatively simple ex-planation for the robot’s task: it was programmed to avoidcertain squares. Alternatively, participants with no explana-tion were simply told to hit the button when the robot enteredthose squares. While the explanation was not long nor veryspeciﬁc about the robot’s path, it still signiﬁcantly impactedthe participants in the task. It is possible that the explanationled participants to maintain their focus on the game ratherthan spending more effort tracking the robot’s movementsbecause it suggested that the robot ought not enter thosesquares and would actively avoid them. It is likely that thisimpact on neglect tolerance was higher when the situationwas still novel because the participants had not seen verymany errors occur at that point and had not created their ownupdated mental models for the robot’s performance.Additionally, providing different types of explanations forrobot behavior could also change neglect tolerance. Our ex-planation suggested that the robot would avoid entering cer-tain areas of the map, which could bias the observer’s mentalmodel to assume that the robot would not make errors. Al-ternative explanations, including which landmarks the robotpasses over or what turns it makes through the map, couldbias the person further in the same direction by providingmore detail about the robot’s programming and/or emphasiz-ing that entry to those areas is a mistake, or they could biasthe person to think it is not particularly important whetherthe robot enters those areas. It is possible that the effects ofany explanations would be attenuated by a more challeng-ing task competing with supervision of the robot. Future re-search should examine the effects of multiple levels of ex-planations and task difﬁculty on neglect tolerance.

Subjective Ratings of Trust.

As predicted in hypothesisH4, the presence versus absence of an error had signiﬁcantnegative effects on many participant ratings of the robot,including measures of trust, reliability, and dependability.However, ratings of robot malfunction were generally loweven after an error had occurred. Notably, whether partici-pants had received an explanation of robot behaviors did notsigniﬁcantly affect their ratings of the robot (contradictingH3).Overall, the questionnaire results did not reﬂect thechanges in behavior that were observed, indicating that sub-jective measures of trust are not sensitive enough to catchsubtle differences for certain tasks. In order to accuratelymeasure robot autonomy and the ability of a person to doanother task while still monitoring the robot, questionnairesdo not properly evaluate that level and type of trust (as foundn (Desai et al. 2013) and (Yang et al. 2017)). Recording andassessing data from the dual task provided a better measureof trust through neglect tolerance.

Dual Task Experiment Design.

Our task was brief andeach trial included no more than one error. To learn moreabout how people allocate attention and effort, future re-search should investigate the effects on neglect tolerance ofdifferent robot error rates and amounts. Frequent errors ornear-misses might close the gap between observers who didand did not receive explanations because it would quicklyforce reassessment of the observers’ mental models. More-over, an increase in these factors would likely result in worseperformance on the other task. Additionally, attention andeffort allocation could be biased towards the alternative taskby increasing the difﬁculty of that task. For our game, it waspossible to slow down the button presses and avoid hittingobstacles in order to avoid losing the game; however, a gamewith more obstacles or opportunities to win points in shortertime spans might elicit more effort from the player and di-vert attention away from the robot. For real-world robot su-pervision, it is important to know what task is appropriatefor people to do in addition to noticing robot behaviors.

Novelty Effect.

Finally, our examination of novelty wasrelatively limited. An increase in the number, variety, andlength of trials would allow further assessment of the degreeto which explanations matter as someone gains more experi-ence with the robot. Moreover, it is possible that map orderimpacted our results. There also are likely long-term effectsof practice on both tasks. Novelty effects might also relateto task difﬁculty such that explanations impact user’s mentalmodels about the robot for a longer period of time if they areexpending their effort on the other task because they do nothave the cognitive effort available to update these models.

Conclusion

We conducted a dual-task experiment to study the effect ofexplanations on robot trust. We measured participants’ ne-glect tolerance—the time that participants spent watchingour robot versus performing their own task—as well as sub-jective trust through surveys. While explanations did nothave a main effect on objective or subjective trust measures,they did have an effect that counteracts the novelty of see-ing a new robot for the ﬁrst time. Additionally, we foundthat our neglect tolerance measure was able to identify sub-tle changes in trust compared to survey measures that did notﬁnd signiﬁcant differences across conditions in the study.

References [Abdul et al. 2018] Abdul, A.; Vermeulen, J.; Wang, D.;Lim, B. Y.; and Kankanhalli, M. 2018. Trends and trajec-tories for explainable, accountable and intelligible systems:An hci research agenda. In

Proceedings of the 2018 CHIConference on Human Factors in Computing Systems , 582.ACM.[Bohus, Saw, and Horvitz 2014] Bohus, D.; Saw, C. W.; andHorvitz, E. 2014. Directions robot: In-the-wild experiencesand lessons learned. In

Proceedings of the 2014 Interna-tional Conference on Autonomous Agents and Multi-agent Systems , AAMAS ’14, 637–644. Richland, SC: Interna-tional Foundation for Autonomous Agents and MultiagentSystems.[Burgard et al. 1998] Burgard, W.; Cremers, A. B.; Fox, D.;H¨ahnel, D.; Lakemeyer, G.; Schulz, D.; Steiner, W.; andThrun, S. 1998. The interactive museum tour-guide robot.In

Aaai/iaai , 11–18.[Bussone, Stumpf, and O’Sullivan 2015] Bussone, A.;Stumpf, S.; and O’Sullivan, D. 2015. The role of expla-nations on trust and reliance in clinical decision supportsystems. In , 160–169.[De Visser et al. 2006] De Visser, E.; Parasuraman, R.;Freedy, A.; Freedy, E.; and Weltman, G. 2006. A compre-hensive methodology for assessing human-robot team per-formance for use in training and simulation. In

Proceedingsof the Human Factors and Ergonomics Society Annual Meet-ing , volume 50, 2639–2643. SAGE Publications Sage CA:Los Angeles, CA.[Desai et al. 2012] Desai, M.; Medvedev, M.; V´azquez, M.;McSheehy, S.; Gadea-Omelchenko, S.; Bruggeman, C.; Ste-infeld, A.; and Yanco, H. 2012. Effects of changingreliability on trust of robot systems. In

Proceedings ofthe seventh annual ACM/IEEE international conference onHuman-Robot Interaction , 73–80. ACM.[Desai et al. 2013] Desai, M.; Kaniarasu, P.; Medvedev, M.;Steinfeld, A.; and Yanco, H. 2013. Impact of robot failuresand feedback on real-time trust. In

Proceedings of the 8thACM/IEEE international conference on Human-robot inter-action , 251–258. IEEE Press.[Gockley et al. 2005] Gockley, R.; Bruce, A.; Forlizzi, J.;Michalowski, M.; Mundell, A.; Rosenthal, S.; Sellner, B.;Simmons, R.; Snipes, K.; Schultz, A. C.; and Wang, J. 2005.Designing robots for long-term social interaction. In , 1338–1343.[Goodrich and Olsen Jr 2003] Goodrich, M. A., andOlsen Jr, D. R. 2003. Seven principles of efﬁcient humanrobot interaction. In

IEEE International Conference onSystems Man and Cybernetics , volume 4, 3943–3948.[Gosling, Rentfrow, and Swann Jr 2003] Gosling, S. D.;Rentfrow, P. J.; and Swann Jr, W. B. 2003. A very briefmeasure of the big-ﬁve personality domains.

Journal ofResearch in personality

Proceedings of the 11th Knowledge-Based Software Engi-neering Conference , 42–51.[Hayes and Shah 2017] Hayes, B., and Shah, J. A. 2017. Im-proving robot controller transparency through autonomouspolicy explanation. In

Proceedings of the 2017 ACM/IEEEinternational conference on human-robot interaction , 303–312. ACM.[Jian, Bisantz, and Drury 2000] Jian, J.-Y.; Bisantz, A. M.;and Drury, C. G. 2000. Foundations for an empirically de-termined scale of trust in automated systems.

InternationalJournal of Cognitive Ergonomics

IEEE Transactions on Robotics

Ergonomics ,1357–1364.[Lim, Dey, and Avrahami 2009] Lim, B. Y.; Dey, A. K.; andAvrahami, D. 2009. Why and why not explanations improvethe intelligibility of context-aware intelligent systems. In

Proceedings of the SIGCHI Conference on Human Factorsin Computing Systems , 2119–2128. ACM.[Muir 1989a] Muir, B. 1989a.

Operators’ Trust in and Useof Automatic Controllers in a Supervisory Process ControlTask . University of Toronto.[Muir 1989b] Muir, B. M. 1989b.

Operator’s Trust in andUse of Automatic Controllers Supervisory Process ControlTask . Ph.D. Dissertation, University of Toronto, Room 301,65 St. George Street.[Ribeiro, Singh, and Guestrin 2016] Ribeiro, M. T.; Singh,S.; and Guestrin, C. 2016. Why should i trust you?: Ex-plaining the predictions of any classiﬁer. In

Proceedings ofthe 22nd ACM SIGKDD international conference on knowl-edge discovery and data mining , 1135–1144. ACM.[Rosenthal and Veloso 2012] Rosenthal, S., and Veloso,M. M. 2012. Mobile robot planning to seek help withspatially-situated tasks. In

AAAI , volume 4, 1.[Rosenthal, Selvaraj, and Veloso 2016] Rosenthal, S.; Sel-varaj, S. P.; and Veloso, M. M. 2016. Verbalization: Nar-ration of autonomous robot experience. In

IJCAI , 862–868.[Sukkerd, Simmons, and Garlan 2018] Sukkerd, R.; Sim-mons, R.; and Garlan, D. 2018. Towards explainable multi-objective probabilistic planning. In

Proceedings of the 4thInternational Workshop on Software Engineering for SmartCyber-Physical Systems (SEsCPS \ ’18) .[Thrun et al. 1999] Thrun, S.; Bennewitz, M.; Burgard, W.;Cremers, A. B.; Dellaert, F.; Fox, D.; Hahnel, D.; Rosenberg,C.; Roy, N.; Schulte, J.; et al. 1999. Minerva: A second-generation museum tour-guide robot. In Robotics and au-tomation, 1999. Proceedings. 1999 IEEE international con-ference on , volume 3. IEEE.[Tolmeijer et al. 2020] Tolmeijer, S.; Weiss, A.; Hanheide,M.; Lindner, F.; Powers, T. M.; Dixon, C.; and Tielman,M. L. 2020. Taxonomy of trust-relevant failures and mit-igation strategies. In

Proceedings of the 2020 ACM/IEEEInternational Conference on Human-Robot Interaction , HRI’20, 3–12. New York, NY, USA: Association for ComputingMachinery.[Yang et al. 2017] Yang, X. J.; Unhelkar, V. V.; Li, K.; andShah, J. A. 2017. Evaluating effects of user experienceand system transparency on trust in automation. In