[PDF] Mitigating belief projection in explainable artificial intelligence via Bayesian Teaching

Abstract

State-of-the-art deep-learning systems use decision rules that are challenging for humans to model. Explainable AI (XAI) attempts to improve human understanding but rarely accounts for how people typically reason about unfamiliar agents. We propose explicitly modeling the human explainee via Bayesian Teaching, which evaluates explanations by how much they shift explainees' inferences toward a desired goal. We assess Bayesian Teaching in a binary image classification task across a variety of contexts. Absent intervention, participants predict that the AI's classifications will match their own, but explanations generated by Bayesian Teaching improve their ability to predict the AI's judgements by moving them away from this prior belief. Bayesian Teaching further allows each case to be broken down into sub-examples (here saliency maps). These sub-examples complement whole examples by improving error detection for familiar categories, whereas whole examples help predict correct AI judgements of unfamiliar cases.

Full PDF

MMitigating belief projection in explainable artiﬁcialintelligence via Bayesian Teaching

Scott Cheng-Hsin Yang, †∗ , Wai Keen Vong, † Ravi B. Sojitra, † ,Tomas Folke , Patrick Shafto Department of Mathematics and Computer Science, Rutgers University101 Warren Street, Newark, NJ 07102, USA Center for Data Science, New York University60 5th Ave, New York, NY 10011, USA Department of Management Science and Engineering, Stanford University † Equal contribution. ∗ To whom correspondence should be addressed; E-mail: [email protected].

Abstract

State-of-the-art deep-learning systems use decision rules that are challenging for humansto model. Explainable AI (XAI) attempts to improve human understanding but rarely accountsfor how people typically reason about unfamiliar agents. We propose explicitly modelling thehuman explainee via Bayesian Teaching, which evaluates explanations by how much they shiftexplainees’ inferences toward a desired goal. We assess Bayesian Teaching in a binary imageclassiﬁcation task across a variety of contexts. Absent intervention, participants predict thatthe AI’s classiﬁcations will match their own, but explanations generated by Bayesian Teachingimprove their ability to predict the AI’s judgements by moving them away from this priorbelief. Bayesian Teaching further allows each case to be broken down into sub-examples(here saliency maps). These sub-examples complement whole examples by improving errordetection for familiar categories, whereas whole examples help predict correct AI judgementsof unfamiliar cases.

While Artiﬁcial Intelligence (AI) can help address socially-relevant problems [1, 2, 3], it is im-portant for humans to be able to scrutinize AI decisions so we may audit, understand, and improveperformance; indeed, this is legally mandated in certain contexts [4, 5]. The best performing AIalgorithms use decision rules that feel alien to most humans [6], which impedes the adoption ofsuch AI in high-leverage contexts, emphasizing the need for successful explanations that facilitatehuman understanding and prediction of the AI’s behavior.A popular class of methods to explain AI is explanation-by-examples . Explanation-by-examplestakes as input an AI model to be explained and the data that it has been trained on and produces asoutput a small subset of training data that exert high impact on the inference of the explainee. Forexample, if the aim is to explain whether a deep-learning model would classify a given image as a1 a r X i v : . [ c s . A I] F e b at or a dog, explanation-by-examples selects the cat and dog images that are most representativeof those categories. The utility of explanation-by-examples is supported by research that conﬁrmshumans’ ability to induce principles from a few examples [7, 8, 9, 10] as well as the extensive useof examples in education [11, 12, 13]. The explanation-by-examples approach has many desirableproperties: It is fully model-agnostic and applicable to all types of machine learning [14, 15, 16];it is domain- and modality-general [17, 18]; and it can be used to generate both global explanation[19, 20, 21, 22, 23] and local explanation [24, 25, 26]. Although the technology of explanation-by-examples for XAI has been developed for at least two decades [27, 28], empirical tests andconnections to its ecological roots in the social sciences have been limited.Explanation-by-examples can be considered a social teaching act, which can be formally cap-tured by Bayesian Teaching [29]. In Bayesian Teaching, there are two parties, a teacher (ex-plainer) who selects examples and a learner (explainee) who draws inferences. The teacher se-lects examples intended to maximize the learner’s probability of a correct inference based on theteacher’s model of the learner’s current beliefs and their inductive biases [30, 31, 32]; the learneruses Bayesian updating to make predictions given these examples [15, 33, 34, 35]. Existing workon explanation-by-examples has demonstrated explanation effectiveness relative to several base-line conditions [14, 20, 36, 37]; however, there is rarely a principled, apriori rationale as to whythe proposed improvements should work. By explicating the computations used to model the ex-plainer, the explainee, and the explanation selection process, Bayesian Teaching provides testablepredictions on the effectiveness of explanatory examples in different contexts.The core prediction of Bayesian Teaching is that explanations which lead the learner modelto correct predictions will help humans to better understand the AI. By symmetry, explanationsthat lead the learner model to incorrect predictions will lead humans to targeted misunderstanding.Combining these, we hypothesize that the helpfulness of the examples positively covary with thepredictive performance of the participants. We also hypothesize that participants prefer helpfulexamples over random examples, and that they prefer random examples over examples selected tobe detrimental.Because explanations generated by Bayesian Teaching is tailored to a learner model, it followsthat explanation effectiveness depends on how accurately the explainee is modelled. We evaluatethe ﬁdelity of a learner model in the context of image classiﬁcation. We use a popular deep-learningclassiﬁer as our model of the human explainee [38]. This is a reasonable starting point becausesuch models effectively capture labels provided by humans, and as such encode some statisticalpatterns of human labelling. However, the way deep learning models learn from the BayesianTeacher does not capture all human inductive biases. Below we suggest some potential sources ofhuman bias and how they can be tested.We expect that participants will treat predicting the AI as a social prediction task. In many so-cial prediction tasks (in contrast to mechanistic prediction tasks) people use their own preferencesor judgements as priors for other agents [39, 40, 41, 42, 43]. In our context that would mean par-ticipants projecting their own ﬁrst-order beliefs onto the artiﬁcial agent, which results in a numberof testable hypotheses outlined in the next paragraph.Because humans generally perform well on image classiﬁcation [44], they will expect the AIto be highly accurate, particularly for familiar categories; successful interventions would mitigatethis expectation. In our experiment this would translate to humans who predict AI classiﬁca-tion achieving higher sensitivity (correctly predicting AI’s correct classiﬁcations) than speciﬁcity(correctly predicting AI’s mistakes), absent explanation. Explanations, if effective, should re-2uce this difference by increasing speciﬁcity. Second, because unfamiliar categories have fuzziermental representations, we hypothesize that examples improve predictive performance most fortrials involving unfamiliar categories, and consequently that the preference for helpful examples isstrongest when dealing with unfamiliar cases. Because more familiar categories should be easier todistinguish, and because participants expect the model to get the right answer for trials they them-selves ﬁnd easy, belief projection implies that familiarity should increase predictive performancefor model hits . Conversely, familiarity should decrease performance for model errors . Finally, ifexplanations successfully shift participants away from belief-projection, they should mitigate theserelationships between familiarity and performance.The notion that humans model the AI as a social agent further implies that users would reasondifferently about AI’s correct classiﬁcations relative to its errors. Social priors tend to be “sticky”in that they are slow to update and are rarely completely overwhelmed by conﬂicting evidence[39, 45]. This stickiness may be driven by humans processing conﬁrmatory and disconﬁrmatoryevidence differently [46, 47]. The distinction between conﬁrmatory and disconﬁrmatory evidenceis currently not captured by our learner model, nor most other high-performing classiﬁcation mod-els, so if people make this distinction it has important implications for XAI. In our task we can testthis by checking whether human performance and explanation effectiveness differ between trialswhere the aim is to conﬁrm a correct prediction of the AI as opposed to detect a mistake.An additional beneﬁt of Bayesian Teaching is that it allows for selection of examples at differ-ent levels of granularity. For the current task, we consider the selection of entire images as wellas pixels in an image as explanations. Surprisingly, the latter pixel-selection process derived fromBayesian Teaching turns out to be mathematically equivalent to a type of feature attribution methodcalled Randomized Input Sampling for Explanations [48]. Thus the two levels of example granu-larity evaluated in this paper coincide with two popular methods of explanation—explanation-by-examples and saliency maps. We lack strong prior hypotheses for the relative impact of saliencymaps and case-level examples, and how the two might interact. Consequently, we test a wide rangeof combinations and evaluate how they impact participants predictions of AI classiﬁcations.We use image classiﬁcation on the ImageNet 1K dataset [44] as the testbed. The model to be ex-plained is ResNet-50 [38]. Following an ideal-observer approach [49, 50], we instantiate BayesianTeaching by selecting examples with differing degrees of helpfulness as judged by the predictiveperformance of the learner model. For the learner model, we used a ResNet-50 model where thelast softmax layer is replaced by a probabilistic linear discriminate analysis (PLDA) model. Thisalteration introduces the probabilistic training required by Bayesian Teaching while keeping thearchitecture of ResNet-50, which is known to accurately ﬁt human labels [38]. The familiaritycontext is captured by ensuring that the categories selected cover a wide range of familiarity asjudged by humans. The context of AI’s correct and mistaken predictions is straightforwardly ma-nipulated by selecting test images that cover both cases. We also vary the informativeness of classlabels (informative vs. generic) and how the saliency maps are presented (heatmap vs. blur) as ad-ditional contextual variables. Our investigation centered on the implications of Bayesian Teachingoffers a comprehensive and nuanced picture of how explanations can mitigate belief projection inXAI. 3 esults

Methodological overview

User understanding in the context of classiﬁcation can be captured by how well the user can predictthe model’s judgement. Throughout this paper we will refer to this predictive capacity as perfor-mance. A natural measure of explanation effectiveness is how much the explanations increase suchperformance, relative to a control condition. We designed a two-alternative forced choice (2AFC)task in which participants were asked to predict the model’s classiﬁcation of a target image be-tween two given categories. No trial-by-trial feedback was provided to participants. It is importantto note that in this task high performance does not imply that participants’ judgements match theground truth of the image, which we refer to as ﬁrst-order accuracy or simply accuracy. It is pos-sible for a participant to have high accuracy (in that their judgements often match the ground-truthcategory of the image) but poor performance (in that their judgements rarely match the AI’s).We designed a total of 15 conditions that vary along three dimensions: (1) presence of infor-mative labels (two levels: [

GENERIC LABELS ] or ([

SPECIFIC LABELS ]), (2) types of examples(three levels: [

NO EXAMPLES ], [

HELPFUL ] or [

RANDOM ]), and types of saliency maps (three lev-els: [

NO MAP ], [

JET ] or [

BLUR ]). The labels dimension indicate whether the images shown wheregiven informative labels (e.g. ”Border terrier” or ”Norwich terrier”) or generic labels (Category Aor Category B). The examples dimension indicate whether examples of the two image categorieswere shown, and if so, if they were selected to be helpful or were drawn from a uniform distributionof helpfulness as determined by Bayesian Teaching. The saliency map dimension indicates if theimages were overlaid with saliency maps that highlighted which pixels the AI focused on to makeits classiﬁcation. If saliency maps were included, they were either visualized as a semi-transparentjet color map or as an image ﬁlter where unimportant pixels where blurred. We found no signiﬁcantdifference between the [

BLUR ] and [

JET ] conditions; thus, for increased clarity we use the [

MAP ]condition, which contains both variants, in the main text. See Supplementary Discussion D2 forthe main analyses in the paper repeated with [

BLUR ] and [

JET ] coded separately. Table 1 showsthe sample size of each condition. Figure 1 shows a trial where the categories are represented withinformative label, helpful examples, and blur saliency maps.

SPECIFIC LABELS GENERIC LABELSNO EXAMPLES EXAMPLESHELPFUL RANDOM HELPFUL RANDOMNO MAP

N = 76 N = 35 N = 34 N = 38 N = 36

MAP BLUR

N = 65 N = 33 N = 36 N = 35 N = 34

JET

N = 71 N = 33 N = 35 N = 35 N = 35Table 1: Naming convention of conditions and the number of participants in each condition. In themain text, conditions are referred to with brackets and the “&” logical operation. For example, [

NOEXAMPLES ] & [

NO MAP ] refers to the condition with 76 participants (See Experimental conditionsin Methods for more detail).Each trial has three more distinct features beyond the condition it belongs to: the ResNetaccuracy, the simulated learner performance, and a familiarity score. ResNet accuracy refers to theclassiﬁcation accuracy on the category which the target ResNet-50 model predicts that the target4 rial:

47 / 150On the right are two examples that belong to the category Flagpole, and two examples that belong to the category Barn. In addition, underneath each image are highlightsindicating which parts of the image the robot is looking at to make its decision. Remember that the robot sometimes makes mistakes, so you should look at the examples closelyto determine what the robot will do.

Examples from BarnExamples from FlagpoleTarget Image

Which category do you think the robot willclassify the target image above as?Flagpole Barn

Figure 1: A snapshot of the experiment. See Methods and Table 1 for the naming conventions ofthe conditions used below. The experimental condition above the black line is [

SPECIFIC LABELS ]& [

HELPFUL ] & [

BLUR ]. Under the black line is the [

JET ] equivalent of the second row, whichis obtained by replacing the blurring maps with the jet color maps. Experimental conditions withgeneric labels are obtained by replacing speciﬁc labels—‘Flagpole” and “Barn” in this case—withgeneric category names—“Category A” and “Category B.” Experimental conditions without thesaliency maps, i.e., [

NO MAP ], show only the ﬁrst row of images. Conditions without examples,i.e., [

NO EXAMPLES ], show only the ﬁrst column of image(s). All images and saliency mapsshown were 224-by-224 pixels. The prediction of the model to be explained on the target image is“Flagpole” in this case.image belongs to (see Supplementary Table T1). Note that in contrast to the

ResNet accuracy which is an accuracy on the category-level, we use the term model correctness to refer to whetherthe target model made a correct judgement on a speciﬁc trial. The simulated learner performanceof a trial (only available in the [

EXAMPLES ] conditions) is an estimate of the probability thatthe learner model’s classiﬁcation would match the target ResNet-50 model’s classiﬁcation, giventhe categories and examples presented. Finally, in a separate study seven raters indicated theirfamiliarity with each category pairing by stating whether they thought they could correctly matchimages of the two categories presented to their respective labels. The familiarity score is the mean5alue across all seven raters. See the Methods for a more technical explanation of these features.

Bayesian Teaching improves predictive performance

To evaluate whether the XAI interventions improved human performance we compared partici-pants who obtained a full explanation ([

SPECIFIC LABELS ] & [

HELPFUL ] & [

MAP ]) with a controlgroup that received no explanations ([

SPECIFIC LABELS ] & [

NO EXAMPLES ] & [

NO MAP ]). Wheninterpreting these results in relation to belief projection it is instructive to consider three idealizedscenarios. An agent who picked categories at random would have 50% performance, sensitivity(correctly predicting AI classiﬁcations when the AI is correct), and speciﬁcity (correctly predictingthe AI’s mistakes). An agent who modelled the AI perfectly would have 100% performance, sen-sitivity, and speciﬁcity. Finally, an agent with perfect ﬁrst-order accuracy who projected their ownbeliefs onto the AI would have 100% sensitivity, 0% speciﬁcity, and 33% overall performance be-cause the experiment contains twice as many AI errors as AI correct classiﬁcations (see Methods).Absent intervention, participants behave most like the third, belief-projecting, agent (Figure 2).The explanation interventions increase overall performance by increasing speciﬁcity (partici-pants are better able to spot the AI’s mistakes), at the cost of some sensitivity. Participants in thecontrol condition have a mean performance of 49.83% [95% CI = 48.83% - 50.84%], signiﬁcantlylower than the 55.04% [95% CI = 52.58% - 57.48%] performance of the experimental group ( β =0.21(0.03), z = 6.99, p < .0001). This is primarily driven by higher speciﬁcity in the experimentalgroup (43.98% [95% CI = 39.68% - 48.37%] relative to the control group’s 32.54% [95% CI =30.96% - 34.13%]; β = 0.49(0.05), z = 9.20, p < .0001). The greater vigilance of the experimentalgroup came with a minor cost to sensitivity for the experimental group (78.90% [95% CI = 71.59%- 84.80%]) and for the control group (85.26% [95% CI = 83.12% - 87.22%]); β = -0.43(0.12), z =-3.68, p = .0002), but not enough to offset the speciﬁcity gains. Collectively, these results implythat participants attempt to predict the AI by projecting their own beliefs, and that the explanationsimprove performance by mitigating this belief projection. Participants prefer examples that are helpful according to Bayesian Teaching

Having established that examples generated by Bayesian Teaching improved participants’ abilityto predict AI judgements, we want to evaluate whether participants preferred helpful to randomand misleading examples. To test this, we ran a second study where participants chose betweenhelpful examples versus random examples or versus misleading examples, where helpfulness wasdetermined by Bayesian Teaching. Participants showed a small but reliable preference for helpfulrelative to random examples and a substantial preference for helpful versus misleading examples.The preference for helpful examples was particularly pronounced when the image categories wereunfamiliar to the participants (see Supplementary Discussion D1 for all the details).

Bayesian Teaching can lead participants to both correct and incorrect infer-ence

Bayesian Teaching makes explicit the existence of an explainee and suggests that a sound learnermodel should have the capacity to track the inference of actual explainees. In our experiment the6igure 2: The effectiveness of examples generated by Bayesian Teaching, evaluated by compar-ing the performance of the participants who obtained a full explanation ([

SPECIFIC LABELS ] &[

HELPFUL ] & [

MAP ]; 66 participants; 9,899 observations) with a control group ([

SPECIFIC LA - BELS ] & [

NO EXAMPLES ] & [

NO MAP ]; 76 participants; 11,394 observations). (A) . Three ide-alised performance proﬁles, showing the performance of: a random agent, a perfect agent, and anagent with perfect access to the ground truth who assumes that the AI always mirror their ownpredictions (belief projection). (B) . Human performance most closely match the belief projectionproﬁle, but the interventions increase speciﬁcity (and slightly reduce sensitivity) by making partici-pants better at spotting the AI’s errors. The violinplots show the distribution of performance withinconditions. Black dots show the group mean with error bars signifying 95% bootstrapped conﬁ-dence intervals. (C) . Individual participants’ sensitivity and speciﬁcity. The vertices of the triangleshow the performance of a belief-projecting agent with perfect access to the ground truth (upperleft), an agent with a perfect model of the AI (upper right), and an agent choosing at random (lowermiddle). The control group is clustered at high sensitivity and low speciﬁcity towards the upperleft, whereas the experimental group is shifted to the right. However, the experimental group alsoshows greater variance, signifying inter-individual differences in the intervention effectiveness.calibration between the learner model and the participants is captured by the relationship betweenResNet accuracy and participant accuracy. We estimate participant accuracy (their ﬁrst-order beliefabout the ground truth) by using their performance in the control trials (their second-order beliefabout the AI with no exposure to explanation). The assumption that their attempt to predict the AImay serve as a proxy of their ﬁrst-order accuracy is justiﬁed given the tendency to belief-projectobserved in previous sections. We found that participant performance (interpreted as accuracy forthe control trials) was positively correlated with ResNet accuracy for trials where the model wascorrect ( β = 1.74(0.20), z = 8.67, p < .0001), indicating good calibration between the model andparticipants in this situation (see Supplementary Figure F1). We also found a negative interactionbetween ResNet accuracy and model correctness ( β = -2.57(0.23), z = -11.03, p < .0001). Thissuggests the poor calibration in the special case in which the model’s overall accuracy on thepredicted category is high but it misclassiﬁes the particular trial.7ayesian Teaching should be able to modify participant performance by selecting explanationsof varying helpfulness. To test this in practice, we ran three nested hierarchical logistic regres-sion models of increasing complexity. Each regression model predicted participant performance(whether the participant correctly predicted the AI on a given trial) from the [ EXAMPLES ] trialsonly, as these are the only trials impacted by the simulated learner performance, which measuresthe degree to which the examples would lead the learner model to the targeted inference. The ﬁrstregression model only included ResNet accuracy as a predictor as well as a dummy variable encod-ing whether the AI prediction for that trial matched the ground truth or not. The second regressionmodel added simulated learner performance as a predictor. The third regression model added twotwo-way interactions between model correctness (model hit and error) and ResNet accuracy, andmodel correctness and simulated learner performance. We found that the second regression modelﬁtted the performance data better than the ﬁrst regression model ( χ < .0001).This means that simulated learner performance predict participant performance above and beyondResNet accuracy and whether the AI was correct for that speciﬁc trial. The third regression modeloutperformed the second regression model ( χ < .0001). This indicates that thepredictive power of ResNet accuracy and/or simulated learner performance differed for trials withcorrect or incorrect AI judgements.To explore how model correctness interacted with ResNet accuracy and simulated learner per-formance, we explored the parameters of the third regression model. Participants are typicallybetter at predicting the AI when it is correct relative to when it is wrong ( β = 0.53(0.06), z = 9.15,p < .0001). This aligns with our previous results, which suggest that participants have a senseof the ground truth for most trials, and assume that the AI would make the same judgement thatthey would make. ResNet accuracy is positively associated with participant performance whenthe AI is wrong ( β = 0.59(0.05), z = 12.30, p < .0001), and even more strongly associated withperformance when the AI is correct ( β = 0.93(0.09), z = 10.68, p < .0001; see Figure 3). Becausethere was a signiﬁcant positive relationship between ResNet performance for both the control tri-als and the example trials, it seems plausible that the calibration between model and participantobserved in the control condition carries over to the example condition, at least partially. Finally,while statistically controlling for ResNet performance, simulated learner performance did not pre-dict human performance on trials when the AI was wrong ( β = -0.01(0.03), z = -0.16, p = .89)but did so for trials when the AI was correct ( β = 0.77(0.05), z = 14.19, p < .0001). Because thesimulated learner performance determined which examples were shown, the fact that this variablecould accurately predict human performance above and beyond ResNet performance implies thatthe Bayesian Teacher can successfully shift the participant judgements in either direction. Bayesian Teaching improves performance through belief-mitigation

The previous results indicate that examples improve participant predictions of the AI’s classi-ﬁcations. Additionally, participants prefer examples that are helpful according to the BayesianTeacher, and this preference is particularly pronounced for unfamiliar categories. Next, we willexplore how explanatory examples improve performance, and evaluate the relative importance ofthe different explanation features employed. The preceding results imply that people belief-projectby default: that is, they use their own beliefs as priors for the AI’s beliefs. The interventionsshift these priors, allowing the participants to distinguish their ﬁrst-order beliefs about the correctclassiﬁcation from their second-order beliefs about the classiﬁcations of the AI.8igure 3: Simulated learner performance predicts human performance across trials with examples(419 participants; 62,820 observations). (A) . The simulated learner performance—the helpfulnessof the explanatory examples expected by the Bayesian Teacher—correlates signiﬁcantly with par-ticipant performance for correct trials but not for incorrect trials. This suggests that the BayesianTeaching framework can generate explanations that are informative or misleading for trials that arecorrectly classiﬁed by the model, but not for trials that are incorrectly classiﬁed. (B)

ResNet accu-racy is positively associated with participant performance, both for trials when the AI is correct andwhen it is wrong. A similar trend is observed in the control condition (see Supplementary FigureF1). This suggests that humans and ResNet ﬁnd the same categories difﬁcult to discriminate. Thedifference in performance between when the AI is correct and when the AI is wrong suggests thatit is harder to teach incorrect judgements, at least in this context. (C) . Two-dimensional kerneldensity with 25 density bins showing the distribution of trials in terms of ResNet accuracy andsimulated learner performance. In this study the two are independent. Note that the higher den-sity near perfect simulated learner performance was due to all the helpful examples being selectedbased on this variable, so they constitute a majority of our example trials.To further evaluate whether explanations improve performance by mitigating belief-projection,we compared how the interventions impacted performance and ﬁrst-order accuracy in the com-plete data set. Speciﬁcally we contrasted [

SPECIFIC LABELS ] vs [

GENERIC LABELS ], [

MAP ] vs[

NO MAP ], and [

EXAMPLES ] vs [

NO EXAMPLES ], while controlling for ResNet accuracy and fa-miliarity score. We ran separate analyses for when the AI was correct and when the AI was wrong,corresponding to the distinction between sensitivity and speciﬁcity in previous sections. We willtreat the ground truth as a proxy of participant ﬁrst-order beliefs, an is defensible assumption giventhe reported human accuracy on ImageNet in previous works [44]. Based on this assumption, theinterventions increasing performance while reducing the number of classiﬁcations that correspondto the ground-truth, would imply that the examples enable participants to move away from theirﬁrst-order judgements. The [

SPECIFIC LABELS ] are associated with higher performance than the[

GENERIC LABELS ] regardless of whether the AI is correct ( β = 0.24(0.08), z = 3.06, p = .002) ornot ( β = 0.07(0.03), z = 2.13, p = .03). However, because these effects are small and orthogonal tobelief projection, they will not be discussed further.The presence of the saliency maps in the [ MAP ] condition improves performance when the AIis wrong ( β = 0.43(0.03), z = 14.24, p < .0001), but reduces performance (to a lesser extent) when9he AI is correct ( β = -0.56(0.07), z = -7.98, p < .0001; see Figure 4). In both cases, saliencymaps reduced the ﬁrst order-accuracy of the participants (model hit: β = -0.56(0.07), z = -7.98, p < .0001; model error: β = -0.43(0.03), z = -14.24, p < .0001), meaning that they were less likely toreport that they believed the AI’s judgements matches the ground truth of the image. This impliesthat the saliency maps encourage participants to consider that the AI might be mistaken. Onepotential explanation for this observation is that the saliency maps show when the AI attends tonon-sensible features (i.e. parts that are not representative of either of the categories) as well asambiguous features (e.g. thin metal strips that are present in both the “Electric Fan” and “Buckle”category).Comparing all [ EXAMPLES ] trials to all [

NO EXAMPLES ] trials, the presence of examples donot signiﬁcantly improve performance when the AI is correct ( β = -0.13(0.08), z = -1.61, p = .11)or when the AI is wrong ( β = 0.02(0.03), z = 0.69, p = .49). However, in the conditions whereexamples were present, helpful examples improve performance for trials when the AI was correct( β = 0.77(0.08), z = 10.11, p < .0001), but not for trials when the AI was wrong ( β = 0.06(0.04),z = 1.77, p = .08). The positive inﬂuence of the helpful but not the random examples conﬁrmsthat Bayesian Teaching can shift beliefs from priors. Note that this is the opposite effect relativeto what we found for the saliency maps: Whereas saliency maps help participants to identify trialswhen the AI has made a mistake by exposing sub-image-level features, the examples help reinforceparticipant’s prior beliefs for trials in which the AI is correct (see Figure 4). In other words, thesaliency maps and the examples serve separate and complementary functions in explaining AIjudgements to the participants.The familiarity scores capture the ease of the discrimination task in that they are higher fortrials involving categories that humans are familiar with. These scores provide clues as to whetherparticipants project their own beliefs onto the AI: If humans use their ﬁrst-order classiﬁcationsto model the AI, participants should assume that the AI gets the correct answer for trials thatthey themselves ﬁnd easy. This is indeed what we ﬁnd: familiarity is positively associated withperformance when the and the AI is correct ( β = 1.10(0.04), z = 29.28, p < .0001), but negativelyassociated with performance for AI errors ( β = -0.92(0.02), z = -42.82, p < .0001; Figure 5).Previously, we showed that saliency maps improved performance on trials when the AI waswrong. This could be explained by saliency maps helping participants distinguish between theirﬁrst-order judgements of the ground truth and their second-order beliefs about the model classiﬁ-cation. This explanation can be evaluated by testing whether the impact of the familiarity scoreson performance are attenuated by the saliency maps. In other words, if participants are more likelyto predict that the AI is correct on trials that they themselves ﬁnd easy, and the saliency maps workby helping people realize that the AI use decision-processes that differ from their own, the saliencymaps should make participants more willing to consider that the AI might be wrong for trials theythemselves ﬁnd easy. This is what we ﬁnd (see Figure 5): the presence of saliency maps reducesthe positive impact of familiarity on performance when the AI is correct ( β = -0.51(0.08), z = -6.31,p < .0001). Conversely, saliency maps reduce the negative impact of familiarity on performancewhen the AI is wrong ( β = 0.70(0.05), z = 15.22, p < .0001; Figure 5). Collectively these resultssuggest that the presence of saliency maps help participants model the AI as an agent with distinctbeliefs that may conﬂict with their own.Though the presence of examples did not generally impact performance, it is possible thatthey impacted judgements speciﬁcally for unfamiliar categories. Like the saliency maps, examplestypically reduced the impact of familiarity on performance, both when the AI is correct ( β = -10igure 4: Task performance differs depending on AI classiﬁcation accuracy. (A) & (B) are basedon the entire data set, comparing all [ MAP ] conditions to all [

NO MAP ] conditions (631 participants;94,582 observations). (C) & (D) exclude the [ NO EXAMPLES ] trials and contrast all [

HELPFUL ]trials with all [

RANDOM ] trials (419 participants; 62,820 observations). (A) . The saliency mapsimprove performance for trials when the AI is wrong but reduce performance when the AI iscorrect. (B) . The saliency maps make people less likely to classify the target image to align withthe ground truth, independent of AI accuracy. Together, (A) & (B) imply that the saliency mapshelp people to consider that the AI might make mistakes. (C)

In trials with examples, helpfulexamples tend to help people to accurately model the AI in cases when the AI is correct, but havea limited impact when the AI is wrong. (D)

Consequently, helpful examples make participantsmore likely to pick the ground truth option when the AI is correct, but do not really impact theprobability of selecting the ground truth option when the AI is wrong. Collectively, these resultssuggest that helpful examples and saliency maps improve human understanding of the AI in distinctand complementary ways. Errorbars represent 95% bootstrapped conﬁdence intervals. All pointestimates have conﬁdence intervals, though some are too narrow to see clearly.1.01(0.08), z = -12.71, p < .0001) and when the AI is wrong ( β = 0.33(0.05), z = 7.35, p < .0001).However, in contrast to the saliency maps, examples seem to be most helpful for unfamiliar trialswhen the AI is correct, see Figure 5 C. This effect may imply that the examples help participants11igure 5: Familiarity score predicts human performance based on the full data set (631 partici-pants; 94,582 observations). (A) . Familiarity score is positively associated with correct predictionswhen the AI is right, but is negatively associated with correct prediction when the AI is wrong.This provides further evidence that participants project their own beliefs to model the AI. (B) Saliency maps decrease the impact of familiarity on participant judgements. For model hits thisleads to decreased performance, whereas for model errors it leads to improved performance. Thispattern provides further evidence that the saliency maps work by shifting participants away fromusing their ﬁrst-order judgments to model the AI’s classiﬁcations. (C)

Examples also decreasethe impact of familiarity on participant judgements. For model hits this improves performancefor unfamiliar items but decreases performance for familiar items, with the opposite pattern formodel errors. These results suggest that examples are most beneﬁcial for unfamiliar items whenthe model is correct. Errorbars represent 95% bootstrapped conﬁdence intervals. All point esti-mates have conﬁdence intervals, though some are too narrow to see clearly. Shaded areas representanalytic 95% conﬁdence intervals. 12evelop a working representation of the unfamiliar categories, which they are otherwise lacking.

Discussion

Multiple strands of evidence from our results suggest that people default to belief-projection whenreasoning about the AI’s decisions. Speciﬁcally, we ﬁnd that participants in the control conditionshow higher sensitivity than speciﬁcity, and that this discrepancy becomes more extreme the morefamiliar participants are with the trial categories. Our results imply that such belief-projection canbe mitigated by Bayesian Teaching. We ﬁnd that examples selected to be helpful according toBayesian Teaching help people to predict the AI’s decisions and are preferred over random ex-amples or examples selected to be detrimental. The most compelling evidence that explanationsmitigate belief projection is that the impact of familiarity on performance is reduced by explana-tions: explanations make participants more likely to catch AI mistakes on trials they themselvesfound easy. Beyond mitigating belief-projection, Bayesian Teaching predicts that examples can infact be chosen to guide people to any target, including erroneous inference. Our results conﬁrmedthis prediction in that examples designed to mislead the participants reduced their performancebelow the no-intervention baseline.Bayesian Teaching also gives a coherent framework for comparing and contrasting explana-tory methods that hereto have been considered independent: explanation-by-examples and featureattribution. We apply Bayesian Teaching to study explanation-by-examples, a popular method forXAI that previously has lacked a sound theoretical footing. Explanation-by-examples has manystrengths: it is model-agnostic, domain-general, and easy to use with other XAI methods. Viewedthrough a Bayesian Teaching lens, this method can be generalized to include feature attribution,another popular post-hoc method, by splitting each example into its component features (i.e. pix-els in this study) and considering each pixel individually. When applied to images, such featureattribution on the pixel level generates saliency maps, which is arguably the most popular methodfor XAI in the image domain. The connection between feature attribution and pixel selectionby Bayesian Teaching opens up the possibility to reinterpret all feature attribution methods (e.g.,[51, 52]) as a form of teaching. By treating images and saliency maps as explanatory examples atdifferent levels of granularity, we discover that the two explanations show complementary effects.Namely, example images are effective explanations for conﬁrming the model’s correct classiﬁca-tion of unfamiliar categories, and saliency maps are effective explanations for exposing the model’sincorrect classiﬁcation of familiar categories.The lack of a coherent theory is currently stiﬂing XAI as methods are developed around tech-nical innovations without any apriori hypothesis as to whether they are appropriate for the speciﬁcuse case [53]. Bayesian Teaching both exposes this blind spot and offers a solution: effective ex-planation is a communication act which depends on a knowledgeable teacher, a good model of thelearner and an awareness of the context in which inference takes place. Consequently, the frame-work encourages systematic evaluation of XAI interventions on these dimensions, and providesa way to systematically diagnose how interventions could be improved. In our study we showhow such an evaluation applies to explanation-by-examples. We modeled both the explainer andexplainee by a ResNet-50 architecture, focused on two contextual variables (familiarity and modelcorrectness), and surfaced human inductive biases that are typically overlooked in XAI. Since thefunction of explanation is to shape the explainee’s inductive reasoning [54], we expect that dif-13erent inductive biases call for different kinds of explanation. Our work conﬁrms that differentinductive biases can be mitigated by different explanation methods under different contexts, indi-cating a fruitful avenue for further XAI research. Furthermore, Bayesian Teaching exempliﬁes howXAI can be improved by considering links to other ﬁelds such as education and cognitive science.A balanced synergy between the social sciences and the more technical literature of AI is muchneeded, as XAI is simultaneously a machine-learning problem and a human-centered endeavor.

Methods

The objective of this study was to explore the effects of explanations, in the form of examples andsaliency maps, on users’ understanding of high-performing machine learning models (referred toas AI throughout the paper) in the domain of image classiﬁcation. We probe users’ understandingby a two-alternative-forced-choice (2AFC) task in which users are asked to predict the model’sclassiﬁcation of a target image into one of two categories. Experimental conditions vary in termsof the information presented on the screen during each classiﬁcation. The information presenteddiffers along three dimensions: types of labels, types of examples, and types of saliency maps.All the examples and saliency maps are generated by the Bayesian Teaching framework. Thepredictive performance of the participant is captured by sensitivity, speciﬁcity, and accuracy.

The model to be explained

The machine learning model to be explained is a ResNet-50 model [38]. For this study, we usedthe pre-trained version of ResNet-50 in Keras with ImageNet weights. For the selection of saliencymaps, the Bayesian Teaching framework expects the model to be able to make probabilistic infer-ence on the image classiﬁcation task presented in the ImageNet Large Scale Visual RecognitionChallenge 2012 (ILSVRC2012). The ResNet-50 model has this capability, and we can use theResNet-50 model without any modiﬁcation. However, for the selection of examples, the BayesianTeaching framework expects the model to be able to make probabilistic inference on the 2AFCtask, and the ResNet-50 model is deterministic. We replace the fully-connected classiﬁcation layerof the ResNet-50 model with a probabilistic linear discriminant analysis (PLDA) model [55]. Thisnew PLDA layer is trained using a transfer-learning-like procedure. Training images were ﬁrstpassed through the ResNet-50 model and transformed into feature vectors. Then, the PLDA layerwas ﬁt to these feature vectors and the corresponding class labels following the algorithm pre-sented in [55]. Using the training dataset

ImageNet 1K from the ILSVRC2012 [44], this ResNet-50-PLDA model has a top-1 accuracy of 52.86% and a top-5 accuracy of 76.29%. For the actualexperiment, we focused on a subset of 100 categories that include the most difﬁcult, easiest, andmost confusable categories (see the next subsection for details). Unless otherwise stated, all themodel predictions used to design the experiment is based on the ResNet-50-PLDA model trainedon the training data in only these 100 categories.

Stimuli selection

Each experiment consisted of 150 trials. For 50 of the trials, the predictions of the model (or therobot) matched the ground-truth labels of the target images. For the remaining 100, the model pre-14ictions did not match the ground-truth labels. We selected the target images and the classiﬁcationcategories based on the model’s confusion matrix, with the aim to cover a wide range of modelbehavior. First we calculated the ResNet-50-PLDA model’s confusion matrix on

ImageNet 1K ,which contains 1000 categories. Then, we randomly selected 25 categories from each of the fol-lowing four subsets: the 100 categories on which the model was most accurate, the 100 categoriesthat were most confusable with these most accurate categories, the 100 categories on which themodel was least accurate, and the 100 categories that were most confusable with these least accu-rate categories. This resulted in 100 categories. We recorded the model’s predicted labels of allthe training images in these 100 categories and marked all images for which the model predictionswere also among these 100 categories.From this subset where both the image and the top model prediction belonged to our 100categories, we randomly sampled 50 images where the model prediction matched the ground truthlabels and 100 images for which the model predictions did not match the ground-truth labels.For the 50 trials with correctly classiﬁed target images, the two classiﬁcation options participantscould choose from were the correct model-predicted category and one of the two most confusablecategories (out of our 100 selected categories). Which one of the two most confusable categorieswas presented were selected randomly for each trial. For the 100 incorrectly classiﬁed trials the twoclassiﬁcation options were simply the ground-truth category and the incorrect model prediction.This procedure resulted in a total of 83 unique categories used in the experiment (SupplementaryTable T1). This number is smaller than 100 because not all confusable categories are unique andnot all categories were kept during the random sampling. Figure 6 depicts the trial generatingprocess. The pairs of categories used in the experiments are listed in Supplementary Table T2.

Experimental design

At the beginning of the experiment, participants were told that a robot has been trained to classifyimages but sometimes makes mistakes. They were asked to help by guessing how the robot willclassify images. On each trial, a target image was displayed along with information about twocategories, and the participants were asked to perform the 2AFC task by choosing which of thetwo categories they think the robot would classify the target image as.The experimental conditions determined what information was presented during each trial andvaried three dimensions: labels, examples, and saliency maps. Figure 1 shows a trial in the ex-perimental condition with all the elements—labels, examples, and saliency maps—and describeshow the conditions impact what elements are presented. More precisely, the conditions are char-acterized by ﬁve binary features: informative or generic labels, with or without examples, helpfulor random examples (if present), with or without saliency maps, and blur or jet saliency maps (ifpresent). The structured column and row labels of Table 1 show the naming conventions for thedifferent conditions in terms of these features. Below, we provide more details on the conditions.

Speciﬁc or generic labels:

Conditions with informative or generic labels are referred to as[

SPECIFIC LABELS ] and [

GENERIC LABELS ], respectively. In the [

SPECIFIC LABELS ] conditions,the English labels of the two categories (e.g., “Flagpole” and “Barn” in Figure 1) are given. In the[

GENERIC LABELS ] conditions, the two categories are named “Category A” and “Category B.”

With or without examples:

Conditions with and without examples are referred to as [

EXAMPLES ]and [

NO EXAMPLES ], respectively. In the [

EXAMPLES ] conditions, two examples are sampledfrom each of the two categories to represent the category. Thus, ﬁve images—one target image15 arget image: barnflagpole

Teachingprobability ... barn

EasyHard ......... ...flagpolebutcher shopscale ...... .........Sampled categoriesConfusion matrixModel Bayesian Teaching Generated trial

ResNet-50-PLDA Targer image, examples,saliency mapsbarn ... flagpole ...Data:Model: 0.040.07 ... ...

Sampled trial +Saliency mapsRandom trials Learner modelBayesian Teaching (A)(B) barnflagpole ... barnflagpole

Teachingprobability ... ... ...

Sampled trial +Saliency mapsRandom trials Learner model ... barnflagpoleTarger image, examples,saliency maps

Generated trial

Figure 6: Flowchart of trial generation. (A) . Selection of examples and saliency maps withBayesian Teaching. The inputs to Bayesian Teaching are: the model to be explained, data setsfrom two categories, and a target image that belongs to one of the two categories. The green boxdepicts the inner working of Bayesian Teaching. Random image pairs are selected from each of theinput categories. Along with the target image, two sets of image pairs, one set from each category,are selected at random to form a trial. The learner model, which is set to have the same architectureas the input model, takes in a large number of random trials to produce the simulated learner per-formance (unnormalized teaching probabilities according to Equation 2). Here, a trial with highperformance (probability) is selected, exemplifying the trial generation process in the [

HELPFUL ]condition. Saliency maps are generated for the target image and the four selected examples usingEquation 5. The ﬁnal output is a set of ten images: a target image, two examples selected from eachof the two input categories, and the saliency maps of the above ﬁve images. (B) . Trial generationsteps peripheral to Bayesian Teaching. Our model to be explained is a ResNet-50-PLDA trainedon

ImageNet 1K . A confusion matrix on the 1000 ImageNet categories was computed using themodel. Using the confusion matrix, we sampled 25 categories where the model has high accuracy(the “Easy” categories), 25 categories where the model has low accuracy (the “Hard” categories),and the categories that are most confusable with the above 50 categories. To generate a trial, weselect at random two categories from the 100 candidates mentioned above as well as a target im-age that belongs to one of the two selected categories. The model, the target image, and the dataassociated with the two categories are fed into Bayesian Teaching to produce a trial. See Methodsfor the full details.and four example images—are on display in each trial in these conditions. In the [

NO EXAMPLES ]conditions, only the target image is shown.

Helpful or random examples:

Conditions with the helpful examples and random examples arereferred to as [

HELPFUL ] and [

RANDOM ], respectively. The selection of the examples are based on16he simulated learner performance , which is the numerator of the Bayesian Teaching probability, f L . The simulated learner performance characterizes the probability that the four examples willlead a “learner model” to classify the target image as the ResNet-50-PLDA model would. TheBayesian Teaching probability and its numerator f L are rigorously deﬁned in Equations 1 and 2,respectively, in the subsection below called “Selection of examples with Bayesian Teaching.” Inthe [ HELPFUL ] conditions, the four examples are chosen such that f L > . . In the [ RANDOM ]conditions, the four examples are chosen such that the f L values across the 150 trials are uniformlydistributed over the ﬁve bins that evenly partition the [0,1] interval. With or without saliency maps:

Conditions with and without saliency maps are referred to as[

MAP ] and [

NO MAP ], respectively. A saliency map is an image mask that shows the contributionof each pixel to the model’s classiﬁcation decision. Details on the generation of the saliency mapsare provided in the subsection below called “Selection of saliency maps with Bayesian Teaching.”In the [

MAP ] conditions, a saliency map is shown for every image displayed. In the [

NO MAP ]conditions, no saliency map is shown.

Blur or jet saliency maps:

Conditions with the blur saliency maps and jet saliency maps arereferred to as [

BLUR ] and [

JET ], respectively. The two types of map differ only in the renderingof the mask but not in the generation of the mask. The jet saliency map renders the importance ofeach pixel by colors following the jet color map. In order of decreasing importance, the jet colormap goes from red to green to blue. The jet color map, overlaid on an image with some level oftransparency, is one of the most commonly used renderings of saliency maps. Two disadvantagesof jet saliency maps are that the colors of the map can interfere with the colors of the image andthat the unimportant regions remain visible to the user and can attract involuntary visual attention.For these reasons, we created the [

BLUR ] conditions in which the saliency maps are rendered byblurring the image. Furthermore, blurring is a more naturalistic visual effect than any color mapmasking because our visual system constantly experiences a large difference in visual acuity be-tween our fovea and peripheral vision. The implementation details of both renderings are providedin the subsection below on saliency map selection.

Naming convention:

As shown in Table 1, not all combinations of the ﬁve binary featuresare allowed. Conditions with generic labels and no examples are not tested because that wouldmake the 2AFC task a game of pure guessing. Furthermore, conditions without examples cannotbe paired with helpful or random examples, and conditions without saliency maps cannot be pairedwith blur or jet maps. This leaves a total of 15 experimental conditions.The naming convention for the conditions is based on ﬁlter queries using the database struc-ture presented in Table 1. To give a few examples: [

HELPFUL ] refers to the aggregate of the sixconditions in columns 2 and 4; [

MAP ] refers to the aggregate of the 10 conditions in rows 2 and 3;[

HELPFUL ] & [

BLUR ] refers to the aggregate of the two conditions in row 2 column 2 and row 2column 4; and [

HELPFUL ] & [

BLUR ] & [

SPECIFIC LABELS ] refers to the one condition in row 2column 2.

Participants

The study protocol was approved by Rutgers University IRB. All research was performed in ac-cordance with the approved study protocol. An IRB-approved consent page was displayed beforethe experiment. Informed consent was obtained from all participants. The experiment began afterthe participants gave consent. 1756 participants (404 male, 249 female, 3 other) were recruited from Amazon Mechanical Turkand paid $2.50 for completing the experiment, which took roughly 15 minutes to complete. Themean age of participants was 34.8 years (SD = 10.1), ranging from 18 to 72 years. The participantswere randomly assigned to each condition, with the aim to obtain 36-40 participants per condition.25 participants were excluded from analysis for completing the experiment too quickly (less thanone second per trial), resulting in a ﬁnal sample of 631 participants completing 150 trials each.The [

NO EXAMPLES ] conditions received twice the sample size of the other conditions, so thatthey would match the sample size of the [

EXAMPLES ] conditions, which had two distinct versions([

HELPFUL ] and [

RANDOM ]). Table 1 shows the number participants in each of the 15 conditions.The number of participants in every condition is shown in Table 1. All participants in the[

HELPFUL ] conditions experienced the same set of 150 trials, i.e., the same 150 combinations oftarget image, category pairs, and example images, but in randomized order. All participants inthe [

RANDOM ] conditions experienced another set of 150 trials, also in randomized order. All thecategory pairs used are listed in Supplementary Table T2. Participants in the [

NO EXAMPLES ]condition experienced one of these two sets of trials, selected at random. Note that because thereare no examples but only English labels in the [

NO EXAMPLES ] conditions, the two sets of trialsare functionally equivalent.

Selection of examples with Bayesian Teaching

The goal of Bayesian Teaching is to select small subsets of the training data such that the inferencemade by a learner model using this small subset will be similar to the inference made by a targetmodel using the entire training data. For this study, the target model is the ResNet-50-PLDAmodel trained on the 100 selected categories as described earlier. The inference task is to classifythe target image among the 100 categories. The inference task of the learner model is the 2AFCimage classiﬁcation task presented in each trial. For the learner model, we search for an ideal-observer model [49, 50] that would capture the participant’s inference in the 2AFC task. A goodcandidate is the ResNet-50-PLDA because it is trained on human-labeled data and achieves highaccuracy on predicting humans’ labelling behavior. This means that the target model and learnermodel share the same parameters (the ResNet-50 weights and PLDA parameters mentioned afterEquation 3), and the use of Bayesian Teaching is focused on explaining the image classiﬁcationinference based on roughly 100K training examples, i.e., all the training data in the 100 selectedcategories, with only four training examples, i.e., those selected to be displayed on each trial of theexperiments in the [

EXAMPLES ] conditions.We introduce some notation to deﬁne the Bayesian Teaching probability formally. The twocategories that deﬁne the 2AFC task in each trial consist of the predicted category of the ResNet-50-PLDA model and an alternative category, which we denote by y ∗ and y , respectively. The twoexamples sampled from the model-predicted category are denoted by τ y ∗ , and the two sampledfrom the alternative category are denoted by τ y . Let the learner model be denoted by f L and thetarget image be denoted by d ∗ . The Bayesian Teaching probability, P T , is deﬁned as the probabilitythat the selected examples, τ y ∗ and τ y , will lead the learner model to classify the target image asthe target model would. Mathematically, this probability can be expressed using Bayes’ rule as: P T ( τ y ∗ , τ y | y ∗ , d ∗ ) = f L ( y ∗ | τ y ∗ , τ y , d ∗ ) (cid:80) ( τ y ∗ ,τ y ) (cid:48) ∈ Ω f L ( y ∗ | ( τ y ∗ , τ y ) (cid:48) , d ∗ ) . (1)18he sum in the denominator is over all possible candidate sets of the four examples. The set ofall candidate sets is denoted by Ω . Equation 1 assumes a uniform prior over Ω so that the priorterms in the numerator and denominator cancel out. Technically, Ω is the Cartesian product of allpossible pairings of images in the category y ∗ with all possible pairings of images in the category y , which is on the order of for the dataset in use. Our goal here is to select τ y ∗ and τ y suchthat f L , the numerator of Equation 1, would provide good coverage of the full range of [0,1]. Thiswould ensure the existence of valid examples for both the [ RANDOM ] and [

HELPFUL ] conditions.We found that the full range can usually be covered by forming a Cartesian product of 1000 randompairings from each category ( combinations). In general, given a target value of f L , one coulduse genetic algorithm [56] or other types of discrete optimization method to select the examples.To sample in proportion to P T , one could use Markov Chain Monte Carlo and variational inferencetechniques [14, 57, 58]. These optimization and advanced inference methods would also be moreefﬁcient in the case that more than a few examples for each category is desired.Using Bayes’ rule again, we express the learner model’s inference of the target image’s labelgiven the target image and examples, the numerator in Equation 1, as f L ( y ∗ | τ y ∗ , τ y , d ∗ ) = f ( d ∗ | τ y ∗ ) f ( d ∗ | τ y ∗ ) + f ( d ∗ | τ y ) , (2)where f ( d ∗ | τ k ) is the probability that the target image, d ∗ , belongs to the category from whichthe two example images, τ k , are sampled. Under the PLDA model, one can write this probabilityin closed form as a normal distribution [21]: f ( d ∗ | τ k ) = N (cid:18) u ∗ (cid:12)(cid:12)(cid:12)(cid:12) Ψ2Ψ + I ( u k + u k ) , Ψ2Ψ + I + I (cid:19) . (3)Here, u is an image transformed in two steps. First, the image is passed through ResNet-50 andtransformed into a feature vector; then, this feature vector undergoes an afﬁne transformation withshift vector m and rotation and scaling matrix A to become u . Thus, in Equation 3, u ∗ is a trans-formed target image, and ( u k , u k ) are a pair of transformed examples sampled from category k .The quantities m and A in the second transformation and the Ψ in Equation 3 are parameters of thePLDA model obtained by training on the images in the 100 selected categories. The precise def-initions of these parameters and the training procedure are presented in Figure 2 in Ioffe’s PLDApaper [55].To summarize this subsection, Equation 1 deﬁnes the Bayesian Teaching probability, and Equa-tion 2 deﬁnes its numerator (simulated learner performance), f L , used to select examples in the[ EXAMPLES ] conditions. A high f L means that the selected examples will lead the model of theexplainee (or learner) to classify the target image as the category predicted by the model to be ex-plained with high probability. Conversely, a low f L means that the selected examples will lead thelearner model to classify the target image as the other category in the 2AFC with high probability.Note that f L is trial speciﬁc, as this probability is a function of the target image, d ∗ , the modelpredicted label of the target image, y ∗ , and the four examples, ( τ y ∗ , τ y ) , which precisely deﬁne atrial. 19 election of saliency maps with Bayesian Teaching A saliency map is an image mask that shows how important each pixel of the image is to themodel’s inference. In the [

MAP ] conditions, we generate a saliency map for every image displayed.To generate a saliency map, one needs to specify a model, an inference task, and a deﬁnition ofimportance. We used ResNet-50 as the model and the classiﬁcation of an image into the 1000categories in

ImageNet 1K as the inference task. Using the Bayesian Teaching framework, wedeﬁne importance to be the probability that a mask, m , will lead the model to predict the image, d ,to be in category, y , when the mask is applied to the image. This is expressed by Bayes’ rule as Q T ( m | y, d ) = g L ( y | d, m ) p ( m ) (cid:82) Ω M g L ( y | d, m ) p ( m ) . (4)Here, g L ( y | d, m ) is probability that the ResNet-50 model will predict the d masked by m to be y ; p ( m ) is the prior probability of m ; and Ω M = [0 , W × H is the space of all possible masks onan image with W × H pixels. The prior probability distribution p ( m ) on m is a sigmoid-functionsquashed Gaussian process prior.Instead of sampling the saliency maps directly from Equation 4, we ﬁnd the expected saliencymap for each image by Monte Carlo integration:E [ M | y, d ] = (cid:90) Ω M m Q T ( m | y, d ) ≈ (cid:80) Ni =1 m i g L ( y | d, m i ) (cid:80) Ni =1 g L ( y | d, m i ) , (5)where m i are samples from the prior distribution p ( m ) , and N = 1000 is the number of MonteCarlo samples used. To see why an expected map is desirable, imagine the following case. Supposethat an image contains 7 goldﬁsh and its category is “goldﬁsh.” In this case, a mask that revealsany one of the goldﬁsh will have a high Q T value. However, it is more desirable that the maskwould reveal all the goldﬁsh in the image. The expectation provides this by averaging the masksappropriately weighted by their Q T values.Now, we describe the step-by-step procedures for generating the saliency map for an image, d .First, d is resized to be 224-by-224 pixels, which is the size displayed in the experiments (Figure 1).A set of 1000 2D functions are sampled from a 2D Gaussian process (GP) with an overall varianceof , a constant mean of − , and a radial-basis-function kernel with length scale 22.4 pixelsin both dimensions. The sampled functions are evaluated on a 224-by-224 grid, and the functionvalues are mostly in the range of [ − , . A sigmoid function, / (1 + exp( − x )) , is appliedto the sampled functions to transform each of the function values, x , to be within the range [0 , .This results in 1000 masks. The mean of the GP controls how many effective zeros there are inthe mask, and the variance of the GP determines how fast neighboring pixel values in the maskchange from zero to one. The 1000 masks are the m i ’s in Equation 5. We produce 1000 maskedimages by element-wise multiplying the image d with each of the masks. The term g L ( y | d, m i ) isthe ResNet-50’s predictive probability that the i th masked image is in category y . Having obtainedthese predictive probabilities from ResNet-50, we average the 1000 masks according to Equation 5to produce the saliency map of image d . If d is a target image, the y used to generate the saliency20ap is the ResNet-50-PLDA model’s prediction. If d is an example, the y is the category fromwhich the example is sampled.In the [ JET ] conditions, the saliency maps are rendered in the Matplotlib package with the “jet”colormap and an alpha value of 0.4 and overlaid on the images (see Figure 1; images at the bottomrow). In the [

BLUR ] conditions, a saliency map is rendered by blurring the image for which it isgenerated (Figure 1; images in the middle row). To generate the blur, each pixel value of a saliencymap, z , is assigned a blurring window width, w ( z ) = ceil (30 / (1 + exp(20 z − . The j th pixel value of the rendered saliency map is the average pixel value of a patch of the original image,where the patch is w -by- w in size and centered on the j th pixel of the original image. If the j th pixelis close to an edge of the image, the patch becomes rectangular, and the average is over whicheverpixel values that are inside the w -by- w sized window.To conclude this subsection, we make a few ﬁnal remarks. First, a PLDA layer is unnecessaryin the generation of saliency maps because the ResNet-50 model is capable of generating theprobabilities g L ( y | d, m ) in Equation 4. In contrast, the ResNet-50 model cannot be used directlyto generate the probabilities f L ( y ∗ | τ y ∗ , τ y , d ∗ ) in Equation 1. Second, while the 2AFC taskmay be suitable for generating a saliency map for the target image, it cannot be used to generatesaliency maps for the examples. This is the main reason that here we used the inference task ofthe image classiﬁcation task that the ResNet-50 model is trained on. Lastly, Equation 5 is thesame as Equation 5 in the RISE approach introduced by Petsiuk, Das, and Saenko [48], whichpresents a state-of-the art method for generating saliency maps. Our implementation and theirimplementation differ only in the way the individual masks are sampled. In our implementation,we sampled functions from a GP prior and turned them into masks by applying a sigmoid function.In [48], random binary matrices are ﬁrst sampled and subsequently up-sampled to the desired masksize through bilinear interpolation. The expectation is done in the same way. Familiarity coding

In addition to the splits by conditions presented in Table 1, analysis also rely on scores of humanfamiliarity with the image categories. The familiarity of each of the [

HELPFUL ] and [

RANDOM ]trials was manually coded by 7 raters. Each rater was asked to code the trial as “familiar” if theythought they could correctly match the category labels to the images presented in that trial, and“unfamiliar” otherwise. A familiarity score for each pairing of categories was then constructedby assigning each raters judgements as 1 for familiar and 0 for unfamiliar, and computing themean across raters. The 300 trials across the [

HELPFUL ] and [

RANDOM ] conditions resulted in167 unique category pairings (counting the ordering of target versus other category), and theirfamiliarity scores are presented in Supplementary Table T2.

Statistical analysis

Whenever we report testing how well participants predict the model classiﬁcations (performance),or how often their judgements correspond to the image ground truth (accuracy) we used hierarchi-cal logistic regressions with random intercepts per participants and ﬁxed effects for the remainingterms. For analyses of sensitivity and speciﬁcity analyses, we still used a logistic regression frame-work but only included trials corresponding to true positives and false negatives, or true negatives21nd false positives, respectively. Sensitivity captures how well participants predict trials when theAI is correct, and speciﬁcity capture how well participants predict trials when the model is wrong.To illustrate,

Bayesian Teaching improves predictive performance used the following equationson the full set of trials, and on a subset of the trials to capture sensitivity and speciﬁcity respectively:

Pr(Performance i = 1) = logit − ( α j [ i ] + β ExplanationCondition i + (cid:15) i ) , for i = 1 , . . . , I. logit − ( x ) = exp( x )1 + exp( x ) α j ∼ N ( U j , σ α ) , for j = 1 , . . . , J. where performance is a binary variable coded as 1 when participant correctly predict the AI classi-ﬁcation and 0 otherwise, i is the observation index, j is the participant index. ExplanationConditionis a binary dummy variable coded as 1 if participants experienced heatmaps and helpful examplesand 0 if they did not experience any explanations.For the Participants prefer helpful examples section we compared three hierarchical logisticmodels: (A) An intercept only model that treated intercepts as nested within participants (B) anintercept only model that treated intercepts as nested within participants and conditions, (C) Modeltwo, with an added ﬁxed effect for the familiarity score. We then compared the negative log-likelihoods of these models to determine which best accounted for the observed data.We evaluated whether

Bayesian Teaching can lead participants to both correct and incorrectinference by predicting performance in the conditions containing examples by ﬁtting three nestedmodels:

Pr(Performance i = 1) = logit − ( α j [ i ] + β ResNetAccuracy i + (cid:15) i ) , for i = 1 , . . . , I. Pr(Performance i = 1) = logit − ( · · · + β SimLearnerPerformance i + (cid:15) i ) , for i = 1 , . . . , I. Pr(Performance i = 1) = logit − ( · · · + β ModelCorrectness i + β ModelCorrectness i ResNetAccuracy i + β ModelCorrectness i SimLearnerPerformance i + (cid:15) i ) , for i = 1 , . . . , I. where SimLearnerPerformance is the expected probability that the participant pick the same re-sponse as the target model, conditional on seeing the examples, and ModelCorrectness is a dummyvariable coding if ResNet made a correct classiﬁcation on this particular trial. We then comparedthe negative log likelihoods of these three models, and reported the coefﬁcients of the best-ﬁttingmodel (the interaction model).In the Bayesian Teaching improves performance through belief-mitigation section we ﬁttedfour logistic hierarchical regression models to the full data. These models shared the followingform:

Pr(Y i = 1) = logit − ( α j [ i ] + β Familiarity i + β ResNetAccuracy i + (6) β Examples i + β MAP i + β Labels i + (cid:15) i ) , for i = 1 , . . . , I. where Familiarity is a proportion of raters who rated the trial categories as familiar, ResNetAccu-racy, was the average ResNet accuracy for the target category, Examples, MAP and Labels wheredummy variables that captured whether examples were shown, whether heatmaps were shown andwhether category labels were informative or not, respectively.22hese four models were distinguished based on whether the AI was correct or not and whetherY corresponded to whether the participant judgement matched the ground truth or matched theAI’s judgement. We ﬁtted similar models to the [ EXAMPLES ] trials only, with the only differencebeing that the Examples term, that previously had captured whether examples were present wasreplaced with a dummy variable that captured whether the examples presented were helpful or not.Finally, we ﬁtted two more models predicting participant performance from the full data. Theseare similar to Equation 6, but added two additional interaction terms:

Pr(Y i = 1) = logit − ( · · · + β MAP i Familiarity i + β Examples i Familiarity i + (cid:15) i ) , for i = 1 , . . . , I. Coefﬁcient tables for these models can be found in Supplementary Tables T3. All hierarchicallogistic regression models were ﬁtted using the lme4 package (1.1-23) [59] in R version 4.0.3.

Data availability

Raw experimental data and analysis code will be deposited at TBA upon publication.

Acknowledgments

This material is based on research sponsored by the Air Force Research Laboratory and DARPAunder agreement number FA8750-17-2-0146 to P.S. and S.Y. The U.S. Government is authorizedto reproduce and distribute reprints for Governmental purposes notwithstanding any copyrightnotation thereon.This work was also supported by DoD grant 72531RTREP, NSF SMA-1640816, NSF MRI1828528 to PS. The methods described herein are covered under Provisional Application No.62/774,976.

Author contributions

S.C.-H.Y., W.K.V., R.B.S, and P.S. conceived and designed the experiments. W.K.V. and S.C.-H.Y. conducted the experiments. R.B.S. and S.C.-H.Y. prepared materials for the experiments.T.F., W.K.V., and S.C.-H.Y. analyzed the data. S.C.-H.Y., T.F., W.K.V., R.B.S, and P.S. wrote thepaper.

Competing interests

The authors declare no competing interests.

References [1] Finale Doshi-Velez, Mason Kortz, Ryan Budish, Chris Bavitz, Sam Gershman, DavidO’Brien, Kate Scott, Stuart Schieber, James Waldo, David Weinberger, et al. Accountabilityof ai under the law: The role of explanation. arXiv preprint arXiv:1711.01134 , 2017.232] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, TonyDuan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet:Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprintarXiv:1711.05225 , 2017.[3] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau,and Sebastian Thrun. Dermatologist-level classiﬁcation of skin cancer with deep neural net-works.

Nature , 542(7639):115, 2017.[4] European Commission. 2018 reform of eu data protection rules. https://ec.europa.eu/commission/sites/beta-political/files/data-protection-factsheet-changes_en.pdf , 2018.[5] Diane Coyle and Adrian Weller. ?explaining? machine learning reveals policy challenges.

Science , 368(6498):1433–1434, 2020.[6] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Buildingmachines that learn and think like people.

Behavioral and brain sciences , 40, 2017.[7] John Stuart Mill.

A system of logic, ratiocinative and inductive: Being a connected view ofthe principles of evidence and the methods of scientiﬁc investigation . Longmans, green, andCompany, 1889.[8] Paul Bloom.

How children learn the meanings of words . MIT press, 2002.[9] Fei Xu and Joshua B Tenenbaum. Word learning as bayesian inference.

Psychological review ,114(2):245, 2007.[10] Brenden M Lake and Steven T Piantadosi. People infer recursive visual concepts from just afew examples.

Computational Brain & Behavior , 3(1):54–65, 2020.[11] Michelene TH Chi, Miriam Bassok, Matthew W Lewis, Peter Reimann, and Robert Glaser.Self-explanations: How students study and use examples in learning to solve problems.

Cog-nitive science , 13(2):145–182, 1989.[12] Vincent AWMM Aleven.

Teaching case-based argumentation through a model and examples .Citeseer, 1997.[13] Liz Bills, Tommy Dreyfus, John Mason, Pessia Tsamir, Anne Watson, and Orit Zaslavsky.Exempliﬁcation in mathematics education. In

Proceedings of the 30th Conference of theInternational Group for the Psychology of Mathematics Education , volume 1, pages 126–154. ERIC, 2006.[14] Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. Learning to explain: Aninformation-theoretic perspective on model interpretation. In

International Conference onMachine Learning , pages 882–891, 2018.[15] Baxter S Eaves Jr, April M Schweinhart, and Patrick Shafto. Tractable bayesian teaching. In

Big Data in Cognitive Science , pages 74–99. Psychology Press, 2016.2416] Mark K Ho, Michael Littman, James MacGlashan, Fiery Cushman, and Joseph L Auster-weil. Showing versus doing: Teaching by demonstration. In

Advances in neural informationprocessing systems , pages 3027–3035, 2016.[17] Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata. Generating counter-factual explanations with natural language. arXiv preprint arXiv:1806.09809 , 2018.[18] Atsushi Kanehira and Tatsuya Harada. Learning to explain with complemental examples.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages8603–8611, 2019.[19] Been Kim, Cynthia Rudin, and Julie A Shah. The bayesian case model: A generative ap-proach for case-based reasoning and prototype classiﬁcation. In

Advances in Neural Infor-mation Processing Systems , pages 1952–1960, 2014.[20] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to crit-icize! criticism for interpretability. In

Advances in Neural Information Processing Systems ,pages 2280–2288, 2016.[21] Wai Keen Vong, Ravi B. Sojitra, Anderson Reyes, Scott Cheng-Hsin Yang, and PatrickShafto. Bayesian teaching of image categories. In

Proceedings of the 40th Annual Con-ference of the Cognitive Science Society , 2018.[22] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959 , 2018.[23] Pang Wei Koh and Percy Liang. Understanding black-box predictions via inﬂuence functions.In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages1885–1894. JMLR. org, 2017.[24] Nicolas Papernot and Patrick McDaniel. Deep k-nearest neighbors: Towards conﬁdent, in-terpretable and robust deep learning. arXiv preprint arXiv:1803.04765 , 2018.[25] Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, and Pradeep K Ravikumar. Representer pointselection for explaining deep neural networks. In

Advances in Neural Information ProcessingSystems , pages 9291–9301, 2018.[26] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactualvisual explanations. arXiv preprint arXiv:1904.07451 , 2019.[27] Rich Caruana, Hooshang Kangarloo, JD Dionisio, Usha Sinha, and David Johnson. Case-based explanation of non-case-based learning methods. In

Proceedings of the AMIA Sympo-sium , page 212. American Medical Informatics Association, 1999.[28] Mark T Keane and Eoin M Kenny. How case-based reasoning explains neural networks:A theoretical analysis of xai using post-hoc explanation-by-example from a survey of ann-cbr twin-systems. In

International Conference on Case-Based Reasoning , pages 155–171.Springer, 2019. 2529] Scott Cheng-Hsin Yang and Patrick Shafto. Explainable artiﬁcial intelligence via bayesianteaching.

NIPS 2017 workshop on Teaching Machines, Robots, and Humans. , 2017.[30] Tim Miller. Explanation in artiﬁcial intelligence: Insights from the social sciences.

ArtiﬁcialIntelligence , 2018.[31] Lee S Shulman. Those who understand: Knowledge growth in teaching.

Educational re-searcher , 15(2):4–14, 1986.[32] Helen L Chick. Teaching and learning by example.

Mathematics: Essential research, essen-tial practice , 1:3–21, 2007.[33] Patrick Shafto, Noah D. Goodman, and Thomas L. Grifﬁths. A rational account of pedagog-ical reasoning: Teaching by, and learning from, examples.

Cognitive Psychology , 71:55–89,2014.[34] Baxter S Eaves Jr, Naomi H Feldman, Thomas L Grifﬁths, and Patrick Shafto. Infant-directedspeech is consistent with teaching.

Psychological review , 123(6):758, 2016.[35] Scott Cheng-Hsin Yang, Yue Yu, Arash Givchi, Pei Wang, Wai Keen Vong, and PatrickShafto. Optimal cooperative inference. In

International Conference on Artiﬁcial Intelligenceand Statistics , pages 376–385, 2018.[36] Oisin Mac Aodha, Shihan Su, Yuxin Chen, Pietro Perona, and Yisong Yue. Teaching cate-gories to human learners with visual explanations. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3820–3828, 2018.[37] Yuxin Chen, Oisin Mac Aodha, Shihan Su, Pietro Perona, and Yisong Yue. Near-optimalmachine teaching via explanatory teaching sets. In

International Conference on ArtiﬁcialIntelligence and Statistics , pages 1970–1978, 2018.[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-age recognition. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 770–778, 2016.[39] Tor Tarantola, Dharshan Kumaran, Peter Dayan, and Benedetto De Martino. Prior preferencesbeneﬁcially inﬂuence social and non-social learning.

Nature communications , 8(1):1–14,2017.[40] Shinsuke Suzuki, Emily LS Jensen, Peter Bossaerts, and John P O?Doherty. Behavioral con-tagion during learning about another agent?s risk-preferences acts on the neural representa-tion of decision-risk.

Proceedings of the National Academy of Sciences , 113(14):3755–3760,2016.[41] Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. Action understanding as inverseplanning.

Cognition , 113(3):329–349, 2009.[42] Branden J Bio, Taylor W Webb, and Michael SA Graziano. Projecting one?s own spatial biasonto others during a theory-of-mind task.

Proceedings of the National Academy of Sciences ,115(7):E1684–E1689, 2018. 2643] ´Agnes Melinda Kov´acs, Ern ˝o T´egl´as, and Ansgar Denis Endress. The social sense: Sus-ceptibility to others? beliefs in human infants and adults.

Science , 330(6012):1830–1834,2010.[44] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.

International Journal of ComputerVision (IJCV) , 115(3):211–252, 2015.[45] Russell W Clement and Joachim Krueger. The primacy of self-referent information in per-ceptions of social consensus.

British Journal of Social Psychology , 39(2):279–299, 2000.[46] Stefano Palminteri, Germain Lefebvre, Emma J Kilford, and Sarah-Jayne Blakemore. Con-ﬁrmation bias in human reinforcement learning: Evidence from counterfactual feedback pro-cessing.

PLoS computational biology , 13(8):e1005684, 2017.[47] Doris Pischedda, Stefano Palminteri, and Giorgio Coricelli. The effect of counterfactualinformation on outcome value coding in medial prefrontal and cingulate cortex: from anabsolute to a relative neural code.

Journal of Neuroscience , 40(16):3268–3277, 2020.[48] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized Input Sampling for Explana-tion of Black-box Models. arXiv preprint , 2018.[49] Wilson S Geisler. Ideal observer analysis.

The visual neurosciences , 10(7):12–12, 2003.[50] Wilson S Geisler. Contributions of ideal observer theory to vision research.

Vision research ,51(7):771–781, 2011.[51] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “why should i trust you?”: Explain-ing the predictions of any classiﬁer. In

Proceedings of the 22nd ACM SIGKDD internationalconference on knowledge discovery and data mining , pages 1135–1144. ACM, 2016.[52] Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In

Advances in Neural Information Processing Systems , pages 4765–4774, 2017.[53] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machinelearning. arXiv preprint arXiv:1702.08608 , 2017.[54] Tania Lombrozo. The structure and function of explanations.

Trends in cognitive sciences ,10(10):464–470, 2006.[55] Sergey Ioffe. Probabilistic linear discriminant analysis. In

European Conference on Com-puter Vision , pages 531–542. Springer, 2006.[56] Thomas Back.

Evolutionary algorithms in theory and practice: evolution strategies, evolu-tionary programming, genetic algorithms . Oxford university press, 1996.[57] Heikki Haario, Marko Laine, Antonietta Mira, and Eero Saksman. Dram: efﬁcient adaptivemcmc.

Statistics and computing , 16(4):339–354, 2006.2758] Dougal Maclaurin and Ryan Prescott Adams. Fireﬂy monte carlo: Exact mcmc with subsetsof data. In

Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence , 2015.[59] Douglas Bates, Deepayan Sarkar, Maintainer Douglas Bates, and L Matrix. The lme4 pack-age.

R package version , 2(1):74, 2007. 28 upplementary information: Mitigating belief projection inexplainable artiﬁcial intelligence via Bayesian TeachingSupplementary Table T1

ImageNet categories used in the experiment. The 83 categories and their corresponding ResNetaccuracy are given in this table. The accuracy scores are computed on the test set of

ImageNet 1K over the 100 selected categories described in Methods. See separate csv ﬁle.

Supplementary Table T2 . The table lists all 167 unique pairs of categories used in the experiment along with each pair’sfamiliarity score. See separate csv ﬁle.

Supplementary Tables T3

Coefﬁcient tables for the 15 regression models reported in the main text. See separate excel ﬁle.29 upplementary Discussion D1: Participants prefer helpful ex-amples

Methods

To test the subjective preference for helpful versus unhelpful or random examples, we use a differ-ent task. In a trial of this task, we presented a target image, its category, and two sets of examplepairs for that category. The participants were asked to select which pair they think inﬂuencedthe AI’s classiﬁcation more. Note that this experiment is different from the 2AFC experimentdescribed previously in that there is only one category and the decision is between teaching sets.Helpful examples are chosen to be the teaching examples for the target category in trials where f L > . . Likewise, unhelpful examples are chosen to be the examples for the target categoryin trials where f L < . . On average, these examples are expected to be helpful or detrimentalregardless of what the other category is; thus, they can be approximated as examples that aim tomaximize or minimize the marginal teaching probability. We extracted 67 target images that haveboth helpful examples and unhelpful examples. Given a target image’s category, random examplesare simply random samples from the training images in ImageNet 1K that are not the target imageor the helpful examples.80 participants (25 male, 54 female, 1 other) were recruited from Amazon Mechanical Turkand paid $1.00 for completing the experiment, which took roughly 5 minutes to complete. The par-ticipants were randomly assigned to each of the two conditions (helpful-vs-unhelpful and helpful-vs-random) with 40 in each condition. The mean age of participants was 36.7 years (SD = 10.5),ranging from 16 to 68 years. 6 participants were excluded from analysis for completing the exper-iment too quickly (less than one second per trial), resulting in a ﬁnal sample of 74 participants.The study protocol was approved by Rutgers University IRB. All research was performed inaccordance with the approved study protocol. An IRB-approved consent page was displayed beforethe experiment. Informed consent was obtained from all participants. The experiment began afterthe participants gave consent.

Results

We wanted to evaluate whether participants preferred informative to uninformative and misleadingexamples. To test this, we ran a second study where participants chose between helpful examplesversus random examples (n=37) or versus unheplful examples (n=37). The helpfulness of the ex-amples is determined by Bayesian Teaching. The helpful examples are chosen to best represent thetarget category by maximizing the marginal teaching probability; random examples are uniformlyrandomly sampled from the target category; and unhelpful examples are chosen to mislead thelearner to infer any other category by minimizing the marginal teaching probability. The marginalteaching probability is the probability that a set of examples will lead the explainee model to inferthe target category compared to any other category in a 2AFC task (see Methods for more details).Participants showed a small but reliable preference for helpful relative to random examples(53.05% [95% CI = 51.08% - 55.01%], z=3.03, p = .002) and a substantial preference for helpfulversus to unhelpful examples (64.14% [95% CI = 61.68% - 66.59%], z=10.95, p < .0001). Thesetwo conditions were reliably different ( χ < .0001), implying that the Bayesian Teacher30s not only capable of selecting helpful examples, but can also select examples that are activelyconfusing (see Figure D1-1). As this pattern of preferences matches our predictions as statedin the introduction, a natural next steps is to evaluate whether these preferences are particularlypronounced for unfamiliar examples, as hypothesised. We found that participants were more likelyto prefer helpful examples when the choice categories were unfamiliar to them ( β = -0.57(0.08), z= -7.02, p < .0001), irrespective of whether helpful examples were contrasted with random orunhelpful examples.Figure D1-1: The popularity of helpful examples. (A) . The probability that a participant chosehelpful examples over random (37 participants; 2479 observations), or unhelpful examples (37participants; 2479 observations) respectively. (B) . The less familiar participants are with the stim-ulus categories, the more they prefer helpful examples. Familiarity ratings were continuous inthe analyses reported in the main text, but are dichotomized here for visual clarity. Each trans-parent point represents the average probability across participants for one speciﬁc stimulus pair.Solid points represent the mean across all stimulus pairs. Error bars signify 95% bootstrappedconﬁdence intervals. 31 upplementary Discussion D2: Analysing MAP conditions sep-arately In the main text we combined the two

MAP ] conditions in our analyses. To show that this decisiondid not meaningfully impact our conclusions we repeat the same analyses with [

JET ] and [

BLUR ] asseparate predictors here. We ran hierarchical logistic regressions on the complete dataset predict-ing performance from the interventions ([

SPECIFIC - LABELS ] vs [

GENERIC - LABELS ], [

BLUR ] vs[

JET ] vs [

NO MAP ], and [

EXAMPLES ] vs [

NO EXAMPLES ], while controlling for ResNet-50-PLDAaccuracy and familiarity ratings.[

BLUR ] improves performance when the AI is wrong ( β = 0.43(0.03), z = 12.27, p < .0001),as do [ JET ] ( β = 0.43(0.03), z = 12.34, p < .0001). However, the masks reduce performance (to alesser extent) when the AI is correct, both for [ BLUR ] ( β = -0.49(0.08), z = -6.06, p < .0001) andfor [ JET ] ( β = -0.62(0.08), z = -7.76, p < .0001), see Figure D2-1. In both cases, the saliency mapsreduced the ﬁrst order-accuracy of the participants. [ BLUR ] AI correct: β = -0.49(0.08), z = -6.05,p < .0001, AI wrong: β = -0.43(0.03), z = -12.27, p < .0001; [ JET ] AI correct: β = -0.62(0.08), z= -7.76, p < .0001, AI wrong: β = -0.43(0.03), z = -12.34, p < .0001. This reduction in ﬁrst-orderaccuracy means that they were less likely to believe that the AI judgements matched the groundtruth of the image. This in turn implies that the saliency maps encourage participants to considerthat the AI might be mistaken. This interpretation assumes that participants know the ground truthfor most of the trials, which seems plausible given the familiarity ratings.The familiarity ratings capture the ease of the discrimination task in that they are higher fortrials involving categories that humans are familiar with. We can use these ratings to further ex-plore whether participants project their own beliefs onto the AI. Speciﬁcally, if humans use theirﬁrst-order classiﬁcations to model the AI, familiarity should positively correlate with predictiveperformance when the AI is correct, but negatively correlate with predictive performance when theAI is wrong. This is indeed what we ﬁnd: participants are more likely to accurately predict AIclassiﬁcations when they are familiar with the item categories and the AI is correct ( β = 1.10(0.04),z = 29.28, p < .0001), but they are less likely to correctly predict AI errors ( β = -0.92(0.02), z =-42.82, p < .0001). In other words, participants are more likely to assume that the AI gets it rightfor trials that they themselves ﬁnd easy.Previously we showed that saliency maps improved prediction accuracy on trials when theAI was wrong. We suggested that this might be explained by saliency maps helping participantsdistinguish between their ﬁrst-order judgements of the ground truth and their performance on pre-dicting the model classiﬁcation. This can be evaluated directly by testing whether the impact of thefamiliarity ratings on classiﬁcation accuracy are attenuated by the saliency maps (see Figure D2-1). In other words, if participants are more likely to predict that the AI is correct on trials thatthey themselves ﬁnd easy, and the saliency maps work by helping people realise that the AI usedifferent decision-processes, the saliency maps should make participants more willing to considerthat the AI is wrong for trials they themselves ﬁnd easy. This is what we ﬁnd, see Figure D2-1).Speciﬁcally, the presence of [ BLUR ] maps reduces the positive impact of familiarity on second-order accuracy when the AI is correct ( β = -0.61(0.09), z = -6.67, p < .0001) and the same is truefor [ JET ] maps ( β = -0.44(0.09), z = -4.87, p < .0001). Conversely, saliency maps reduce the neg-ative impact of familiarity on second-order accuracy when the AI is wrong, for both [ BLUR ] ( β =0.74(0.05), z = 13.95, p < .0001) and [ JET ] ( β = 0.67(0.05), z = 12.73, p < .0001). Collectively32hese results suggest that the presence of saliency maps help participants model the AI as an agentwith distinct beliefs that may conﬂict with their own.Figure D2-1: Masks improve human performance for identifying model errors, and reduce perfor-mance for identifying model hits, irrespective of how they are presented. All subplots are based onthe entire data set, comparing [ JET ] and [

BLUR ] and [

NO MAP ] conditions (631 participants; 94,582observations). (A) . The saliency maps improve performance for trials when the AI is wrong butreduce performance when the AI is correct. (B) . The saliency maps make people less likely toclassify the target image to align with the ground truth, independent of AI accuracy. Together, A& B imply that the saliency maps help people to consider that the AI might make mistakes. (C)

Saliency maps decrease the impact of familiarity on participant judgements. For model hits thisleads to decreased performance, whereas for model errors it leads to improved performance. Thispattern provides further evidence that the saliency maps work by shifting participants away fromusing their ﬁrst-order judgments to model the AI classiﬁcations. Collectively these ﬁgures suggestthat [

JET ] and [

BLUR ] have very similar impacts on participant judgements. Errorbars represent95% bootstrapped conﬁdence intervals. All point estimates have conﬁdence intervals, though someare too narrow to see clearly. Shaded areas represent analytic 95% conﬁdence intervals.33 upplementary Figure F1: The relationship between ResNet Ac-curacy and participant performance in the control condition

Focusing exclusively on the control trials, we see that ResNet accuracy is positively associatedwith human predictive performance when the AI is wrong ( β = 0.81(0.11), z = 6.95, p < .0001),but even more so when the AI is correct ( β = 0.92(0.23), z = 3.96, p < .0001).Figure F1: (A) ResNet accuracy is positively associated with human performance during thecontrol condition both for trials when the AI is correct and when the AI is wrong. However, thebase rate human performance is much higher when the AI is correct. (B)(B)