[PDF] EUCA: A Practical Prototyping Framework towards End-User-Centered Explainable Artificial Intelligence

Abstract

The ability to explain decisions to its end-users is a necessity to deploy AI as critical decision support. Yet making AI explainable to end-users is a relatively ignored and challenging problem. To bridge the gap, we first identified twelve end-user-friendly explanatory forms that do not require technical knowledge to comprehend, including feature-, example-, and rule-based explanations. We then instantiated the explanatory forms as prototyping cards in four AI-assisted critical decision-making tasks, and conducted a user study to co-design low-fidelity prototypes with 32 layperson participants. The results verified the relevance of using the explanatory forms as building blocks of explanations, and identified their proprieties (pros, cons, applicable explainability needs, and design implications). The explanatory forms, their proprieties, and prototyping support constitute the End-User-Centered explainable AI framework EUCA. It serves as a practical prototyping toolkit for HCI/AI practitioners and researchers to build end-user-centered explainable AI. The EUCA framework is available at this http URL

Full PDF

EEUCA: A Practical Prototyping Framework towardsEnd-User-Centered Explainable Artificial Intelligence

WEINA JIN,

Simon Fraser University

JIANYU FAN,

Simon Fraser University

DIANE GROMALA,

Simon Fraser University

PHILIPPE PASQUIER,

Simon Fraser University

GHASSAN HAMARNEH,

Simon Fraser University

Feature 1 F ea t u r e FeaturesExamplesRules

Supplementaryinformation

Explain using: Explanatory Form & its Example

Input Performance

Overall performance of the autonomous driving mode:

Under normal road condition: 40 kmDuring the night : 5 kmOn rainy days: 3 kmOn snowy days: 1 km * Disengagement means when the automated system is switched off by the intervention of a human driver

Measured using average distance driven between disengagements*

Dataset

Diabetes risk

Distribution of predicted diabetes risk for all recorded patients N u m be r o f pa t i en t r e c o r d s your estimated diabetes risk Output

Predicted price of your own house $ 650,000 $ 638 ~ 662,000 with certainty of 90%with certainty of 95%

Slow down and stop Slow down to 30 km/h

Keep speed at 50km/h

Slow down to 40km/h current traffic view If house area ≤ 800 sq, and distance to school, parks > 2.5 km,Then house price is no more than 600,000 If house area is 800 - 900 sq,and distance to school, parks < 2.5 km,Then house price is about 700,000-850,000 Similar traffic conditions as the current one, from the dataset to train the self-driving car: A typical house to sell at the estimated priceas yours is like:In your neighbourhood:2 bedrooms 2 bathrooms1000 sq20 years old(cid:3)(cid:3) Years of household appliances H ou s e p r i c e Body weight A ge D i abe t e s r i sk Important objects detected for the self-driving car’s judgement: lane marking contribute of the slow down & stop decisioncontribute of the slow down & stop decisioncontribute of the keep current lane decisioncontribute of the keep current lane decision

The image you uploaded:

Feature AttributeSimilar ExampleDecision Rule Decision TreeTypical Example Counterfactual ExampleFeature Shape Feature Interation

Fig. 1.

End-user-friendly explanatory forms in the EUCA framework . The explanatory forms are shownon the right grids, each accompanied by a prototyping card example across four tasks used in the userstudy. The forms are a familiar language to both AI designers and end-users, thus overcoming the technicalcommunication barriers between the two. The 12 explanatory forms are grouped into four categories toexplain AI’s prediction on a new data point (the red dot in the leftmost 2D feature-space plot), or the model’soverall behavior: Explaining using features, examples, rules, and supplementary information. These categoriescorrespond to the different aspects of showing AI’s learned representations at the feature, instance, anddecision boundary level, indicated in the plot.

The ability to explain decisions to its end-users is a necessity to deploy AI as critical decision support. Yetmaking AI explainable to end-users is a relatively ignored and challenging problem. To bridge the gap, weﬁrst identiﬁed twelve end-user-friendly explanatory forms that do not require technical knowledge to com-prehend, including feature-, example-, and rule-based explanations. We then instantiated the explanatoryforms as prototyping cards in four AI-assisted critical decision-making tasks, and conducted a user studyto co-design low-ﬁdelity prototypes with 32 layperson participants. The results veriﬁed the relevance ofusing the explanatory forms as building blocks of explanations, and identiﬁed their proprieties (pros, cons,applicable explainability needs, and design implications). The explanatory forms, their proprieties, and

Authors’ addresses: Weina Jin, [email protected], Simon Fraser University; Jianyu Fan, Simon Fraser University; DianeGromala, Simon Fraser University; Philippe Pasquier, Simon Fraser University; Ghassan Hamarneh, [email protected],Simon Fraser University.

W. Jin, et al. prototyping support constitute the End-User-Centered explainable AI framework EUCA . It serves as apractical prototyping toolkit for HCI/AI practitioners and researchers to build end-user-centered explainableAI.CCS Concepts: • Computing methodologies → Artiﬁcial intelligence ; •

Human-centered computing → User studies .Additional Key Words and Phrases: Explainable Artiﬁcial Intelligence; Machine Learning interpretability;Usability Study; Human-AI Collaboration; User-Centered Design

Problem statement.

Doctors, judges, drivers, bankers, and other decision-makers require explana-tions from artiﬁcial intelligence (AI) when they use AI for critical decision support. As AI becomespervasive in high-stake decision-making tasks, such as in supporting medical, military, legal, andﬁnancial judgments, making AI explainable to its users is crucial to identify potential errors andestablish trust [39]. The growing research community of eXplainable AI (XAI) aims to address suchproblems and “open the black box of AI” [32]. XAI literature generally divides its users into twogroups according to their level of technical knowledge in AI: Technical users and non-technicalusers [21, 74, 75, 78, 86]. The primary focus of current XAI research, however, is on debugging, un-derstanding, and improving AI models for technical users, such as data scientists, AI researchersand developers, leaving the largest and most diverse group of XAI users largely ignored: thenon-technical end-users [21, 69]. Non-technical end-users, or end-users for short, can be eitherlaypersons: such as drivers overseeing autonomous driving vehicles, or domain experts: such asdoctors using AI-assisted technology in diagnostic tasks [26, 44, 47], judges using AI to supportreaching a guilt verdict [51], and bankers using AI to assist in approving loan applications.

Challenges.

Compared to developing for technical users that mainly needs to deal with technicalchallenges in XAI [33], developing XAI for end-users faces even greater challenges: ) No technicalknowledge : unlike technical users, end-users typically do not possess technical knowledge inAI, machine learning, data science, or programming, making some explanation methods whichpresume users’ prior knowledge in AI (such as gradients, activations, neurons, layers) unviable. ) Diverse users, tasks, and explainability needs : when developing XAI for technical users, usershave a relatively uniﬁed need: they utilize explanations mainly for debugging, gaining insightson the model, and improving it accordingly [45]. In contrast, developing XAI for end-users mustadapt to the variability in the end-users’ roles, tasks, and needs for explanation. For example, Adoctor may demand distinct explanations from AI when using it as a diagnostic support system,whereas a human resources specialist resorts to explanations to support her hiring decisions(different end-users and tasks). Even if an XAI system is built for the same task, the needs andrequirements for explanation may vary. For example, a house seller may leverage the explanationof AI predictions to boost her property value, whereas a realtor may need an explanation to verifywhy AI’s prediction diverges from her own judgment.

Research gaps.

Given these challenges, there is an urgent need for end-user-centered XAI designguidance to support AI practitioners’ and researchers’ XAI design process on critical decisionsupport tasks. Although recent years have witnessed booming research on XAI in both human-computer interaction (HCI) and AI communities [12, 32, 78], research on end-user-centered XAIis still at its infancy. The AI community lacks and calls for such user-centered perspective [62, 68].In the HCI community, the existing user-centric XAI design guidance [57, 61, 67, 85] utilized atraditional user-centered approach informed by users’ requirements only, which may lead to The EUCA framework is available at http://weina.me/end-user-xai

UCA: A Practical Prototyping Framework for End-User XAI 3 (cid:37)(cid:87)(cid:87)(cid:83)(cid:71)(cid:77)(cid:69)(cid:88)(cid:73)(cid:72)(cid:4)(cid:37)(cid:80)(cid:75)(cid:83)(cid:86)(cid:77)(cid:88)(cid:76)(cid:81)(cid:87)(cid:4)(cid:74)(cid:83)(cid:86)(cid:4)(cid:45)(cid:81)(cid:84)(cid:80)(cid:73)(cid:81)(cid:73)(cid:82)(cid:88)(cid:69)(cid:88)(cid:77)(cid:83)(cid:82) (cid:52)(cid:86)(cid:83)(cid:84)(cid:73)(cid:86)(cid:88)(cid:77)(cid:73)(cid:87)(cid:4)(cid:83)(cid:74)(cid:4)(cid:41)(cid:92)(cid:84)(cid:80)(cid:69)(cid:82)(cid:69)(cid:88)(cid:83)(cid:86)(cid:93)(cid:4)(cid:42)(cid:83)(cid:86)(cid:81)(cid:87)(cid:4)(cid:41)(cid:82)(cid:72)(cid:17)(cid:57)(cid:87)(cid:73)(cid:86)(cid:17)(cid:42)(cid:86)(cid:77)(cid:73)(cid:82)(cid:72)(cid:80)(cid:93)(cid:4)(cid:41)(cid:92)(cid:84)(cid:80)(cid:69)(cid:82)(cid:69)(cid:88)(cid:83)(cid:86)(cid:93)(cid:4)(cid:42)(cid:83)(cid:86)(cid:81)(cid:87) (cid:40)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82)(cid:4)(cid:41)(cid:92)(cid:69)(cid:81)(cid:84)(cid:80)(cid:73)(cid:87)(cid:4)(cid:10)(cid:4)(cid:56)(cid:73)(cid:81)(cid:84)(cid:80)(cid:69)(cid:88)(cid:73)(cid:87) (cid:55)(cid:89)(cid:75)(cid:75)(cid:73)(cid:87)(cid:88)(cid:73)(cid:72)(cid:4)(cid:52)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:77)(cid:82)(cid:75)(cid:4)(cid:59)(cid:83)(cid:86)(cid:79)(cid:444)(cid:83)(cid:91) (cid:3)(cid:3) (cid:3)(cid:3) (cid:52)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:73)(cid:4)(cid:72)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82)(cid:4)(cid:74)(cid:83)(cid:86)(cid:4)(cid:89)(cid:87)(cid:73)(cid:86)(cid:4)(cid:87)(cid:88)(cid:89)(cid:72)(cid:93)(cid:93)(cid:84) (cid:75) (cid:42)(cid:77)(cid:75)(cid:18)(cid:4)(cid:21)(cid:16)(cid:4)(cid:55)(cid:73)(cid:71)(cid:88)(cid:77)(cid:83)(cid:82)(cid:4)(cid:24) (cid:3)(cid:3) (cid:57)(cid:87)(cid:73)(cid:86)(cid:4)(cid:87)(cid:88)(cid:89)(cid:72)(cid:93)(cid:4)(cid:86)(cid:73)(cid:87)(cid:89)(cid:80)(cid:88)(cid:87)(cid:93) (cid:54)(cid:73)(cid:87)(cid:89)(cid:80)(cid:88)(cid:4)(cid:55)(cid:73)(cid:71)(cid:88)(cid:77)(cid:83)(cid:82)(cid:4)(cid:25)(cid:4)(cid:10)(cid:4)(cid:26) (cid:3)(cid:3) (cid:40)(cid:77)(cid:87)(cid:71)(cid:89)(cid:87)(cid:87)(cid:77)(cid:83)(cid:82) (cid:55)(cid:73)(cid:71)(cid:88)(cid:77)(cid:83)(cid:82)(cid:4)(cid:27)(cid:18)(cid:21)(cid:18)(cid:22)(cid:56)(cid:69)(cid:70)(cid:80)(cid:73)(cid:4)(cid:21)(cid:16)(cid:4)(cid:55)(cid:73)(cid:71)(cid:88)(cid:77)(cid:83)(cid:82)(cid:4)(cid:23)(cid:56)(cid:69)(cid:70)(cid:80)(cid:73)(cid:4)(cid:21)(cid:16)(cid:4)(cid:55)(cid:73)(cid:71)(cid:88)(cid:77)(cid:83)(cid:82)(cid:4)(cid:23) (cid:56)(cid:73)(cid:71)(cid:76)(cid:82)(cid:77)(cid:71)(cid:69)(cid:80)(cid:4)(cid:80)(cid:77)(cid:88)(cid:73)(cid:86)(cid:69)(cid:88)(cid:89)(cid:86)(cid:73)(cid:4)(cid:87)(cid:89)(cid:86)(cid:90)(cid:73)(cid:93)

Fig. 2.

EUCA framework components and creation process.

Top : EUCA contains 5 components tosupport XAI prototyping process.

Bottom : The length of each arrow covers the creation of a EUCA componentand its corresponding paper sections. The light blue arrows are preparation stages before the user study, anddark blue arrows indicate the user study phase. technically-unachievable solutions limited by current AI capacities or training data availabil-ity [94]. They also lack support for prototyping, participatory design, and UX/UI (user interac-tion/user interface) design, which are the most desired support identiﬁed by prior user studieswith XAI design practitioners [57, 92, 93].

Solution.

To address the above challenges and research gaps, we propose the End-User-Centeredexplainable AI framework EUCA as a practical prototyping framework to support the designand implementation process of end-user-centered XAI. The EUCA framework (Fig. 2) contains asuggested prototyping workﬂow (Section 7.1.2), a series of end-user-friendly explanatory forms(Section 3) that consider both end-user literacy and technically-viable solutions, their designexamples and templates (Fig. 1), their identiﬁed proprieties (pros, cons, applicable needs forchecking explanations, and UI/UX design implications) from our user study for prototypingsupport (Section 5), and their associated XAI algorithms for implementation (Table 1). The fullcontent of the framework is in the Appendix, and key messages are summarized in Table 1.The process of creating the EUCA framework is illustrated in Fig. 2. To tackle challenge ) lack oftechnical knowledge , we screened existing XAI techniques, summarized their ﬁnal representationforms for explanation, and selected those forms that do not require any prior technical knowledgeto understand. We ﬁnally curated twelve end-user-friendly explanatory forms (Fig. 1): Explainingusing features (including feature attributes, feature shape, and feature interaction), examples (similar, prototypical, and counterfactual examples), rules (decision rule and decision tree), andsome necessary supplementary information (input, output, dataset, performance). Since theywere derived from technical works, the explanatory forms naturally link design representations toXAI algorithms. They enable designers to fully explore the technical feasible solution space, andnot have to worry about their design solutions are technically infeasible. Those forms are also afamiliar language for end-users, thus providing opportunities to involve users in the prototypingand participatory design process.To address challenge ) end-user’s roles, tasks, and needs diversity , EUCA incorporates user-centered design in its prototyping process, so that users’ context-speciﬁc requirements are fullyunderstood and addressed in prototypes. To do so, we ﬁrst instantiated the explanatory forms asprototyping cards on four AI-assisted critical decision-making tasks across health, safety, ﬁnance,and education, and conducted a user study with 32 layperson participants. The interview andcard sorting demonstrated using the prototyping cards as building blocks of explanation. And W. Jin, et al. through a participatory design process, designers and users could discuss and identify the suitablestrategy to combine the prototyping cards to construct XAI prototypes that address users’ needsfor explanation. The user study also identiﬁed the strengths, weaknesses, applicable explanationneeds, and UI/UX design implications for the explanatory forms.

Contribution.

The main contribution of EUCA is that it provides a practical prototyping frameworkfor

AI practitioners (UX designers, developers, etc.) to build end-user-centered XAI prototypes.The prototyping workﬂow and tangible design examples/templates support user-centered pro-totyping and co-design process, and enable end-users to communicate their context-speciﬁcexplainability needs to practitioners. The suggested prototyping process (illustrated in Fig. 10) isintuitive to follow even for people outside the HCI/UX community. The explanatory forms aresimple and familiar for both technical creators and non-technical users, thus can easily invite allstakeholders to the co-design conversation. The coupled XAI algorithms facilitate to implementthe low-ﬁdelity prototypes to functional high-ﬁdelity ones.In addition to the above support for XAI practitioners,

HCI and AI researchers may prototypeusing EUCA to propose novel XAI interfaces/algorithms, with the idea of using explanatory formsas building blocks. The user study ﬁndings uncovered the strengths, weaknesses, and designimplications for each explanatory form, providing opportunities to improve and create new ones.Besides bridging the communication gap between XAI creators and their end-users, EUCAis also a boundary object [3] that bridge the knowledge gap between AI and HCI/UX expertise.Designing end-user-centered XAI is challenging since it requires both expertise in HCI and AI [90].EUCA is built with a collaborative effort of combining AI and HCI expertise, and XAI creatorsworking in either ﬁeld can use EUCA to compensate for the missing expertise, or to scaffoldthe conversation and collaborate with team from the other ﬁeld. HCI designers and researchersuse the prototype since it’s the representation of the underlying algorithm (“form followed byfunctions”). For HCI/UX designers who lack constant access to AI experts, we provide designmethod that abstracts XAI techniques into tangible design patterns and exemplars. By having atool to directly talk to users, EUCA introduces the notion of prototyping and user-centered designto AI researchers and the AI community. It joins the recent effort in XAI ﬁeld to synergize the HCIand AI community to facilitate interdisciplinary collaboration and communication [12, 90].In addition to the EUCA framework contribution, the user study also identiﬁed ﬁne-grainedend-users’ requirements in different explainability needs, such as to calibrate trust, detect bias,resolve disagreement with AI, and to improve the outcomes.

Explainable artiﬁcial intelligence (XAI), or interpretable machine learning (ML), is usually regardedas a sub-ﬁeld of ML. XAI can be narrowly deﬁned as revealing the model decision making process.But a broader deﬁnition includes all necessary background information to make the AI modeland its decision-making process transparent and understandable [72], including the training dataand model performance. We adopt the broad deﬁnition of XAI in this paper.Unlike other ML ﬁelds that rarely involve end-user in the technical development and evaluationphase, XAI inherently has a close relationship with its end-users: In the end it is the users whowill interpret the explanations resulting from XAI techniques. Although in the past few years,the XAI ﬁeld is booming (largely due to the pervasive use of AI in critical tasks and the legal andethical requirements on model transparency and accountability [1]), and many XAI techniqueswere proposed, most works remained at an algorithmic level. It is unknown whether or how theywill work in practice, and what are their suitable use cases. As Lipton criticized, “with a surfeit

UCA: A Practical Prototyping Framework for End-User XAI 5 (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:3)(cid:9) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:2)(cid:11)(cid:12)(cid:4)(cid:3)(cid:6)(cid:13)(cid:2)(cid:8)(cid:14)(cid:6)(cid:2)(cid:7)(cid:14)(cid:15)(cid:4)(cid:7)(cid:6)(cid:2)(cid:15)(cid:16)(cid:2)(cid:6)(cid:17)(cid:18)(cid:9)(cid:19)(cid:20)(cid:19)(cid:8)(cid:4)(cid:15)(cid:20)(cid:13) (cid:10)(cid:8)(cid:11)(cid:12)(cid:6)(cid:13)(cid:14)(cid:15)(cid:16) (cid:2)(cid:19)(cid:20)(cid:3)(cid:2) (cid:17)(cid:3)(cid:8)(cid:18)(cid:8)(cid:18)(cid:19)(cid:17)(cid:14)(cid:16)(cid:15) (cid:2)(cid:18)(cid:5)(cid:15)(cid:7)(cid:6)(cid:13)(cid:13)(cid:2)(cid:11)(cid:12)(cid:4)(cid:3)(cid:6)(cid:2)(cid:8)(cid:14)(cid:6)(cid:2)(cid:7)(cid:14)(cid:15)(cid:4)(cid:7)(cid:6)(cid:2)(cid:15)(cid:16)(cid:2)(cid:6)(cid:17)(cid:18)(cid:9)(cid:19)(cid:20)(cid:19)(cid:8)(cid:4)(cid:15)(cid:20)(cid:13) (cid:15) (cid:18)(cid:5)(cid:15)(cid:7)(cid:6)(cid:13)(cid:13) (cid:11)(cid:12)(cid:4)(cid:3) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:3)(cid:7)(cid:8)(cid:3)(cid:9)(cid:8)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:8)(cid:15)(cid:16)(cid:17)(cid:8)(cid:9)(cid:4)(cid:18)(cid:19)(cid:17)(cid:7)(cid:3)(cid:4)(cid:5)(cid:11)(cid:20)(cid:4)(cid:12)(cid:14)(cid:12)(cid:13)(cid:8)(cid:3)(cid:9)(cid:8)(cid:15)(cid:16)(cid:17)(cid:8)(cid:9)(cid:4)(cid:18)(cid:19)(cid:17)(cid:7)(cid:3)(cid:4)(cid:5)(cid:11) (cid:20)(cid:13)(cid:6)(cid:3)(cid:11)(cid:21)(cid:6)(cid:22)(cid:23)(cid:14)(cid:3)(cid:6)(cid:5)(cid:6)(cid:16)(cid:18)(cid:11)(cid:24)(cid:16)(cid:25)(cid:8)(cid:3)(cid:5)(cid:6)(cid:12)(cid:26)(cid:27)(cid:28)(cid:24)(cid:26)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:3)(cid:9)(cid:26) (cid:21)(cid:4)(cid:9)(cid:9)(cid:6)(cid:5)(cid:2)(cid:22)(cid:23)(cid:24)(cid:2)(cid:25)(cid:26)(cid:23)(cid:27) (cid:28)(cid:19)(cid:20)(cid:11)(cid:2)(cid:22)(cid:29)(cid:24)(cid:2)(cid:25)(cid:30)(cid:31)(cid:27) (cid:4)!(cid:2)(cid:22)(cid:29)(cid:24)(cid:2)(cid:25)(cid:26)(cid:22)(cid:27) (cid:4)(cid:19)(cid:15)(cid:2)"

Fig. 3.

Visualizing the distinction between EUCA and prior frameworks regarding their user-centered origin

The left half shows two distinct streams of developing the user-centered XAI frameworks:informed by users’ requirements only (top: existing frameworks), and informed by both technical capabilitiesand users (bottom: EUCA). The curved and straight lines indicate knowledge sources or prior works theyoriginated from. The right half outlines the workflows of using the two types of frameworks in practice. Whileuser-informed frameworks imply a linear workflow that does not provide opportunities to incorporate users’feedback before implementation, EUCA framework supports an iterative prototyping process. A detailedcomparison regarding the two workflows is expanded in the next figure. of hammers, and no agreed-upon nails,”...“we fail to ask what end the proposed interpretabilityserves” [62]. Although a number of XAI technical taxonomies [72], surveys [41, 87], and techniqueselection guidance [18] were proposed, they only support the technique selection process inthe back-end, not the whole system design including the front-end UI/UX design and userrequirements analysis. These technical guidelines also mainly consider the technical aspect ratherthan usability requirements, and are technique-oriented, not end-user-oriented.The ML and XAI community realized such pitfalls and calls for a human-centered perspectiveand collaboration between HCI and AI ﬁelds [62, 68]. The explanation problem is also regainingvisibility in the HCI ﬁeld in recent years. Based on extensive literature analysis, Abdul proposedan HCI research agenda on XAI [12]. Vaughan and Wallach discussed the importance of takinga human-centered strategy when designing and evaluating XAI techniques [90]. Furthermore,XAI is usually part of an AI system with XAI being the main component or an embedded feature.Therefore, the human-AI interaction guidance may also be applicable to the XAI system designprocess as general design principles [14, 94].

To inform the design of user-centered explanations, existing worksidentify human-centered insights from explanation theories and human-subject studies. Weillustrate the comparison of existing user-centered XAI frameworks with EUCA in Fig. 3 and Fig. 4,and state the details below.Miller summarized the characteristics of explanations from philosophy, psychology, and socialscience [67]: users prefer simple, selected (but may be biased), and causal explanations; expla-nations are contrastive to other related predictions, and people tend to seek causal reasoning in

W. Jin, et al. (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:6)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:7)(cid:8)(cid:9)(cid:6)(cid:10)(cid:17)(cid:7)(cid:13)(cid:14)(cid:15)(cid:3)(cid:18)(cid:6)(cid:19)(cid:20)(cid:9)(cid:6)(cid:10)(cid:12)(cid:21)(cid:13) (cid:22)(cid:23)(cid:24)(cid:25)(cid:6)(cid:26)(cid:18)(cid:27)(cid:28)(cid:29)(cid:30) (cid:2)(cid:3)(cid:4) $%(cid:2)%&’’()*(cid:2) (cid:5) () (cid:6) (cid:7) +’, (cid:3) - (cid:3) *.(-(cid:2)) (cid:8) ’) (cid:8) % (cid:8) -* (cid:3) *.(-(cid:2) (cid:9) (cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2).-* (cid:8) ) (cid:3)(cid:4) *.(- (cid:10) *) (cid:3) * (cid:8)(cid:11) /(cid:2)(-(cid:2) (cid:4) (0 (cid:12) .-.- (cid:11) (cid:2)0&,*.’, (cid:8) (cid:2) (cid:8) +’, (cid:3) - (cid:3) *.(-%1(cid:2)1(cid:2) *((cid:2)&- (cid:13)(cid:8) )%* (cid:3) - (cid:13) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:6)(cid:5)(cid:7)(cid:3)(cid:8)(cid:6)(cid:5)(cid:9)(cid:5)(cid:10)(cid:11)(cid:4) &% (cid:8) (cid:6) (cid:3)(cid:4)(cid:5)(cid:6)(cid:12)(cid:13)(cid:5)(cid:10)(cid:11)(cid:5)(cid:6)(cid:5)(cid:14)(cid:2)(cid:9)(cid:5)(cid:11)(cid:15)(cid:16)(cid:14) (cid:5)(cid:17)(cid:18)(cid:19)(cid:20)(cid:10)(cid:20)(cid:11)(cid:8)(cid:16)(cid:10)(cid:2)(cid:8)(cid:10)(cid:21)(cid:16)(cid:6)(cid:9)(cid:20)(cid:11)(cid:8)(cid:16)(cid:10) (cid:13)(cid:8) %. (cid:11) - (cid:2)(cid:22)(cid:23)(cid:24)(cid:22)(cid:25) .0’, (cid:8) (cid:8) -*(cid:2) (cid:3) (cid:21)(cid:3)(cid:10)(cid:13)(cid:11)(cid:8)(cid:16)(cid:10)(cid:20)(cid:19)(cid:4)(cid:26)(cid:4)(cid:11)(cid:5)(cid:9)(cid:9)(cid:20)(cid:18) (cid:2)* (cid:14)(cid:8) (cid:15)(cid:8) (cid:8) %(cid:2)* (cid:8)(cid:4)(cid:14) -. (cid:4)(cid:3) ,(cid:2) (cid:8) +’ (cid:8) )*.% (cid:8) (cid:2)*((cid:2). (cid:13)(cid:8) -*. (cid:5) / (cid:4) ()) (cid:8) %’(- (cid:13)(cid:8) - (cid:4)(cid:8) (cid:2) (cid:12)(cid:8) *3 (cid:8)(cid:8) -(cid:2) (cid:8) +’, (cid:3) - (cid:3) *.(-(cid:2).- (cid:5) ((cid:2) (cid:3) - (cid:13) (cid:2) (cid:3) , (cid:11) ().* (cid:14) (cid:16) (cid:2) (cid:17)(cid:14)(cid:8) (cid:2) (cid:13)(cid:8) %. (cid:11) -(cid:2)0 (cid:3) /(cid:2) (cid:12)(cid:8) (cid:2)* (cid:8)(cid:4)(cid:14) -. (cid:4)(cid:3) ,(cid:2)&- (cid:3)(cid:4)(cid:14) . (cid:8) (cid:3)(cid:12) , (cid:8) (cid:18) .**( (cid:18) .**( (cid:18) .**( (cid:19) ) (cid:3) (cid:8) (cid:3) .-(cid:2)&*.,.*/ (cid:2) (cid:20)(cid:3) ’’.- (cid:11) (cid:2).%(cid:2) (cid:3) (cid:12) . (cid:11) &(&% (cid:21) (*(cid:2)’)(4. (cid:13)(cid:8) (cid:2)(’’()*&-.*. (cid:8) %(cid:2)*((cid:2) (cid:11)(cid:8) *(cid:2)&% (cid:8) )%5(cid:2) (cid:5)(cid:8)(cid:8)(cid:13)(cid:12)(cid:3)(cid:4) $ (cid:19) ) (cid:3) (cid:8) (cid:3) .-(cid:2)&*.,.*/ (cid:2) (cid:20)(cid:3) ’’.- (cid:11) (cid:2).%(cid:2)(- (cid:8) (cid:2) (cid:22) (cid:8) %*.(- (cid:23) (cid:2)*((cid:2)0 (cid:3) -/(cid:2) (cid:22)(cid:24)(cid:25)(cid:26) (cid:2)0 (cid:8) * (cid:14) ( (cid:13) % (cid:23)(cid:27) (cid:2) (cid:13) . (cid:13) (cid:2)-(*(cid:2)’)(4. (cid:13)(cid:8) (cid:2)%*) (cid:3) * (cid:8)(cid:11) . (cid:8) %(cid:2)*((cid:2)% (cid:8) , (cid:8)(cid:4) * (cid:28)(cid:4) (0 (cid:12) .- (cid:8) (cid:2) (cid:24)(cid:25)(cid:26) (cid:2)0 (cid:8) * (cid:14) ( (cid:13) % (cid:21) (*(cid:2)’)(4. (cid:13)(cid:8) (cid:2)(’’()*&-.*. (cid:8) %(cid:2)*((cid:2) (cid:11)(cid:8) *(cid:2)&% (cid:8) )%5(cid:2) (cid:5)(cid:8)(cid:8)(cid:13)(cid:12)(cid:3)(cid:4) $ (cid:25) (cid:2)0() (cid:8) (cid:2)’) (cid:3)(cid:4) *. (cid:4)(cid:3) ,(cid:2)0 (cid:3) ’’.- (cid:11) (cid:2) (cid:20)(cid:3) ’’.- (cid:11) (cid:2).%(cid:2) (cid:4) (- (cid:4)(cid:8) ’*& (cid:3) , (cid:21) (*(cid:2)’)(4. (cid:13)(cid:8) (cid:2)* (cid:14)(cid:8) ()/(cid:2)()(cid:2)&% (cid:8) )(cid:2)%*& (cid:13) /(cid:2) (cid:8) (cid:13)(cid:8) - (cid:4)(cid:8) (cid:2)* (cid:14)(cid:3) *(cid:2)* (cid:14)(cid:8) (cid:2)0 (cid:3) ’’.- (cid:11) (cid:2) (cid:12)(cid:3) % (cid:8)(cid:13) (cid:2)(- (cid:26)(cid:13)(cid:8) -*. (cid:5) /(cid:2)&% (cid:8) ) (cid:29) %(cid:2) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:7)(cid:9) (cid:26) (cid:9)(cid:6)(cid:4)(cid:10)(cid:5) (cid:2) (cid:3) - (cid:13) (cid:2) (cid:4) ( (cid:11) -.*.4 (cid:8) (cid:2) (cid:11)(cid:8)(cid:4)(cid:5)(cid:3)(cid:5) (cid:26)(cid:13)(cid:8) -*. (cid:5) /(cid:2)&% (cid:8) )%5(cid:2) (cid:3) (cid:29) (cid:12)(cid:10)(cid:4)(cid:7)(cid:4)(cid:13)(cid:8)(cid:6)(cid:7) (cid:26) (cid:9)(cid:6)(cid:4)(cid:10)(cid:5) (cid:6) (cid:2) (cid:8) )(cid:2) (cid:4)(cid:3) &% (cid:8) % (cid:11)(cid:8) - (cid:8) ) (cid:3) ,.7 (cid:8) (cid:2) (cid:3) - (cid:13) (cid:2), (cid:8)(cid:3) )-’) (cid:8)(cid:13) . (cid:4) *(cid:2) (cid:3) - (cid:13) (cid:2) (cid:4) (-*)(, (cid:10) & (cid:11)(cid:11)(cid:8) %*%(cid:2)0 (cid:8) * (cid:14) ( (cid:13) %(cid:2)%& (cid:4)(cid:14) (cid:2) (cid:3) %(cid:2),.* (cid:8) ) (cid:3) *&) (cid:8) (cid:2)) (cid:8) (cid:8) (cid:27) (cid:2) (cid:8) * (cid:14) -( (cid:11) ) (cid:3) ’ (cid:14) / (cid:27) (cid:2)’ (cid:3) )*. (cid:4) .’ (cid:3) *()/(cid:2) (cid:13)(cid:8) %. (cid:11) - (cid:27) (cid:2) (cid:12) &*(cid:2)-(*(cid:2)’)(4. (cid:13)(cid:8) (cid:2) (cid:13)(cid:8) * (cid:3) ., (cid:8)(cid:13) (cid:2)0 (cid:8) * (cid:14) ( (cid:13) (cid:2)%&’’()* (cid:18) .**( (cid:10) & (cid:11)(cid:11)(cid:8) %*%(cid:2)&%.- (cid:11) (cid:2)&% (cid:8) )(cid:2)) (cid:8) % (cid:8)(cid:3) ) (cid:4)(cid:14)(cid:27) (cid:2) (cid:12) &*(cid:2) (cid:13) ( (cid:8) %(cid:2)-(*(cid:2)’)(4. (cid:13)(cid:8) (cid:2) (cid:13)(cid:8) * (cid:3) ., (cid:8)(cid:13) (cid:2)0 (cid:8) * (cid:14) ( (cid:13) (cid:2)%&’’()* (cid:14) (cid:28)(cid:24)(cid:26) (cid:15)(cid:3)(cid:4)(cid:13)(cid:16)(cid:2)(cid:3)(cid:5) (cid:6) (cid:2)’).() (cid:27) (cid:2)3 (cid:14) / (cid:27) (cid:2) (cid:3) **). (cid:12) &*.(- (cid:27) (cid:2) (cid:4) (-*) (cid:3) %*.4 (cid:8) (cid:2) (cid:30)(cid:30)(cid:30) (cid:2) (cid:21) (*(cid:2)0&*& (cid:3) ,,/(cid:2) (cid:8) + (cid:4) ,&%.4 (cid:8)(cid:2)(cid:3)(cid:4) $%(cid:2) (cid:8) (cid:13)(cid:8) - (cid:4)(cid:8) (cid:2)(-(cid:2)* (cid:14)(cid:8) .)(cid:2)&% (cid:8)(cid:5) &,- (cid:8) %%(cid:2).-(cid:2)’) (cid:3)(cid:4) *. (cid:4)(cid:8) (cid:31)(cid:3) % (cid:8)(cid:13) (cid:2)(-(cid:2) (cid:17)(cid:18) (cid:10)(cid:28)(cid:30) (cid:5) (cid:26) (cid:5)(cid:16)(cid:9)(cid:9)(cid:3)(cid:5)(cid:13)(cid:3)(cid:19) (cid:26) (cid:18) (cid:24) (cid:20)(cid:18)(cid:14) (cid:26) (cid:19)(cid:3)(cid:5)(cid:8)(cid:9)(cid:7) (cid:26) (cid:12)(cid:4)(cid:13)(cid:13)(cid:3)(cid:2)(cid:7)(cid:5) (cid:26) (cid:4)(cid:7)(cid:19) (cid:26) (cid:3) (cid:29) (cid:4)(cid:21)(cid:12)(cid:10)(cid:3)(cid:5) (cid:27) (cid:2) (cid:13)(cid:8) %. (cid:11) -(cid:2)’)(*(*/’ (cid:8) (cid:2) (cid:4)(cid:3) ) (cid:13) %(cid:2)4 (cid:3) )/.- (cid:11) (cid:2)* (cid:14)(cid:8) (cid:2).-* (cid:8) ) (cid:5)(cid:3)(cid:4)(cid:8)(cid:28) .-* (cid:8) ) (cid:3)(cid:4) *.(-(cid:2) (cid:5) ()(cid:2)* (cid:14)(cid:8) (cid:2) (cid:8) +’, (cid:3) - (cid:3) *()/(cid:2) (cid:5) ()0%(cid:2) ,(3 (cid:13)(cid:8) ,.*/(cid:2)’)(*(*/’ (cid:8) (cid:14) . (cid:11)(cid:14) (cid:13)(cid:8) ,.*/(cid:2)’)(*(*/’ (cid:8) (cid:3) - (cid:13) (cid:2)&- (cid:13)(cid:8) )%* (cid:3) - (cid:13) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:6)(cid:5)(cid:7)(cid:3)(cid:8)(cid:6)(cid:5)(cid:9)(cid:5)(cid:10)(cid:11)(cid:4) (cid:13)(cid:8) %. (cid:11) - (cid:2)(cid:22)(cid:23)(cid:24)(cid:22)(cid:25) .0’, (cid:8) (cid:8) -*(cid:2) (cid:3) (cid:21)(cid:3)(cid:10)(cid:13)(cid:11)(cid:8)(cid:16)(cid:10)(cid:20)(cid:19)(cid:4)(cid:26)(cid:4)(cid:11)(cid:5)(cid:9) (cid:26)(cid:13)(cid:8) -*. (cid:5) / (cid:11)(cid:5)(cid:13)(cid:15)(cid:10)(cid:8)(cid:13)(cid:20)(cid:19)(cid:2)(cid:13)(cid:20)(cid:18)(cid:20)(cid:27)(cid:8)(cid:19)(cid:8)(cid:11)(cid:8)(cid:5)(cid:4) (cid:26) (cid:8) (cid:8) -*(cid:2)* (cid:14)(cid:8) (cid:2) (cid:22)(cid:6)(cid:2)(cid:2)(cid:3)(cid:5)(cid:12)(cid:6)(cid:7)(cid:19)(cid:8)(cid:7)(cid:9) (cid:26) (cid:4)(cid:10)(cid:9)(cid:6)(cid:2)(cid:8)(cid:13)(cid:23)(cid:21)(cid:5) (cid:26) (cid:12)(cid:2)(cid:6) (cid:31) (cid:8)(cid:19)(cid:3)(cid:19) (cid:26) (cid:11) (cid:26) (cid:17)(cid:18) (cid:10)(cid:28)(cid:26) (cid:5) ()(cid:2)* (cid:14)(cid:8) (cid:2) (cid:4)(cid:14) (% (cid:8) -(cid:2) (cid:8) +’, (cid:3) - (cid:3) *()/(cid:2) (cid:5) ()0 ! % (cid:8) (cid:2) (cid:7)!"(cid:25) (cid:8) - (cid:13) &% (cid:8) ) (cid:5) ). (cid:8) - (cid:13) ,/(cid:2) (cid:3) (cid:29) (cid:12)(cid:10)(cid:4)(cid:7)(cid:4)(cid:13)(cid:6)(cid:2) (cid:26) (cid:15)(cid:6)(cid:2)(cid:21)(cid:5) (cid:2) (cid:22) %&00 (cid:3) ).7 (cid:8)(cid:13) (cid:2) (cid:5) )(0(cid:2) (cid:13)(cid:3)(cid:22)(cid:23)(cid:7)(cid:8)(cid:22)(cid:4)(cid:10) (cid:26) (cid:5)(cid:6)(cid:10)(cid:16)(cid:13)(cid:8)(cid:6)(cid:7) (cid:26) (cid:5)(cid:12)(cid:4)(cid:22)(cid:3) (cid:23) (cid:19)(cid:8)(cid:3) *&) (cid:8) (cid:12)(cid:3) % (cid:8)(cid:13)(cid:7) + (cid:3) (cid:8) (cid:12)(cid:3) % (cid:8)(cid:13)(cid:15) &, (cid:8) (cid:12)(cid:3) % (cid:8)(cid:13)(cid:26) -’&* (cid:27) (cid:2)(&*’&* (cid:27) (cid:2) (cid:13)(cid:3) * (cid:3) % (cid:8) * (cid:27) (cid:2)’ (cid:8) ) (cid:5) ()0 (cid:3) - (cid:4)(cid:8) (cid:2)1(cid:2)11(cid:2)1 ! % (cid:8) (cid:2)&% (cid:8) ) (cid:4)(cid:8) -* (cid:8) ) (cid:8)(cid:13) (cid:2)0 (cid:8) * (cid:14) ( (cid:13) (cid:2) (cid:22) . (cid:30)(cid:8)(cid:30)(cid:6) (cid:2) (cid:22)(cid:4)(cid:2)(cid:19) (cid:26) (cid:5)(cid:6)(cid:2)(cid:13)(cid:8)(cid:7)(cid:9) (cid:2)( (cid:5) (cid:2)* (cid:14)(cid:8) (cid:2)’)(*(*/’ (cid:8) (cid:2) (cid:4)(cid:3) ) (cid:13) % (cid:27) (cid:2) (cid:5) ( (cid:4) &%(cid:2) (cid:11) )(&’ (cid:27) (cid:2).-* (cid:8) )4. (cid:8) (cid:23) (cid:2)*((cid:2) (cid:4) ( (cid:13)(cid:8) %. (cid:11) -(cid:2),(3 (cid:13)(cid:8) ,.*/(cid:2)’)(*(*/’ (cid:8) (cid:2)3.* (cid:14) (cid:2)%* (cid:3) $ (cid:8)(cid:14) (, (cid:13)(cid:8) )% (cid:2) (cid:18) . (cid:13) (cid:2)-(*(cid:2)% (cid:14) (3(cid:2) (cid:8) (cid:13)(cid:8) - (cid:4)(cid:8) (cid:2)* (cid:14)(cid:3) *(cid:2)* (cid:14)(cid:8) (cid:2) (cid:4) ()) (cid:8) %’(- (cid:13) .- (cid:11) (cid:2) (cid:24)(cid:25)(cid:26) (cid:2)0 (cid:8) * (cid:14) ( (cid:13) %(cid:2) (cid:4)(cid:3) -(cid:2) (cid:3)(cid:13)(cid:13) ) (cid:8) %% (cid:8)(cid:13) (cid:2)&% (cid:8) )(cid:2)2& (cid:8) %*.(-% (cid:17) (cid:29) (cid:12)(cid:10)(cid:4)(cid:7)(cid:4)(cid:13)(cid:8)(cid:6)(cid:7) (cid:26) (cid:21)(cid:3)(cid:13)(cid:23)(cid:6)(cid:19)(cid:5) (cid:26) (cid:22) . (cid:13)(cid:8) -*.6 (cid:8)(cid:13) (cid:2)4. (cid:3) (cid:2) (cid:8) +.%*.- (cid:11) (cid:2)%&)4 (cid:8) /% (cid:23)(cid:6) (cid:2) ! % (cid:8) (cid:2)* (cid:14)(cid:8) (cid:2) (cid:4) ()) (cid:8) %’(- (cid:13) .- (cid:11) (cid:2)%& (cid:11)(cid:11)(cid:8) %* (cid:8)(cid:13) (cid:2) (cid:3) , (cid:11) ().* (cid:14) (cid:8) (cid:8) -*(cid:2)* (cid:14)(cid:8) (cid:2) (cid:13)(cid:8) %. (cid:11) - (cid:26)(cid:13)(cid:8) -*. (cid:5) /(cid:2) (cid:24)(cid:16)(cid:3)(cid:5)(cid:13)(cid:8)(cid:6)(cid:7)(cid:5) (cid:2)(cid:2)&% (cid:8) )%(cid:2)0 (cid:3) /(cid:2) (cid:3) %$(cid:2)*((cid:2)&- (cid:13)(cid:8) )%* (cid:3) - (cid:13) (cid:2) (cid:25)(cid:26)(cid:6) (cid:26) .-’&* (cid:27) (cid:2)(&*’&* (cid:27) (cid:2) (cid:14) (3 (cid:27) (cid:2)3 (cid:14) / (cid:27) (cid:2)3 (cid:14) /(cid:2)-(* (cid:27) (cid:2)3 (cid:14)(cid:3) *(cid:2). (cid:5)(cid:27) (cid:2) (cid:14) (3(cid:2)*( (cid:27) (cid:2)’ (cid:8) ) (cid:5) ()0 (cid:3) - (cid:4)(cid:8) (cid:2) (cid:9) (cid:2)0.% (cid:4) . * (cid:8) ) (cid:3) * (cid:8) (cid:2) * (cid:14) (cid:8) (cid:2) ’ ) ( * ( * / ’ (cid:8) (cid:11) ,( (cid:12)(cid:3) , (cid:27) (cid:2),( (cid:4)(cid:3) , (cid:27) (cid:2) (cid:8) + (cid:3) (cid:8)(cid:27)(cid:4) (&-* (cid:8) ) (cid:5)(cid:3)(cid:4) *& (cid:3) , (cid:17) (cid:29) (cid:12)(cid:10)(cid:4)(cid:7)(cid:4)(cid:13)(cid:8)(cid:6)(cid:7) (cid:26) (cid:13) (cid:12)(cid:3)(cid:5) (cid:26) (cid:22) . (cid:13)(cid:8) -*.6 (cid:8)(cid:13) (cid:2)4. (cid:3) (cid:2)’).()(cid:2)&% (cid:8) )(cid:2)%*& (cid:13) . (cid:8) % (cid:23)(cid:6) (cid:2).-’&* (cid:27) (cid:2)(&*’&* (cid:27) (cid:2) (cid:4)(cid:8) )* (cid:3) .-*/ (cid:27) (cid:2)3 (cid:14) / (cid:27) (cid:2)3 (cid:14) /(cid:2)-(* (cid:27) (cid:2)3 (cid:14)(cid:3) *(cid:2). (cid:5)(cid:27) (cid:2) (cid:14) (3(cid:2)*( (cid:27) (cid:2)3 (cid:14)(cid:8) - (cid:2) (cid:21) (cid:3) (cid:22) (cid:4)(cid:5)(cid:3) (cid:23)(cid:24)(cid:22) (cid:6)(cid:3)(cid:7)(cid:3)(cid:8) (cid:25) (cid:4)(cid:9)(cid:8)(cid:10) (cid:26)(cid:22) (cid:7)(cid:3)(cid:11) (cid:8) (cid:12) (cid:26)(cid:22) (cid:13) (cid:27)(cid:26)(cid:28) (cid:2) (cid:21) (cid:3) (cid:22) (cid:4)(cid:14)(cid:8)(cid:11)(cid:4)(cid:15)(cid:3)(cid:16)(cid:17)(cid:8) (cid:26) (cid:18) (cid:26) (cid:19) (cid:29) (cid:4)(cid:9)(cid:8)(cid:10) (cid:26)(cid:22) (cid:7)(cid:3)(cid:11) (cid:8) (cid:12) (cid:26)(cid:22) (cid:13) (cid:27)(cid:26)(cid:28) Fig. 4.

Workflow comparison of using the two types of frameworks in practice.

Prior frameworksimply a user-requirement-informed workflow as shown on the top: user requirement → design → implementation , while EUCA follows a user-and-technology-informed workflow (bottom) by consideringtechnical capabilities . EUCA replaces the direct mapping from user requirements to explanation information,with a iterative prototyping and co-design process (indicate by the back-and-forth arrow in the workflow). Foreach step in the workflow, we highlight the key actions or results using bold font. (cid:2) and × indicate whether astep is supported or not supported by a framework content, and (cid:2) indicates the limitations of applying aframework in a certain step. a counterfactual fashion, i.e.: what would the prediction be if certain features of the input hadbeen different. The explanation is a social process in that humans tailor explanatory contents todifferent explainability needs and audiences.Following this line, Wang et al. conducted a review on explanation theory literature, and furtherprovided a theory-driven, user-centered XAI framework that describes how human reasoningprocess and explanation theories guide explanation system requirements [85]. They suggestedthe XAI system should support reasoning while mitigating heuristics and bias. Their work is aﬁrst attempt in developing user-centered XAI design guidance, but remained at a conceptualand abstract level, and lacked actionable guidance on how to practically implement explanationtheories for context-speciﬁc tasks and needs.In their follow-up paper, Lim et al. [61] extended the framework by detailing the explanationtypes (input, output, certainty, why, why not, what if, how to, and when), and proposing pathwaysto link these types to users’ three explanation goals: ﬁlter causes, generalize and learn, and predictand control. The explanation type taxonomy was ﬁrst identiﬁed by Lim and Dey in 2009, bysurveying users’ questions in crowdsourcing user studies for context-aware systems [60]. Our UCA: A Practical Prototyping Framework for End-User XAI 7 explanatory forms overlap with their taxonomy in our category of “supplementary information”:input, output, and certainty. As a position paper, their proposed linkage is mainly conceptual andlacks user study evidence. They also did not provide practical guidance or implementation supportto illustrate its usefulness in real-world tasks. In contrast, we included a variety of explanationneeds/goals in our user study, and backed our ﬁndings on explanatory forms and explanationneeds correlation with qualitative and qualitative user study data.Liao et al. [57] further explored the idea of providing mapping guidance between users’ require-ments and explanation types to facilitate human-centered explanation design. The explanationtypes were identiﬁed based on questions users may ask to understand AI. Their framework alsoprovides an additional mapping from explanation types to algorithmic implementation, savingpractitioners the efforts and needed expertise in identifying the right algorithm to implement.Using the question list and explanation types as a study probe, they further conducted a user studywith 20 UX designers to explore the opportunities and challenges of putting XAI techniques intopractice. Their results revealed rich details on users’ needs for XAI, but failed to show evidences(such as user studies with users) that the corresponding XAI methods will answer users’ questions.Different from EUCA that focuses on the prototyping process, their framework directly guides thechoices of explanation types, and it does not provide opportunities to take in users’ feedback ondesign solutions before the system being implemented.

While prior frameworks utilizeda general user-centered approach by proposing design solutions based on users’ requirements,such a direct user-requirement-informed paradigm may not be applicable in the context of XAI,or generally, AI system development. If we abstract the human-centered technology developmentprocess as the following workﬂow:User requirements 1 (cid:2) −→ Design 2 (cid:2) −→ Implementation Workﬂow (1)For common technological development, usually the user-centered challenge is in Step 1 (cid:2) : toguide design by getting informed from users, which is the focus of previous frameworks. Thisapproach implies that once we ﬁnd the design solutions, such design can be easily fulﬁlled bytechnique implementation (Step 2 (cid:2) ). This is usually the case for traditional technology develop-ment, as the functionality of the system can largely be speciﬁed and determined by design, but notthe case for AI [94]. Yang illustrated it in a case study on designing an AI-driven clinical decisionsupport system [92], where considering users requirements only led to a technically unachievablesolution (Step 2 (cid:2) is blocked), due to designers’ and end-users’ limited understanding of currentAI’s technical capabilities, or lack of training data to train the proposed AI model. In real-worldpractice, since UX designers usually “know very little about how AI works”, to come up with atechnical-viable design, UX designers need to work closely with technical teams to incorporatetechnical solutions in the craft of design, which is the central stage in the AI system design process.And such relatively novel and unique workﬂow is the most challenging and unsupported part fordesign practitioners, and made “working with AI took much longer than when designing otherUX products and services”, according to Yang et al.’s interview with 13 UX designers who areexperienced in AI products [93].To sum up, the unique of AI and XAI system design requires to take into consideration not onlythe user side, but fully consider the viable technological solution space [94]. The workﬂow ofdesigning a user-centered AI or XAI system thus becomes:User requirements 1 (cid:2) −→ Design 2 (cid:2) −→ ImplementationTechnical capabilities Workﬂow (2)

W. Jin, et al.

As analyzed in previous Section 2.2.1, prior XAI design frameworks [57, 61, 85] follows user-requirement-informed paradigm (Workﬂow 1), whereas ours is informed by both end-users andtechnical capabilities (Workﬂow 2) (Fig. 3, 4). Although previous frameworks tried to address theunique challenge of technical capability in XAI design, by regulating users’ requirements in a pre-deﬁned space (such as pre-deﬁned XAI goals or questions), and provided direct mappings fromuser’s requirements to design solutions, such approach had to compromise design possibilitiesreside outside the pre-deﬁned space, and limit users’ choices, hence may not fully address users’needs. It also makes many existing technical solutions under-explored. In contrast, the EUCAframework is a strategic synergy of both XAI techniques and user-centered design methods. Andto the best of our knowledge, we are the ﬁrst to take the user-and-technology-informed paradigmto develop the end-user-centered XAI framework.

In addition, prior frameworks implicitly suggestedusing the framework to directly guide the choice of explanation information. The explanationselection and design processes do not provide opportunities to take in users’ feedback until thecandidate solution has been designed or implemented. The shortcoming of such a solution-ﬁrst design approach is that the initial premature solution becomes hard to get over, as theeffort in crafting design and implementation becomes an expensive sunk cost [79]. In contrast,prototypes, especially the low-ﬁdelity ones, enable quick and inexpensive trial-and-error, allowingfull exploration of the solution space before implementation. We discuss the prototype topicfurther in the next section.

Given the complex scenario in real-world design and development practice, using the one-size-ﬁts-all mapping between user’s requirements and explanation information provided by existingframeworks may not be applicable. Prototyping provides a quick-and-dirty way to help XAI systemdesigners understand user-, task-, and scenario-speciﬁc requirements, to assess and improvetheir design. It also sensitizes designers to the scope of AI capabilities, “it is through sketchingand prototyping that designers understand what the technology is and can do” [94]. Prior worksdemonstrated using prototypes and involving stakeholders in the co-design processes in varioususer-oriented XAI development settings.To bridge technical XAI with their ambiguous and dynamic real-world use, Wolf [89] proposedto apply scenario-based design [79], an HCI method that mimics prototyping idea without atangible prototype [46]. It creates a narrative description to envision the user experience afterdeployment to guide the system design. Cirqueira et al. demonstrated applying scenario-basedrequirements elicitation in designing a user-centric XAI system for fraud detection [30].Similarly, after a literature reivew, Eiband et al. found there is no consensus in prior works on what to explain , and users’ demands vary case-by-case [34]. They then presented their six-monthparticipatory design process on the transparency interface design for a commercial intelligentﬁtness coach, and demonstrated an iterative prototyping process to answer how to explain asfollows: in a focus groups workshop with stakeholders, the team members brainstormed and sketched the UI and user workﬂow, followed by voting and discussion of the ideas to generate alist of promising implementation ideas. Next they implemented two most promising ideas as aseries of low- and high-ﬁdelity prototypes , and evaluated and reﬁned the prototypes in several user testing . The process resulted in two high-ﬁdelity prototypes.Despite prior attempts in incorporating the prototyping process in XAI design, creating XAIprototypes is still a challenging task and requires technical expertise. Practitioners desire support

UCA: A Practical Prototyping Framework for End-User XAI 9 on prototyping tools and methodology, according to a number of user studies with UX design-ers [57, 92, 93]. Our EUCA framework provides prototyping tools and methodologies to facilitatethe creation of low- and high-ﬁdelity prototypes that are technologically feasible. This support isparticularly useful for designers who do not have ML expertise or lack constant access to capabledata scientists.

Existing user studies with non-technical end-user participants were conducted in a case-by-casemanner, to understand users perception on the explanation information and provide insights forXAI design.Cai et al. [25] examined the effect of similar examples and counterfactual examples (namedcomparative explanations in the paper) in a study involving 1150 layperson participants in anonline drawing and guessing platform. They found users who received similar example expla-nations felt they had a better understanding of AI, and perceived AI to have a higher capability.

Counterfactual examples , however, did not always improve the perceptions of AI as it exposed thelimitations of AI and may have led to a confusing or unexpected result.Narayanan et al. [71] conducted a controlled user study with 600 Amazon Mechanical Turkersto identify how varying different complexities of a decision set explanation affect users’ ability tointerpret it. They found that while almost all types of complexity resulted in longer response times,some types of complexity, such as the number of rules, or the number of new features introduced,had a much bigger effect than others such as repeated features.While prior works provide individual evidences based on a speciﬁc XAI application, our studysystematically compares and assesses the strengths, weaknesses, applicable explanation needs,and design implications of explanatory forms in a variety of tasks and explanation need scenarios.

Distinct from prior frameworks that the explanation information is informed from users’ re-quirements only, by applying the EUCA framework, the design of explanation information isinformed from both user requirements and technical capabilities. This ensures that the resultingdesign is technologically achievable. To do so, we began with the observation on existing XAIsystems/taxonomies that, despite the XAI algorithms, models, tasks, and visual representationsvary, their resulting explanation information can be abstracted as several recurrent forms, suchas feature attributes generated by linear model or algorithms mimic linear model [13, 63, 76],similar examples from different content-based retrieval algorithms [52], or decision tree and rules.And some of the forms may be consumed by non-technical users without technical knowledgeas a prerequisite. Since the explanatory forms are the ﬁnal resulting explanation informationfrom existing XAI algorithms, and the number of explanatory forms is a ﬁnite set, we may usethe explanatory forms to guide the choice and design of explanations in an XAI system. Becausethe explanatory forms are originated from existing XAI algorithms, once the forms are decided,it is straightforward to implement their associated algorithms, i.e.: form followed by function(we reverse the famous design maxim “form follows function” [7]). Our approach also echoesthe matchmaking design process that identiﬁes potential user domains (“nails”) for numerousexisting XAI techniques (“hammers”) [22].Based on the above insights, we explored the XAI solution space by extracting the resulting ex-planation information from existing technical literature in AI, HCI, and information visualizationﬁelds via literature review, then selected and summarized end-user-friendly explanatory formsbased on the following criteria: (1) The explanatory forms must be end-user-friendly, i.e., users are not required to possesstechnical knowledge to understand the explanation.(2) The explanatory forms are mutually exclusive regarding the information they represent.We noted sometimes the explanation information can be attributed to different XAI typesand concepts that entangle with each other, e.g.: causal explanations may be expressed asfeature attributes or rules; counterfacutal explanations can be represented as counterfactualfeatures, examples, or rules; feature attribute can be global as well as local . We selected formsthat are mutually exclusive, so that they can act as building blocks, represent the elementalexplanation information and their combination would not be redundant/repeated in anXAI system.The literature review process and the list of surveyed literature are detailed in SupplementaryMaterial S1. We ended up with 8 explanatory forms in three categories: explaining using features,examples, and rules. In addition, we added necessary supplementary information to make theexplanation complete, including input, output, dataset, and performance. A total of 12 end-user-friendly explanatory forms are included in the EUCA framework.The end-user-friendly explanatory forms are familiar and mutual language to both end-usersand XAI practitioners, so that they can facilitate the communication on users’ requirements andco-design process. Since the forms are summarized from technically achievable solution spaceand shown as UI design patterns, it also bridges the expertise gap between HCI/UX designers andAI developers. Next we introduce each explanatory form, accompanied by their possible visualrepresentations summarized from the surveyed literature to facilitate UI/UX design. Figure 1shows their visual examples.

UCA: A Practical Prototyping Framework for End-User XAI 11 T a b l e . T h e E n d - U s e r - F r i e n d l y E x p l a n a t o r y F o r m s . W e i n d i c a t e w h e t h e r a f o r m i s a g l o b a l ( e x p l a i n i n g t h e m o d e l ’ s o v e r a ll b e h a v i o r ) , o r l o c a l e x p l a n a t i o n ( e x p l a i n i n g t h e d e c i s i o n t o t h e i n d i v i d u a l i n s t a n c e ) . W e a l s o g i v e i t s a pp l i c a b l e i n p u t d a t a t y p e s : T a b u l a r - t a b u l a r d a t a , I m g - s p a t i a l - s t r u c t u r e dd a t a ( e . g ., i m a g e , g r a p h ) , T x t - s e q u e n t i a l d a t a ( e . g ., t e x t , s i g n a l ) . T h e (cid:2) i n d i c a t e s o u rr a t e d u s e r - f r i e n d l y l e v e l ( : l e a s t f r i e n d l y , : m o s t f r i e n d l y ) . I t s p r o s , c o n s , d e s i g n i m p l i c a t i o n s , a n d a pp l i c a b l ee x p l a n a t i o nn ee d s a r e s u mm a r i z e d f r o m t h e u s e r s t u d y f i n d i n g s , f o ll o w e d b y a ss o c i a t e d a l g o r i t h m s f o r i m p l e m e n t a t i o n . E x p l a n a t i o n C a t e g o r y E x p l a n a t o r y F o r m V i s u a l R e p r e - s e n t a t i o n s P r o s C o n s U I / U X D e s i g n I m p l i c a t i o n s A pp l i c a b l e N ee d s X A I A l g o r i t h m E x a m p l e s F e a t u r e - b a s e d e x p l a n a t i o n F e a t u r e A tt r i b u t e L o c a l / G l o b a l T a b u l a r / I m g / T x t (cid:2)(cid:2)(cid:2) S a l i e n c y m a p ; B a r c h a r t S i m p l e a n d e a s y t o u n d e r s t a n d ; C a n a n s w e r h o w a n d w h y A I r e a c h e s i t s d e c i s i o n s . I ll u s i o n o f c a u s a l i t y , c o n ﬁ r m a t i o n b i a s A l a r m u s e r s a b o u t c a u s a l i t y i ll u s i o n ; A ll o w u s e r s t o s e t t h r e s h o l d s o n f e a t u r e i m p o r t a n c e s c o r e , a n d s h o w d e t a i l s o n - d e m a n d T o v e r i f y A I ’ s d e c i s i o n L I M E [ ] , S H A P [ ] , C A M [ ] , L R P [ ] , T C A V [ ] F e a t u r e S h a p e G l o b a l T a b u l a r (cid:2)(cid:2) L i n e p l o t G r a p h i c a l r e p r e s e n t a t i o n , e a s y t o u n d e r s t a n d t h e r e l a t i o n s h i p b e t w ee n o n e f e a t u r e a n d p r e d i c t i o n L a c k s f e a t u r e i n t e r a c t i o n ; I n f o r m a t i o n o v e r l o a d i f m u l t i p l e f e a t u r e s h a p e s a r e p r e s e n t e d U s e r s c a n i n s p e c tt h e p l o t o f t h e i r i n t e r e s t e d f e a t u r e s ; M a y i n d i c a t e t h e p o s i t i o n o f l o c a l d a t a p o i n t s ( u s u a ll y u s e r s ’ i n p u t d a t a ) T o c o n t r o l a n d i m p r o v e t h e o u t c o m e ; T o r e v e a l b i a s P D P [ ] , A L E [ ] , G A M [ ] F e a t u r e I n t e r a c t i o n G l o b a l T a b u l a r (cid:2) D o r D h e a t m a p S h o w f e a t u r e - f e a t u r e i n t e r a c t i o n T h e d i a g r a m o n m u l t i p l e f e a t u r e s i s d i f ﬁ c u l tt o i n t e r p r e t U s e r s m a y s e l e c tt h e i r i n t e r e s t e d f e a t u r e p a i r s a n d c h e c k f e a t u r e i n t e r a c t i o n s ; o r X A I s y s t e m c a n p r i o r i t i z e s i g n i ﬁ c a n t f e a t u r e i n t e r a c t i o n s T o c o n t r o l a n d i m p r o v e t h e o u t c o m e P D P [ ] , A L E [ ] , G A M [ ] E x p l a n a t i o n C a t e g o r y E x p l a n a t o r y F o r m V i s u a l R e p r e - s e n t a t i o n s P r o s C o n s U I / U X D e s i g n I m p l i c a t i o n s A pp l i c a b l e N ee d s X A I A l g o r i t h m E x a m p l e s E x a m p l e - b a s e d e x p l a n a t i o n S i m i l a r E x a m p l e L o c a l T a b u l a r / I m g / T x t (cid:2)(cid:2)(cid:2) D a t a i n s t a n c e s a s e x a m p l e s E a s y t o c o m p r e h e n d , u s e r s i n t u i t i v e l y v e r i f y A I ’ s d e c i s i o n u s i n g a n a l o g i c a l r e a s o n i n g o n s i m i l a r e x a m p l e s I t d o e s n o t h i g h l i g h t f e a t u r e s w i t h i n e x a m p l e s t o e n a b l e u s e r s ’ s i d e - b y - s i d e c o m p a r i s o n S u pp o r t s i d e - b y - s i d e f e a t u r e - b a s e d c o m p a r i s o n a m o n g e x a m p l e s T o v e r i f y t h e d e c i s i o n N e a r e s t n e i g h b o u r , C B R [ ] T y p i c a l E x a m p l e L o c a l / G l o b a l T a b u l a r / I m g / T x t (cid:2)(cid:2) D a t a i n s t a n c e s a s e x a m p l e s U s e p r o t o t y p i c a l i n s t a n c e s t o s h o w l e a r n e d r e p r e s e n t a t i o n ; R e v e a l p o t e n t i a l p r o b l e m s o f t h e m o d e l U s e r s m a y n o t a pp r e c i a t e t h e i d e a o f t y p i c a l c a s e s M a y s h o ww i t h i n - c l a ss v a r i a t i o n s ; o r e d g e c a s e s T o v e r i f y t h e d e c i s i o n ; T o r e v e a l b i a s k - M e d i o d s , MM D - c r i t i c [ ] , G e n e r a t e p r o t o t y p e [ , ] , C NN p r o t o - t y p e [ , ] C o u n t e r f a c t u a l E x a m p l e L o c a l T a b u l a r / I m g / T x t (cid:2)(cid:2) T w o c o u n t e r f a c t u a l d a t a i n s t a n c e s w i t h t h e i r h i g h l i g h t e d c o n t r a s t i v e f e a t u r e s , o r a p r o g r e ss i v e t r a n s i t i o n b e t w ee n t h e t w o H e l p f u l t o i d e n t i f y t h e d i ff e r e n c e s b e t w ee n t h e c u rr e n t o u t c o m e a n d a n o t h e r c o n t r a s t i v e o u t c o m e H a r d t o u n d e r s t a n d , m a y c a u s e c o n f u s i o n U s e r c a n d e ﬁ n e t h e p r e d i c t e d o u t c o m e t o b e c o n t r a s t e d w i t h , r e c e i v e p e r s o n a l i z e d c o u n t e r f a c t u a l c o n s t r a i n t s ; M a y s h o w c o n t r o ll a b l e f e a t u r e s o n l y T o d i ff e r e n t i a t e b e t w ee n s i m i l a r i n s t a n c e s ; T o c o n t r o l a n d i m p r o v e t h e o u t c o m e I n v e r s e c l a ss i ﬁ c a - t i o n [ ] ) , MM D - c r i t i c [ ] , P r o g r e s - s i o n [ ] , V i s u a l [ ] UCA: A Practical Prototyping Framework for End-User XAI 13 E x p l a n a t i o n C a t e g o r y E x p l a n a t o r y F o r m V i s u a l R e p r e - s e n t a t i o n s P r o s C o n s U I / U X D e s i g n I m p l i c a t i o n s A pp l i c a b l e N ee d s X A I A l g o r i t h m E x a m p l e s R u l e - b a s e d e x p l a n a t i o n D e c i s i o n R u l e s / S e t s G l o b a l T a b u l a r / I m g / T x t (cid:2)(cid:2) P r e s e n t r u l e s a s t e x t , t a b l e , o r m a t r i x P r e s e n t d e c i s i o n l o g i c , “ l i k e h u m a n e x p l a n a t i o n ” N ee d t o c a r e f u ll y b a l a n c e b e t w ee n c o m p l e t e n e ss a n d s i m p l i c i t y o f e x p l a n a t i o n T r i m r u l e s a n d s h o w o n - d e m a n d ; H i g h l i g h t l o c a l r u l e c l a u s e s r e l a t e d t o u s e r ’ s i n t e r e s t e d i n s t a n c e s F a c i l i t a t e u s e r s ’ l e a r n i n g , r e p o r t g e n e r a t i o n , a n d c o mm u n i - c a t i o n w i t h o t h e r s t a k e h o l d e r s B a y e s i a n R u l e L i s t s [ ] , L O R E [ ] , A n c h o r s [ ] D e c i s i o n t r ee G l o b a l T a b u l a r / I m g / T x t (cid:2) T r ee d i a g r a m S h o w d e c i s i o n p r o c e ss , e x p l a i n t h e d i ff e r e n c e s T oo m u c h i n f o r m a t i o n , c o m p l i c a t e d t o u n d e r s t a n d T r i m t h e t r ee a n d s h o w o n - d e m a n d ; S u pp o r t h i g h l i g h t i n g b r a n c h e s f o r u s e r ’ s i n t e r e s t e d i n s t a n c e s C o m p a r i s o n ; C o u n t e r f a c t u a l r e a s o n i n g M o d e l d i s t i ll a t i o n [ ] , D i s e n - t a n g l e C NN [ ] Feature-based explanations are the most common form of explanation information. We refer feature to a piece of information that can describe the input data. It could be raw representationof the input (such as image pixels, sound wave signals), or descriptive characteristics of the inputsummarized/designed by human (such as house features presented in tabular data), or featuresautomatically learned by AI. For example, a real estate agent can describe a house by its size,location, and age, three descriptive features; The feature of an image can be each individual pixel,or a group of pixels highlighting the object of a car, or the explicit concept of “car”.To use features for explanations, the feature representation must be human-interpretable.Feature space is also the cornerstone of other explanatory forms: example-based explanations areinstances with similar or contrastive features; and rule-based explanations are features connectedby logic and conditional statements. The feature-based explanations consist of three explanatoryforms:

Feature attribute . It indicates which features are important for the decision, and whatare their attributions to the prediction. For example, it can be a list of key features and theirimportance scores to the house price prediction, or a color map overlaid on input image indicatingthe important parts/objects for image recognition. It assumes the prediction is explainable (oftenlocally) by linearly addable important features.

Visual representation : Its visual representations largely depend on the data type of features.For image and text data, overlaying a saliency map or color map on input is the most commonvisualization. It uses sequential colors to code the ﬁne-grained feature importance score foreach individual feature (could be a pixel for image input, a word for text data). For image/videoinput data, other popular visualizations include using segmentation masks or bounding boxes onimportant image objects/parts.To visualize multiple feature attributes for tabular or text data, a bar chart is a typical choice.Variations of the bar chart include waterfall plot, treemap, wrapped bars, packed bars, piledbars, Zvinca plots, and tornado plot. Compared to bar chart that shows a point estimation offeature importance, box plot can be used to visualize the probabilistic distribution of the featureimportance score. Its variations include violin plot and beeswarm plot that show more detaileddata distribution and skewness. Feature shape . It shows the relationship between one particular feature and the outcome,such as the house size to the predicted house price.

Visual representation : For a continuous feature (such as height, temperature, i.e.: measurementon a scale), a line chart is the most common visualization, depicting whether the relationshipbetween the feature and outcome is monotonic, linear, or more complex. The line chart can beaccompanied by a scatter plot detailing the position of individual data points.For a categorical feature (such as gender, season), a bar chart can be used.

Feature interaction . When features interact with each other, their total effect on theoutcome may not be a linear summation of each feature’s individual effect. Feature interactionconsiders such an interactive effect, and shows the total interaction effect of multiple featuresto the outcome. It can be regarded as an extension of feature shape by taking multiple features(instead of one feature) into account.

Visual representation :

2D or 3D heatmap is usually used to visualize the total effect of featureinteractions on prediction. Limited by the visualization, a heatmap shows feature interaction forat most three features (using 3D heatmap). More complicated multiple paired feature-featureinteractions can be visualized using matrix heatmap, node-link network, or contingency wheel.

UCA: A Practical Prototyping Framework for End-User XAI 15

Human uses examples to learn and explain. Examples carry contextual information and areintuitive for end-users to interpret. Three different types of examples are included:

Similar example . Similar examples are instances that are similar to the input data re-garding their features. For example, for a house to sell, its similar examples can be houses in theadjacent area with similar features such as house size, age, etc.

Typical example . A typical or prototypical example is a representative instance for acertain prediction. For example, a typical example of the diabetes prediction could be a patientwho exhibits typical characteristics (such as a high blood sugar level, an abnormal hemoglobinA1C level) that could be diagnosed as diabetes.

Visual representation : For similar and typical examples, it is straightforward to show severalexamples with their corresponding predictions.

Counterfactual example . Its features are similar to the input, but has minimal featurechanges so that its prediction is distinct from the input. For example, an instance C that ispredicted as healthy is a counterfactual example for the input I that is predicted to have diabetes,if C has all the same features as I , except its blood sugar level is lower than I . We noted thatcounterfactual explanations can also be expressed as counterfactual features or rules. However,a counterfactual feature/rule can not be a standalone explanation in an XAI system, they mustreside within contain context by assuming all other features are constant. To make the explanationinformation complete, we include counterfactual explanation in the form of example. Visual representation : Counterfactual examples can be shown as two instances with their counterfactual/contrastive features highlighted , or a transition from one instance to the otherby gradually changing the counterfactual features.

Rule-based explanations are explanations where decisions of the model, in whole or in part, canbe described succinctly by a set of logical if/else statements, mimicking human reasoning anddecision making. It also implies the decision boundary, thus may be convenient for counterfactualreasoning. The rule-based explanation is a global explanation of the model’s overall behavior. Itincludes the following two explanatory forms of decision rule and decision tree. We note ruleand decision tree actually carry out similar explanation information. But since they are usuallygenerated by different XAI algorithms, and their representation format (text vs. diagram) aredistinct to the end-users, we included them as two separate explanatory forms.

Rule . The decision rules or decision sets are simple IF-THEN statement with conditionand prediction. For example: IF blood sugar is high, AND body weight is over-weighted, THENthe estimated diabetes risk is over 80%.

Visual representation : Rules are usually represented using text . Other representing formatsinclude table [27] or matrix [70] to align, read, and compare rule clauses more easily.

Decision tree . Decision tree represents rules graphically using a tree structure, withbranches representing the decision pathways, and leaves representing the predicted outcomes.

Visual representation : The most common representation is to use a node-link tree diagram .Other visual representations to show the hierarchical structure include treemap, cladogram,hyperbolic tree, dendrogram, and ﬂow chart.

In addition to the above explanatory forms generated by XAI algorithms, an XAI system needs topresent some necessary background or supplementary information to end-users, such as input,output, decision conﬁdence/certainty, model performance metrics, training dataset informationsuch as data distribution, etc. We included the following essential and common ones in ourframework, and indicates whether they are global (explaining the model’s overall behavior) orlocal explanation (explaining the decision on an individual instance):(1)

Input , output (local): input is end-users’ input data, and output is AI’s prediction on thegiven input.(2) Certainty (local): Since the prediction from AI models is usually probabilistic, the certaintyscore shows the case-speciﬁc certainty level about how conﬁdent model is in making thisparticular decision.(3)

Performance (global): Performance metrics (such as accuracy, confusion matrix, ROC curve,mean squared error) help end-users to judge the overall decision quality of the model, and toset a proper expectation on the model’s capability, as suggested in the human-AI interactionguideline [14].(4)

Dataset (global): It describes the information of the dataset where the AI model is trainedon, such as the training data distribution. It may help end-users to understand the modeland identify potential ﬂaws in the data.

We conducted a user study with 32 layperson participants. The user study utilized interview andcard sorting methodology. It is to demonstrate the process of using EUCA prototyping workﬂowto design XAI low-ﬁdelity prototypes in different AI-assisted critical decision-making tasks. Theprimary goal of the user study is to identify the strengths, weaknesses, applicable explanationneeds, and UI/UX design implications of the explanatory forms as their proprieties to incorporateinto the EUCA framework. The secondary goal is to use EUCA as a study probe to understand theend-users’ explanation needs for XAI. Our research questions are:

RQ1 : What are the strengths, weaknesses, applicable explanation needs, and design implica-tions for the end-user-friendly explanatory forms?

RQ2 : What are end-users’ requirements under various explanation needs?

We recruited layperson participants via convenience sampling method by advertising posters atthe public libraries, community centers and online community boards in the Greater Vancouverarea over a 3-month period in 2019. The inclusion criteria were: 1) adult (19 years old and above);and 2) do not have prior technical knowledge in machine learning, data science, or artiﬁcialintelligence. A total of 32 participants were enrolled in the study (Female = 16; Age: 38.2±16.0,range 19-73). Participants’ occupations covered a variety of industries e.g.: technology, design,car insurance, ﬁnance, psychology, construction, sales, food & cooking, law, healthcare, govern-ment/social services, retired. For participants who use AI in work or life (6 participants, 19%), theyused AI software such as Google Assistant to play music, navigate trafﬁc, chat with clients, andhelp drive investment decisions. Figure 5 shows the distribution of participants’ age, educationalbackground, familiarity with and attitudes towards AI. Participants’ detailed demographics are inSupplementary Material S2. The participants were thanked with $25 CAD for their time and effortin the study. The study is approved by the university’s ethics board (Ethics number: 2019s0244).

UCA: A Practical Prototyping Framework for End-User XAI 17

Bachelor47%(15)Master12%(4)Some college credit, no degree 12%(4)High school 9%(3)PhD 6%(2)Professional degree (e.g.: MD, JD) 6%(2)Trade/technical/vocational training6%(2)

20 30 40 50 60 70

Age N u m be r o f pa r t i c i pan t s A BC D

Positive39%(12)Neutral10%(3)Negative10%(3)Mixed42%(13)only hear of AI58%(18)use AI in work or life 19%(6)can code, but not write AI code 19%(6)never hear of AI

Participants’ age distribution Participants’ education distributionParticipants’ AI knowledge distribution Participants’ AI attitude distribution

Fig. 5.

Participants’ demographic information . ( A ) Histogram of participants’ age distribution. Thesticks on the x axis show each participant’s age. ( B ) Pie chart on participants’ educational level. Numbers inparentheses represent the number of participants in that category. ( C ) Pie chart on participants’ familiaritywith AI. ( D ) Pie chart on participants’ attitudes towards AI; Positive attitudes include “interested ” in and “excited” to use AI; Negative attitudes consist of “skeptical” and “concerned” about AI; A mixed attitude meansparticipants hold both positive and negative attitudes towards AI. Critical decision-making tasks . We focus the scope of the study on AI-assisted criticaldecision-support tasks, where explanations have high utility as shown in previous research [24,32, 59], and AI could not be delegated to have full automation because of the high-stake nature ofthe tasks and liability issue. We designed four decision-making tasks reﬂecting the diversity ofAI-supported critical decision-making. The four tasks are:

House task: users use AI to get a properestimate of their house price.

Health task: users use AI to predict diabetes risk.

Car task: usersdecide whether to buy an autonomous driving vehicle.

Bird task: users use AI bird recognitiontool to prepare for an important biology exam. The tasks are critical decision-making scenarios,and the decisions have signiﬁcant consequences on one’s health and life (Health and Car Task),ﬁnance (House Task), or education (Bird Task). At this stage, we have not included domain expertsin our study. Thus we deliberately designed the tasks so that decisions can be made based oncommon sense without requiring domain knowledge. The four tasks covered common inputdata types of tabular, sequential, image and video data, and their corresponding datasets arepublicly-available (see Table ?? ), so that the resultant paper prototype from the user study can beactualized as a working prototype for case-speciﬁc studies in future work. End-Users’ Explanation Needs . Even for the same user and task, end-users’ needs forexplanation, i.e.: the trigger point or motivation to check the explanation of an AI system, may varyfrom time to time based on different contexts or usage scenarios. In our study, we aim to capturethe ﬁne-grained details of end-users’ requirements in different explanation need scenarios. Wesummarized the following potential explanation needs from prior works [32, 39, 72, 78] as follows:

AI-Assisted CriticalDecision-Making Tasks

HOUSE TASK HEALTH TASK CAR TASK BIRD TASKSell a house Check diabetesrisk Buy a self-drivingcar Prepare for exam (cid:2)(cid:3)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:5)(cid:3)(cid:10)(cid:8)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:9)(cid:8)(cid:14)(cid:6)(cid:8)(cid:7)(cid:16)(cid:4)(cid:12)(cid:17)(cid:18)(cid:14)(cid:17)(cid:19)(cid:20) (cid:21)(cid:22) (cid:23)(cid:12)(cid:17)(cid:18)(cid:8)(cid:9)(cid:5)(cid:3)(cid:8)(cid:17)(cid:7)(cid:7)(cid:18)(cid:8)(cid:24)(cid:5)(cid:8)(cid:6)(cid:7)(cid:15)(cid:15)(cid:8)(cid:9)(cid:5)(cid:3)(cid:10)(cid:8)(cid:25)(cid:3)(cid:10)(cid:10)(cid:7)(cid:17)(cid:24)(cid:8)(cid:26)(cid:5)(cid:3)(cid:6)(cid:7)(cid:27)(cid:8)(cid:11)(cid:5)(cid:10)(cid:8)(cid:12)(cid:8)(cid:28)(cid:14)(cid:19)(cid:19)(cid:7)(cid:10)(cid:8)(cid:5)(cid:17)(cid:7)(cid:29)(cid:2)(cid:14)(cid:17)(cid:25)(cid:7)(cid:8)(cid:9)(cid:5)(cid:3)(cid:10)(cid:8)(cid:28)(cid:3)(cid:18)(cid:19)(cid:7)(cid:24)(cid:8)(cid:14)(cid:6)(cid:8)(cid:15)(cid:14)(cid:13)(cid:14)(cid:24)(cid:7)(cid:18)(cid:27)(cid:8)(cid:9)(cid:5)(cid:3)(cid:8)(cid:17)(cid:7)(cid:7)(cid:18)(cid:8)(cid:24)(cid:5)(cid:8)(cid:6)(cid:7)(cid:15)(cid:15)(cid:8)(cid:9)(cid:5)(cid:3)(cid:10)(cid:8)(cid:25)(cid:3)(cid:10)(cid:10)(cid:7)(cid:17)(cid:24)(cid:8)(cid:26)(cid:5)(cid:3)(cid:6)(cid:7)(cid:8)(cid:12)(cid:24)(cid:8)(cid:12)(cid:8)(cid:10)(cid:7)(cid:12)(cid:15)(cid:15)(cid:9)(cid:8)(cid:19)(cid:5)(cid:5)(cid:18)(cid:8)(cid:4)(cid:10)(cid:14)(cid:25)(cid:7) (cid:30)(cid:5)(cid:3)(cid:8)(cid:19)(cid:7)(cid:24)(cid:8)(cid:24)(cid:5)(cid:8)(cid:31)(cid:17)(cid:5) (cid:8)(cid:24)(cid:26)(cid:7)(cid:10)(cid:7)(cid:8)(cid:14)(cid:6)(cid:8)(cid:12)(cid:17)(cid:8) (cid:2)(cid:3)(cid:4)(cid:5)(cid:4)(cid:6)(cid:7)(cid:2)(cid:8)(cid:9)(cid:4)(cid:10)(cid:5)(cid:11)(cid:8)(cid:8)(cid:4)(cid:12)(cid:11)(cid:10)(cid:7)(cid:11)(cid:9)(cid:13)(cid:14)(cid:15)(cid:16)(cid:9) (cid:24)(cid:5)(cid:5)(cid:15)(cid:8)(cid:24)(cid:26)(cid:12)(cid:24)(cid:8)(cid:25)(cid:12)(cid:17)(cid:8) (cid:17)(cid:3)(cid:11)(cid:18)(cid:4)(cid:7)(cid:5)(cid:9)(cid:19)(cid:20)(cid:21)(cid:22)(cid:11)(cid:9)(cid:17)(cid:3)(cid:4)(cid:7)(cid:11) (cid:29)(cid:8)!(cid:24)(cid:8)(cid:13)(cid:12)(cid:9)(cid:8)(cid:26)(cid:7)(cid:15)(cid:4)(cid:8)(cid:9)(cid:5)(cid:3)(cid:8)(cid:24)(cid:5)(cid:8)(cid:19)(cid:7)(cid:24)(cid:8)(cid:12)(cid:8)(cid:4)(cid:10)(cid:5)(cid:4)(cid:5)(cid:7)(cid:10)(cid:8)(cid:7)(cid:6)(cid:24)(cid:14)(cid:13)(cid:12)(cid:24)(cid:7)(cid:8)(cid:5)(cid:11)(cid:8)(cid:9)(cid:5)(cid:3)(cid:10)(cid:8)(cid:26)(cid:5)(cid:3)(cid:6)(cid:7)(cid:29) (cid:14)(cid:15) (cid:23)(cid:11)(cid:8)(cid:8)(cid:4)(cid:10)(cid:12)(cid:9)(cid:24)(cid:20)(cid:21)(cid:3)(cid:9)(cid:19)(cid:20)(cid:21)(cid:22)(cid:11) (cid:2) (cid:3)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:5)(cid:17)(cid:7)(cid:8)(cid:18)(cid:12) (cid:3) (cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:10)(cid:7)(cid:25)(cid:7)(cid:14) (cid:4) (cid:7)(cid:18)(cid:8)(cid:12)(cid:17)(cid:8)(cid:7)(cid:13)(cid:12)(cid:14)(cid:15)(cid:8)(cid:11)(cid:10)(cid:5)(cid:13)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8)(cid:25)(cid:5)(cid:13)(cid:4)(cid:12)(cid:17) (cid:3) (cid:8)(cid:24)(cid:26)(cid:12)(cid:24)(cid:8)(cid:6)(cid:24)(cid:5)(cid:10)(cid:7)(cid:6)(cid:8) (cid:3) (cid:5)(cid:3)(cid:10)(cid:8)(cid:26)(cid:7)(cid:12)(cid:15)(cid:24)(cid:26)(cid:8)(cid:10)(cid:7)(cid:25)(cid:5)(cid:10)(cid:18) (cid:5) (cid:8)" (cid:4) (cid:14)(cid:18)(cid:7)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:12)(cid:8)(cid:6)(cid:7)(cid:10) (cid:4) (cid:14)(cid:25)(cid:7)(cid:8)(cid:24)(cid:26)(cid:12)(cid:24)(cid:8)(cid:26)(cid:7)(cid:15)(cid:4)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:14)(cid:18)(cid:7)(cid:17)(cid:24)(cid:14)(cid:11) (cid:3) (cid:8) (cid:3) (cid:5)(cid:3)(cid:10)(cid:8)(cid:10)(cid:14)(cid:6)(cid:31)(cid:8)(cid:5)(cid:11)(cid:8)(cid:18)(cid:14)(cid:12)(cid:28)(cid:7)(cid:24)(cid:7)(cid:6) (cid:5) (cid:8)(cid:8)(cid:28) (cid:3) (cid:8)(cid:12)(cid:17)(cid:12)(cid:15) (cid:3)(cid:6) (cid:14)(cid:17)(cid:19)(cid:8) (cid:3) (cid:5)(cid:3)(cid:10)(cid:8)(cid:26)(cid:7)(cid:12)(cid:15)(cid:24)(cid:26)(cid:8)(cid:10)(cid:7)(cid:25)(cid:5)(cid:10)(cid:18)(cid:6) (cid:7) (cid:30)(cid:5)(cid:3)(cid:8)(cid:19)(cid:12) (cid:4) (cid:7)(cid:8)(cid:14)(cid:24)(cid:8)(cid:12)(cid:8)(cid:24)(cid:10) (cid:3)(cid:5) (cid:8)(cid:12)(cid:17)(cid:18)(cid:8)(cid:14)(cid:24)(cid:8)(cid:24)(cid:7)(cid:15)(cid:15)(cid:6)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:24)(cid:26)(cid:12)(cid:24)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:26)(cid:12) (cid:4) (cid:7)(cid:8)& (cid:8) ’(cid:8)(cid:4)(cid:7)(cid:10)(cid:25)(cid:7)(cid:17)(cid:24)(cid:8)(cid:5)(cid:11)(cid:8)(cid:25)(cid:26)(cid:12)(cid:17)(cid:25)(cid:7)(cid:8)(cid:24)(cid:5)(cid:8)(cid:28)(cid:7)(cid:8)(cid:18)(cid:14)(cid:12)(cid:19)(cid:17)(cid:5)(cid:6)(cid:7)(cid:18)(cid:8) (cid:9) (cid:14)(cid:24)(cid:26)(cid:8)(cid:18)(cid:14)(cid:12)(cid:28)(cid:7)(cid:24)(cid:7)(cid:6)(cid:8)(cid:14)(cid:17)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8)(cid:17)(cid:7) (cid:10) (cid:24)(cid:8) (cid:3) (cid:7)(cid:12)(cid:10) (cid:7) (cid:8) (cid:11) (cid:5)(cid:3)(cid:15)(cid:18)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:15)(cid:14)(cid:31)(cid:7)(cid:8)(cid:24)(cid:5)(cid:8)(cid:24)(cid:12)(cid:31)(cid:7)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8)(cid:4)(cid:10)(cid:7)(cid:18)(cid:14)(cid:25)(cid:24)(cid:7)(cid:18)(cid:8)(cid:10)(cid:7)(cid:6)(cid:3)(cid:15)(cid:24)(cid:8)(cid:11)(cid:10)(cid:5)(cid:13)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8)(cid:6)(cid:5)(cid:11)(cid:24) (cid:9) (cid:12)(cid:10)(cid:7) (cid:12) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9) (cid:9) (cid:10)(cid:3)(cid:8)(cid:9)(cid:11)(cid:10) (cid:9) (cid:12)(cid:3)(cid:13)(cid:14)(cid:5)(cid:14)(cid:6)(cid:7) (cid:13) (cid:5) (cid:2)(cid:3)(cid:4)(cid:5) (cid:8) (cid:6)(cid:5)(cid:7)(cid:6)(cid:8)(cid:9)(cid:4)(cid:10)(cid:11)(cid:10) (cid:17) (cid:12) (cid:8) (cid:13) (cid:17)(cid:8) (cid:13)(cid:2)(cid:6) (cid:5)(cid:17)(cid:5)(cid:13)(cid:5) (cid:2)(cid:7) (cid:8) (cid:9)(cid:4)(cid:10)(cid:11)(cid:10) (cid:17) (cid:12) (cid:8) (cid:11)(cid:5)(cid:14)(cid:10)(cid:15) (cid:15) (cid:5)(cid:16) (cid:17)(cid:18) (cid:19)(cid:20)(cid:21)(cid:2)(cid:10) (cid:4)(cid:4) (cid:5)(cid:9) (cid:8) (cid:22)(cid:10)(cid:6)(cid:14) (cid:8) (cid:7)(cid:5) (cid:17) (cid:7) (cid:5) (cid:4)(cid:7) (cid:8) (cid:13) (cid:17) (cid:9) (cid:8) (cid:13)(cid:4)(cid:6)(cid:10)(cid:23)(cid:15)(cid:10)(cid:13) (cid:15)(cid:8) (cid:10) (cid:17) (cid:6)(cid:5) (cid:15)(cid:15) (cid:10)(cid:12)(cid:5) (cid:17) (cid:15)(cid:5) (cid:8) (cid:24)(cid:25)(cid:26)(cid:27) (cid:8) (cid:7)(cid:28)(cid:7)(cid:6)(cid:5) (cid:13) (cid:29) (cid:8) (cid:6)(cid:14)(cid:5) (cid:8) (cid:15)(cid:13)(cid:4) (cid:8) (cid:15)(cid:13) (cid:17)(cid:8) (cid:9)(cid:4)(cid:10)(cid:11)(cid:5) (cid:8)(cid:5)(cid:17)(cid:8) (cid:10)(cid:6)(cid:7) (cid:8)(cid:5) (cid:22) (cid:17) (cid:30) (cid:8) (cid:31) (cid:5) (cid:2)(cid:4) (cid:8)(cid:13) (cid:13)(cid:10) (cid:17)(cid:8) (cid:15) (cid:5)(cid:17) (cid:15)(cid:5)(cid:4) (cid:17)(cid:8) (cid:10)(cid:7) (cid:8) (cid:6)(cid:14)(cid:5) (cid:8) (cid:7)(cid:13) (cid:5)(cid:6)(cid:28) (cid:8) (cid:10)(cid:7)(cid:7)(cid:2)(cid:5)(cid:30) (cid:31) (cid:5) (cid:2) (cid:8)(cid:17) (cid:5)(cid:5)(cid:9) (cid:8) (cid:6) (cid:5)(cid:8) (cid:9)(cid:5)(cid:15)(cid:10)(cid:9)(cid:5) (cid:8) (cid:22)(cid:14)(cid:5)(cid:6)(cid:14)(cid:5)(cid:4) (cid:8) (cid:6) (cid:5)(cid:8) !(cid:2)(cid:28) (cid:8) (cid:6)(cid:14)(cid:5) (cid:8) (cid:15)(cid:13)(cid:4) (cid:8)(cid:5) (cid:4) (cid:8)(cid:17)(cid:5) (cid:6)(cid:30) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7) (cid:9) (cid:8)(cid:6) (cid:9) (cid:8)(cid:3)(cid:9)(cid:10)(cid:6)(cid:10)(cid:11)(cid:10)(cid:3)(cid:12) (cid:9) (cid:13)(cid:14)(cid:5)(cid:15)(cid:5)(cid:6)(cid:7) (cid:9) (cid:15)(cid:16)(cid:17)(cid:5)(cid:18)(cid:19)(cid:16) " (cid:2) (cid:3)(cid:4)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8) (cid:28) (cid:5)(cid:3) (cid:2) (cid:10)(cid:7)(cid:8)(cid:12)(cid:8)(cid:28)(cid:14)(cid:5)(cid:15)(cid:5)(cid:19) (cid:3) (cid:8)(cid:6)(cid:24)(cid:3)(cid:18)(cid:7)(cid:17)(cid:24) (cid:4) (cid:8)(cid:12)(cid:17)(cid:18)(cid:8)(cid:12)(cid:10)(cid:7)(cid:8)(cid:6)(cid:24)(cid:3)(cid:18) (cid:3) (cid:14)(cid:17)(cid:19)(cid:8)(cid:5) (cid:5) (cid:7)(cid:10)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8) (cid:6) (cid:7)(cid:7)(cid:31)(cid:7)(cid:17)(cid:18)(cid:8)(cid:24)(cid:5)(cid:8)(cid:4)(cid:10)(cid:7)(cid:4)(cid:12)(cid:10)(cid:7)(cid:8)(cid:11)(cid:5)(cid:10)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8)(cid:7) (cid:7) (cid:12)(cid:13)(cid:8)(cid:5)(cid:17)(cid:8)(cid:28)(cid:14)(cid:10)(cid:18)(cid:8)(cid:24)(cid:12) (cid:7) (cid:5)(cid:17)(cid:5)(cid:13) (cid:3)(cid:8) (cid:30)(cid:5)(cid:3)(cid:8)(cid:19)(cid:7)(cid:24)(cid:8)(cid:24)(cid:5)(cid:8)(cid:31)(cid:17)(cid:5) (cid:6) (cid:8)(cid:12)(cid:8)(cid:28)(cid:14)(cid:10)(cid:18)(cid:8)(cid:24)(cid:12) (cid:7) (cid:5)(cid:17)(cid:5)(cid:13) (cid:3) (cid:8) (cid:6) (cid:7)(cid:28)(cid:6)(cid:14)(cid:24)(cid:7)(cid:8)(cid:24)(cid:26)(cid:12)(cid:24)(cid:8)(cid:25)(cid:12)(cid:17)(cid:8)(cid:12)(cid:3)(cid:24)(cid:5)(cid:13)(cid:12)(cid:24)(cid:14)(cid:25)(cid:12)(cid:15)(cid:15) (cid:3) (cid:8)(cid:10)(cid:7)(cid:25)(cid:5)(cid:19)(cid:17)(cid:14) (cid:9) (cid:7)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8)(cid:28)(cid:14)(cid:10)(cid:18)(cid:8)(cid:14)(cid:13)(cid:12)(cid:19)(cid:7)(cid:6)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:3)(cid:4)(cid:15)(cid:5)(cid:12)(cid:18) (cid:8) (cid:8) (cid:10) (cid:5)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:19)(cid:14) (cid:5) (cid:7)(cid:8)(cid:14)(cid:24)(cid:8)(cid:12)(cid:8)(cid:24)(cid:10) (cid:3) (cid:8)(cid:28) (cid:3) (cid:8)(cid:3)(cid:4)(cid:15)(cid:5)(cid:12)(cid:18)(cid:14)(cid:17)(cid:19)(cid:8)(cid:12)(cid:8)(cid:28)(cid:14)(cid:10)(cid:18)(cid:8)(cid:14)(cid:13)(cid:12)(cid:19)(cid:7) (cid:4) (cid:8)(cid:12)(cid:17)(cid:18)(cid:8)(cid:14)(cid:24)(cid:8)(cid:19)(cid:14) (cid:5) (cid:7)(cid:6)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8)(cid:13)(cid:5)(cid:6)(cid:24)(cid:8)(cid:15)(cid:14)(cid:31)(cid:7)(cid:15) (cid:3) (cid:8)(cid:28)(cid:14)(cid:10)(cid:18)(cid:8)(cid:6)(cid:4)(cid:7)(cid:25)(cid:14)(cid:7)(cid:6) (cid:8) (cid:11) (cid:14)(cid:15)(cid:15)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:3)(cid:6)(cid:7)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8) (cid:6) (cid:7)(cid:28)(cid:6)(cid:14)(cid:24)(cid:7)(cid:8)(cid:24)(cid:5)(cid:8)(cid:26)(cid:7)(cid:15)(cid:4)(cid:8) (cid:3) (cid:5)(cid:3)(cid:8)(cid:4)(cid:10)(cid:7)(cid:4)(cid:12)(cid:10)(cid:7)(cid:8)(cid:11)(cid:5)(cid:10)(cid:8)(cid:24)(cid:26)(cid:7)(cid:8)(cid:7) (cid:7) (cid:12)(cid:13) (cid:12) (cid:8) (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:6)(cid:8) (cid:9) (cid:9)(cid:7)(cid:5)(cid:10) (cid:9) (cid:11)(cid:4)(cid:12)(cid:13)(cid:6)(cid:13)(cid:14)(cid:15)

ExplanationNeeds

Trust

You doubt whether to trustthe AI tool or not You doubt whether to trustthe software prediction onyour diabetes risk You don’t know whether totrust the results from thewebsite or not

Safety

You need to know whetherthe autopilot mode is safeand reliable

Bias

You doubt whether thesoftware will perform thesame among people withdifferent gender, age, orethnicity group You want to know if theautopilot mode performsrobustly under varying road,weather, and lightconditions.

Disagreementwith AI

AI’s prediction aligns/doesnot align with your ownestimation You maintain good healthwith no major diseases or afamily history ofdiabetes/Diabetes tends torun in your family, andyou’re afraid of getting itsomeday, and AI predictsyour chance of gettingdiabetes is low/high You notice the carsometimes drives muchslower than the expectedspeed limit The results sometimes donot align with yourknowledge

Differentiation

In the exam, you need towrite a short statement todifferentiate different birds

Learning

Is it a good tool to improveyour learning and help youknow more about birdtaxonomy?

Improvement

You need to decide whetherto do a renovation orreplacement of appliancesto increase your housevalue, and which action isthe most cost-effective You want to know how toadjust your lifestyleaccordingly to lower the riskof diabetes

Communication

You need to communicateyour decision with yourfamily You need to need to informfamily members and consultyour doctor You need to communicatewith your family about yourjudgment on the car’s safety

Report

In the exam, you need towrite a short statement onhow you recognize the birdas such species

Multi-objectivestrade-off

You’re aware that theinsurance company mayuse such a prediction fromthe software to determineyour insurance premiumand beneﬁts You’re easy to get motionsickness, and you notice youseem to get car sick morefrequently in autopilotmode

ML problem type

Regression Regression Classiﬁcation Classiﬁcation

Input data type

Tabular data Tabular/sequentialdata Image/video data Image data

Available dataset

Bostonhousing [11] Diabetesdataset [6] BDD100K [96] CUB-200dataset [88]

Table 2.

The four tasks and their explanation needs used in the interview . • Calibrate trust : trust is a key to establish human-AI decision-making partnership. Sinceusers can easily distrust or overtrust AI, it is important to calibrate the trust to reﬂect thecapabilities of AI systems [84, 99].

UCA: A Practical Prototyping Framework for End-User XAI 19 • Ensure safety : users need to ensure safety of the decision consequences. • Detect bias : users need to ensure the decision is impartial and unbiased. • Unexpected prediction : the AI prediction is unexpected, and users disagree with AI’s pre-diction. • Expected prediction : AI’s prediction aligns with users’ expectations. • Differentiate similar instances : due to the consequences of wrong decisions, users some-times need to discern similar instances or outcomes. For example, a doctor differentiateswhether the diagnosis is a benign or malignant tumor. • Learn : users need to gain knowledge, improve their problem-solving skills, and discovernew knowledge. • Improve : users seek causal factors to control and improve the predicted outcome. • Communicate with stakeholders : many critical decision-making processes involve multi-ple stakeholders, and users need to discuss the decision with them. • Generate reports : users need to utilize the explanations to perform particular tasks suchas report production. For example, a radiologist generates a medical report on a patient’sX-ray image. • Trade-off multiple objectives : AI may be optimized on an incomplete objective while theusers seek to fulﬁll multiple objectives in real-world applications. For example, a doctorneeds to ensure a treatment plan is effective as well as has acceptable patient adherence.Ethical and legal requirements may also be included as objectives.Each task is accompanied by several explanation needs as shown in Table ?? . The tasks and expla-nation needs were presented in the form of storyboards using graphics and text. SupplementaryMaterial S2 details the interview schedule and materials. Creating Prototyping Cards from Explanatory Forms . We demonstrated our processof creating low-ﬁdelity prototyping cards out of the 12 explanatory forms.(1)

Create prototyping card templates

We started by creating templates according to the visualrepresentations in previous Section 3. The visualizations were the vanilla and most commonformat appeared in previous literature. For example, we used bar chart and color map tovisualize feature attribute in tabular and image data respectively. Each card shows oneexplanatory form. For some explanatory forms (such as feature attribute and counterfactualexample), we created multiple cards with different variations of their visual representations.(2)

Extract features as content placeholder

We then manually extracted several interpretablefeatures given the AI task. For instance, in the house prediction task, we extracted housesize, age, etc. In the self-driving car task, we extracted saliency objects such as trafﬁc signs,road markers, cars, and pedestrians. As quick prototyping, the feature content may notnecessarily reﬂect the true content generated by XAI algorithms. They served as contentplaceholders.(3)

Fill the prototyping templates with content placeholder

The extracted features were thenused to ﬁll in the prototyping card templates. The ﬁnal prototyping cards are shown inFigure 1 and Supplementary Material S2.After interviewing the ﬁrst ﬁve participants, we revised some prototyping cards based on partici-pants’ feedback. For instance, we indicated the position of the input data point on the featureshape and feature interaction cards. We also removed several variations of the cards since partici-pants found them harder to interpret. (cid:53)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:20)(cid:3)(cid:56)(cid:81)(cid:71)(cid:72)(cid:85)(cid:86)(cid:87)(cid:68)(cid:81)(cid:71)(cid:3)(cid:88)(cid:86)(cid:72)(cid:85)(cid:86)(cid:10)(cid:3)(cid:81)(cid:72)(cid:72)(cid:71)(cid:86)(cid:3)(cid:73)(cid:82)(cid:85)(cid:3)(cid:72)(cid:91)(cid:83)(cid:79)(cid:68)(cid:76)(cid:81)(cid:68)(cid:69)(cid:76)(cid:79)(cid:76)(cid:87)(cid:92)(cid:3)(cid:76)(cid:81)(cid:3)(cid:71)(cid:76)(cid:401)(cid:72)(cid:85)(cid:72)(cid:81)(cid:87)(cid:3)(cid:83)(cid:88)(cid:85)(cid:83)(cid:82)(cid:86)(cid:72)(cid:86) (cid:53)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:21)(cid:3)(cid:38)(cid:68)(cid:85)(cid:71)(cid:3)(cid:86)(cid:82)(cid:85)(cid:87)(cid:76)(cid:81)(cid:74)(cid:3)(cid:82)(cid:81)(cid:3)(cid:72)(cid:91)(cid:83)(cid:79)(cid:68)(cid:81)(cid:68)(cid:87)(cid:82)(cid:85)(cid:92)(cid:3)(cid:87)(cid:92)(cid:83)(cid:72)(cid:86) (cid:51)(cid:85)(cid:72)(cid:86)(cid:72)(cid:81)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:36)(cid:44)(cid:16)(cid:68)(cid:86)(cid:86)(cid:76)(cid:86)(cid:87)(cid:72)(cid:71)(cid:3)(cid:71)(cid:72)(cid:70)(cid:76)(cid:86)(cid:76)(cid:82)(cid:81)(cid:16)(cid:80)(cid:68)(cid:78)(cid:76)(cid:81)(cid:74)(cid:3)(cid:87)(cid:68)(cid:86)(cid:78) (cid:51)(cid:85)(cid:72)(cid:86)(cid:72)(cid:81)(cid:87)(cid:3)(cid:86)(cid:72)(cid:89)(cid:72)(cid:85)(cid:68)(cid:79)(cid:3)(cid:72)(cid:91)(cid:83)(cid:79)(cid:68)(cid:81)(cid:68)(cid:87)(cid:82)(cid:85)(cid:92)(cid:3)(cid:83)(cid:88)(cid:85)(cid:83)(cid:82)(cid:86)(cid:72)(cid:86) (cid:41)(cid:82)(cid:85)(cid:3)(cid:72)(cid:68)(cid:70)(cid:75)(cid:3)(cid:83)(cid:88)(cid:85)(cid:83)(cid:82)(cid:86)(cid:72)(cid:15)(cid:3)(cid:68)(cid:86)(cid:78)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:83)(cid:68)(cid:85)(cid:87)(cid:76)(cid:70)(cid:76)(cid:83)(cid:68)(cid:81)(cid:87)(cid:3)(cid:90)(cid:75)(cid:72)(cid:87)(cid:75)(cid:72)(cid:85)(cid:3)(cid:9)(cid:3)(cid:90)(cid:75)(cid:68)(cid:87)(cid:3)(cid:78)(cid:76)(cid:81)(cid:71)(cid:3)(cid:82)(cid:73)(cid:3)(cid:72)(cid:91)(cid:83)(cid:79)(cid:68)(cid:81)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:18)(cid:76)(cid:81)(cid:73)(cid:82)(cid:85)(cid:80)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:76)(cid:86)(cid:3)(cid:81)(cid:72)(cid:72)(cid:71)(cid:72)(cid:71) (cid:51)(cid:85)(cid:72)(cid:86)(cid:72)(cid:81)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:36)(cid:44)(cid:16)(cid:68)(cid:86)(cid:86)(cid:76)(cid:86)(cid:87)(cid:72)(cid:71)(cid:3)(cid:71)(cid:72)(cid:70)(cid:76)(cid:86)(cid:76)(cid:82)(cid:81)(cid:16)(cid:80)(cid:68)(cid:78)(cid:76)(cid:81)(cid:74)(cid:3)(cid:87)(cid:68)(cid:86)(cid:78)(cid:3)(cid:68)(cid:74)(cid:68)(cid:76)(cid:81) (cid:58)(cid:68)(cid:79)(cid:78)(cid:3)(cid:87)(cid:75)(cid:85)(cid:82)(cid:88)(cid:74)(cid:75)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:72)(cid:91)(cid:83)(cid:79)(cid:68)(cid:81)(cid:68)(cid:87)(cid:82)(cid:85)(cid:92)(cid:3)(cid:87)(cid:92)(cid:83)(cid:72)(cid:3)(cid:70)(cid:68)(cid:85)(cid:71)(cid:86) (cid:41)(cid:82)(cid:85)(cid:3)(cid:72)(cid:68)(cid:70)(cid:75)(cid:3)(cid:83)(cid:88)(cid:85)(cid:83)(cid:82)(cid:86)(cid:72)(cid:15)(cid:3)(cid:83)(cid:68)(cid:85)(cid:87)(cid:76)(cid:70)(cid:76)(cid:83)(cid:68)(cid:81)(cid:87)(cid:3)(cid:86)(cid:72)(cid:79)(cid:72)(cid:70)(cid:87)(cid:3)(cid:9)(cid:3)(cid:86)(cid:82)(cid:85)(cid:87)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:72)(cid:91)(cid:83)(cid:79)(cid:68)(cid:81)(cid:68)(cid:87)(cid:82)(cid:85)(cid:92)(cid:3)(cid:70)(cid:68)(cid:85)(cid:71)(cid:86)

Fig. 6.

The user study procedure . The study consisted of two rounds corresponding to the two researchquestions. In the above example, we use the Bird task and the explanatory purpose of calibrating trust.

The study session consists a one-to-one, in-person, open-ended, semi-structured interview and acard sorting.

Interview . The interview consists of two rounds (Fig. 6). The ﬁrst round is to familiarizeparticipants with the tasks and explanation needs, and to understand end-users’ explanationneeds for XAI (

RQ2 ) before showing them the prototyping cards. The participant was ﬁrst intro-duced to an AI-assisted decision-making task and its corresponding explanation need scenarios.Each task and need scenario were shown as storyboards color-printed on paper. For each expla-nation need, we asked the participants whether they accept AI as decision-support, and need AIto explain its decision. If explanations were needed, we then asked what explanations/furtherinformation they request.After discussing all the explanation needs for one task, the participant entered the second round:card sorting, which is detailed in the next section. At the end of the interview, the participantsﬁlled out a demographic questionnaire. The average study session lasted for 65 minutes. (eachparticipant’s study duration is in Supplementary Material S2). We audio-recorded the interviews,made observational notes on the card selection and sorting process, and took pictures of the cardsorting results.

Prototyping via Card Selection and Sorting . For each decision-making task, the par-ticipants ﬁrst revisited the task. Then the researcher walked through the created prototyping cardsshowing the explanatory forms for that task. In this process, the participants could ask questionsif they did not understand or need clariﬁcation. They could also comment on each card. Before wemoved to the next step, we asked participants and made sure they had no questions or concernswith the cards.Next, for each explanation need scenario, the researcher asked participants to select, rank,and combine the prototyping cards that they found were the most useful ones and could meettheir explainability needs. They could also sketch on blank cards to create new prototyping cards,and add the newly created cards to the card sorting. After sorting the cards, they were asked tocomment on why they selected or did not select a card, and their rationals for making such asorting (

RQ1 ). After the card sorting, they were asked whether the combination of cards wouldfulﬁll their explainability needs.

We utilized a mixed method to analyze the data.For qualitative analysis, we analyzed the interview data using inductive thematic analysis ap-proach [23]. About 2800 minutes of interviews were recorded and transcribed. We performedcoding using Nvivo software. Three members of the research team started with an open coding

UCA: A Practical Prototyping Framework for End-User XAI 21 pass to individually create a list of potential codes. Two additional sets of codes were also applied:1) the 10 explanatory explanation needs (listed in Section 4.2.2); 2) the 12 explanatory forms inour framework (listed in Section 3). Upon discussion and applying the afﬁnity diagram process,a uniﬁed coding scheme was devised, and two team members independently coded one tran-scription using this scheme. The ﬁrst pass of inter-rater reliability Kappa score was 0.43. Afteran in-depth discussion with the research team, we further clariﬁed the code deﬁnition, mergedpotential overlapping codes, and removed redundant codes in the coding scheme. The secondpass of inter-rater reliability Kappa score was 0.88 on two new transcriptions. The ﬁrst authoranalyzed all interview transcripts twice, and the other coder analyzed half of the transcripts.We also conducted quantitative analysis on card sorting, and on participants’ responses to thequestions asked in the interview. Since the quantitative results are less relevant to the researchquestions, we put the quantitative methods and results in Supplementary Material S2.

To avoid redundancy, we present the quantitative and qualitative results together. The explanationneeds are marked as blue, and explanatory forms are orange. The ratio of responses are shown asrule (12/15), which means out of 15 card-ranking responses, 12 selected the explanatory formrule. We highlighted key messages using bold font. Whenever necessary, we included participants’verbatim quotes despite some minor grammatical error. Some quotes had their current task andexplanatory purpose indicated.

We present the primary ﬁndings from the user study, by detailing the ﬁne-grained proprieties ofthe 12 explanatory forms (pros, cons, applicable explanation needs, and design implications).

Pros . In the study, we used a bar chart to represent feature importance score for tabulardata, and color map and bounding box object detection for image data (Fig. 1). All participants intuitively understood feature attribute , and over half selected it (143/248) and ranked it at toppositions. “Feature attribute uses a simple way to highlight the most important parts, and youcan see very clearly at your ﬁrst sight how this can be recognized.” (P04, Bird, Learning) “It’s easy to read. ...And you have a bar (chart) here it’s really clear information thatpeople understand instantly.” (P28, House, Trust)By showing “ﬁner details” (P10) and “breakdown and weights of features” (P23) “that AI took intoaccount” (P31), participants perceived feature attribute can answer “how” and “why” questions . “tells me why” (P20), “gives me the behind the scenes” (P24), “tells me how AI read thingsand how it makes decisions” (P03), “have an understanding of how much weight AI isgiving to each of the factors” (P22), and “identify key aspect, ...support its reasoning” (P18). Applicable Explanation Needs . By checking feature importance ranking, participantswould instantly “compare with my own judgment, to see if that aligns with my feature attribute” (P01, Car, Safety), especially when participants need to verify AI’s decision . Cons . Although a causal relationship may not be conﬁrmed, some participants tended to assume feature attribute is causal , or simplify the relationship among features by assuming theyare independent from each other . This was usually occurred when they were seeking explanation to improve the predicted outcome. And participants were likely to be informed by the featureimportance score to prioritize the most important features to take actions upon. “Seeing that body weight is more important than exercise, I think I will focus on chang-ing what I ate, instead of like responding by going to the gym everyday.” (P16, Health,Improvement) – Relies on feature attribute to improve the outcome. “It (feature attribute) shows what are the most important factors that AI has taken intoaccount, so you could target the biggest factors.” (P31, Health, Improvement) – Assumesa causal relationship and prioritizes the action. “If my blood sugar puts me at a super high risk here, but my caloric intake doesn’tactually put me at that higher risk, it’s like a lower risk, then I would rather just focuson blood sugar. ” (P22, Health task) – Ignores the complex interaction between bloodsugar and caloric intake.

Design Implications . To avoid the above causal illusion, UI/UX design may need to alarm users either implicitly or explicitly that changing the important features may not necessarilylead to the outcome change in the real world, due to correlation does not necessarily implycausality.For designing UI/UX of its prototyping card, designers may consider varying different rep-resentations of the feature importance, such as showing the feature ranking only and allowingusers to check the detailed attribute scores on demand, or allowing users to set a threshold onthe attribute score and only showing features above the cut-off value, as suggested by a fewparticipants. “If the percentage (of the feature) is below the cutoff value, the users does not need to see(the feature), reduce the cognitive load.” (P04, Bird, Learning)

Pros . Participants liked its graphical representation of showing the relationship betweenone feature and prediction. “It (feature shape on exercise and diabetes risk) feels so easy to latch onto like it’ssomething that you can impact and something that’s very tangible.” (P22, Health,Trust)

Applicable Explanation Needs . The slope of the curve in feature shape line chart allowsusers to easily check how changing one feature would lead to the change of the outcome, thusmany participants intuitively used feature shape for counterfactual reasoning , especially toimprove the predicted outcome. “I would be interested to see how much like here (feature shape) increasing the exerciseby a small amount actually makes a really big difference. So that’s also helpful to decidewhat you should be focusing on to try to avoid it (diabetes). The shape of the curveactually helps. Coz if I was out here [pointing to the ﬂat part of the curve], then it wouldnot be as helpful for me to increase my exercise.” (P16, Health, Improvement)By showing the relationship between the protected feature and outcomes, it also helps to revealbias, i.e.: to check if the different values of the protected features (such as male, female) will leadto differences in prediction (such as loan approval). “If these features are related to diabetes, then it (AI) should present some (feature shape)cards to tell me if the gender, age and ethnicity (will affect diabetes prediction), so thisimage (feature shape) would be really helpful.” (P02, Health, Bias)

UCA: A Practical Prototyping Framework for End-User XAI 23

Cons . One drawback of feature shape pointed out by a few participants is that it does notconsiderate feature interactions . “This one (feature shape on house size and price) is not based on the bigger the house,the higher you can sell, because it is based on a lot of features. Let’s say the house is2000 square feet. It was built in 1980. Another one is 1000 square feet, but it’s just builta decade ago. So its (the latter) price will be much higher than this one (the former).You cannot just base on a house area and then determine the price.” (P30)Another drawback is that since one feature shape graph can only present one feature, to showmultiple features’ feature shape the interface will need multiple graphs that may – “make your page so overloaded , so people just get tired. You want to make it as clear aspossible. So if (there is) some unnecessary information people just intimidated.” (P28) Design Implications . One suggestion for the above weaknesses is that feature shapecan be accompanied by other explanatory forms and show on-demand . Users can select theirinterested features from a feature list or other explanatory forms such as feature attribute, coun-terfactual example or rule, and choose to view feature shape diagrams of the selected features, asparticipants suggested: “If I can click on this (feature attribute) and then I can get this chart (feature shape),I think that would be good. I don’t think everyone is going to click it, but I think (if)people want more information, you will click it.” (P20, House)Many participants tended to check the local position of their input data point on the globalfeature shape diagram. “It’s good to see where exactly on a (house price) scale you are.” (P20, House, Trust)And P30 suggested feature shape could have the assumption that for all the other features thatare kept constant, they should be as similar to user’s input features as possible. “The AI should assume all the other features are almost the same as mine, consideringthis hypothesis then this is the (feature shape) curve” . Applicable Explanation Needs . Since feature interaction just adds one more feature tothe (feature-outcome) diagram to show feature-feature interactions, it can be regarded as anexpanded version of feature shape, and many of the above ﬁndings on feature shape apply to fea-ture interaction as well. Similar to feature shape, feature interaction also supports counterfactualreasoning by including two or more features instead of one in feature shape. “(feature interaction on age-body weight interaction) If you put yourself in a hypotheti-cal guessing, you’re in this age and this is your body weight, and you can already tellthe chances (of diabetes) are high.” (P23, Health, Trust)

Cons . “The graph is less accessible to understand ” (P22). In our study, only a few partici-pants could correctly interpret the 2D heatmap of two feature interactions. Design Implications . Similar to feature shape, participants would like to choose theirinterested feature pairs to check their interactions on feature interaction diagram. Since thecombination of features is large, the XAI system may be able to suggest interesting featureinteractions and prioritize the feature pairs which have signiﬁcant interactions. “If I click on any two of them (features), show the relationship between them. If I canchoose age and blood sugar level, then probably there is some correlation between them.

If it is statistically signiﬁcant, then I would want to know that. If there is no signiﬁcancebetween, for instance, age and body weight, then I don’t think it should tell me that. Ifthe AI can tell me that this combination really is important for you to look into, thenthe priority would also make a lot of sense.” (P23, Health, Unexpected)

In our study, most participants regarded both similar example and typical example as similarexamples. Only a few participants got the idea of typical example that “you’re getting the average” (P20). Thus in this section, we state the themes on similar example as well as the common themesof similar and typical example.

Pros . Participants intuitively understood the concept of similar example. Similar exampleuses analogical reasoning to facilitate to the sense-making process. “It just intuitively makes sense to me. ...similar and typical example are much easier. Idon’t have to think about them before ﬁguring it out.” (P16, Bird, Trust) “(similar and typical example) It’s similar to how humans make decisions, like wecompare similar images to the original (input) one.” (P02, Bird, Trust)

Applicable Explanation Needs . Unlike other explanatory forms that reveal AI’s decision-making process (such as rule-based explanations) , “even though these (similar and typical exam-ple) aren’t much speciﬁc about how it’s actually doing the (decision) process” (P16), participants’minds automatically made up such a process by themselves by comparing instances . Such com-parison mainly allow users to verify AI’s decisions and to calibrate their trust. The commonexplanation needs in which similar example were selected are: ) To build trust, especially from a personal and emotional level. This (similar example) made me trust on an emotional level. Because I’m thinking, ‘Ohreally? I am only 33 years old.’ Like I probably not going to get diabetes. But then I’mreading about somebody that does (get diabetes), that sounds a lot like me, it kind ofemotionally makes me feel like, ‘Oh geez, maybe it is accurate.’ So this (performance,output) is like using my brain, and this one (similar example) kind of got me in thegut like, ‘Oh, okay. This could actually happen to me. It happened to this person whosounds a lot like me.’ ” (P16, Health, Trust). ) To verify the decision quality of AI. “It’s like a proof for my ﬁnal decision.” (P30) “Because AI has only 85% accuracy, I want to see similar ones, and what AI thinks theyare. ” (P14, Bird, Trust) “If it doesn’t align (with my prediction), then I want to see some similar houses to remakethe judgment.” (P04, House, Unexpected). ) To assess the level of disagreement when AI made an unexpected prediction, and to revealpotential ﬂaws of AI. “If my prediction appears in (a list of) similar examples, it allows me to judge whetherAI is completely unreliable or just need some improvement.” (P01, Bird, Unexpected) Cons . Showing examples for comparison may not be applicable when input data is incom-prehensible or difﬁcult to read and compare . “I think (similar and typical example) it’s not important to me. Because I need to readother people’s status, read their records.” (P02, Health, Trust) UCA: A Practical Prototyping Framework for End-User XAI 25

In addition, participants easily got confused when instances in similar example have divergentpredictions . This problem might be solved by typical example which is stated in Section 5.5. “(similar example) It’s not really telling you if it (the input) is the one (prediction), soit could be this (prediction) or this or this [pointing to different predictions on similarexample card].” (P26, Bird, Trust) “This one (similar example) has too many choices (predictions), it’s too confusing.” (P05,Bird, Trust)

Design Implications . As mentioned above, participants had to compare the features insimilar example by themselves. It’s important for the XAI system to support such side-by-sidefeature-based comparison among instances such as input, similar, typical, or counterfactualexample, especially when the input data format is difﬁcult to read through. “I don’t want to read the text (in similar and typical example), it is better to showthose features and examples in a table for me to compare directly, also highlight theimportant features as an analysis process.” (P29, Health, Trust) “ Maybe it could help the doctor to pinpoint things that are similar or different betweenthese cases.” (P31, Health, Communication) “I would like a comparison. That’s my own house (input), which probably will be off thetop somewhere. And I’m comparing it with other information (typical example andcounterfactual example). So in a column, and I can compare it. For the layout, maybeyou can do a product comparison.” (P03, House, Expected)

Pros . One drawback of similar example is that it may make users confused about simi-lar instances. Typical example may solve this problem since the typical examples for differentpredictions are more distinct and separable than nearest neighbors of similar example. “(typical example) You actually made a category of each one. I remember in cognitivepsychology, there’s a theory. I don’t remember the name, but if you clearly separate eachcategory, that helps people to differentiate the different categories, then remember. Butfor this one (similar example), you have to read every one (instance) of them.” (P04,Bird, Learning)

Applicable Explanation Needs . Since typical example represents the typical case for theoutcome, it may help to reveal class-speciﬁc characteristics or even potential problems in the AImodel or data, for example to reveal bias . “If I’m concerned about what group the data is coming from, I would love if the typicalcase like the average that comes up says like, male, this age, and the factors were quitedifferent from mine, then I kinda go, ‘huh?’ But if it could give me a typical case that’sactually quite similar to me, then I would be less worried about it not performing wellwith my group.” (P22, Health, Bias)Unfortunately, most participants did not realize the meaning of typical example and did not makeuse of such “debugging” property. Design Implications . In addition to show typical example of different predictions (between-class variation), in some cases, it might be beneﬁcial to show different variations of typicalexample for a particular prediction (within-class variation). “It’s showing different pictures of the same bird, and the colors even look different. Soit’s saying maybe, ‘Oh, I get it, we have the male and female.’ So it’s showing differentlooks that the bird can have.” (P06, Bird, Learning)Opposite to typical example, some participants expected to see non-typical or edge cases thatrepresent rare but severe consequences, mainly due to safety and bias concerns. “So they (similar and typical example) don’t really provide enough information aboutwhen the weather is different and when you’re driving at night, the results from non-typical conditions.” (P27, Car, Bias) “I still don’t know if the dog jumps out of nowhere. so maybe the (similar example)similar trafﬁc conditions can see the extreme cases.” (P03, Car, Safety)

Pros and Applicable Explanation Needs . In our study, counterfactual example wasshown as two instances of different predictions, with their feature differences highlighted whilekeeping other features the same (Figure 1). This format can serve for different explanation needsdepending on the task context. In predictive tasks (House and Health), participants regardedcounterfactual example as the most direct explanatory form to suggest an improvement . “For renovations, I think that’s (counterfactual example) the only card I would choose.The only one that really tells me that I can do something to increase the price.” (P20,House, Improvement)While in recognition task (Bird), counterfactual example is suitable to show the differences todifferentiate two similar predictions. “Counterfactual example let me learn their relationship, highlight the difference be-tween the two (birds). Help me remember the different features.” (P11, Bird, Learning) Cons . Some participants did not understand the meaning of counterfactual example, andcould not capture the nuance between feature attribute and counterfactual example, since theyboth have features highlighted but for different reasons (feature attribute highlights importantfeatures for prediction, whereas counterfactual example highlights what features need to changefor the alternative outcome to happen).Counterfactual example may have the risk to make participants confused about similar in-stances , especially in recognition tasks. “I think this tool (counterfactual example) will make me remember the wrong thing.I’m already confused. It shows information that is similar.” (P11, Bird)Thus it may not be the beginning explanations and may only show up on-demand, for example,for the two explanation needs of improvement and differentiation mentioned above.

Design Implications . The two contrastive outcomes in counterfactual example can beuser-deﬁned or pre-generated depending on the speciﬁc explanation needs. One outcome isusually from user’s current instance such as input, and the alternative outcome can be: “thenext possible prediction” (P18, Bird, Report), users’ own prediction when there’s a disagreement(Unexpected), the prospective outcome for improvement, and the easily confused outcome fordifferentiation.The generating of counterfactual features may also receive user-deﬁned or pre-deﬁned con-straints, such as: ) constraints on the counterfactual feature type to include controllable featuresonly (see Section 6.8 Improvement on controllable features); ) generate personalized counter-factual suggestions based on features that users look upon: “the recommendation should be a UCA: A Practical Prototyping Framework for End-User XAI 27 lot based on what I do” (P24, Health); and ) constraints on the range of speciﬁc counterfactualfeatures: “AI should accept my personalized constraints on budget (P01, House, Improvement).Given these constraints, the XAI system can also provide multiple improvement suggestions forusers to choose from (P01, P11), and may give weights or relative ranking on multiple suggestions. Many participants noticed the three formats of rule-based explanations (rule, decision tree)provided “basically the same information” (P02, Health), “all show the decision process” (P10, Bird),and were only different in the text (rule) or graphical (decision tree) representation.

Pros . Several participants regarded rule can “ explain the logic behind how the AI makesdecisions” (P27). Particularly, the text description format is “like human explanation” (P01, House,Trust), and “simple enough and understandable” (P11). Applicable Explanation Needs . The above pros make it suitable for verbal (Section 6.9Communication) and written communications (Section 6.10 Report). Text format may also helpto dispel confusion, since some participant regarded texts as being more precise than images,thus facilitate learning . “In this case (Bird, Unexpected), I don’t want to see the highlights (feature attribute). Iwant it to see points, the speciﬁc parts and give me some explanation. If I’m trying toprove myself wrong, or if I want to see how AI system can prove me wrong, I want to seemore precise text, and precisely point out the important information.” (P04) “The written helps because it’s more exact, whereas the pictures, ...the blue in the picturemight not be the blue that was in the written.” (P05, Bird, Learning) “(rule) It’s listing out something that a person might miss in the picture.” (P18, Bird)However, when the input is image data, some participants also mentioned providing textexplanations only was not enough. “(rule) It doesn’t really show you the bird that you were looking at. Lots of birds havesmall thin bills short tails...if I can’t see a picture of it, then it’s not as helpful.” (P06,Bird, Trust)And many participants suggested “ideally you’d want both written and pictures” (P05) to comple-ment each other. Cons . rule is very sensitive to the degree of complexity in text descriptions, as an increasein rule length or number of features will dramatically reduce its simplicity and the above advan-tages [71]. However, if the rule clauses are short, the explanation may not be precise and satisfyingas well, as P06 pointed out, “It (rule) is just too broad, it could apply to so many other birds.” Another concern is that since participants lack technical knowledge, some of them misinter-preted rule as instructions human fed to the AI. “(rule) it is giving very clear instructions to the AI, like written text instructions, theseare already fed into the system.” (P09, Car, Safety)

Design Implications . To reduce the cognitive load of complex rule, a few participantssuggested trimming the rules by only showing the shallow level, or only showing rules containingthe current input “just show rule related to my own house features” (P30), then users may querydetails on demand.

To carefully balance between explanation completeness and usability, if the full rule is shown,it’s beneﬁcial to highlight local rule clauses describing the current instance on top of the global rule explanation.

Pros . Similar to rule, participants regarded decision tree as “the most logical one” (P20)that “ tells you the decision-making process ” (P04): “(decision tree) shows the process of thinking with AI, what it’s going to do with theinformation.” (P10) “how the algorithm is working, what the machine is thinking about when it’s comingup with the prediction.” (P16) Applicable Explanation Needs . Participants mentioned an advantage of decision tree isto differentiate , possibly due to its unique tree layout: “It explained very well what’s the difference between them (the two confusing in-stances).” (P04, Bird, Report) “It would show you how to pick up the different types of variants.” (P10, Bird, Report) “I think this (decision tree) is the graphic comparison, like this beak might be sharper orsmaller than this one, all those comparisons help” (P09, Bird, Unexpected).Such advantage also supports counterfactual reasoning by checking alternative feature valueson the adjacent branches. “(Decision tree) can see how to improve. It has a comparison with different outputs.” (P29, Health, Trust) “Where does my house stands, if I’d be here, then I maybe try to change some of myfeatures, to see how do these features affect my house price, or other houses comparedto my own house.” (P30, House)

Cons . Several participants brought up its weakness in communication and interpreta-tion . “(Decision tree) is not natural language, it is more difﬁcult to explain to my family.” (P01, House, Communication) “This is more like a logical thing for me to see. But I wouldn’t use this as an explanationto family, because that’s just weird. I don’t want to rack their brains too much.” (P20,House, Communication)Indeed, in the study even with a two-feature two-layer decision tree, a number of participantscommented: “It’s confusing.” (P05, Bird, Learning) “It got too much information.” (P16, Bird, Unexpected) “I don’t really understand this one. I think it’s a little bit complicated.” (P08, Bird,Learning)Since it is less interpretable than other forms, some participants suggested to show it on-demand. “I don’t think these two (decision tree, decision ﬂow chart) are necessary to show inthe ﬁrst UI. Maybe these two can be hidden in an icon that says ‘process’. Because it(decision tree) is more like a program in process.” (P04, Bird, Trust)Besides the tree structure, we used another ﬂow chart visual representation (decision ﬂow chart)in the study. In tasks that the inputs were images (Bird and Car task), quite a few participants UCA: A Practical Prototyping Framework for End-User XAI 29 found neither the tree nor the ﬂow chart structure helpful, and they only focused on the saliencyfeatures or objects the ﬂow chart shows. “I don’t think it (the ﬂow chart structure) matters, just the head and the belly (thehighlighted region shown in the ﬂow chart) matter.” (P14, Bird)

Design Implications . Similar to the suggestions in rule, to reduce its complexity, oneparticipant suggested trimming the tree and just show the main branches, hiding the deeperbranch details and only showing them on-demand. “You could use this one (the two-feature decision tree) as a beginning, based on this,and you click (one branch) to another in-depth version of the price calculation. Becausethis (price prediction) range is still very far wide, and the features given is not enough,so if you want to (check details) maybe click and (it will) add more features to it (thatbranch), then get a narrow range (of prediction).” (P28, House)Although rule-based explanations are global explanations, many participants tended to seekthe branch pathway where their own input resides. It served as a local explanation on top of the global explanation. It suggested an XAI system may only show the branches containing interestedinstances, or highlight branches for their interested instances . They did so for the needs toverify AI’s decisions, and for comparison with other counterfactual instances. “I know there’re factors that could be other houses that lead to different prices, but I stillsee it as, ‘okay, I plug in my own numbers here and what’s my price?’ So it’s still speciﬁcto me.’ (P20, House, Trust) – Displays local as well as global explanations “The only thing we need is to indicate my own position on this (decision tree) branch.....Then I can chase the features of my house.” (P30, House, Unexpected) – Suggests tohighlight the pathway for user’s interested instance

It serves as necessary background information, and participants regarded input as a “proﬁle” (P24)that “stating the facts” (P20). It allows participants to understand what information AI’s decisionis based on , and can help “debug” to see “if AI is missing the most important feature” in input(P22, Health, Bias), and “whether or not the input is enough for it (AI) to make that decision” (P16,Health, Trust).When checking input, participants tended to intuitively “look for certain features” (P14) tojudge by themselves. And in the card sorting, some participants used input as an anchor, put itside-by-side with example-bases explanatory forms (similar, typical, and counterfactual example)for comparison. Quantitative results led to the same ﬁndings, as input was clustered together withother example-based explanatory forms (Figure 7, Supplementary Material S2 Figure ??).

In our study, the output card contained prediction information of a point prediction, a predictionrange, and their corresponding uncertainty level (for regression tasks); Or top three predictionsand their likelihood (for classiﬁcation tasks) (Fig 1). For the output information presentation,some participants preferred to check the point prediction at the beginning, and check the detailedprediction range and uncertainty level on-demand or leave them at the end, since they “need alonger time to understand what these numbers mean” (P02).Participants had divergent preferences and understandings on the prediction presentationform. Compared to a point prediction (e.g.: house price prediction is 650k), some preferred tosee a prediction range in regression tasks (e.g.: house price is 638-662k), or top predictions list (cid:2)(cid:3) (cid:3)(cid:31) ! (cid:3) (cid:6)(cid:3)(cid:31)(cid:31)! (cid:4)(cid:5) (cid:31) (cid:3)(cid:6)(cid:7)(cid:8)(cid:2)(cid:3) (cid:3)(cid:31) ! (cid:3) (cid:6)" (cid:9) (cid:3) (cid:3) (cid:12) (cid:10)(cid:2)(cid:3) (cid:3)(cid:31) ! (cid:3) (cid:6) (cid:4)(cid:11) (cid:31) (cid:3) !(cid:3) (cid:12) (cid:31) (cid:4) $ (cid:11)(cid:8)(cid:13) (cid:14)(cid:4)(cid:15)(cid:4)(cid:16) (cid:3)!(cid:6) (cid:3) %(cid:3) (cid:15) (cid:16)(cid:3)(cid:6)(cid:6)(cid:17)(cid:18) & (cid:4)(cid:12) (cid:3) (cid:16) (cid:6) (cid:3) %(cid:3) (cid:15) (cid:16)(cid:3)(cid:13)(cid:7) (cid:19) $ (cid:11) (cid:31) (cid:3) ! (cid:20) (cid:3) (cid:12) (cid:31) (cid:3) (cid:16) (cid:6) (cid:3) %(cid:3) (cid:15) (cid:16)(cid:3)(cid:21)(cid:22)(cid:23) (cid:16)(cid:3)(cid:13)(cid:21)(cid:24)(cid:3)(cid:12)(cid:4) " (cid:4) $ (cid:11) (cid:6)(cid:31)! (cid:3)(cid:3)(cid:6)(cid:17) (cid:12) (cid:24)(cid:3)(cid:12)(cid:4) " (cid:4) $ (cid:11) (cid:6)’$((cid:6) (cid:12)(cid:9) (cid:3)!(cid:31) (cid:10)(cid:7)(cid:25)(cid:11) (cid:10)(cid:7)(cid:26) (cid:31) (cid:21)(cid:13)(cid:24) (cid:3)(cid:31)(cid:3)" (cid:3) (cid:31) (cid:10)(cid:6) (cid:27)(cid:3) ! (cid:20) $! (cid:15) (cid:3) (cid:11)(cid:12)(cid:3)(cid:13)(cid:22) Fig. 7.

Visualizing the similarities of the explanatory forms . The explanatory forms that are close toeach other indicate they are more likely to be selected together to construct an explanation. The total numberof times an Explanatory form was selected is indicated as the number below its name, also proportional to itsdot size. The dot color indicates one of the four categories in the framework the explanatory form belongs to:feature, example, rule, and supplementary information. K-means clustering analysis on the 2D positionaldata yielded four clusters:

Cluster 1 : output, performance, dataset;

Cluster 2 : feature attribute, decision tree,rule, decision flow chart;

Cluster 3 : typical example, similar example, counterfactual example, input;

Cluster4 : feature shape, feature interaction. in classiﬁcation tasks, because such prediction range “give choices” (P05, Bird classiﬁcation task,Differentiation), “acknowledges a possibility” (P18, Bird, Unexpected), rank the decision priorities(P03, Car classiﬁcation task, Safety), help them “(the range) to see how different between my andAI prediction” (P01, House regression task, Unexpected), and provide rooms for adjustment andnegotiation: “If I want to sell it higher, and I’ll put 662k (the upper bound). Or if I wanted to sell itfast, then I’ll put 638k (the lower bound). There’s always a range, it’s not necessarily justone price. And people will always bargain too.” (P20, House, Communication)And sometimes they “don’t even need to know the (prediction) number exactly. This (range) tells methat (my diabetes risk) it’s high. I have to do something. So that’s what I want to know” (P17, Healthregression task, Trust), and the range gives a higher certainty than a single point prediction whichenhanced participants’ trust.In contrast, some other participants were more acceptable to a narrower range or a pointprediction , because they saw a wider range of prediction had its drawbacks: “(the predictionrange) shows too much ﬂuctuation” (P07, House, Trust); And seeing the full predictions list (somewith lower prediction likelihoods) may make them confused and discredit AI’s decisions. Thus anarrower range may give them more conﬁdence about AI’s prediction.

UCA: A Practical Prototyping Framework for End-User XAI 31 “Seeing that the range is pretty small makes me a lot more conﬁdent that they’ve gotenough data to actually be drawing conclusions.” (P16, Health, regression task, Trust)For the prediction likelihood/uncertainty/conﬁdence , some participants had a hard timeunderstanding the meaning of uncertainty and required researchers’ explanations. A high cer-tainty “reassure AI’s performance” (P22), “help a lot of persuading yourself into believing in AI” (P10), which is consistent with the recent quantitative ﬁnding on certainty level and trust calibra-tion [99]. Especially for the explanation need where AI’s prediction is unexpected, participantsmay abandon their own judgment due to AI’s high certainty. “If it had a high certainty, then I would want to know why my estimation is wrong.” (P10, House, Unexpected) After checking the performance information, most participants realized the probabilistic natureof AI decisions: “AI is not perfect” (P20), “they (AI) make errors sometimes” (P05). If the performanceis within their acceptable range, participants would accept the “imperfect AI”, and it helped themto set a proper expectation for AI’s performance . “I get it’s downside. Performance warns me to, ‘Hey, you know, it’s not really accurate.There’s some room for error.’ ” (P24)And sometimes participants may calibrate their trust according to the error rate (in classiﬁcationtasks) or error margin (in regression tasks). “If there is a really big margin (of error), then it would probably demean the trust.” (P23)Almost all participants understood the meaning of accuracy (error rate) in classiﬁcation tasks,whereas many participants had a difﬁcult time understanding the margin of error in regressiontasks. “Performance is really in detail. I mean not everyone is familiar with statistics, likemean error.” (P30, House, regression task)Unlike the uncertainty level in output (Section 5.10) which is case-speciﬁc decision qualityinformation, a few participants noticed performance is model-wide information, and just provides “general information showing the trust level of the system” (P04) is “too general, I would want toknow speciﬁcally why (the speed) it’s going down in this particular case of driving” (P05, Car,Unexpected). Thus they suggested there was no need to show it every time, “you should knowbefore you use AI” (P11).However, in some particular explanation needs such as to detect bias (Section 6.3), participantsmay require to check the ﬁne-grained performance analysis on interested outcome. “It (ﬁne-grained performance on road/weather conditions) explains how often I shouldbe conﬁdent in rainy days.” (P19, Car, Safety) In our study, the dataset card contains training dataset distribution of the prediction outcomes.Even after researchers’ explanation, some participants did not well understand or misunderstood the information on this card (for example, some misinterpreted the distribution graph as featureshape), indicating it requires a higher level of AI/math/visualization literacy. For those who Although the AI community has distinct methods to compute output likelihood and uncertainty level, in our study weused likelihood, conﬁdence and uncertainty interchangeably to avoid participants’ confusion. comprehended the dataset information, some participants tended to link the dataset size withmodel accuracy and trust. “The higher the (training data distribution) curve goes, then I would be more conﬁdentthat they have a big pool of data to pull from.” (P31, Health, Unexpected)Some participants intuitively wanted to check their own data point within the training datadistribution, and use it as a dashboard to navigate, identify, and ﬁlter interested instances (suchas similar, typical, and counterfactual examples), to compare what are the same and differentfeatures between their input and the interested instances. “I want to see which region I fall in the population, and compare with people around tosee why my (diabetes) risk is only 10% with a family history.” (P01, Health, Unexpected)Nevertheless, in practice there may be some restrictions on reviewing the detailed datasetinformation due to data proprietary and privacy, as brought out by P19: “I want to know the number of data and the details of it to verify. But I don’t knowif that’s going to be able to be viewed. That’s probably secret, right?” (P19, House,Expected)

Although participants may have various motivations to check explanations, two main themes ofexplanation needs emerged in the interview. Quantitative clustering analysis conﬁrmed similartrends as visualized in Fig. 9.The ﬁrst and fundamental driving force of checking explanations is to verify AI’s prediction ,and to gain trust and understanding of the AI system. “Like boyfriend and girlfriend, I want to know what my boyfriend is thinking. Similarly,I want to know what the car’s thinking before I’m with the car.” (P32, Car task, Safety) –To gain understanding on AI “I will also want to know how the software can predict the 80% chance that I’m goingto have diabetes. And also, how did they come up with that numbers? Just giving mea number without justiﬁcation or some veriﬁable reasons, it’s just unlikely I wouldaccept it because it may not be true.” (P25, Health task, Trust)The following explanation needs are more related to this motivation: calibrating trust (Section 6.1),ensuring safety (6.2), detecting bias (6.3), and resolving disagreement (6.4).The second motivation to check explanations is for personal improvement , i.e.: to improveusers’ own welfare, such as to enhance personal problem solving skills and learning, or to improvethe predicted outcome. This is built upon the trust and veriﬁcation from previous experience, asone participant stated: “(When trust has been established,) what has to be done in the next phase of the (AI)software, is how the software is being helpful to me. ...If I know the result, I don’t thinkI would want to dig in to see why it is, but I would want to see how I can reduce thechances of diabetes.” (P23, Health)The following explanation needs are more related to this motivation: to seek suggestions toimprove the outcome(6.8), to learn and discover new knowledge (6.7), to differentiate similarinstances (6.6), to facilitate verbal (Communication, 6.9) and written communication (Report,6.10), and to balance among multiple objectives (6.11):

UCA: A Practical Prototyping Framework for End-User XAI 33

The process of calibrating trust involves multiple factors and their complicated interactions.We summarize the following key emerged themes participants requested to calibrate their trusttoward AI.(1)

Performance

Trust towards AI is fundamental to incorporate AI’s opinion into the criticaldecision-making process [97], and many explanation needs below are built on trust.Since end-users usually do not have complete computational and domain knowledgeto judge AI’s decision process, model performance becomes an important surrogate toestablish trust. “Even if AI tells me how it reaches its decision, I cannot judge whether it’s correct sinceit is a medical analysis and requires professional medical knowledge. I just know theaccuracy and that’ll be ﬁne.” (P01, Health)Prior work identiﬁed two types of performance: stated and observed performance [95], andthey were both mentioned in the interview.

Stated performance or accuracy is performancemetrics tested on previous hold-out test data, and it was mentioned by most participants asa requirement to build trust towards AI. “I understand maybe AI is learning from past examples, and it may be ﬁnding patternsin the data that might not be easy to explain. So I’m less concerned about how it’s gettingthere. I think I do have a trust that is doing it right, as long as there’s something you cantest after how accurate it’s been. ” (P16, Health)Compared to assessing the performance metrics, some participants tended to test AI bythemselves and get hands-on observed performance to be convinced. This requires users tohave a referred ground truth from their own judgment or reliable external sources. “(My own test driving experience) is way more useful than watching a (test driving) video,because you shouldn’t trust everything. The video might be just made to wait for you tobuy the car. So talking from a customer perspective, I would like to try it myself, because Ialso sell things. So I would always like to try it myself instead of watching a video.” (P21,Car)(2)

Feature

The important features that AI was based on are the next frequently mentionedinformation. “I would like to know the list of criteria that the AI chose the price based on, and whichone weighs more.” (P30, House)(3)

The ability to discriminate similar instances

This information was requested by severalparticipants to demonstrate AI’s capability. “(Decision tree) It’s showing me that it’s picking from a few similar ones, not just like arandom ray of blue, purple, green birds. It’s not random, it’s a calculated response. Moreof that would help me trust AI.” (P06, Bird) “Typical example seems to be pretty good at picking up on differences. Similar example Ican see that it’s got a good variety of similar birds. So I found these ones make me trust itmore. (P16, Bird)(4)

Dataset

The dataset size that AI was trained on is another surrogate mentioned by someparticipants to enhance trust. “To me what artiﬁcial intelligence does is just collecting a lot of data, and tries to makesense for behavioral patterns. So I would actually trust it, because I think it’s just basedon data, it is a more accurate measurement of what market rate is for house prices.” (P03,House) T r u s t S a f e t y B i a s U n e x p e c t e d E x p e c t e d D i ﬀ e r e n t i a t i o n L e a r n i n g I m p r o v e m e n t C o m m u n i c a t i o n R e p o r t M u l t i - o b j e c t i v e

38 10 20 46 21 14 14 22 38 14 11

Feature attributeFeature shapeFeature intearctionSimilar exampleTypical exampleCounterfactual exampleRuleDecision treeDecision ﬂow chartInputOutputDatasetPerformance

143 57 39 110 94 82 98 105 74 74 89 71 92

Fig. 8.

Heatmap of explanatory form—explanation need matrix . The darkness level and number in thegrid is the percentage of a explanatory form selected for that explanation need. The number under each need(on the horizontal top) is the total number of card sorting data collected for that need. The number besideeach explanatory form (on the vertical left) is the total number of times an explanatory form was selected inthe card sorting data. “If I know that the AI comes from a large database, it seems like the database is actuallythe experience that AI has. So the larger it (dataset) gets, the more experienced AI wouldbe, so I can trust it more.” (P30, House)(5)

External information

This is another surrogate mentioned by participants to judge if AI istrustworthy. The external information could include:(a) Peer reviews, endorsement, and AI company’s credit. “Since I’m not really a tech person, so I’m not sure how I look at it in a technical way. Sothat’s why I just really depend on the company’s reputation, and also how people feelabout the website.” (P28, House)(b) Authority approval and liability. “I trust more if the government themselves kind of stands behind it, getting some sort ofgovernment approval helps it a little bit more. So if there’s some health authority likeHealth Canada or FDA support gives it more legitimacy.” (P24, Health) “For me personally, I would prefer if an actual person is there in the end, at least in thebeginning stage. So if somebody is there to just say, ‘hi, I’m so and so’, and then AI takescontrol. Then we still know that there is somebody who’s liable in the end for whateverhappens.” (P23, Health)

Preferred explanatory forms.

The top three most selected forms for the need to calibrate trustwere performance (20/40), output (20/40), and feature attribute (17/40). This quantitative resultscorresponds to the above qualitative themes (Figure. 8).

UCA: A Practical Prototyping Framework for End-User XAI 35

To ensure safety and reliability of the AI system in critical tasks (the autonomous driving vehicletask in our study), participants frequently mentioned checking AI’s performances in test cases ,expecting the testing to cover a variety of scenarios to show the robustness of safety. Although itis impossible to enumerate a complete list of potential failure cases in testing, extreme cases orpotential accidents were the main concerns and focuses of end-users. “Potential crashes or just like someone speeding or a pedestrian jumping out of nowhere.” (P19) “There is likely to be someone running around, so it needs to show me the extreme cases....I need to see something like FMEA, failure modes and effects analysis, just to be like,‘okay this is how it works.’ because I know nothing is foolproof. There are always to besomething, but to what extent.” (P03)Similar to the need to calibrate trust, alongside the above stated performance , a few participantsrequired observed performance to emotionally accept AI as an emerging technology. “deﬁnitely I would want to be in one car. I think information is not helpful, it’s not anintellectual factual thing, it’s emotionally not acceptable. It (AI) is new and I have tolearn to trust it.” (P17)

Preferred explanatory forms.

Regarding the speciﬁed information to present in the performancetesting, participants would like to check the objects detected by AI (feature attribute, 9/13): “It shows how it detects the important objects and how it makes decision” (P03, P05,P27) “See if (the feature attributes) align with my own judgment of feature importance.” (P01)Performance (6/13) were also favourable to check the metrics summary of performance. Aspeciﬁed performance analysis in different test scenarios may also help as a safety alert byrevealing the weakness of the system. “Let’s say I’m driving on a rainy day, then I know that I should be a lot more carefulthan when I’m with the car in a normal condition.” (P27)Similar example (7/13) were preferred since it showed “what’s the condition or what kind ofdecision the car gonna make” (P32), although participants did not focus on its similarity nature, butrather assumed it can showcase a variety of cases including the extreme cases. Several participantschose decision tree (6/13) because it “gave me an overview of how the car makes decision” (P27).

Participants were concerned about population bias [66], or distribution shift where AI models areapplied to a different population other than the training dataset. Such concern is more prominentwhen a prediction is based on users’ own personal data, and when users are in minority subgroups.Participants wanted to compare and see if their own subgroup is included in the training data . “I know I’m in a class, they talked about how a lot of studies haven’t been done speciﬁ-cally on women, even though they (diabetes) affect men and women differently. That isprobably something I would want to know about, like if it gave me this result and thenit had a little note that explained the research was done more on that demographic, soit may be more true for that demographic, but they’re just trying to, what’s the word,extrapolate to this group where I sit.” (P22, female, Health)Unlike the common bias and fairness problem in AI where the protected features should notaffect the prediction, in our Health task on diabetes prediction, the protected features (age, gender, ethnicity) do lead to a difference in diabetes outcome (referred as explainable discriminationin [66]). Participants who were aware of this point required AI to account for such differencesamong subgroups . “I know some ethnic groups just by genetic makeup could be more predisposed todiabetes. In order for it (AI) to arrive at this decision, I would think that it has maybelike a sample size of different people with different ethnicities to try to ﬁgure out. Iwould think there’ll be years and years of research has already been done of the differentgroups, different ages that would then be factored in by AI. If I can see it (AI) is using thatinformation, I’ll be a lot more comfortable to actually using the AI’s recommendation.” (P17, Health)In cases where the AI task is not related to personal information (in our study the self-drivingcar task), participants required AI to be able to detect objects and perform equally in all potentialbiased conditions. “Now we are operating in night time, or different weather, but they (the self-drivingcars) still have to be able to see the signs and identify the objects.” (P13, Car) Preferred explanatory forms.

A ﬁne-grained performance (12/24) analysis based on protected-feature-deﬁned subgroups [66] can help users to identify potential biases. “I would want to see the certainty and what the prediction error can potentially befor my demographic versus other groups. If it (the prediction error) is quite low, then Iwould probably worry less about that.” (P22, Health)Participants chose similar + typical example (12/24, means out of the 24 card-selection re-sponses on Bias, 12 selected either similar or typical example) to help inspect the data and model,and to compare with other similar instances to conﬁrm their subgroup is included in the model. “You would want to know what the data that it’s being drawn from, is it similar to you?” (P16)Feature attribute (12/24) was also chosen since participants wanted to check if AI could stilldetection important features in minority conditions. “I want to see how well AI is performing at night to see what it detected.” (P05, Car).

When AI’s predictions did not align with participants’ own expectations, most participants would “question AI” (P16, P20) and “the contradiction may let me confuse” (P02). Some may lose trustwith AI thus would not go further to check its explanations, if they were conﬁdent about their ownjudgment. Some would check “a trusted second opinion” (P06, P10), or refer to human experts(P12).But for the majority of participants, explanations were needed “to know why” (P01) and toresolve conﬂicts. “I’m feeling conﬂicted because it’s giving me two different information, my own personalbelief and AI. So in order to convince me that AI does know what it’s talking about,you need to go through the mental validation step [pointing to the ranked explanatorycards]. So by the time I go through this (explanatory cards) and I come out of it, I amextremely convinced.” (P24, Health)Explanations help to identify AI’s ﬂaws and reject AI, or to check the detailed differences andto be convinced and correct user’s own judgment, although “it might be harder to persuade me” (P31). Speciﬁcally, participants “try to understand what makes a difference (between AI and my

UCA: A Practical Prototyping Framework for End-User XAI 37 prediction)” (P03), which is similar to the need to differentiate (Section 6.6). To show why thepredictions are different, many participants required a list of key features . “Because AI cannot think like a human, so the reason that I ask for the criteria list istrying to think how similar to me is AI’s thinking. So maybe AI is thinking better, or isseeing a wider range, so it’s checking things that I’ve never thought about.” (P03, House)In case AI made errors, seeing what AI is based on can facilitate user’s “debugging” process.Although end-users cannot debug the algorithmic part, they may be able to debug the input tosee if AI “have the complete information” (P03) as users have. Furthermore, if some key inputinformation is lacking in AI’s decision, the system needs to allow users to provide feedback byinputting more information (P03, P24), or “correct the error” (P16) for AI. Preferred explanatory forms.

Feature attribute: 28/61, similar example: 25/61, decision tree: 23/61,performance: 20/61.

In contrast, when the prediction matched participants’ expectations, participants “will trust theAI more” (P10), and the motivation to check explanations was “not as strong as the previous one(unexpected explanation need)” (P02). Some participants stopped at the prediction, willing toaccept the “black-box” AI and may “not even waste my time (checking explanations)” (P20).A few participants still wanted to check further explanations for the following motivations:(1) To boost user’s conﬁdent . “Even in this (expected) scenario, it would be nice to have some bullet points, like thereasons behind it the estimation being accurate, because if someone says that you’recharging me way too much, I can have point by point reasons explaining to you whythis house worth this price, it actually kind of as a conﬁdence boost to think you are notovercharging or undercharging.” (P03, House)(2) To improve the outcome (see Section 6.8 for more ﬁndings). “If diabetes already runs in my family, (and AI predicts my risk of diabetes is 80%), itwould probably make me more conﬁdent about the software. So I might want to ask formore information about which aspects of my health records were the most important formaking this decision? Coz then maybe that can help me with my future activities andchanging things in the future.” (P31, Health) Preferred explanatory forms.

Feature attribute: 13/34, similar example: 12/34, typical example:10/34.

To facilitate end-users’ need to differentiate similar instances, AI is required to ﬁrst have the abilityto discern similar instances. “Depends on how good it is...So I think you would have to improve how AI picks up thebirds, like maybe these are the same color birds, but maybe they have slightly differentcharacteristics. So if AI can pick that up, then I think it would be better.” (P10)And in case of doubtful prediction, participants expected AI to indicate how certain it is to theprediction. “I would expect AI if it doesn’t know, it would give choices. So it would say 100% or 99%that’s an indigo bunting, and 89% it thinks it’s a ﬁnch.” (P05)

Based on that, AI needs to be able to “ pinpoint unique features that made them really differentfrom each other” (P06). In addition, the interface may also need to support users’ own comparison. “AI can tell you what the differences are. I guess it could be some list of the beak is longerfor this and that. But I think visually bringing the differences up side by side, and then Ican directly compare what the differences are.” (P16) Preferred explanatory forms.

Rule (12/14) and counterfactual example (10/14) were the mostpreferable forms. Participants chose rule since “you could write that you differentiated the bird’stail were long or short, or beak thin or thick” (P10). The counterfactual examples “identify wherespeciﬁcally to look” (P16), and “describe the change, the progress” (P11).

Using AI for user’s personal learning, improving problem solving skills, and knowledge discovery, “depends on how reliable it (AI) really is” (P10). And participants expected AI to “receive humanfeedback to correct its error and improve itself ” (P01).To facilitate learning and knowledge discovery, “just looking at (input) pictures and (output)names isn’t enough” (P10), and participants expected a wide range of explanations depending onthe particular learning goal, such as “more details to systematically learn, go over that same bird,...a mind map to build a category of birds by one feature” (P02), “the speciﬁc characteristic aboutthis bird, and how can I differentiate this bird from other birds” (P04). Other learning featuresmentioned by participants include: referring to external “respectable source” (P18), supportingpersonalized learning for unfamiliar terms (P04), and “collecting information about how well I’mdoing on it, like if I guess wrong, does it record that? to see if I’m progressing” (P16).

Preferred explanatory forms.

Rule-based explanations (rule: 12/14, decision ﬂow chart: 10/14,decision tree: 8/14) were more favourable for the need to learn, since they showed “a learningprocess. It has like how you could recognize a bird. So help me to learn some new knowledge” (P02).Same as in Report, participants would prefer to see “the graphics and text combined” (P02): “Itcombines text and pictures, and they are relevant to each other. It’s kind of a multi-modal learning” (P04).

Participants intuitively sought explanations to improve the predicted outcome, when predictionsare related to personal data (in our study, the House and Health tasks). However, they tended tounwarrantedly assume the explanations were causal (causal illusion [65], i.e., believe there isa causal connection between the breakdown factors and the outcome), even though the cause-effect relationship has not been conﬁrmed, and AI largely relies on correlation for prediction [73].Only a few participants required more solid evidence to support the explanations on improvingthe predicted outcome, especially when the action was related to critical consequences (personalhealth outcomes). “I presume the recommendation (on improvement actions from AI) is also has beenbacked up by Health Canada, because I think I would tend to follow the recommenda-tions if I know there’s deﬁnitely medical support behind it.” (P24) “I would deﬁnitely want to know like what can I do to mitigate those risk factors or toaddress those things so that I can decrease the risk. I would really like to know if it hadan explanation of how reliable each source was. Coz I know some studies, they mightseem like a correlation, but it doesn’t mean it’s a direct cause. So I would really love it ifit could potentially explain how powerful those studies are suggesting.” (P22)

UCA: A Practical Prototyping Framework for End-User XAI 39 (cid:2) (cid:15) (cid:3)(cid:4) (cid:5)(cid:6)(cid:7)(cid:7)(cid:8)(cid:9) (cid:15) (cid:10)(cid:3)(cid:11) (cid:15) (cid:6)(cid:9)(cid:12)(cid:8)(cid:11)(cid:10)(cid:6)(cid:7)(cid:13) (cid:6) (cid:14)(cid:7)(cid:15)(cid:16)(cid:6)(cid:17)(cid:13)(cid:7)(cid:13)(cid:9)(cid:11)(cid:18) (cid:15) (cid:19)(cid:19)(cid:13)(cid:16)(cid:13)(cid:9)(cid:11) (cid:15) (cid:3)(cid:11) (cid:15) (cid:6)(cid:9) (cid:20)(cid:21)(cid:15)(cid:13)(cid:10)(cid:11)(cid:13)(cid:22)(cid:23)(cid:4)(cid:13)(cid:16)(cid:24)(cid:4) (cid:6) (cid:25)(cid:13)(cid:3)(cid:16)(cid:9) (cid:15) (cid:9) (cid:5) (cid:26)(cid:8)(cid:27)(cid:11) (cid:15) (cid:28)(cid:6)(cid:29)(cid:30)(cid:13)(cid:10)(cid:11) (cid:15) (cid:17)(cid:13)(cid:11)(cid:16)(cid:3)(cid:22)(cid:13)(cid:28)(cid:6)(cid:19)(cid:19)(cid:31)(cid:13)(cid:15)(cid:6)(cid:16)(cid:11) (cid:3)(cid:19)(cid:13)(cid:11)! "(cid:16)(cid:8)(cid:4)(cid:11) (cid:23)(cid:9)(cid:13)(cid:21)(cid:15)(cid:13)(cid:10)(cid:11)(cid:13)(cid:22)

Fig. 9.

Clusters of the explanation needs

The explanation needs that are close to each other indicatethey have similar patterns on participants’ Explanatory form selection. Specifically, each explanation need isrepresented by a 12-dimensional vector, where each number in the vector is the total number of an Explanatoryform selected for that explanation need. We visualize their relative distances in the 2D scatter plot using PCAdimensional reduction. explanation needs are marked by different colors indicating the cluster they belong tousing k-means clustering:

Cluster 1 : Trust, Communication, Unexpected;

Cluster 2 : Safety, Multi-objectivealignment, Bias;

Cluster 3 : Expected, Improvement;

Cluster 4 : Differentiation, Learning, Report.

Regarding the speciﬁc requirements on the explanations for improvement, participants werelooking for controllable features and ignoring the features that cannot be changed. “I can not change my age, but I’m able to reduce my weights.” (P02)Knowing the controllable features has a positive psychological effect to give users a sense ofcontrol, and vice versa. “If I’m afraid of getting diabetes, and assume that I’m going to sentence, it feels likethere’s nothing I can do about it. But when I see this one (feature attribute), I think, ‘ohgeez, maybe there are other factors here that I can do something about.’ So this maymake me more positive about doing something about my condition.” (P16) “I know it (feature interaction) is comparing my house area and my number of roomswith other houses. I can understand ‘okay if I increase my room number, the price willbe increased that much.’ But the problem is I cannot change any of them (the housefeatures). It just gives me the feeling of disappointment.” (P30)To counterpoise the unchangeable features, users may intuitively apply counterfactual reason-ing to compare different feature adjustment settings. “If I make any change in my house appliance and renew, then I can still reach the sameprice as if my house was bigger” (P30).

Preferred explanatory forms

Counterfactual example (18/26) and feature shape (13/26) werethe top two selected forms. While counterfactual example (Section ?? ) provides how to achievethe target outcome change by adjusting the input features (counterfactual reasoning), featureshape (Section 5.2) (and feature interaction) allow users to adjust features and see how that leadsto outcome change (transfactual reasoning [42]). To communicate with other stakeholders, some participants chose to communicate verballyabout their opinions without mentioning AI. Others preferred to present stakeholders with moreevidence by bringing AI’s additional information explicitly to the discussion. For the latter case,the other stakeholders need to establish basic understanding and trust towards AI before dis-cussing AI’s explanations. “I’d sit down and get my family together and explain about the artiﬁcial intelligencething.” (P12, House) “I would try to get some evidence from it (AI) that I could take to the doctor to get themto buy into it.” (P16, Health)To do so, most participants chose to present AI’s performance information to build trust. “As long as the backstage is accurate and then I can just provide accuracy to my wifeand she’ll be able to get that. Trustworthy is the most fundamental.” (P28, House)Different audiences and explanation needs of communication may require distinct explana-tions, as described by P32: “I’m pretty sure my husband or my mother has a different way to decide or they want toknow different things.”

In addition, in the Health task, we asked participants to communicate with family members ordoctors about their diabetes predictions. Since the requested explanation covered a wide range ofcontents, we did not identify any distinct differences in the communicating contents between thetwo audiences.A formal summary or report from AI may facilitate the communication with other stakeholders,as requested by many participants. “A written report from AI that I would be able to reference to, in order to talk to myfamily about that. It would feel a little bit more ofﬁcial rather than just, ‘oh, this is whatsomebody said’, there’s no real evidence, whereas this sort of creates that paper trail.” (P31, Health)

Preferred explanatory forms

While output (21/46) and performance (17/46) provide AI’s resultand help to build trust, feature attribute (27/46) and decision tree (17/46) show the breakdownfactors and internal logic behind the prediction.

The content of reports may largely depend on the speciﬁc explanation need and readers of thereport. In our study, participants frequently mentioned the report should include “key identifyingfeatures” , “list of distinguishing characteristics or what makes it unique” (P09), or “ a summaryof factors that were part of the input led to the diabetic prediction” (P31). Users also mentionedincluding supporting information to back up the decisions, such as the training dataset size of thepredicted class, and the decision certainty level (P01). Preferred explanatory forms

Rule(12/14), decision ﬂow chart(7/14), and feature attribute(5/14)are the most frequently selected explanatory forms.Rule descriptions can conveniently generate text reports. “I have to write the explanation” (P08, P09) “You can not only by looking at the images and get some explanation. You need somemore speciﬁc description.” (P08)In addition, adding image to the text “would be complementary” (P10) to each other, and theformat of image + text were more favourable by many participants.

UCA: A Practical Prototyping Framework for End-User XAI 41 “Rule is just describing and writing. It doesn’t really show you a visual on how tocompare them.” (P06) “Feature attribute and decision ﬂow chart (presented in image format on bird recogni-tion task) highlights what rule is saying, this knowledge complements your statement.” (P10)

Usually it is the human user rather than AI to trade-off among multiple objectives in AI-assisteddecision-making tasks. Thus when multiple objectives get conﬂicted (in our study, they arescenarios when car drives autonomously and passenger gets a car sick; and AI predicts diabetesand uses it to determine insurance premium), AI was required to allow users to take over or toreceive users’ inputs. “It’s the most important thing I would want to do is to allow me to stop, or asking toslow down if I’m feeling sick.” (P03, Car)Explanations are required if the multiple objectives conﬂict and need to trade-off. And userscould use such explanations to defend for or against certain objectives. “I think it’s like a defensive thing, like if I’m expecting that they’re going to cause anincrease in my payments or whatever they’re going to deny me (health insurance)coverage, I would be trying to ﬁnd out what it’s based on for the opposite reason maybeto discredit it.” (P16, Health)

We state how the EUCA framework supports HCI/AI practitioners and researchers to build end-user-oriented XAI systems/algorithms, and discuss our ﬁndings and compare them with priorliterature.

Explanatory Forms as Building Blocks . In the user study, participants sorted the ex-planatory form prototyping cards, and combined them to construct a low-ﬁdelity prototype.Participants rated the resulting prototype fulﬁlled majority of explanation needs (231 out of 279responses, 83%). This shows that the explanatory forms in the EUCA framework may serve asbuilding blocks that can be combined to complete an explanation. The ﬁnding resonates withprevious user studies that an XAI system should support “integrating multiple explanations”, as“users employed a diverse range of explanations to reason variedly” [85]. The combination helps toovercome the weakness of an individual explanatory form, and may make the explanation morerobust, complete, and versatile. With multiple explanatory forms that complement each other,users may construct a whole picture about AI’s decision process more easily, and may mitigateconﬁrmation bias, attribution bias, and anchoring bias [58].Different explanatory forms can be combined statically as different modules in the UI, or interactively combined and incorporated in the UX to show detailed explanatory forms on-demand [29, 81, 82]. The contents of combination can be ﬁxed or dynamically generated, i.e., theXAI system learns to use different explanatory forms as vocabularies to respond to user’s followup questions, so that to construct an interactive explanatory conversation [87] with end-users. Suggested Prototyping Process . Our user study demonstrated the prototyping and co-design process with end-users. To determine the most feasible combination of the explanatoryforms to construct prototypes for a particular XAI system, we summarize the prototyping processfrom our user study, and suggest the following co-design and prototyping workﬂow. (cid:21)(cid:18)(cid:4)(cid:39)(cid:86)(cid:73)(cid:69)(cid:88)(cid:73)(cid:4)(cid:84)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:77)(cid:82)(cid:75)(cid:4)(cid:71)(cid:69)(cid:86)(cid:72)(cid:87)(cid:4)(cid:74)(cid:83)(cid:86)(cid:4)(cid:88)(cid:76)(cid:73)(cid:4)(cid:21)(cid:22)(cid:4)(cid:73)(cid:82)(cid:72)(cid:17)(cid:89)(cid:87)(cid:73)(cid:86)(cid:17)(cid:74)(cid:86)(cid:77)(cid:73)(cid:82)(cid:72)(cid:80)(cid:93)(cid:4)(cid:73)(cid:92)(cid:84)(cid:80)(cid:69)(cid:82)(cid:69)(cid:88)(cid:83)(cid:86)(cid:93)(cid:4)(cid:74)(cid:83)(cid:86)(cid:81)(cid:87) (cid:52)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:77)(cid:82)(cid:75)(cid:4)(cid:71)(cid:69)(cid:86)(cid:72)(cid:87) (cid:48)(cid:83)(cid:91)(cid:17)(cid:443)(cid:72)(cid:73)(cid:80)(cid:77)(cid:88)(cid:93)(cid:4)(cid:84)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:73) (cid:44)(cid:77)(cid:75)(cid:76)(cid:17)(cid:443)(cid:72)(cid:73)(cid:80)(cid:77)(cid:88)(cid:93)(cid:4)(cid:84)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:73) (cid:52)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:73) (cid:41)(cid:82)(cid:72)(cid:17)(cid:89)(cid:87)(cid:73)(cid:86) (cid:39)(cid:83)(cid:17)(cid:72)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82)(cid:10)(cid:4)(cid:77)(cid:88)(cid:73)(cid:86)(cid:69)(cid:88)(cid:77)(cid:83)(cid:82)(cid:87)(cid:77)(cid:75)(cid:82)(cid:88)(cid:77)(cid:83)(cid:82)(cid:39)(cid:83)(cid:17)(cid:10)(cid:4)(cid:77)(cid:88)(cid:73)(cid:72)(cid:73)(cid:87)(cid:73)(cid:86)(cid:69)(cid:88) (cid:40)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82)(cid:73)(cid:86) (cid:23)(cid:18)(cid:4)(cid:45)(cid:81)(cid:84)(cid:80)(cid:73)(cid:81)(cid:73)(cid:82)(cid:88)(cid:4)(cid:74)(cid:89)(cid:82)(cid:71)(cid:88)(cid:77)(cid:83)(cid:82)(cid:69)(cid:80)(cid:4)(cid:84)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:73)(cid:87)(cid:4) (cid:70)(cid:69)(cid:87)(cid:73)(cid:72)(cid:4)(cid:83)(cid:82)(cid:4) (cid:88)(cid:76)(cid:73)(cid:4)(cid:71)(cid:83)(cid:86)(cid:86)(cid:73)(cid:87)(cid:84)(cid:83)(cid:82)(cid:72)(cid:77)(cid:82)(cid:75)(cid:4)(cid:60)(cid:37)(cid:45)(cid:4)(cid:69)(cid:80)(cid:75)(cid:83)(cid:86)(cid:77)(cid:88)(cid:76)(cid:81)(cid:87) (cid:4)(cid:74)(cid:83)(cid:86)(cid:4)(cid:88)(cid:76)(cid:73)(cid:4)(cid:87)(cid:73)(cid:80)(cid:73)(cid:71)(cid:88)(cid:73)(cid:72)(cid:4)(cid:73)(cid:92)(cid:84)(cid:80)(cid:69)(cid:82)(cid:69)(cid:88)(cid:83)(cid:86)(cid:93)(cid:4)(cid:74)(cid:83)(cid:86)(cid:81)(cid:87)(cid:4)(cid:77)(cid:82)(cid:4)(cid:80)(cid:83)(cid:91)(cid:17)(cid:443)(cid:72)(cid:73)(cid:80)(cid:77)(cid:88)(cid:93)(cid:4)(cid:84)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:73)(cid:87) (cid:404) (cid:57)(cid:87)(cid:73)(cid:4) (cid:72)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82)(cid:4)(cid:88)(cid:73)(cid:81)(cid:84)(cid:80)(cid:69)(cid:88)(cid:73) (cid:4)(cid:84)(cid:86)(cid:83)(cid:90)(cid:77)(cid:72)(cid:73)(cid:72)(cid:4)(cid:70)(cid:93)(cid:4)(cid:41)(cid:57)(cid:39)(cid:37)(cid:16)(cid:4)(cid:83)(cid:86)(cid:4)(cid:71)(cid:86)(cid:73)(cid:69)(cid:88)(cid:73)(cid:4)(cid:74)(cid:86)(cid:83)(cid:81)(cid:4)(cid:87)(cid:71)(cid:86)(cid:69)(cid:88)(cid:71)(cid:76) (cid:404) (cid:41)(cid:92)(cid:88)(cid:86)(cid:69)(cid:71)(cid:88)(cid:4)(cid:74)(cid:73)(cid:69)(cid:88)(cid:89)(cid:86)(cid:73)(cid:87)(cid:4)(cid:69)(cid:87)(cid:4)(cid:71)(cid:83)(cid:82)(cid:88)(cid:73)(cid:82)(cid:88)(cid:4)(cid:84)(cid:80)(cid:69)(cid:71)(cid:73)(cid:76)(cid:83)(cid:80)(cid:72)(cid:73)(cid:86)(cid:4)(cid:69)(cid:82)(cid:72)(cid:4)(cid:443)(cid:80)(cid:80)(cid:4)(cid:77)(cid:82)(cid:4)(cid:88)(cid:76)(cid:73)(cid:4)(cid:88)(cid:73)(cid:81)(cid:84)(cid:80)(cid:69)(cid:88)(cid:73) (cid:404) (cid:39)(cid:86)(cid:73)(cid:69)(cid:88)(cid:73)(cid:4)(cid:57)(cid:45)(cid:19)(cid:57)(cid:60)(cid:4)(cid:90)(cid:69)(cid:86)(cid:77)(cid:69)(cid:88)(cid:77)(cid:83)(cid:82)(cid:4)(cid:69)(cid:71)(cid:71)(cid:83)(cid:86)(cid:72)(cid:77)(cid:82)(cid:75)(cid:4)(cid:88)(cid:83)(cid:4) (cid:57)(cid:45)(cid:19)(cid:57)(cid:60)(cid:4)(cid:72)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82)(cid:4)(cid:77)(cid:81)(cid:84)(cid:80)(cid:77)(cid:71)(cid:69)(cid:88)(cid:77)(cid:83)(cid:82)(cid:87) (cid:75) (cid:22)(cid:18)(cid:4)(cid:39)(cid:83)(cid:17)(cid:72)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82)(cid:4)(cid:84)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:73)(cid:4)(cid:91)(cid:77)(cid:88)(cid:76)(cid:4)(cid:73)(cid:82)(cid:72)(cid:17)(cid:89)(cid:87)(cid:73)(cid:86)(cid:87) (cid:404) (cid:52)(cid:69)(cid:86)(cid:88)(cid:77)(cid:71)(cid:77)(cid:84)(cid:69)(cid:88)(cid:83)(cid:86)(cid:93)(cid:4)(cid:72)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82) (cid:30)(cid:4)(cid:89)(cid:87)(cid:73)(cid:4)(cid:89)(cid:87)(cid:73)(cid:86)(cid:17)(cid:71)(cid:73)(cid:82)(cid:88)(cid:73)(cid:86)(cid:73)(cid:72)(cid:4)(cid:81)(cid:73)(cid:88)(cid:76)(cid:83)(cid:72)(cid:87)(cid:4)(cid:88)(cid:83)(cid:4)(cid:77)(cid:82)(cid:90)(cid:83)(cid:80)(cid:90)(cid:73)(cid:4)(cid:73)(cid:82)(cid:72)(cid:17)(cid:89)(cid:87)(cid:73)(cid:86)(cid:87)(cid:4)(cid:69)(cid:82)(cid:72)(cid:4)(cid:83)(cid:88)(cid:76)(cid:73)(cid:86)(cid:4)(cid:87)(cid:88)(cid:69)(cid:79)(cid:73)(cid:76)(cid:83)(cid:80)(cid:72)(cid:73)(cid:86)(cid:87)(cid:4) (cid:380) (cid:45)(cid:82)(cid:88)(cid:73)(cid:86)(cid:90)(cid:77)(cid:73)(cid:91)(cid:16)(cid:4)(cid:74)(cid:83)(cid:71)(cid:89)(cid:87)(cid:4)(cid:75)(cid:86)(cid:83)(cid:89)(cid:84)(cid:16)(cid:4)(cid:71)(cid:69)(cid:86)(cid:72)(cid:4)(cid:87)(cid:83)(cid:86)(cid:88)(cid:77)(cid:82)(cid:75)(cid:16)(cid:4)(cid:73)(cid:88)(cid:71)(cid:18) (cid:404) (cid:52)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:77)(cid:82)(cid:75)(cid:4)(cid:90)(cid:77)(cid:69)(cid:4)(cid:71)(cid:69)(cid:86)(cid:72)(cid:4)(cid:87)(cid:83)(cid:86)(cid:88)(cid:77)(cid:82)(cid:75) (cid:30)(cid:4)(cid:87)(cid:88)(cid:69)(cid:79)(cid:73)(cid:76)(cid:83)(cid:80)(cid:72)(cid:73)(cid:86)(cid:87)(cid:4)(cid:87)(cid:73)(cid:80)(cid:73)(cid:71)(cid:88)(cid:16)(cid:4)(cid:71)(cid:83)(cid:81)(cid:81)(cid:73)(cid:82)(cid:88)(cid:16)(cid:4)(cid:87)(cid:83)(cid:86)(cid:88)(cid:16)(cid:4)(cid:71)(cid:83)(cid:81)(cid:70)(cid:77)(cid:82)(cid:73)(cid:16)(cid:4)(cid:69)(cid:82)(cid:72)(cid:4)(cid:86)(cid:73)(cid:90)(cid:77)(cid:87)(cid:73)(cid:4)(cid:88)(cid:76)(cid:73)(cid:4)(cid:84)(cid:86)(cid:83)(cid:88)(cid:83)(cid:88)(cid:93)(cid:84)(cid:77)(cid:82)(cid:75)(cid:4)(cid:71)(cid:69)(cid:86)(cid:72)(cid:87)(cid:4) (cid:404) (cid:40)(cid:73)(cid:87)(cid:77)(cid:75)(cid:82)(cid:73)(cid:86)(cid:87)(cid:4)(cid:86)(cid:73)(cid:74)(cid:73)(cid:86)(cid:4)(cid:88)(cid:83)(cid:4)(cid:88)(cid:76)(cid:73)(cid:4)(cid:89)(cid:87)(cid:73)(cid:86)(cid:4)(cid:87)(cid:88)(cid:89)(cid:72)(cid:93)(cid:4)(cid:443)(cid:82)(cid:72)(cid:77)(cid:82)(cid:75)(cid:87)(cid:4)(cid:88)(cid:83)(cid:4)(cid:70)(cid:73)(cid:4)(cid:77)(cid:82)(cid:74)(cid:83)(cid:86)(cid:81)(cid:73)(cid:72)(cid:4)(cid:69)(cid:70)(cid:83)(cid:89)(cid:88)(cid:4)(cid:88)(cid:76)(cid:73)(cid:4) (cid:84)(cid:86)(cid:83)(cid:84)(cid:73)(cid:86)(cid:88)(cid:77)(cid:73)(cid:87)(cid:4)(cid:83)(cid:74)(cid:4)(cid:88)(cid:76)(cid:73)(cid:4)(cid:73)(cid:92)(cid:84)(cid:80)(cid:69)(cid:82)(cid:69)(cid:88)(cid:77)(cid:83)(cid:82)(cid:4)(cid:74)(cid:83)(cid:86)(cid:81)(cid:87) (cid:16)(cid:4)(cid:69)(cid:82)(cid:72)(cid:4)(cid:88)(cid:83)(cid:4) (cid:89)(cid:82)(cid:72)(cid:73)(cid:86)(cid:87)(cid:88)(cid:69)(cid:82)(cid:72)(cid:4)(cid:73)(cid:82)(cid:72)(cid:17)(cid:89)(cid:87)(cid:73)(cid:86)(cid:87)(cid:11)(cid:4)(cid:72)(cid:77)(cid:90)(cid:73)(cid:86)(cid:87)(cid:73)(cid:4)(cid:73)(cid:92)(cid:84)(cid:80)(cid:69)(cid:82)(cid:69)(cid:88)(cid:77)(cid:83)(cid:82)(cid:4)(cid:82)(cid:73)(cid:73)(cid:72)(cid:87)

Fig. 10.

Suggested Prototyping Workflow using EUCA Framework

Blue bold text highlights the sup-porting contents provided by EUCA. (1)

Create prototyping cards from explanatory forms

The designer starts by manually extracting several interpretable features given the AI taskand input/output data type. For example, for tabular data, the features could be the columnnames that describing the input, such as house size, age, and location. For image data, thefeatures could be saliency image part or object for recognition, such as cars, trafﬁc signs,or pathological appearance of a disease on chest X-ray. As quick prototyping, the featurecontent may not necessarily reﬂect the true content generated by XAI algorithms. Theyserved as content placeholders for the prototyping card design template.Then the designer can use the prototyping card design template provided by EUCA, and ﬁll in the template with the above extracted features . In the design template and example,we provide the basic visualization of the explanatory forms used in the user study. Designerscan also create their own template from scratch by referring to the design examples. TheEUCA framework website supports designers sharing of their prototype design to expandthe available design templates, and to encourage the reuse of design patterns on similarXAI applications.For a particular explanatory form, the designer may prepare multiple versions varyingthe visual representations (e.g.: graphics or text) and UI layout, alternating contents frombrief to details, and providing different options, such as whether to use pre-deﬁned oruser-deﬁned contrastive outcome on counterfactual example, whether to give users theoption to set a threshold level for feature attribute, or refer to the UI/UX design implicationspart in the user study ﬁndings (Section 5). Each explanatory form and its variations arepresented on individual prototyping cards.While designing UI/UX variations for the prototyping cards, designers may also considerand apply the general human-AI interaction guidelines . We selected the following designguidelines that are more relevant to XAI system: “remember recent interactions”, “supportefﬁcient invocation, dismissal and correction”, “remember recent interactions”, “learn fromuser behavior”, and “encourage granular feedback”. Designers can refer to the guidelinepaper [14] for details. http://weina.me/end-user-xai UCA: A Practical Prototyping Framework for End-User XAI 43 (2)

Co-design and iterate low-ﬁdelity prototype with end-users

With the prepared prototyp-ing cards, the designers then can meet and discuss with the target end-users and/or otherstakeholders of the XAI system, and apply user-centered methods informally or formally.Such methods may include interview, focus group, and card sorting. The communicationaims to use the created cards as a prototyping tool to understand users’ needs under po-tential explanation needs, and involve end-users in the co-design and prototype iterationprocess.To quickly create a low-ﬁdelity paper prototype from the prototyping cards, the end-userscan select, rank, combine, modify the prototyping cards , and sketch new ones. In this pro-cess, designers may ask users why they selected or did not select a card, and their rationalsfor making such a combination, whether the combination could fulﬁll their requirements,and what is lacking in the current prototype. The users can easily manipulating the card po-sitions to try out different layouts to examine different UI design possibilities (for example,on brewers, tablet or mobile phone).Users can also comment on and revise each variation of the same explanatory form. Withthe tangible prototyping card examples, designers can know in-details about users speciﬁcrequirements on the UI/UX design. The prototyping cards may facilitate the discussionof UX design, for examples, users may choose to hide some cards and only show themon-demand, or to present different explanatory information in different contexts.After the initial communication with users, designers need to synthesize users commentsand decide one or several prototype designs (such as using majority voting). Then basedon the prototyping card ranking and combination, the designer may create low-ﬁdelityprototypes, and continue to seek user and/or other stakeholders’ feedback and iterate theprototype.During the above process, the designer may refer to the user study ﬁndings to be informedabout the properties of the explanation forms (pros, cons, applicable explanation needs,and design implications in Section 5), and to understand end-users’ diverse explanationneeds (to calibrate trust, detect bias, resolve disagreement with AI, etc. in Section 6).(3)

Implement functional prototype

After co-design and several iteration, when the low-ﬁdelity prototype is ready to implement,given that many existing XAI techniques are implemented as open-source toolkits (e.g.: [2,4, 5, 8, 10, 17]), the development team can identify the most viable technical solutionaccording to the corresponding XAI algorithms (Table 3) of the selected explanatory formsin low-ﬁdelity prototypes.

Insights for novel XAI algorithms/interfaces . Our ﬁndings provide design implicationsand insights from end-users’ perspectives. It would motivate HCI and AI researchers to developnovel interfaces/algorithms towards end-user-centered XAI. We give some examples for inspira-tion: • The design implication in similar example (Section 5.4) indicates users need to pinpoint thecorresponding features among similar examples for easy comparison. This requirementcan be regarded as a combination of the two explanatory forms similar example and featureattribute. Such insight may inspire UX researchers to design novel interfaces to supporthighlighting and comparing important features among instances on tabular data. However,this novel XAI interface is not applicable for image data, and new XAI algorithms need to beproposed, such as [28, 31]. • Participants suggested to click features in feature attribute to check details of feature shape.It can be regarded as a combination of the two explanatory forms. Such a combination can be achieved at the interface level, e.g.: Gamut [43], or at the algorithmic level, e.g.:COGAM [13]. • The advantages and disadvantages of similar example and typical example seem to becomplementary to each other. A new type of example-based explanation may be proposedaccordingly: it creates typical examples that are representative to the target class, whilebeing as similar as possible to the input instance. Thus by taking the advantage of similarexample, it is similar to the input instance to be easily understood, thus overcoming thedisadvantage of typical example for being unrelated to the input instance; meanwhile itinherits the advantage of typical example for being distinctive and not confusing.The above examples demonstrate using the explanatory forms as building blocks to create novelXAI algorithms/interfaces. In addition to the identiﬁed design implications in our study, XAI re-searchers can use the above prototyping method to identify users’ requirements in their particulartasks, and propose new interfaces or algorithmic solutions accordingly.Since explanation is a social process [67], an advanced XAI system may be trained to constructan explanation dialog [87] that mimics the human explanation process. New XAI algorithms canalso be created in line with such a manner, for example, by using reinforcement learning to usedifferent explanatory forms to respond to the user’s current query and make such explanationsadapted to users’ preferences or explanation needs.

XAI techniques are abundant, but understanding on end-users’ needs is little. In this section,we present the user study ﬁndings on end-users’ diverse needs for explainability within thecontext of explanatory forms. Our ﬁndings reveal two major themes of the need for explainability:explanations for veriﬁcation/justiﬁcation, and explanations for betterment.

After acknowledging AI’s decision is probabilistic, most users need explanations fordecision veriﬁcation, so that they could incorporate AI into their own decision process on high-stake tasks. In our study, we discovered users frequently seek decision quality metrics, followed byexplanations answering why or how questions (such as feature attribute and similar example)to verify the decisions. Users usually request such information ) during the initial deploymentstage [57] when trust has not been established so that users do not have knowledge on theobserved performance [95], as in the explanation needs of safety and communicate; and )when AI’s decision is being challenged, as in the explanation needs of bias and unexpected. Ourquantitative results also revealed that the decision quality-related metrics (output, performance,and dataset) were frequently selected and ranked higher for the above explanation needs (Fig. 8).In our study, we found that most participants accepted and understood the decision certaintyin output, followed by performance. The training data distribution in a dataset was the least com-prehensible form. Interpreting these metrics may require a certain degree of data analytic skillsand could be time-consuming. The numbers may be contradicting and cause users’ frustration.Thus, in real-world applications, it may not be feasible for end-users to check all the metrics. Toindicate the probabilistic nature of AI, our ﬁndings (Section 5.10) suggest a possible workaroundis to provide the range of prediction on-demand or a point prediction within its range. The rangemay bring additional beneﬁts of leaving rooms for ﬂexible and negotiable decisions for speciﬁctasks. Another suggestion is to provide a uniﬁed and precise uncertainty estimation [20] metricthat is case speciﬁc, incorporating all sources of decision uncertainty (such as performance onmodel capabilities, prior knowledge about the training data distribution, noises on input data,etc), to indicate the capabilities and limitations of AI prediction. UCA: A Practical Prototyping Framework for End-User XAI 45

Alongside the above metrics, various other explanatory forms were selected by participants toverify AI’s decision (trust, safety, bias, unexpected). The selection of explanatory forms largelydepends on the speciﬁc task, the explanation need, and users’ preferences, and there are no deﬁ-nite patterns. Previous quantitative results showed discrepancies of providing local explanations(feature attribute or similar example) and its effect on trust calibration and users’ decision accu-racy [54, 99], indicating explanations may play a complex role in the AI-assisted decision process.It may involve complex interactions among factors such as users’ perception of the explanatoryforms and their visual representations/layouts, AI and human’s different error zones [99], explana-tory information overload, users’ cognitive bias when interpreting the explanations [9, 58], andhow faithful the explanations are to the underlying AI model, etc. Future research is needed toexplore these factors and their effects on human-AI collaborative decision quality. Because therelacks a universal model to predict various explanatory forms and their outcomes, our proposedEUCA framework could serve as a practical prototyping tool to quickly test the effects of variousexplanatory forms to guide the design process.

The other major motivation to check AI’s explanation is tomove beyond decision veriﬁcation, and to improve users’ current status, such as to improve thepredicted outcome, enhance users’ learning and problem-solving skills, discover new knowledge,and trade-off among multiple objectives. Those explanation needs may emerge as users estab-lished trust and adopted AI into their decision workﬂow. As AI surpasses human performancein some critical tasks, AI can act as a knowledgeable source providing insights for humans toimprove their own welfare. Although research in this direction is relatively limited, some priorworks provide promising results on using machine explanations to improve users’ knowledge andtask performance [15, 53].

The limitations and future work include: • We summarized the end-user-friendly explanatory forms from technically-achievable solu-tions via a literature/critical review. We aimed to include the majority of existing explanatoryforms with the information saturation criterion: i.e. no more additional explanatory formscould be identiﬁed. This process manifested in a conceptual model of the 12 end-user-friendly explanatory forms that served as a starting point for subsequent user study [38]. Wedid not aim to conduct an exhaustive, comprehensive systematic review, which is beyondwhat one paper could achieve. And since XAI techniques are fast evolving, the currentframework may not necessarily cover all possible algorithms. The EUCA framework aims toserve as a moderate initial step towards a practical end-user-centered XAI framework, andis extendable to update with any emerging XAI technologies on the EUCA website . • Due to the high-stake nature and limited adoption of AI in critical decision-support, it ischallenging to gain access to real-world AI systems in high-stake facilities (such as policeofﬁces, courts, clinics/hospitals, banks) to conduct user studies on multiple critical tasks,and recruit domain-speciﬁc end-users (such as physicians, police ofﬁcers, judges, bankers).This is beyond the scope of one single paper could achieve. Therefore in the user study, wedesigned four ﬁctional vignettes to represent the variability of AI-supported critical decision-making tasks, and participants’ responses were based on conjecture rather than their realexperience with AI. Our ongoing future work on XAI system design involves physicians asdomain expert end-users using AI as support in their day-to-day clinical decision tasks. http://weina.me/end-user-xai Future work may apply the EUCA framework in other domain-speciﬁc XAI design anddevelopment practices to iterate and improve the framework. • Bias may be involved in the card sorting of explanatory forms, as we noticed a few partic-ipants selected a card because it contained certain features rather than its distinguishedform, despite in the follow-up questions we asked “what if the speciﬁc feature was or wasn’tincluded”. The explanatory forms, their contents, the particular visual representations, thetask, user’s current explanation needs, and user type all played a role in participants’ selec-tion choices under the speciﬁc study context, and our study design could not disentanglethem. The quantitative results from card sorting are meant to serve as a reference only. Theyare not meant to be used directly to choose explanatory forms without the prototypingprocess, due to the above complex factors involved. Future work may design randomizedcontrolled user studies to quantitatively examine the effects of the above factors in detail toguide the choice of explanatory forms in speciﬁc contexts. The EUCA framework websiteallows community users to share their prototypes, encouraging the reuse of design patternson certain XAI applications.

Designing end-user-oriented explainable AI systems faces many challenges. From the user side, ) End-users have diverse roles, tasks, and explanation needs. ) End-users lack technologicalknowledge which is a prerequisite for some XAI systems in order to interpret the explanation.From the XAI practitioner side, ) practitioners’ expertise on AI or HCI/UXUI design usually doesnot overlap, and there lacks boundary objects to connect the two ﬁelds and facilitate collaborationbetween AI and HCI practitioners. ) there lacks tools to support UI/UX design, prototyping, andco-design process.To address the above challenges, we developed the end-user-oriented XAI framework EUCAwith a collaborative effort of combining AI and HCI expertise. EUCA considers not only thehuman-centered perspective but also the technological capabilities, so that the design solutionsare both end-user-oriented and technically achievable. It acts as a boundary object between AIand HCI ﬁelds and provides UI/UX design, prototyping, and co-design support.To apply EUCA in practice, XAI designers can use the provided design templates to create pro-totyping cards for the twelve explanatory forms. The explanatory forms are end-user-friendly andwere identiﬁed from a technically achievable solution space. They are a familiar and mutual lan-guage to both end-users and XAI practitioners. With the prototyping cards, designers can conducta participatory design process that involves multiple stakeholders to receive their feedback anditerate the prototype. In this process, the stakeholders can comment, sort, combine, and revisethe prototyping cards, to use them as building blocks to build a low-ﬁelty prototype. Designerscan also refer to the user study ﬁndings to be informed by end-users about the properties of theexplanation forms (their strength, weakness, UI/UX design implications, and applicable explana-tion needs), and to understand end-users’ diverse needs for explainability within the context ofexplanatory forms. The corresponding XAI algorithms for each explanatory form can facilitatedevelopers to implement a functional prototype. As an initial step towards end-user-centered XAI,the EUCA framework provides a practical prototyping toolkit that supports HCI/AI practitionersand researchers to develop end-user-oriented XAI systems. ACKNOWLEDGMENTS

We thank all the study participants for their time, effort, and valuable inputs in the study. Wethank Sheelagh Carpendale, Parmit Chilana, Ben Cardoen, Pegah Kiaei, and Zipeng Liu for thehelpful discussions in shaping this work. We thank all reviewers for the valuable comments. The

UCA: A Practical Prototyping Framework for End-User XAI 47 ﬁrst author was supported by Simon Fraser University Big Data Initiative The Next Big QuestionFunding. The ﬁrst author would like to appreciate family for their generous support to completethe work during the difﬁcult times in 2020.

REFERENCES [1] [n.d.]. The impact of the General Data Protection Regulation (GDPR) on artiﬁcial intelligence. ([n. d.]). https://doi.org/10.2861/293[2] 2020. Alibi. https://docs.seldon.io/projects/alibi/en/v0.2.0/index.html Accessed: 2020-09-10.[3] 2020. Boundary object. https://en.wikipedia.org/wiki/Boundary_object[4] 2020. Captum · Model Interpretability for PyTorch. https://captum.ai/ Accessed: 2020-09-10.[5] 2020.

DALEX, moDel Agnostic Language for Exploration and eXplanation

Proceedings of the 2018 CHIConference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18) . Association for ComputingMachinery, New York, NY, USA, 1–18. https://doi.org/10.1145/3173574.3174156[13] Ashraf Abdul, Christian von der Weth, Mohan Kankanhalli, and Brian Y. Lim. 2020. COGAM: Measuring andModerating Cognitive Load in Machine Learning Model Explanations. In

Proceedings of the 2020 CHI Conference onHuman Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20) . Association for Computing Machinery, NewYork, NY, USA, 1–14. https://doi.org/10.1145/3313831.3376615[14] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, ShamsiIqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AIInteraction. In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, ScotlandUk) (CHI ’19) . Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300233[15] Oisin Mac Aodha, Shihan Su, Yuxin Chen, Pietro Perona, and Yisong Yue. 2018. Teaching Categories to Human Learn-ers with Visual Explanations. In

Proceedings of the IEEE Computer Society Conference on Computer Vision and PatternRecognition . IEEE Computer Society, 3820–3828. https://doi.org/10.1109/CVPR.2018.00402 arXiv:1802.06924[16] Daniel W. Apley. 2016. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models. (dec2016). arXiv:1612.08468 http://arxiv.org/abs/1612.08468[17] Vijay Arya, Rachel K. E. Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind, Samuel C. Hoffman, StephanieHoude, Q. Vera Liao, Ronny Luss, Aleksandra Mojsilovi´c, Sami Mourad, Pablo Pedemonte, Ramya Raghavendra,John Richards, Prasanna Sattigeri, Karthikeyan Shanmugam, Moninder Singh, Kush R. Varshney, Dennis Wei, andYunfeng Zhang. 2019. One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques.(sep 2019). arXiv:1909.03012 http://arxiv.org/abs/1909.03012[18] Vijay Arya, Rachel K. E. Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind, Samuel C. Hoffman, StephanieHoude, Q. Vera Liao, Ronny Luss, Aleksandra Mojsilovi´c, Sami Mourad, Pablo Pedemonte, Ramya Raghavendra,John Richards, Prasanna Sattigeri, Karthikeyan Shanmugam, Moninder Singh, Kush R. Varshney, Dennis Wei, andYunfeng Zhang. 2019. One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques.arXiv:1909.03012 [cs.AI][19] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and WojciechSamek. 2015. On Pixel-Wise Explanations for Non-Linear Classiﬁer Decisions by Layer-Wise Relevance Propagation.

PLOS ONE

10, 7 (jul 2015), e0130140. https://doi.org/10.1371/journal.pone.0130140[20] Edmon Begoli, Tanmoy Bhattacharya, and Dimitri Kusnezov. 2019. The need for uncertainty quantiﬁcation inmachine-assisted medical decision making.

Nature Machine Intelligence

1, 1 (jan 2019), 20–23. https://doi.org/10.1038/s42256-018-0004-1[21] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri,José M.F. Moura, and Peter Eckersley. 2020. Explainable machine learning in deployment. In

FAT* 2020 - Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency . 648–657. https://doi.org/10.1145/3351095.3375624 arXiv:1909.06342[22] Sara Bly and Elizabeth F. Churchill. 1999. Design through Matchmaking: Technology in Search of Users.

Interactions

6, 2 (March 1999), 23–31. https://doi.org/10.1145/296165.296174[23] Virginia Braun and Victoria Clarke. 2012. Thematic analysis. In

APA handbook of research methods in psychology,Vol 2: Research designs: Quantitative, qualitative, neuropsychological, and biological.

American PsychologicalAssociation, Washington, DC, US, 57–71. https://doi.org/10.1037/13620-004[24] Andrea Bunt, Matthew Lount, and Catherine Lauzon. 2012. Are explanations always important?. In

Proceedings ofthe 2012 ACM international conference on Intelligent User Interfaces - IUI ’12 . ACM Press, New York, New York, USA,169. https://doi.org/10.1145/2166966.2166996[25] Carrie J. Cai, Jonas Jongejan, and Jess Holbrook. 2019. The effects of example-based explanations in a machinelearning interface. In

Proceedings of the 24th International Conference on Intelligent User Interfaces - IUI ’19 . ACMPress, New York, New York, USA, 258–262. https://doi.org/10.1145/3301275.3302289[26] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noémie Elhadad. 2015. Intelligible modelsfor healthcare: Predicting pneumonia risk and hospital 30-day readmission. In

Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , Vol. 2015-Augus. Association for ComputingMachinery, New York, New York, USA, 1721–1730. https://doi.org/10.1145/2783258.2788613[27] Federica Di Castro and Enrico Bertini. 2019. Surrogate Decision Tree Visualization. http://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-15.pdf[28] Alina Barnett Jonathan Su Cynthia Rudin Chaofan Chen, Oscar Li. 2019. This Looks Like That: Deep Learning forInterpretable Image Recognition. In

Proceedings of Neural Information Processing Systems (NeurIPS) .[29] Hao-Fei Cheng, Ruotong Wang, Zheng Zhang, Fiona O’Connell, Terrance Gray, F. Maxwell Harper, and Haiyi Zhu.2019. Explaining Decision-Making Algorithms through UI. (2019), 1–12. https://doi.org/10.1145/3290605.3300789[30] Douglas Cirqueira, Dietmar Nedbal, Markus Helfert, and Marija Bezbradica. 2020. Scenario-Based RequirementsElicitation for User-Centric Explainable AI. In

Machine Learning and Knowledge Extraction , Andreas Holzinger,Peter Kieseberg, A Min Tjoa, and Edgar Weippl (Eds.). Springer International Publishing, Cham, 321–341.[31] Noel C.F. Codella, Chung Ching Lin, Allan Halpern, Michael Hind, Rogerio Feris, and John R. Smith. 2018. Col-laborative human-AI (CHAI): Evidence-based interpretable melanoma classiﬁcation in dermoscopic images. In

Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notesin Bioinformatics) , Vol. 11038 LNCS. Springer Verlag, 97–105. https://doi.org/10.1007/978-3-030-02628-8_11arXiv:1805.12234[32] Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Interpretable Machine Learning. (feb 2017).arXiv:1702.08608 http://arxiv.org/abs/1702.08608[33] Mengnan Du, Ninghao Liu, and Xia Hu. 2020. Techniques for interpretable machine learning.

Commun. ACM

63, 1(2020), 68–77. https://doi.org/10.1145/3359786 arXiv:1808.00033[34] Malin Eiband, Hanna Schneider, Mark Bilandzic, Julian Fazekas-Con, Mareike Haug, and Heinrich Hussmann. 2018.Bringing Transparency Design into Practice. In (Tokyo,Japan) (IUI ’18) . Association for Computing Machinery, New York, NY, USA, 211–223. https://doi.org/10.1145/3172944.3172961[35] Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting machine.

The Annals of Statistics

29, 5 (oct 2001), 1189–1232. https://doi.org/10.1214/aos/1013203451[36] Nicholas Frosst and Geoffrey Hinton. 2017.

Distilling a Neural Network Into a Soft Decision Tree . Technical Report.arXiv:1711.09784v1 https://arxiv.org/pdf/1711.09784.pdf[37] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. [n.d.].

Counterfactual Visual Explanations .Technical Report. arXiv:1904.07451v2[38] Maria J. Grant and Andrew Booth. 2009. A typology of reviews: an analysis of 14 review types and associatedmethodologies.

Health Information & Libraries Journal

26, 2 (2009), 91–108. https://doi.org/10.1111/j.1471-1842.2009.00848.x arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1471-1842.2009.00848.x[39] Shirley Gregor and Izak Benbasat. 1999. Explanations from Intelligent Systems: Theoretical Foundations andImplications for Practice.

MIS Quarterly

23, 4 (dec 1999), 497. https://doi.org/10.2307/249487[40] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Dino Pedreschi, Franco Turini, and Fosca Giannotti. 2018.Local Rule-Based Explanations of Black Box Decision Systems. (may 2018). arXiv:1805.10820 http://arxiv.org/abs/1805.10820[41] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. ASurvey of Methods for Explaining Black Box Models.

ACM Comput. Surv

51, 93 (2018). https://doi.org/10.1145/3236009

UCA: A Practical Prototyping Framework for End-User XAI 49 [42] R. R. Hoffman and G. Klein. 2017. Explaining Explanation, Part 1: Theoretical Foundations.

IEEE Intelligent Systems

32, 3 (2017), 68–73. https://doi.org/10.1109/MIS.2017.54[43] Fred Hohman, Andrew Head, Rich Caruana, Robert DeLine, and Steven M. Drucker. 2019. Gamut: A DesignProbe to Understand How Data Scientists Understand Machine Learning Models. In

Proceedings of the 2019CHI Conference on Human Factors in Computing Systems - CHI ’19 . ACM Press, New York, New York, USA, 1–13.https://doi.org/10.1145/3290605.3300809[44] Andreas Holzinger, Bernd Malle, Peter Kieseberg, Peter M. Roth, Heimo Müller, Robert Reihs, and Kurt Zat-loukal. 2017. Towards the Augmented Pathologist: Challenges of Explainable-AI in Digital Pathology. (dec 2017).arXiv:1712.06657 http://arxiv.org/abs/1712.06657[45] Sungsoo Ray Hong, Jessica Hullman, and Enrico Bertini. 2020. Human Factors in Model Interpretability: IndustryPractices, Challenges, and Needs.

Proceedings of the ACM on Human-Computer Interaction

4, CSCW1 (2020), 1–26.https://doi.org/10.1145/3392878 arXiv:2004.11440[46] James W. Hooper and Pei Hsia. 1982. Scenario-Based Prototyping for Requirements Identiﬁcation.

SIGSOFT Softw.Eng. Notes

7, 5 (April 1982), 88–93. https://doi.org/10.1145/1006258.1006275[47] Weina Jin, Mostafa Fatehi, Kumar Abhishek, Mayur Mallya, Brian Toyota, and Ghassan Hamarneh. 2020. Artiﬁcialintelligence in glioma imaging: challenges and advances.

Journal of neural engineering

17, 2 (2020), 021002.https://doi.org/10.1088/1741-2552/ab8131 arXiv:1911.12886[48] Jeremy Kawahara, Kathleen P Moriarty, and Ghassan Hamarneh. 2017. Graph geodesics to ﬁnd progressively similarskin lesion images. In

Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligenceand Lecture Notes in Bioinformatics) , Vol. 10551 LNCS. 31–41. https://doi.org/10.1007/978-3-319-67675-3_4[49] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. Examples are not enough, learn to criticize! Criticism forInterpretability. In

Advances in Neural Information Processing Systems 29 , D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 2280–2288. http://papers.nips.cc/paper/6300-examples-are-not-enough-learn-to-criticize-criticism-for-interpretability.pdf[50] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory sayres. 2018.Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) (Proceed-ings of Machine Learning Research, Vol. 80) , Jennifer Dy and Andreas Krause (Eds.). PMLR, Stockholmsmässan,Stockholm Sweden, 2668–2677. http://proceedings.mlr.press/v80/kim18d.html[51] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, Sendhil Mullainathan, David Abrams, MattAlsdorf, Molly Cohen, Alexander Crohn, Gretchen Ruth Cusick, Tim Dierks, John Donohue, Mark Dupont, Meg Egan,Elizabeth Glazer, Joan Gottschall, Nathan Hess, Karen Kane, Leslie Kellam, Angela LascalaGruenewald, CharlesLoefﬂer, Anne Milgram, Lauren Raphael, Chris Rohlfs, Dan Rosenbaum, Terry Salo, Andrei Shleifer, Aaron Sojourner,James Sowerby, Cass Sunstein, Michele Sviridoff, Emily Turner, and Judge John. 2017. Human Decisions andMachine Predictions. September (2017), 1–53. https://doi.org/10.1093/qje/qjx032/4095198/Human-Decisions-and-Machine-Predictions[52] Janet L. Kolodner. 1992. An introduction to case-based reasoning.

Artiﬁcial Intelligence Review

6, 1 (mar 1992), 3–34.https://doi.org/10.1007/BF00155578[53] Vivian Lai, Han Liu, and Chenhao Tan. 2020. "Why is ’Chicago’ Deceptive?" Towards Building Model-Driven Tutorialsfor Humans. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI,USA) (CHI ’20) . Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376873[54] Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learningmodels: A case study on deception detection. In

FAT* 2019 - Proceedings of the 2019 Conference on Fairness,Accountability, and Transparency . Association for Computing Machinery, Inc, New York, New York, USA, 29–38.https://doi.org/10.1145/3287560.3287590 arXiv:1811.07901[55] Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Xavier Renard, and Marcin Detyniecki. 2018. Comparison-Based Inverse Classiﬁcation for Interpretability in Machine Learning. In

Information Processing and Management ofUncertainty in Knowledge-Based Systems. Theory and Foundations , Jesús Medina, Manuel Ojeda-Aciego, José LuisVerdegay, David A Pelta, Inma P Cabrera, Bernadette Bouchon-Meunier, and Ronald R Yager (Eds.). SpringerInternational Publishing, Cham, 100–111.[56] O. Li, H. Liu, C. Chen, and C. Rudin. 2018. Deep Learning for Case-based Reasoning through Prototypes: A NeuralNetwork that Explains its Predictions. In

AAAI .[57] Q. Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: Informing Design Practices for ExplainableAI User Experiences. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu,HI, USA) (CHI ’20) . Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3313831.3376590 [58] Geoffrey K. Lighthall and Cristina Vazquez-Guillamet. 2015. Understanding decision making in critical care.

ClinicalMedicine and Research

13, 3-4 (dec 2015), 156–168. https://doi.org/10.3121/cmr.2015.1289[59] Brian Y. Lim and Anind K. Dey. 2009. Assessing demand for intelligibility in context-aware applications. In

Proceed-ings of the 11th international conference on Ubiquitous computing - Ubicomp ’09 . ACM Press, New York, New York,USA, 195. https://doi.org/10.1145/1620545.1620576[60] Brian Y. Lim, Anind K. Dey, and Daniel Avrahami. 2009. Why and Why Not Explanations Improve the Intelligibilityof Context-Aware Intelligent Systems. In

Proceedings of the SIGCHI Conference on Human Factors in ComputingSystems (Boston, MA, USA) (CHI ’09) . Association for Computing Machinery, New York, NY, USA, 2119–2128.https://doi.org/10.1145/1518701.1519023[61] Brian Y Lim, Qian Yang, Ashraf Abdul, and Danding Wang. 2019. Why these Explanations? Selecting IntelligibilityTypes for Explanation Goals. (2019), 7. https://doi.org/10.1145/1234567890[62] Zachary C Lipton. 2017. The Doctor Just Won’t Accept That!. In

NIPS Symposium on Interpretable ML

Advances in Neural Information Processing Systems 30 . 4765–4774. https://github.com/slundberg/shap[64] Aravindh Mahendran and Andrea Vedaldi. 2014. Understanding Deep Image Representations by Inverting Them.(nov 2014). arXiv:1412.0035 http://arxiv.org/abs/1412.0035[65] Helena Matute, Fernando Blanco, Ion Yarritu, Marcos Díaz-Lago, Miguel A. Vadillo, and Itxaso Barberia. 2015.Illusions of causality: how they bias our everyday thinking and how they could be reduced.

Frontiers in Psychology

6, July (2015), 1–14. https://doi.org/10.3389/fpsyg.2015.00888[66] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2019. A Survey on Biasand Fairness in Machine Learning. (2019). arXiv:1908.09635 http://arxiv.org/abs/1908.09635[67] Tim Miller. 2019. Explanation in artiﬁcial intelligence: Insights from the social sciences. , 38 pages. https://doi.org/10.1016/j.artint.2018.07.007 arXiv:1706.07269[68] Tim Miller, Piers Howe, and Liz Sonenberg. 2017. Explainable AI: Beware of Inmates Running the Asylum Or: How ILearnt to Stop Worrying and Love the Social and Behavioural Sciences. arXiv:1712.00547 [cs.AI][69] Tim Miller, Piers Hower, and Liz Sonenberg. 2017. Explainable AI: beware of inmates running the asylum. In

IJCAI2017 workshop on explainable artiﬁcial intelligence (XAI) . 363. https://doi.org/10.1016/j.foodchem.2017.11.091arXiv:1712.00547[70] Yao Ming, Huamin Qu, and Enrico Bertini. 2019. RuleMatrix: Visualizing and Understanding Classiﬁers with Rules.

IEEE Transactions on Visualization and Computer Graphics

25, 1 (jan 2019), 342–352. https://doi.org/10.1109/TVCG.2018.2864812[71] Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How doHumans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretabilityof Explanation. (feb 2018). arXiv:1802.00682 http://arxiv.org/abs/1802.00682[72] Ingrid Nunes and Dietmar Jannach. 2017. A systematic review and taxonomy of explanations in decision supportand recommender systems.

User Modeling and User-Adapted Interaction

27, 3-5 (2017), 393–444. https://doi.org/10.1007/s11257-017-9195-0[73] Judea Pearl. 2000.

Causality: Models, Reasoning, and Inference .[74] Alun Preece, Dan Harborne, Dave Braines, Richard Tomsett, and Supriyo Chakraborty. 2018.

Stakeholders inExplainable AI . Technical Report. arXiv:1810.00184v1[75] Gabriëlle Ras, Marcel Van Gerven, and Pim Haselager. 2018. Explanation Methods in Deep Learning: Users, Values,Concerns and Challenges. (2018). arXiv:1803.07517v2 https://arxiv.org/pdf/1803.07517.pdf[76] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictionsof Any Classiﬁer. In

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery andData Mining (San Francisco, California, USA) (KDD ’16) . Association for Computing Machinery, New York, NY, USA,1135–1144. https://doi.org/10.1145/2939672.2939778[77] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-Precision Model-Agnostic Explana-tions. In

AAAI Conference on Artiﬁcial Intelligence (AAAI) .[78] Mireia Ribera and Agata Lapedriza. 2019. Can we do better explanations? A proposal of user-centered explainableAI. In

Joint Proceedings of the ACM IUI 2019 Workshops . http://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-12.pdf[79] Mary Beth Rosson and John M. Carroll. 2002.

Scenario-Based Design . L. Erlbaum Associates Inc., USA, 1032–1050.[80] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013.

Deep Inside Convolutional Networks: VisualisingImage Classiﬁcation Models and Saliency Maps . Technical Report. arXiv:1312.6034v2 https://arxiv.org/abs/1312.6034v2[81] Alison Smith-Renner, Ron Fan, Melissa Birchﬁeld, Tongshuang Wu, Jordan Boyd-Graber, Daniel S. Weld, andLeah Findlater. 2020. No Explainability without Accountability. In

Proceedings of the 2020 CHI Conference on

UCA: A Practical Prototyping Framework for End-User XAI 51

Human Factors in Computing Systems . Association for Computing Machinery (ACM), New York, NY, USA, 1–13.https://doi.org/10.1145/3313831.3376624[82] Kacper Sokol and Peter Flach. 2020. Explainability fact sheets: A framework for systematic assessment of explainableapproaches.

FAT* 2020 - Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (2020),56–67. https://doi.org/10.1145/3351095.3372870 arXiv:1912.05100[83] Sarah Tan, Rich Caruana, Giles Hooker, Paul Koch, and Albert Gordo. 2018.

Learning Global Additive Explanationsfor Neural Nets Using Model Distillation . Technical Report. arXiv:1801.08640v2 https://youtu.be/ErQYwNqzEdc.[84] Amy Turner, Meena Kaushik, Mu-Ti Huang, and Srikar Varanasi. 2020. Calibrating Trust in AI-Assisted DecisionMaking. (2020).[85] Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y. Lim. 2019. Designing Theory-Driven User-Centric ExplainableAI. In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI’19) . Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3290605.3300831[86] Jonas Wanner and Christian Janiesch. 2020. How much is the black box? The value of explainability in machinelearning models. (2020).[87] Daniel S. Weld and Gagan Bansal. 2019. The challenge of crafting intelligible intelligence.

Commun. ACM

62, 6 (mar2019), 70–79. https://doi.org/10.1145/3282486 arXiv:1803.04263[88] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. 2010.

Caltech-UCSD Birds 200 .Technical Report CNS-TR-2010-001. California Institute of Technology.[89] Christine T. Wolf. 2019. Explainability scenarios: Towards scenario-based XAI design.

International Conference onIntelligent User Interfaces, Proceedings IUI

Part F1476 (2019), 252–257. https://doi.org/10.1145/3301275.3302317[90] Jennifer Wortman Vaughan and Hanna Wallach. [n.d.]. A Human-Centered Agenda for Intelligible Machine Learning.

Jennwv.Com

Proceedings of the 34th Inter-national Conference on Machine Learning - Volume 70 (Sydney, NSW, Australia) (ICML’17) . JMLR.org, 3921–3930.[92] Qian Yang. 2018. Machine Learning as a UX Design Material: How Can We Imagine Beyond Automation, Recom-menders, and Reminders? https://aaai.org/ocs/index.php/SSS/SSS18/paper/view/17471[93] Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018. Investigating How ExperiencedUX Designers Effectively Work with Machine Learning. In

Proceedings of the 2018 Designing Interactive SystemsConference (Hong Kong, China) (DIS ’18) . Association for Computing Machinery, New York, NY, USA, 585–596.https://doi.org/10.1145/3196709.3196730[94] Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-Examining Whether, Why, and HowHuman-AI Interaction Is Uniquely Difﬁcult to Design. In

Proceedings of the 2020 CHI Conference on Human Factorsin Computing Systems (Honolulu, HI, USA) (CHI ’20) . Association for Computing Machinery, New York, NY, USA,1–13. https://doi.org/10.1145/3313831.3376301[95] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the Effect of Accuracy on Trustin Machine Learning Models. In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19) . Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300509[96] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. 2018.BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling.

CoRR abs/1805.04687 (2018).arXiv:1805.04687 http://arxiv.org/abs/1805.04687[97] Kun Yu, Shlomo Berkovsky, Ronnie Taib, Jianlong Zhou, and Fang Chen. 2019. Do I Trust My Machine Teammate?An Investigation from Perception to Decision. In

Proceedings of the 24th International Conference on Intelligent UserInterfaces (Marina del Ray, California) (IUI ’19) . Association for Computing Machinery, New York, NY, USA, 460–468.https://doi.org/10.1145/3301275.3302277[98] Quanshi Zhang, Yu Yang, Haotian Ma, and Ying Nian Wu. 2018. Interpreting CNNs via Decision Trees. (jan 2018).arXiv:1802.00121 http://arxiv.org/abs/1802.00121[99] Yunfeng Zhang, Q. Vera Liao, and Rachel K.E. Bellamy. 2020. Efect of conﬁdence and explanation on accuracyand trust calibration in AI-assisted decision making.

FAT* 2020 - Proceedings of the 2020 Conference on Fairness,Accountability, and Transparency (2020), 295–305. https://doi.org/10.1145/3351095.3372852 arXiv:2001.02114[100] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2015.