[PDF] To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making

Abstract

People supported by AI-powered decision support tools frequently overrely on the AI: they accept an AI's suggestion even when that suggestion is wrong. Adding explanations to the AI decisions does not appear to reduce the overreliance and some studies suggest that it might even increase it. Informed by the dual-process theory of cognition, we posit that people rarely engage analytically with each individual AI recommendation and explanation, and instead develop general heuristics about whether and when to follow the AI suggestions. Building on prior research on medical decision-making, we designed three cognitive forcing interventions to compel people to engage more thoughtfully with the AI-generated explanations. We conducted an experiment (N=199), in which we compared our three cognitive forcing designs to two simple explainable AI approaches and to a no-AI baseline. The results demonstrate that cognitive forcing significantly reduced overreliance compared to the simple explainable AI approaches. However, there was a trade-off: people assigned the least favorable subjective ratings to the designs that reduced the overreliance the most. To audit our work for intervention-generated inequalities, we investigated whether our interventions benefited equally people with different levels of Need for Cognition (i.e., motivation to engage in effortful mental activities). Our results show that, on average, cognitive forcing interventions benefited participants higher in Need for Cognition more. Our research suggests that human cognitive motivation moderates the effectiveness of explainable AI solutions.

Full PDF

1188To Trust or to Think: Cognitive Forcing Functions CanReduce Overreliance on AI in AI-assisted Decision-making

ZANA BUÇINCA,

Harvard University, USA

MAJA BARBARA MALAYA,

Lodz University of Technology, Poland

KRZYSZTOF Z. GAJOS,

Harvard University, USAPeople supported by AI-powered decision support tools frequently overrely on the AI: they accept an AI’ssuggestion even when that suggestion is wrong. Adding explanations to the AI decisions does not appear toreduce the overreliance and some studies suggest that it might even increase it. Informed by the dual-processtheory of cognition, we posit that people rarely engage analytically with each individual AI recommendationand explanation, and instead develop general heuristics about whether and when to follow the AI suggestions.Building on prior research on medical decision-making, we designed three cognitive forcing interventions tocompel people to engage more thoughtfully with the AI-generated explanations. We conducted an experiment(N=199), in which we compared our three cognitive forcing designs to two simple explainable AI approachesand to a no-AI baseline. The results demonstrate that cognitive forcing significantly reduced overreliancecompared to the simple explainable AI approaches. However, there was a trade-off: people assigned the leastfavorable subjective ratings to the designs that reduced the overreliance the most. To audit our work forintervention-generated inequalities, we investigated whether our interventions benefited equally people withdifferent levels of Need for Cognition (i.e., motivation to engage in effortful mental activities). Our resultsshow that, on average, cognitive forcing interventions benefited participants higher in Need for Cognitionmore. Our research suggests that human cognitive motivation moderates the effectiveness of explainable AIsolutions.CCS Concepts: •

Human-centered computing → Interaction design .Additional Key Words and Phrases: explanations; artificial intelligence; trust; cognition

ACM Reference Format:

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive ForcingFunctions Can Reduce Overreliance on AI in AI-assisted Decision-making.

Proc. ACM Hum.-Comput. Interact.

5, CSCW1, Article 188 (April 2021), 21 pages. https://doi.org/10.1145/3449287

From loan approval to disease diagnosis, humans are increasingly being assisted by artificiallyintelligent (AI) systems in decision-making tasks. By combining two types of intelligence, theseemerging sociotechnical systems (i.e., human+AI teams) were expected to perform better than eitherpeople or AIs alone [35, 36]. Recent studies, however, show that although human+AI teams typicallyoutperform people working alone, their performance is usually inferior to the AI’s [2, 7, 9, 26, 30,41, 59]. There is evidence that instead of combining their own insights with suggestions generated

Authors’ addresses: Zana Buçinca, [email protected], Harvard University, 33 Oxford St., Cambridge, MA, 02138,USA; Maja Barbara Malaya, [email protected], Lodz University of Technology, ul. Stefana Żeromskiego 116, 90-924,Łódź, Poland; Krzysztof Z. Gajos, [email protected], Harvard University, 33 Oxford St., Cambridge, MA, 02138, USA.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and thefull citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.2573-0142/2021/4-ART188 $15.00https://doi.org/10.1145/3449287Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. a r X i v : . [ c s . H C ] F e b by the computational models, people frequently overrely on the AI, following its suggestionseven when those suggestions are wrong and the person would have made a better choice on theirown [9, 30, 41].Explainable AI (XAI), an approach where AI’s recommendations are accompanied by explanationsor rationales, was intended to address the problem of overreliance: By giving people an insightinto how the machine arrived at its recommendations, the explanations were supposed to helpthem identify the situations in which AI’s reasoning was incorrect and the suggestion shouldbe rejected. However, evidence suggests that explainable systems, also, have not had substantialsuccess in reducing human overreliance on the AIs: when the AI suggests incorrect or suboptimalsolutions, people still on average make poorer final decisions than they would have without AI’sassistance [2, 7, 30, 40, 67].We posit that the dual-process theory provides a useful lens through which to understand thereasons why the explanations do not eliminate human overreliance on AIs. According to thedual-process theory, humans mostly operate on System 1 thinking, which employs heuristics andshortcuts when making decisions [32, 63]. Because most of the daily decisions are successfullyaccomplished by these heuristics, analytical thinking (i.e., System 2) is triggered rarely, as it isslower and costlier in terms of effort. Yet, System 1 thinking leaves us vulnerable to cognitive biasesthat can result in incorrect or suboptimal decisions [32]. We do not want to suggest, however, thatSystem 1 thinking is always bad or inappropriate. Indeed, successful use of pattern matching andheuristics that arise from extensive experience are important components of expertise [49]. Buteven experts can fall prey to cognitive biases [42] so a judicious combination of both processes isneeded.In the context of explainable AI, the implicit assumption behind the design of most systemsis that people will engage analytically with each explanation and will use the content of theseexplanations to identify which of the AI’s suggestions are plausible and which appear to be basedon faulty reasoning. Because evaluating every explanation requires substantial cognitive effort,which humans are averse to [38], this assumption is likely incorrect. Instead, people appear todevelop heuristics about the competence of the AI partner overall. Indeed, some studies demonstratethat explanations are interpreted as a general signal of competence—rather than being evaluatedindividually for their content—and just by their presence can increase the trust in and overrelianceon the AI [2].We argue that to reduce human overreliance on the AI and improve performance, we need tonot only develop effective explanation techniques, but also ways to increase people’s cognitivemotivation for engaging analytically with the explanations.A number of approaches have been explored in other domains for engaging people in moreanalytical thinking to reduce the impact of cognitive biases on decision-making. One of the mostpromising appears to be the cognitive forcing functions —interventions that are applied at thedecision-making time to disrupt heuristic reasoning and thus cause the person to engage inanalytical thinking [42]. Examples of cognitive forcing functions include checklists, diagnostictime-outs, or asking the person to explicitly rule out an alternative. Bringing these insights intoAI-assisted decision-making, we hypothesized that adding cognitive forcing functions to existingexplainable AI designs would help reduce human overreliance on the AI. An important aspect ofthe design of forcing functions in this context, however, is the usability and the acceptability ofthese interventions. We hypothesized that stricter interventions will push people to think harder,but will be found complex and less usable.We conducted an experiment with 199 participants on Amazon Mechanical Turk, in which wecompared three cognitive forcing designs to two simple explainable AI approaches, and to a no-AIbaseline. Our results demonstrate that cognitive forcing functions significantly reduced overreliance Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:3 on the AI compared to the simple explainable AI approaches. However, cognitive forcing functionsdid not completely eliminate overreliance on the AI: even in cognitive forcing function conditions,participants were prone to rely on incorrect AI predictions for instances where they would havemade a better decision without AI assistance.As hypothesized, we also observed a trade-off between the acceptability of the designs and theireffectiveness at reducing the overreliance on the AI. Specifically, people over-relied less on the AIwhen exposed to the conditions that they found more difficult, preferred less, and trusted less.We also audited our work for intervention-generated inequalities [60]. Because prior worksuggested that people with high Need for Cognition (NFC, a stable personality trait that capturesone’s motivation to engage in effortful mental activities) tend to benefit more from complex userinterface features [12, 22, 24], we did so by disaggregating our results by NFC level. We foundthat cognitive forcing functions benefited the more advantaged group—people with high NFC—themost.In summary, we made two contributions in this paper:First, we introduced cognitive forcing functions as interaction design interventions for human+AIdecision-making and demonstrated their potential to reduce overreliance on the AI. Our study alsoshows that there exists a trade-off between the effectiveness and the acceptability of interventionsthat cause people to exert more cognitive effort. Our research demonstrates that the effectivenessof human+AI collaborations on decision-making tasks depends on human factors, such as cognitivemotivation, that go beyond the choice of the right form and content of AI-generated explanations.Hence, in addition to developing explanation techniques and tuning explanation attributes such assoundness and completeness [39], or faithfulness, sensitivity, and complexity [5], more thoughtshould be put into the design of the interaction to ensure that people will make effective use of theAI-generated recommendations and explanations.Second, by self-auditing our work for potential intervention-generated inequalities, we showedthat our approach, while effective on average, appears to benefit individuals with high NFC morethan those with low NFC. These results add to a small but growing body of work suggesting thatthe user interface innovations that allow people to tackle ever more complex tasks, but whichalso require increasing amounts of cognitive effort to operate, systematically benefit high-NFCindividuals more than those with lower levels of cognitive motivation. Thus, reducing thesedisparities emerges as a novel challenge for the HCI community. Our results also demonstratethe value and feasibility of auditing HCI innovations (whether in the area of explainable AI orelsewhere) for disparate effects. While there is a growing body of work on auditing the behavior of algorithms underlying interactive systems, the need to audit interaction design choices has receivedless attention so far but appears equally important.

Dual processing theory postulates that humans make decisions through one of two distinct cognitiveprocesses — fast and automatic (i.e., System 1 thinking) or slow and deliberative (i.e., System 2thinking) [32, 33, 63]. System 1 thinking employs heuristics that lead to effective decisions in mostdaily decision-making settings. These heuristics are highly useful and mostly effective, howeverthey can lead to systemic and predictable errors [34]. Importantly, as a definition of their expertise,experts also develop heuristics, which at times make them prone to faulty decisions even in high-stakes domains [42]. Therefore, shifting people’s reasoning from System 1 to a more deliberativeand rational, System 2 thinking remains a challenge that researchers have tackled in numeroushigh-stakes domains [14, 15, 42].

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021.

For example, substantive effort has been spent in clinical settings to improve diagnostic out-comes by eliciting analytical thinking from clinicians. That is because evidence suggests thatvery often diagnostic errors are not cases of lack of knowledge, but rather of biases and faultyinformation synthesis [25]. Interventions that seek to mitigate these types of errors by invokingdeliberative thinking can be categorized into educational strategies and cognitive forcing func-tions [16]. Educational strategies are metacognitive debiasing techniques that aim at enhancing future decision-making by increasing clinicians awareness about the existence of different decision-making pitfalls. They are introduced through educational curricula, simulation training, and otherinstruction techniques. On the other hand, cognitive forcing functions are interventions which takeplace at the decision-making time and encourage the decision-maker to engage analytically withthe decision at hand. For instance, in clinical settings, these forcing functions can take the form ofchecklists, diagnostic time-outs, and slow decision-making. Studies indicate that cognitive forcingfunctions, as interventions at the time of decision-making, are more effective than educationalstrategies in increasing accuracy of diagnostic processes in clinical decision making [42, 55].

As machine learning models have achieved high accuracies across numerous domains, they areincreasingly being deployed as decision-support aids for humans. The implicit assumption is thatbecause these models have high accuracies, so will the overall human+AI teams. However, mountingresults suggest that these teams systematically under-perform the AI alone in tasks where AI’saccuracy is higher than humans working alone (e.g., medical treatment selection [30], deceptivereview detection [40, 41], loan default prediction [26], high-fat nutrient detection [7], incomecategory prediction [67]). Many expected that if humans were shown explanations for AI decisions,the team performance would be complementary as the human would be able to understand why theAI came up with a prediction and, more importantly, detect when it was wrong. Yet recent studiesfound that explanations did not help in detecting AI’s incorrect recommendations [30]. Whileexplanations have generally improved the overall performance compared to the performance ofhumans alone and providing the human only with predictions, team performance is still inferior tothat of AI’s [2]. A necessary next step, then, is to understand where this gain in performance (if any)comes from, and why the combined human+AI performance is still lower than the performance ofAI models working independently.A leading view to explain the inferior team performance is that humans overtrust the AI [1]. Inother words, when AI makes an incorrect prediction, humans rely on it even when they wouldhave made a better decision on their own. Thus, numerous studies have highlighted the importanceof calibrated trust in improving AI-assisted decision-making [7, 9, 31, 44, 51, 53, 67]. The researchcommunity is aware of the risk of overreliance on AIs in general and guidelines have been proposedto reduce overtrust. For example, Wagner et al. [62] in the context of people overtrusting robots,suggest avoiding features that may nudge the user towards the anthropomorphization of robots, orfor self-driving cars, they advise developing tools that understand when the driver is not payingattention. However, to the best of our knowledge, there are no specific interventions that aredesigned explicitly to mitigate overreliance on AIs and that are shown empirically to reduceovertrust.

Cognitive forcing functions , an umbrella term for interventions that elicit thinking at the decision-making time, have been implemented in various forms by prior work. Often these functions weredesigned to explicitly disrupt the quick, heuristic (i.e., System 1) decision-making process [14, 15,19, 20, 42, 50], but may not necessarily be referred to as cognitive forcing functions even in thehealthcare domain. Listed are some of these strategies that have either previously been employed for

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:5

AI-assisted decision-making or for which we found to have appropriate analogs in the AI-assisteddecision-making domain: • Asking the person to make a decision before seeing the AI’s recommendation.

Prior studies inhuman-AI decision-making have shown the anchoring bias that occurs by presenting peoplewith AI’s recommendation before allowing them to make a decision first [26]. People madebetter decision when they saw the AI’s recommendation after making an unassisted decision. • Slowing down the process.

As shown by other HCI researchers, simply delaying the presentationof AI recommendation can improve outcomes [50]. • Letting the person choose whether and when to see the AI recommendation.

There is evidencethat showing unsolicited advice that contradicts a person’s initial idea may trigger reactance(resistance to the advice) [21]. To prevent this, one could only show AI recommendationswhen a person requests it.A growing number of studies demonstrate, however, that people prefer simpler constructs, eventhough they learn more and perform better with more complex ones. Visualization literature revealsthat visual difficulties, while not necessarily preferred, improve participants’ comprehension andrecall of the displayed content [29]. Recent education research also indicates that students preferredand thought they learned more with easier, passive instructions than with more cognitively de-manding, active instruction. But when evaluated objectively, their actual learning and performancewas better with the more cognitively demanding, active instruction [18]. Therefore, while cognitiveforcing functions may enhance user performance, there likely exists a tension with user preferenceof the system.

We conducted an experiment with 3 different cognitive forcing interventions, two simple explainableAI conditions, and a no-AI baseline, to examine whether cognitive forcing functions are successfulin reducing human overreliance on the AI when working on a decision-making task. Specifically,we hypothesized that:

H1a:

Compared to simple explainable AI approaches, cognitive forcing functionswill improve the performance of human+AI teams in situations where the AI’s topprediction is incorrect.And, consequently:

H1b:

Compared to simple explainable AI approaches, cognitive forcing functions willimprove the performance of human+AI teams.However, because cognitive forcing functions cause people to exert extra cognitive effort, weexpected that there would be a trade-off between the acceptability of the design of the human+AIcollaboration interface and its effectiveness in reducing the overreliance: people would over-relyless on the AI when they were forced to think harder, but they would prefer such interfaces lessthan those that require less thinking. Thus, we hypothesized that:

H2:

There will be a negative correlation between the self-reported acceptability ofthe interface and the performance of human+AI teams in situations where the AI’sprediction is incorrect.

Accessing experts such as judges, or clinicians for experiments is notoriously challenging andcostly. We, therefore, designed the task around nutrition as an approachable domain for laypeople.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021.

Practice Question

T(cid:92)rn (cid:91)hi(cid:90) pla(cid:91)e of food in(cid:91)o a lo(cid:94) carb meal

B(cid:96) replacing one of the ingredients, (cid:96)our goal is to make this meal a lo(cid:94) carb meal (cid:94)hile keeping its original (cid:197)a(cid:93)or (as much as possible).

The main ingredien(cid:91)(cid:90) on (cid:91)hi(cid:90) pla(cid:91)e are: chicken , beans , cherry tomato , spinach AI (cid:90)(cid:92)gge(cid:90)(cid:91)i(cid:86)(cid:85)

The AI suggested replacing bea(cid:85)(cid:90) (cid:94)ith the follo(cid:94)ing top 4 options b(cid:96)optimi(cid:97)ing for (cid:197)a(cid:93)or and nutrition goal: carb red(cid:92)c(cid:91)ion (cid:197)a(cid:93)or (cid:90)imilari(cid:91)(cid:96)green bean(cid:90)green (cid:97)(cid:92)cchinim(cid:92)(cid:90)hroom(cid:90)(cid:91)oma(cid:91)o

I (cid:94)o(cid:92)ld replace(cid:94)i(cid:91)h

Ne(cid:95)(cid:91) (cid:107)

AI’s suggestion

Next (a) explanation (SXAI)

The AI is 87% conﬁdent in its suggestion (b) uncertainty (SXAI)

See AI’s suggestion (c) on demand (CFF)

The AI is processing the image (d) wait (CFF)

Fig. 1. Multiple conditions. (a) depicts the main interface with the explanation condition, where the ingredientsare recognized correctly and an explanation is provided for top replacements. In uncertainty condition (b)participants were shown AI’s confidence along with the explanation. In on demand condition (c) participantscould click to see the AI’s suggestion and explanation, whereas in wait condition (d) they were shown amessage “AI is processing the image” for 30 seconds before the suggestion and explanation were presented tothem.

Participants were shown meal images and were asked to replace the ingredient highest in carbohy-drates on the plate, with an ingredient that was low in carbohydrates, but similar in flavor. Each ofthe meals had an ingredient with substantially more carbohydrates than other ingredients on theplate. Participants had to make two discrete choices: First, they had to pick the ingredient to takeout from the meal (by selecting from a list of approximately 10 choices, that included all the mainingredients in the image plus several other ingredients). Next, they had to select a new ingredientto put in place of the removed ingredient (again, by selecting from a list of roughly 10 choices).

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:7

We designed six conditions that varied in whether and how a simulated AI assisted the participantsin making their decisions.

No AI.

In the no AI condition, participants were shown just the image of the meal and the twopull-down menus to select which ingredient to remove and which new ingredient to put in itsplace.

Explanation.

The explanation condition (Figure 1a) included the list of the ingredients that thesimulated AI (described in the next section) recognized on the plate and top four substitutionsfor the ingredient highest in carbohydrates among the recognized ingredients. Each substitutionwas accompanied by a specific feature-based explanation showing the estimated carbohydratereduction and flavor similarity. While there is a great diversity in the designs of explanations in theexplainable AI community, this condition reflects a common approach for designing the human-AIcollaborative interface: the explanation and the AI suggestion are shown immediately to the humandecision-maker.

Uncertainty.

The uncertainty condition was like the explanation , except that participants werealso shown a confidence prompt “The AI is X % confident in its suggestion.” (Figure 1b). The promptwas shown only when the AI was uncertain about its suggestion. This condition captures anothercommon approach for designing interfaces for human-AI collaboration [2, 40, 66].

On demand.

In the on demand condition, the AI suggestion was not shown to the users bydefault. Users could see the suggestion and the explanation (identical to the one in the explanation condition) if they clicked on the “See AI’s suggestion” button (Figure 1c). We hypothesized this tobe a light form of cognitive forcing function, where the human would engage with the explanationbecause they explicitly requested it.

Update.

Participants in the update condition had to make the initial decision without the help ofan AI (i.e., like in the no AI condition). Having made the initial decision, they were shown the AI’ssuggestion and explanation and could update their decision. We hypothesized that this conditionwould motivate the participants to engage with the explanation, especially when the AI disagreedwith their initial decision. In other words, participants would be curious as to why the AI disagreedwith them.

Wait.

Participants in the wait condition were shown a message “AI is processing the image”(Figure 1d) for 30 seconds before being shown the AI suggestion and the explanation (identicalto the one in the explanation condition). Informed by prior work that a slow algorithm improvesuser’s accuracy [50], we also hypothesized this is a form of cognitive forcing function. That is,because a user forms a hypothesis for the correct answer while waiting for the AI’s suggestion,then evaluates the AI explanation to check if it supports their hypothesis.

We designed a simulated AI for the experiment, which had 75% accuracy of correctly recognizingthe ingredient with the highest carbohydrate impact in the image of the meal. Note that we didnot train an actual machine learning model for this task because we wanted to have control overthe type and prevalence of error the AI would make. Once the ingredients were recognized, the

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. simulated AI always provided suggestions for replacing the ingredient highest in carbohydratesand an explanation which included four top replacements for the ingredient, each accompanied byan estimate of carbohydrate reduction and flavor similarity. These top replacements were rankedin terms of their carb reduction and flavor similarity compared to the replaced ingredient.We built a lookup table for each of the ingredients present in the image and their top tenreplacements. These top ten replacements were selected as the ingredients that were most similarin terms of flavor to the ingredient to be replaced. We used a flavor database comprised of flavormolecules representing an array of tastes and odors associated with 936 ingredients [23]. Eachingredient is made up of different flavor molecules and the ingredients can be paired to otheringredients that share the most flavor molecules with. However, in addition to the number of flavormolecules that the ingredients share, the number of molecules that the ingredients do not shareis also important to understand how similar two ingredients are. Hence, we computed the flavorsimilarity between two ingredients with sets of flavor molecules: 𝐴 and 𝐵 as 𝐹 𝐴𝐵 = | 𝐴 ∩ 𝐵 || 𝐴 ∪ 𝐵 | . Havingselected the top ten ingredients in terms of flavor, we computed the percent of carbs that wouldbe reduced with respect to the original ingredient. We ranked the top ingredients in terms of theharmonic mean of the flavor similarity and carb reduction. We chose to compute the harmonicmean, because we sought for an optimization for both flavor and carbs, not only one of them.The AI explanations included this information about the flavor similarity and carb content of thesuggested replacements. These feature based explanations (i.e., carb reduction and flavor similarity)were intended to be comparable to commonly used explanations generated by techniques suchas SHAP [46]. For all the conditions, once the participants selected an ingredient to replace, thesecond list of ingredients to replace with was populated by these top ten ingredients.We designed errors of the simulated AI to stem from visual misrecognition. To produce incorrectmodel predictions, the AI would not recognize the ingredient highest in carbohydrates. Therefore, itwould suggest replacing the second ingredient highest in carbs on the plate. Note that we designedthe questions such that all meals included a single ingredient that had substantially more carbsthan the other ingredients on the plate. As a result, the second ingredient highest in carb on theplate was a low carb ingredient. The study was conducted online on Amazon Mechanical Turk (MTurk). Participants were firstpresented with a consent form and instructions. They completed 26 questions, split into two blocksof 13 questions. Out of the 9 conditions (the 6 mentioned earlier, plus 3 additional exploratorydesigns not reported in this paper), participants were randomly shown a different condition ineach block. The first question of each block was a practice question to help participants familiarizethemselves with a new design, and was discarded from analysis. Six questions with incorrect modelpredictions (three per block) were shown overall. While the positions of these questions werefixed, the order of questions where the model predictions were correct/incorrect were randomizedfor each participant. Participants completed a questionnaire after each block to indicate theirsubjective experience with the system. After the first block, they also completed a four-item Needfor Cognition questionnaire, consisting the four items with the highest factor loading from [11].

A total of 260 participants were recruited to complete the task via Amazon MTurk in three batches.Participation was limited to adults residing in the United States. The study took 15 minutes onaverage to complete. Each participant was paid $2.5 (USD) for an estimated rate of $10 per hour.Participants could take part in the study only once. To motivate participants to perform well on the

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:9 task, the top performer of each batch was rewarded with a bonus of $3. Out of 260 participants,49 participants were filtered as they selected as an ingredient to replace one that was not on theplate for more than 55% of the questions. We noticed that these excluded participants had also anaccuracy of 0 on the task. An additional 12 participants were excluded from the analyses as theywere assigned to only exploratory conditions not reported in the results. From the retained 199participants, 191 completed an optional demographic survey in the beginning of the study ( 𝑀 𝑎𝑔𝑒 =37.09, SD = 10.74). 125 of the participants (65.44%) self-identified as male, and the rest self-identifiedas female. 164 of the participants (85.86%) had either received or were pursuing a college degree, 13of them a high school degree, and the remaining 13 were either pursuing or had received a PhDdegree. For only one of the participants, the highest level of education received was a pre-highschool degree. The study was a mixed between- and within-subject design. Both the within subject factor and be-tween subject factor was the condition. Each participant interacted with two of the nine conditions.We collected the following performance measures: • Overall performance: Percentage of top replacements, including the correct ingredient to bereplaced and the top replacement for it • Carb source detection performance: Percentage of correct selections of ingredients to bereplaced • Carb reduction: Percentage of carbohydrate reduction with respect to replaced ingredients • Flavor similarity: Percentage of flavor similarity with respect to replaced ingredients • Overreliance: Percentage of agreement with the AI when the AI made incorrect predictions • Human error (on incorrect AI predictions): Percentage of incorrect decisions (different fromAI’s suggestion) when the AI made incorrect predictionsWe also collected several self-reported subjective measures: • Preference: Participants rated the statement “I would like to use this system frequently.” on a5-point Likert scale from 1=Strongly disagree to 5=Strongly agree after each block. • Trust: Participants rated the statement “I trust this AI’s suggestions for optimal replacement.” on a 5-point Likert scale from 1=Strongly disagree to 5=Strongly agree after each block. • Mental demand: Participants rated the statement “I found this task difficult.” on a 5-pointLikert scale from 1=Strongly disagree to 5=Strongly agree for each block • System complexity: Participants rated the statement “The system was complex.” on a 5-pointLikert scale from 1=Strongly disagree to 5=Strongly agree for each blockWe conducted our analyses first by category ( no AI , simple explainable AI (SXAI) , and cognitiveforcing functions (CFF) ). Subsequently, we conducted additional analyses to test for differencesamong individual designs within the SXAI and CFF categories.For performance on incorrect model predictions, we also added no AI category to the analysiseven though participants in that category did not see any model predictions. We compared partici-pants’ performance on questions when they saw incorrect predictions (i.e., CFF and

SXAI ) to theperformance of other participants on the same questions but with no AI assistance (i.e., no AI ).We used analysis of variance to analyze the impact of the different designs on both objectiveand subjective measures. Our data was analyzed using mixed-effects models. Category/conditionwas modeled as a fixed effect, and participant as a random effect to account for the fact thateach participant saw two out of several possible conditions (i.e., all the measurements were notstatistically independent) [3]. Mixed-effects models also properly handle the imbalance in ourdata [57], due to participants being randomly assigned to conditions and not all participants having

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021.

Post-hoc analyses Marginal Means(Standard Errors) Significance Effect SizeCFF vs. SXAIno AI SXAI CFF all {no AI} <{CFF, SXAI} 𝐹 , . = . 𝑝 ≪ . 𝑑 = . Overallperformance incorrect AIpredictions {SXAI} <{CFF} <{no AI} 𝐹 , . = . 𝑝 ≪ . 𝑑 = . all {no AI} <{SXAI, CFF} 𝐹 , . = . 𝑝 ≪ . 𝑑 = . Carb SourceDetectionPerformance incorrect AIpredictions {SXAI} <{CFF} <{no AI} 𝐹 , . = . 𝑝 ≪ . 𝑑 = . all {no AI} <{SXAI, CFF} 𝐹 , = . 𝑝 ≪ . 𝑑 = . Carb Reduction incorrect AIpredictions {SXAI} <{CFF} <{no AI} 𝐹 , . = . 𝑝 ≪ . 𝑑 = . all {no AI} <{SXAI, CFF} 𝐹 , . = . 𝑝 ≪ . 𝑑 = . Flavor Similarity incorrect AIpredictions {SXAI} <{CFF} <{no AI} 𝐹 , . = . 𝑝 ≪ . 𝑑 = . Table 1. Results of objective measures. Differences indicated by the < symbol are statistically significant.Categories that share brackets do not differ significantly. interacted with both categories (CFF and SXAI). Note that unbalanced data can lead to fractionaldenominator degrees of freedom.We used Student’s t-test for post-hoc pairwise comparisons with Holm-Bonferroni correctionsto account for multiple comparisons [28]. We report marginal means, which are means for groupsthat are adjusted for means of other factors in the model (i.e., participant). The effect size (Cohen’s d ) was calculated accounting for the random effect as described in Westfall et al. [64].We used Pearson’s correlation to analyze associations between subjective measures and perfor-mance.Throughout the results and discussion sections, significantly corresponds to statistically signifi-cantly for improved readability. Results for the performance measures are summarized in Table 1. Participants had to pick boththe ingredient to replace and an ingredient to replace it with, each from a list that contained anaverage of 10 ingredients. Therefore, the probability that a participant was correct by chance inthis task was approximately 1% for the overall decision, and 10% for carb source detection. Notethat we use the term correct when referring to the optimal ingredient replacement in terms of flavorsimilarity and carb reduction for the ingredient highest in carbs on the plate.When analyzing performance on all task instances (both those where the top AI predictions werecorrect and those where they were incorrect), both simple explainable AI conditions and cognitiveforcing functions improved participants’ performance on all measures (overall performance, carbsource detection, carb reduction, and flavor similarity) compared to the no AI baseline. There wereno significant differences on any of these metrics between cognitive forcing functions and simpleexplainable AI . There were also no significant differences within categories.When the top AI predictions were incorrect, however, cognitive forcing functions improved theobjective metrics (i.e., overall performance, carb source detection performance, carb reductionand flavor similarity) significantly more compared to simple explainable AI . Yet, performance of

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:11

Marginal Means(Standard Errors) Significance& Effect SizeSXAI CFF correct {SXAI} <{CFF} 𝐹 , . = . 𝑝 = . 𝑑 = . overrelied {CFF, SXAI} 0.30(0.03) 0.26(0.03) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . Overall human error {CFF, SXAI} 0.68(0.03) 0.65(0.03) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . correct {SXAI} <{CFF} 𝐹 , . = . 𝑝 ≪ . 𝑑 = . overrelied {CFF} <{SXAI} 𝐹 , . = . 𝑝 = . 𝑑 = . Carb SourceDetection human error {CFF, SXAI} 0.27(0.04) 0.26(0.03) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . Table 2. Distribution of decisions on instances where the model was incorrect. Differences indicated by the

On incorrect modelpredictions, participants could either follow the incorrect AI suggestion (i.e., overrely), provide adifferent incorrect answer (i.e., human error), or provide a correct answer.Distributions of overreliance, human error, and correctness were significantly different acrosscategories for overall performance ( 𝜒 ( , 𝑁 = ) = . 𝑝 = . cognitive forcing functions made significantly more correct decisions than participantsin simple explainable AI . They also overrelied less, but not significantly so. There were also nosignificant differences between categories for human errors.For carb source detection, distributions of overreliance, human error, and correctness were alsosignificantly different across categories ( 𝜒 ( , 𝑁 = ) = . 𝑝 ≪ . cognitive forcing functions overrelied significantly less and made significantly morecorrect decisions than participants in simple explainable AI . There were no significant differencesbetween categories for human errors.For all the metrics, there were no significant differences among conditions within either category(i.e., simple explainable AI and cognitive forcing functions ). Table 3 summarizes the comparisons of subjective measures across categories.Overall, participants reported higher trust in the AI in simple explainable AI conditions comparedto cognitive forcing functions , albeit not significantly so. They preferred completing the task signifi-cantly more with AI assistance (either with simple explainable AI or cognitive forcing functions ), thanwithout it ( no AI ). They found the no AI condition to be significantly more mentally demandingthan cognitive forcing functions and simple explainable AI conditions. They perceived the system assignificantly less complex in simple explainable AI conditions compared to either cognitive forcing Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021.

Marginal Means(Standard Errors) SignificancePost-hoc analyses no AI SXAI CFFTrust {CFF, SXAI} / 3.91(0.09) 3.72(0.08) 𝐹 , . = . 𝑛.𝑠. Preference {no AI} <{CFF, SXAI} 𝐹 , . = . 𝑝 ≪ . Mental Demand {SXAI, CFF} <{no AI} 𝐹 , . = . 𝑝 ≪ . System Complexity {SXAI} <{CFF, no AI} 𝐹 , . = . 𝑝 = . Table 3. Results of subjective measures. Differences indicated by the < symbol are statistically significant.Categories that share brackets do not differ significantly. functions or no AI . Conditions within categories did not differ significantly on any of the subjectiveratings. Figure 2 depicts relationships between subjective measures and performance.Trust and preference were significantly negatively correlated with both overall performance andcarb source detection performance for incorrect model predictions. Preference was also significantlynegatively correlated with carb source detection for correct model predictions.Mental demand and system complexity were significantly negatively correlated with overalland carb source detection performance for correct model predictions. However, mental demandwas significantly positively correlated with carb source detection performance for incorrect modelpredictions.Trust was significantly positively correlated with reliance on incorrect model predictions (i.e.,overreliance) for carb source detection, but not so for correct model predictions. Trust was notsignificantly correlated with reliance for overall decision either for correct or incorrect modelpredictions.

We are mindful of the fact that technological interventions can lead to intervention-generatedinequalities (IGIs). IGIs arise when the benefits of an intervention disproportionately accrue to agroup that is already privileged in a particular context [60]. Thus, even though everyone mightbenefit to some extent, the gaps between different groups increase.Therefore, we conducted a limited internal audit [54] of our work to check for inequalities thatmight emerge and to understand how different groups of people are affected by the introduction ofcognitive forcing functions into the AI-assisted decision-making processes.While some prior research in ethical AI documented disparities by disaggregating results by raceand gender [8], we believe that the relevant variable in our case is the intrinsic cognitive motivation.In psychology, it is captured through the concept of Need for Cognition (NFC), a stable personalitytrait that reflects how much a person enjoys engaging in cognitively demanding activities [10, 52].The impact of NFC on cognitive engagement with information has been studied in fields such asadvertising and purchasing behavior [27, 43, 45], skill acquisition [13], web usage [56, 58], teamfunction [37], health communication [61, 65], and AI-assisted decision-making [24]. The evidence

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:13 trust o v e r a ll p e r f o r m a n ce ca r b s o u r ce d e t ec t i o n p e r f o r m a n ce correct model predictionsincorrect model predictions preference correct model predictionsincorrect model predictions r=-.24, p=.0003 r=.005, n.s. r=-.29, p=.0008 r=-.03, n.s. r=-16, p=.0002 r=-.09, n.s. r=-.21, p=.0005r=-.15, p=.01 (a) Trust vs. performance (left) and preference vs. performance (right) mental demand o v e r a ll p e r f o r m a n ce ca r b s o u r ce d e t ec t i o n p e r f o r m a n ce correct model predictionsincorrect model predictions system complexity correct model predictionsincorrect model predictions r=.02, n.s. r=-.24, p<.0001r=.13, p=.03r=-.24, p<.0001 r=-.27, p<.0001 r=.01, n.s. r=-.24, p<.0001 r=.04, n.s. (b) Mental demand vs. performance (left) and system complexity vs. performance (right)Fig. 2. Subjective measures vs. performance trust r e li a n ce f o r o v e r a ll d ec i s i o n r e li a n ce f o r ca r b s o u r ce d e t ec t i o n correct model predictionsincorrect model predictions r=.11, n.s.r=-.02, n.s. r=.14, p=.04 r=-.05, n.s. Fig. 3. Trust vs. reliance on AI for correct model predictions and incorrect model predictions (i.e., overreliance) that has accrued is very consistent: high-NFC participants seek out more information and processit more deeply, while low-NFC participants are more likely to resort to cognitive shortcuts such asrelying on the surface cues to assess the information such as the authority or celebrity of the source

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021.

High NFC Low NFCMarginal Means(Standard Errors) Significance& Effect Size Marginal Means(Standard Errors) Significance& Effect SizeSXAI CFF SXAI CFF all {CFF, SXAI} 0.44(0.03) 0.43(0.03) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = .

01 {CFF, SXAI} 0.25(0.04) 0.21(0.03) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . Overallperformance incorrect AIpredictions {SXAI} < {CFF} 𝐹 , . = . 𝑝 = . 𝑑 = .

57 {SXAI, CFF} 0.03(0.02) 0.06(0.02) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . all {SXAI, CFF} 0.66(0.03) 0.68(0.03) 𝐹 , = . 𝑛.𝑠. , 𝑑 = .

09 {CFF, SXAI} 0.45(0.04) 0.44(0.04) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . Carb SourceDetectionPerformance incorrect AIpredictions {SXAI} < {CFF} 𝐹 , . = . 𝑝 = . 𝑑 = . {SXAI} < {CFF} 𝐹 , . = . 𝑝 = . 𝑑 = . Table 4. Performance disaggregated by NFC. Differences indicated by the < symbol are statistically significant.Categories that share brackets do not differ significantly.

High NFC Low NFCMarginal Means(Standard Errors) Marginal Means(Standard Errors)SXAI CFF Significance& Effect Size SXAI CFF Significance& Effect Size correct {SXAI} <{CFF} 𝐹 , . = . 𝑝 = . 𝑑 = .

57 {SXAI, CFF} 0.03(0.02) 0.07(0.02) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . overrelied {CFF, SXAI} 0.39(0.05) 0.32(0.04) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = .

18 {CFF, SXAI} 0.22(0.04) 0.19(0.04) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . Overallperformance human error {CFF, SXAI} 0.60(0.05) 0.57(0.04) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = .

09 {CFF, SXAI} 0.75(0.05) 0.77(0.05) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . correct {SXAI} <{CFF} 𝐹 , . = . 𝑝 = . 𝑑 = . {SXAI} <{CFF} 𝐹 , . = . 𝑝 = . 𝑑 = . overrelied {CFF} <{SXAI} 𝐹 , . = . 𝑝 = . 𝑑 = .

50 {CFF, SXAI} 0.50(0.06) 0.37(0.06) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . Carb SourceDetectionPerformance human error {CFF, SXAI} 0.16(0.04) 0.15(0.03) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = .

04 {CFF, SXAI} 0.38(0.05) 0.39(0.05) 𝐹 , . = . 𝑛.𝑠. , 𝑑 = . Table 5. Distribution of decisions on instances where the model was incorrect disaggregated by NFC. Dif-ferences indicated by the < symbol are statistically significant. Categories that share brackets do not differsignificantly. of the information, or the aesthetics of the presentation. In the context of HCI, people with high NFCare more likely to adopt novel productivity enhancing features in complex software [12]. Also, givenmultiple ways of getting a task done, high-NFC participants are more likely than those with lowNFC to choose the method that saves manual effort, but requires increased cognitive exertion [22].However, there is also some evidence that people with high NFC benefit less from explanations (interms of confidence in their decisions) when using recommender systems than people with lowNFC [47] although there appear to be limits to the generalizability of this result [48].Therefore, in the context of human-AI collaboration on decision making we considered individualswith high NFC to be the already privileged group and we investigated whether cognitive forcingfunctions were equally effective for people with low NFC as they were for people with highNFC or whether they increased the performance gap (i.e., inequality) between the two groups. Inanticipation of this analysis, we had included a 4-item subset of the NFC questionnaire [11] as partof our study (we used the same subset as [22]).Participants were split into two groups — low NFC and high NFC — at the median of the NFCscore distribution. Table 4 summarizes the results of performance disaggregated by NFC.

Objective Measures.

As an initial check we compared the overall performance across categories(i.e., cognitive forcing functions and simple explainable AI ) between the two NFC groups. High-NFCparticipants demonstrated significantly higher overall performance ( 𝑀 = . ) than low-NFCparticipants ( 𝑀 = . ) , ( 𝐹 , . = . , 𝑝 ≪ . ) . They also detected the source of carbs ( 𝑀 = . ) significantly better ( 𝐹 , . = . , 𝑝 ≪ . ) than low-NFC participants ( 𝑀 = . ) . Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:15

High NFC Low NFCMarginal Means(Standard Errors) Marginal Means(Standard Errors)SXAI CFF Significance SXAI CFF SignificanceTrust {CFF} < {SXAI} 𝐹 , . = . 𝑝 = .

02 {CFF, SXAI} 3.93(0.13) 3.93(0.13) 𝐹 , . = . 𝑛.𝑠. Preference {CFF} < {SXAI} 𝐹 , . = . 𝑝 = .

03 {SXAI, CFF} 3.73(0.13) 3.86(0.13) 𝐹 , . = . 𝑛.𝑠. Mental Demand {SXAI, CFF} 2.50(0.16) 2.70(0.14) 𝐹 , . = . 𝑛.𝑠. {CFF, SXAI} 3.27(0.20) 2.96(0.19) 𝐹 , . = . 𝑛.𝑠. System Complexity {SXAI} < {CFF} 𝐹 , . = . 𝑝 = .

01 {SXAI, CFF} 2.96(0.16) 3.11(0.15) 𝐹 , . = . 𝑛.𝑠. , Table 6. Results of subjective measures disaggregated by NFC. Differences indicated by the < symbol arestatistically significant. Categories that share brackets do not differ significantly.

These results demonstrate that high-NFC participants generally perform better at this task thanlow-NFC participants.Consistent with the main findings (reported in Section 4), compared to simple explainable AI , cognitive forcing functions did not have a significant effect in overall performance and carb sourcedetection performance on all questions for either of the groups. On incorrect model predictions,however, high-NFC participants benefited from cognitive forcing functions as they significantlyimproved both their overall and carb source detection performance. Whereas, while cognitiveforcing functions improved low NFC participants’ carb source detection performance significantlyon incorrect model predictions, they did not significantly improve their overall performance. Detailed analysis of decisions when model predictions were incorrect.

For high NFC partici-pants, consistent with the main findings, distributions of overreliance, human error, and correctnesswere significantly different between the CFF and SXAI categories for both the overall performance( 𝜒 ( , 𝑁 = ) = . 𝑝 = . 𝜒 ( , 𝑁 = ) = . 𝑝 ≪ . cognitive forcing functions significantly improved high NFCparticipants’ overall and carb source detection performance compared to simple explainable AI conditions. High-NFC participants in cognitive forcing functions conditions, also, overrelied on theAI significantly less for carb source detection than those in simple explainable AI conditions.For low-NFC participants, distributions of overreliance, human error, and correctness were notsignificantly different between the CFF and SXAI categories for overall performance ( 𝜒 ( , 𝑁 = ) = . 𝑛.𝑠. ). For carb source detection, however, the distributions were significantly different( 𝜒 ( , 𝑁 = ) = . 𝑝 = . cognitive forcing function conditions detected the carb source significantly better than those in simple explainable AI condi-tions. Different from combined findings, there was no significant difference across categories foroverreliance and human errors. Subjective Measures.

We first compared subjective measures reported by high NFC participantswith those reported by low NFC participants by combining the data across categories (i.e., cogni-tive forcing functions and simple explainable AI ). Low NFC participants generally found the tasksignificantly more mentally demanding ( 𝑀 = . ) than high NFC participants ( 𝑀 = . ) , ( 𝐹 , = . , 𝑝 = . ) . Similarly, they found the system to be significantly more complex ( 𝑀 = . ) than high NFC participants ( 𝑀 = . ) , ( 𝐹 , . = . , 𝑝 = . ) . Low NFC par-ticipants reported on average higher trust ( 𝑀 = . ) than high NFC participants ( 𝑀 = . ) ,albeit not significantly so ( 𝐹 , . = . , 𝑛.𝑠. ) . They also preferred the systems on average more ( 𝑀 = . ) than high NFC participants ( 𝑀 = . ) , but this difference was also not statisticallysignificant ( 𝐹 , = . , 𝑛.𝑠. ) . Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021.

Subsequently, we investigated the effect of category on subjective ratings separately for eachof the two NFC groups. These results are summarized in Table 6. High NFC participants reportedsignificantly higher trust for simple explainable AI conditions compared to cognitive forcing functions .They also preferred simple explainable AI conditions more than cognitive forcing functions andperceived them to be significantly less complex. There were no significant differences for subjectiveratings across categories for low NFC participants.

Consistent with prior research [7, 9, 26, 41], our results demonstrate that people aided by simpleexplainable AI interfaces performed better overall than people who completed the task withoutany AI support. Also consistent with the prior research, these human+AI teams performed worsethan the AI model alone (its accuracy was set to 75%). Our results suggest that part of the reason isthat people frequently overrelied on the AI: they followed the suggestions of the AI even when itspredictions were incorrect. Consequently, when the AI predictions were incorrect, people aided bythe simple explainable AI approaches performed less well than people who had no AI support.Our results demonstrate that cognitive forcing functions reduced overreliance on AI comparedto the simple explainable AI approaches: When the simulated AI model was incorrect, participantsdisregarded the AI suggestions and made the optimal choices significantly more frequently in the cognitive forcing function conditions than in the simple explainable AI conditions. These results support the hypothesis (H1a).

However, even with the cognitive forcing functions, the human+AI teams in our study continuedto perform worse than the AI model alone: the cognitive forcing functions reduced, but not yeteliminated, overreliance on the AI.Our results did not provide support for the hypothesis (H1b) that cognitive forcing func-tions will improve the performance of human+AI teams. There was no significant difference inperformance between simple explainable AI approaches and cognitive forcing functions .Subjective ratings of the conditions indicate that participants in cognitive forcing function condi-tions perceived the system as more complex than those in simple explainable AI conditions. Thetension between subjective measures and performance is observed in the correlation analysesof the subjective measures versus the performance. When the AI model was making incorrectpredictions, people performed best in conditions that they preferred and trusted the least, andthat they rated as the most difficult. Thus, as hypothesized (H2) , there appears to be a trade-offbetween the acceptability of a design of the human-AI collaboration interface and the performanceof the human+AI team.Overall, our research suggests that human cognitive motivation moderates the effectiveness ofexplainable AI solutions. Hence, in addition to tuning explanation attributes such as soundnessand completeness [39], or faithfulness, sensitivity, and complexity [5], explainable AI researchersshould ensure that people will exert effort to attend to those explanations.Our results also lend further support to the recent research that demonstrated that using proxytasks (where participants are asked, for example, to predict the output of an algorithm based on theexplanations it generated) to evaluate explainable AI systems may lead to different results thantesting the systems on actual decision-making tasks (where participants are asked to make decisionsaided by an algorithm and its explanations) [7]. To explain their results, the authors hypothesizedthat proxy tasks artificially forced participants to pay attention to the AI, but when participants werepresented with actual decision-making tasks, they focused on making the decisions and allocatedfewer cognitive resources to analyzing AI-generated suggestions and explanations. Our resultsshowed that cognitive forcing interventions improved participants’ ability to detect AI’s errors inan actual decision-making task, where the human is assisted by the AI while making decisions.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:17

Hence, we concur: If we consider proxy tasks to be a strong form of a cognitive forcing intervention,our results suggest that evaluations that use proxy task are likely to produce more optimistic results(i.e., show higher human performance) than evaluations that use actual decision-making tasks withsimple explainable AI approaches.Because there is prior evidence that cognitively-demanding user interface designs benefit peopledifferently depending on the level of their cognitive motivation, we disaggregated our resultsby dividing our participants into two halves based on their Need for Cognition (NFC)—a stablepersonality trait that captures how much a person enjoys engaging in effortful cognitive activities.The significant improvements in performance stemming from cognitive forcing functions, whendisaggregated, held mostly for people with high NFC. However, high-NFC participants trusted andpreferred cognitive forcing functions less than simple explainable AI approaches, while there wasno negative impact of these interventions on the low-NFC participants. These finding suggest thatcognitive forcing functions might produce intervention-generated inequalities by exacerbatingthe differences in performance between people based on their NFC. These findings also point to apossible way to mitigate this undesired effect. Future interventions might be tailored to account forthe differences in intrinsic cognitive motivation: stricter interventions might benefit more and stillbe accepted by people with lower intrinsic cognitive motivation.A key limitation of our work is that it was conducted in the context of a single non-critical decision-making task. Additional work is needed to examine if the effects we observe generalize acrossdomains and settings. However, because prior research provides ample evidence that even expertsmaking critical decisions resort to heuristic thinking (and fall prey to cognitive biases) [4, 6, 17, 42],we have some confidence that our results will generalize broadly.Another limitation is that cognitive forcing functions improved the performance of the human+AIteams at the cost of reducing the perceived usability. Consequently, systems that employ cognitiveforcing functions may find less adoption than those that use simple explainable AI techniquesbecause users will resist using them. We believe that this limitation could be overcome in part bydeveloping adaptive strategies where the most effective cognitive forcing functions are deployedonly in a small fraction of cases where using them is predicted to substantially improve the finaldecision made by the human+AI team. Developing such adaptive strategies will require creatingefficient models for predicting the performance of human+AI teams on particular task instancesin the presence of different interventions. Our work suggests that these predictive models shouldconsider AI’s uncertainty for task instances and individual differences in cognitive motivationwhen predicting whether cognitive forcing functions are necessary.

As AIs assist humans increasingly in decision-making tasks ranging from trivial to high-stakes, itis imperative for people to be able to detect incorrect AI recommendations. This goal is also anincumbent milestone towards building human and AI teams that outperform both people and AIsalone. Thus, in this study, we investigated the effect of cognitive forcing functions as interventionsfor reducing human overreliance on AI in collaborative human+AI decision-making.First, our results demonstrated that cognitive forcing functions, which elicit analytical think-ing, significantly reduced overreliance on AI compared to the simpler approaches of presentingexplanations for the AI recommendations to the user.Second, our results also showed that there exists a trade-off between subjective trust and prefer-ence in a system and performance with the system in human+AI decision-making. Participantspreferred and trusted more the systems that they perceived as less mentally demanding, but withwhich they performed poorly.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021.

Third, our research suggests that as researchers, we should audit our work for emerging inequal-ities among relevant groups. Our results show that cognitive forcing functions disproportionatelybenefited participants with high Need for Cognition — a group that has been shown in othersettings to benefit the most from useful but cognitively demanding user interface features.Together, these findings suggest that research in explainable AI should not assume that partici-pants will engage with the explanations by default. Hence, more effort should be spent on devisinginterventions that elicit analytical thinking and engagement with explanations when necessaryto avoid unquestioning trust from humans. Also, further research is necessary to account forindividual differences in cognitive motivation and explore the right amount and timing of cognitiveforcing necessary for optimal human performance with AI-powered decision-support tools.

ACKNOWLEDGMENTS

We thank Isaac Lage, Andrzej Romanowski, Ofra Amir, Zilin Ma, and Vineet Pandey for helpfulsuggestions and discussions.

REFERENCES [1] Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond accuracy: Therole of mental models in human-AI team performance. In

Proceedings of the AAAI Conference on Human Computationand Crowdsourcing , Vol. 7. 2–11.[2] Gagan Bansal, Tongshuang Wu, Joyce Zhu, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, andDaniel S Weld. 2021. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary TeamPerformance. In

Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21) . Association for Computing Machinery, New York, NY, USA, 1–16. To appear.[3] Dale J Barr, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. Random effects structure for confirmatoryhypothesis testing: Keep it maximal.

Journal of memory and language

68, 3 (2013), 255–278.[4] Eta S Berner and Mark L Graber. 2008. Overconfidence as a cause of diagnostic error in medicine.

The American journalof medicine

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 , Christian Bessiere(Ed.). International Joint Conferences on Artificial Intelligence Organization, 3016–3022. https://doi.org/10.24963/ijcai.2020/417[6] Brian H Bornstein and A Christine Emler. 2001. Rationality in medical decision making: a review of the literature ondoctors’ decision-making biases.

Journal of evaluation in clinical practice

7, 2 (2001), 97–107.[7] Zana Buçinca, Phoebe Lin, Krzysztof Z. Gajos, and Elena L. Glassman. 2020. Proxy Tasks and Subjective Measures CanBe Misleading in Evaluating Explainable AI Systems. In

Proceedings of the 25th International Conference on IntelligentUser Interfaces (IUI ’20) . ACM, New York, NY, USA.[8] Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial genderclassification. In

Conference on fairness, accountability and transparency . 77–91.[9] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015. The role of explanations on trust and reliance inclinical decision support systems. In . IEEE, 160–169.[10] John T. Cacioppo and Richard E. Petty. 1982. The need for cognition.

Journal of Personality and Social Psychology

42, 1(1982), 116–131. https://doi.org/10.1037/0022-3514.42.1.116[11] J T Cacioppo, R E Petty, and C F Kao. 1984. The efficient assessment of need for cognition.

Journal of personalityassessment

48, 3 (1984), 306–307. https://doi.org/10.1207/s15327752jpa4803_13[12] Giuseppe Carenini. 2001. An Analysis of the Influence of Need for Cognition on Dynamic Queries Usage. In

CHI ’01Extended Abstracts on Human Factors in Computing Systems (Seattle, Washington) (CHI EA ’01) . ACM, New York, NY,USA, 383–384. https://doi.org/10.1145/634067.634293[13] Ana-Maria Cazan and Simona Elena Indreica. 2014. Need for cognition and approaches to learning among universitystudents.

Procedia-Social and Behavioral Sciences

127 (2014), 134–138.[14] Jim Q Chen and Sang M Lee. 2003. An exploratory cognitive DSS for strategic decision making.

Decision supportsystems

36, 2 (2003), 147–160.[15] Glinda S Cooper and Vanessa Meterko. 2019. Cognitive bias research in forensic science: a systematic review.

Forensicscience international

297 (2019), 35–46.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:19 [16] Pat Croskerry. 2003. Cognitive forcing strategies in clinical decisionmaking.

Annals of emergency medicine

41, 1 (2003),110–120.[17] Pat Croskerry. 2003. The importance of cognitive errors in diagnosis and strategies to minimize them.

Academicmedicine

78, 8 (2003), 775–780.[18] Louis Deslauriers, Logan S McCarty, Kelly Miller, Kristina Callaghan, and Greg Kestin. 2019. Measuring actual learningversus feeling of learning in response to being actively engaged in the classroom.

Proceedings of the National Academyof Sciences (2019), 201821936.[19] Jennifer L Eberhardt. 2020.

Biased: Uncovering the hidden prejudice that shapes what we see, think, and do . PenguinBooks.[20] John W Ely, Mark L Graber, and Pat Croskerry. 2011. Checklists to reduce diagnostic errors.

Academic Medicine

86, 3(2011), 307–313.[21] Gavan J Fitzsimons and Donald R Lehmann. 2004. Reactance to recommendations: When unsolicited advice yieldscontrary responses.

Marketing Science

23, 1 (2004), 82–94.[22] Krzysztof Z. Gajos and Krysta Chauncey. 2017. The Influence of Personality Traits and Cognitive Load on the Use ofAdaptive User Interfaces. In

Proceedings of the 22Nd International Conference on Intelligent User Interfaces (Limassol,Cyprus) (IUI ’17) . ACM, New York, NY, USA, 301–306. https://doi.org/10.1145/3025171.3025192[23] Neelansh Garg, Apuroop Sethupathy, Rudraksh Tuwani, Rakhi Nk, Shubham Dokania, Arvind Iyer, Ayushi Gupta,Shubhra Agrawal, Navjot Singh, Shubham Shukla, et al. 2018. FlavorDB: a database of flavor molecules.

Nucleic acidsresearch

46, D1 (2018), D1210–D1216.[24] Bhavya Ghai, Q. Vera Liao, Yunfeng Zhang, Rachel Bellamy, and Klaus Mueller. 2021. Explainable Active Learning(XAL): Toward AI Explanations as Interfaces for Machine Teachers.

Proc. ACM Hum.-Comput. Interact.

4, CSCW3,Article 235 (2021), 28 pages. https://doi.org/10.1145/3432934[25] Mark L Graber, Stephanie Kissam, Velma L Payne, Ashley ND Meyer, Asta Sorensen, Nancy Lenfestey, Elizabeth Tant,Kerm Henriksen, Kenneth LaBresh, and Hardeep Singh. 2012. Cognitive interventions to reduce diagnostic error: anarrative review.

BMJ quality & safety

21, 7 (2012), 535–557.[26] Ben Green and Yiling Chen. 2019. The principles and limits of algorithm-in-the-loop decision making.

Proceedings ofthe ACM on Human-Computer Interaction

3, CSCW (2019), 1–24.[27] Curtis P Haugtvedt, Richard E Petty, and John T Cacioppo. 1992. Need for cognition and advertising: Understandingthe role of personality variables in consumer behavior.

Journal of Consumer Psychology

1, 3 (1992), 239–260.[28] Sture Holm. 1979. A simple sequentially rejective multiple test procedure.

Scandinavian Journal of Statistics

6, 2 (1979),65–70.[29] Jessica Hullman, Eytan Adar, and Priti Shah. 2011. Benefitting infovis with visual difficulties.

IEEE Transactions onVisualization and Computer Graphics

17, 12 (2011), 2213–2222.[30] Maia Jacobs, Melanie F. Pradier, Thomas H. McCoy Jr, Roy H. Perlis, Finale Doshi-Velez, and Krzysztof Z. Gajos. 2021.How machine-learning recommendations influence clinician treatment selections: the example of the antidepressantselection.

Translational Psychiatry

11 (2021). https://doi.org/10.1038/s41398-021-01224-x[31] Heinrich Jiang, Been Kim, Melody Y Guan, and Maya Gupta. 2018. To trust or not to trust a classifier. In

Proceedings ofthe 32nd International Conference on Neural Information Processing Systems . 5546–5557.[32] Daniel Kahneman. 2011.

Thinking, fast and slow . Macmillan.[33] Daniel Kahneman and Shane Frederick. 2002. Representativeness revisited: Attribute substitution in intuitive judgment.In

Representativeness revisited: Attribute substitution in intuitive judgment . New York. Cambridge University Press.,49–81.[34] Daniel Kahneman, Stewart Paul Slovic, Paul Slovic, and Amos Tversky. 1982.

Judgment under uncertainty: Heuristicsand biases . Cambridge university press.[35] Ece Kamar. 2016. Directions in Hybrid Intelligence: Complementing AI Systems with Human Intelligence.. In

IJCAI .4070–4073.[36] Ece Kamar, Severin Hacker, and Eric Horvitz. 2012. Combining human and machine intelligence in large-scalecrowdsourcing. In

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1 . International Foundation for Autonomous Agents and Multiagent Systems, 467–474.[37] Eric Kearney, Diether Gebert, and Sven C Voelpel. 2009. When and how diversity benefits teams: The importance ofteam members’ need for cognition.

Academy of Management journal

52, 3 (2009), 581–598.[38] Wouter Kool and Matthew Botvinick. 2018. Mental labour.

Nature human behaviour

2, 12 (2018), 899–908.[39] Todd Kulesza, Simone Stumpf, Margaret Burnett, Sherry Yang, Irwin Kwan, and Weng-Keen Wong. 2013. Too much, toolittle, or just right? Ways explanations impact end users’ mental models. In . IEEE, 3–10.[40] Vivian Lai, Han Liu, and Chenhao Tan. 2020. "Why is ’Chicago’ Deceptive?" Towards Building Model-Driven Tutorialsfor Humans. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. (CHI ’20) . Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376873[41] Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models:A case study on deception detection. In

Proceedings of the Conference on Fairness, Accountability, and Transparency .29–38.[42] Kathryn Ann Lambe, Gary O’Reilly, Brendan D Kelly, and Sarah Curristan. 2016. Dual-process cognitive interventionsto enhance diagnostic reasoning: a systematic review.

BMJ quality & safety

25, 10 (2016), 808–820.[43] G Daniel Lassiter, Michael A Briggs, and R David Slaw. 1991. Need for cognition, causal processing, and memory forbehavior.

Personality and Social Psychology Bulletin

17, 6 (1991), 694–700.[44] John D Lee and Katrina A See. 2004. Trust in automation: Designing for appropriate reliance.

Human factors

46, 1(2004), 50–80.[45] Chin-Lung Lin, Sheng-Hsien Lee, and Der-Juinn Horng. 2011. The effects of online reviews on purchasing intention:The moderating role of need for cognition.

Social Behavior and Personality: an international journal

39, 1 (2011), 71–81.[46] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In

Advances in NeuralInformation Processing Systems 30 . Curran Associates, Inc., 4765–4774.[47] Martijn Millecamp, Nyi Nyi Htun, Cristina Conati, and Katrien Verbert. 2019. To Explain or Not to Explain: TheEffects of Personal Characteristics When Explaining Music Recommendations. In

Proceedings of the 24th InternationalConference on Intelligent User Interfaces (Marina del Ray, California) (IUI ’19) . Association for Computing Machinery,New York, NY, USA, 397–407. https://doi.org/10.1145/3301275.3302313[48] Martijn Millecamp, Nyi Nyi Htun, Cristina Conati, and Katrien Verbert. 2020. What’s in a User? Towards PersonalisingTransparency for Music Recommender Interfaces. In

Proceedings of the 28th ACM Conference on User Modeling,Adaptation and Personalization (Genoa, Italy) (UMAP ’20) . Association for Computing Machinery, New York, NY, USA,173–182. https://doi.org/10.1145/3340631.3394844[49] Carol-anne E Moulton, Glenn Regehr, Maria Mylopoulos, and Helen M MacRae. 2007. Slowing down when you should:a new model of expert judgment.

Academic Medicine

82, 10 (2007), S109–S116.[50] Joon Sung Park, Rick Barber, Alex Kirlik, and Karrie Karahalios. 2019. A Slow Algorithm Improves Users’ Assessmentsof the Algorithm’s Accuracy.

Proceedings of the ACM on Human-Computer Interaction

3, CSCW (2019), 1–15.[51] Avi Parush, Shir Ahuvia, and Ido Erev. 2007. Degradation in spatial knowledge acquisition when using automaticnavigation systems. In

International conference on spatial information theory . Springer, 238–254.[52] Richard E. Petty and John T. Cacioppo. 1986. The Elaboration Likelihood Model of Persuasion.

Communication andPersuasion

19 (1986), 1–24. https://doi.org/10.1007/978-1-4612-4964-1_1 arXiv:arXiv:1011.1669v3[53] Vlad L Pop, Alex Shrewsbury, and Francis T Durso. 2015. Individual differences in the calibration of trust in automation.

Human factors

57, 4 (2015), 545–556.[54] Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, JamilaSmith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. In

Proceedings of the 2020 Conference on Fairness, Accountability,and Transparency (Barcelona, Spain) (FAT* ’20) . Association for Computing Machinery, New York, NY, USA, 33–44.https://doi.org/10.1145/3351095.3372873[55] Jonathan Sherbino, Kulamakan Kulasegaram, Elizabeth Howey, and Geoffrey Norman. 2014. Ineffectiveness of cognitiveforcing strategies to reduce biases in diagnostic reasoning: a controlled trial.

Canadian Journal of Emergency Medicine

16, 1 (2014), 34–40.[56] Maria Sicilia, Salvador Ruiz, and Jose L Munuera. 2005. Effects of interactivity in a web site: The moderating effect ofneed for cognition.

Journal of advertising

34, 3 (2005), 31–44.[57] J Spilke, HP Piepho, and X Hu. 2005. Analysis of unbalanced data by mixed linear models using the MIXED procedureof the SAS system.

Journal of Agronomy and crop science

Social Behavior and Personality: an international journal

29, 4 (2001), 391–398.[59] Michelle Vaccaro and Jim Waldo. 2019. The effects of mixing machine learning and human judgment.

Commun. ACM

62, 11 (2019), 104–110.[60] Tiffany C Veinot, Hannah Mitchell, and Jessica S Ancker. 2018. Good intentions are not enough: how informaticsinterventions can worsen inequality.

Journal of the American Medical Informatics Association

25, 8 (2018), 1080–1088.[61] Jennifer Irvin Vidrine, Vani Nath Simmons, and Thomas H. Brandon. 2007. Construction of smoking-relevant riskperceptions among college students: The influence of need for cognition and message content.

Journal of AppliedSocial Psychology

37, 1 (2007), 91–114. https://doi.org/10.1111/j.0021-9029.2007.00149.x[62] Alan R Wagner, Jason Borenstein, and Ayanna Howard. 2018. Overtrust in the robotic age.

Commun. ACM

61, 9 (2018),22–24.[63] Peter C Wason and J St BT Evans. 1974. Dual processes in reasoning?

Cognition

3, 2 (1974), 141–154.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 188. Publication date: April 2021. o Trust or to Think 188:21 [64] Jacob Westfall, David A Kenny, and Charles M Judd. 2014. Statistical power and optimal design in experiments inwhich samples of participants respond to samples of stimuli.

Journal of Experimental Psychology: General

Health Communication

15, 4(2003), 375–392.[66] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the effect of accuracy on trust inmachine learning models. In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems . 1–12.[67] Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. 2020. Effect of Confidence and Explanation on Accuracy andTrust Calibration in AI-Assisted Decision Making. In

Proceedings of the 2020 Conference on Fairness, Accountability,and Transparency (Barcelona, Spain) (FAT* ’20) . Association for Computing Machinery, New York, NY, USA, 295–305.https://doi.org/10.1145/3351095.3372852