Explainable AI and Adoption of Financial Algorithmic Advisors: an Experimental Study
EExplainable AI and Adoption of Algorithmic Advisors: anExperimental Study
Daniel Ben David a,b , Yehezkel S. Resheff b , Talia Tron b a Hebrew University of Jerusalem b Intuit inc
Abstract
Machine learning is becoming a commonplace part of our technological experience. The notion ofexplainable AI (XAI) is attractive when regulatory or usability considerations necessitate the abilityto back decisions with a coherent explanation. A large body of research has addressed algorithmicmethods of XAI, but it is still unclear how to determine what is best suited to create humancooperation and adoption of automatic systems. Here we develop an experimental methodologywhere participants play a web-based game, during which they receive advice from either a humanor algorithmic advisor, accompanied with explanations that vary in nature between experimentalconditions. We use a reference-dependent decision-making framework, evaluate the game resultsover time, and in various key situations, to determine whether the different types of explanationsaffect the readiness to adopt, willingness to pay and trust a financial AI consultant. We find that thetypes of explanations that promotes adoption during first encounter differ from those that are mostsuccessful following failure or when cost is involved. Furthermore, participants are willing to paymore for AI-advice that includes explanations. These results add to the literature on the importanceof XAI for algorithmic adoption and trust.
Keywords:
Explainable AI, Financial Advice, Trust, Algorithm Adoption, Experiment1 a r X i v : . [ c s . H C ] J a n . Introduction The use of machine learning and other automation methods is becoming overwhelmingly popularand increasingly available in different facets of everyday life. From basic research these methodshave made their way into medicine, transportation, business processes, retail, customer service, anddiverse financial services. At times these methods are utilized to do work for individuals, drivingdown costs of previously labour intensive products and services. Other times the same methods areutilized to automatically make decisions about individuals.When algorithms take part in decision making with significant impact to individuals, there isa burden of explainability that naturally falls on the providers of the system. In some cases thereare regulations that imposes obligatory explanation of decisions as an integral part of their output(for example, the recent GDPR legislation by the European Union which requires the maker of AIalgorithms which “significantly affects” decisions to explain how any output was obtained [9]). Inother cases, a deep understanding of how an output was generated is crucial for human based decisionmaking in interaction with the automatic process (e.g. in some medical and security applications).However, in many other cases, and in various fields, the importance of explanations is first andforemost in the effect on the perceived trustworthiness of the system [26, 12, 20], and hence on thereadiness of consumers to adopt (RTA) the AI service, and pay for it.In recent years, the field of explainable AI (XAI) has seen a boost in interest from the community,with many new approaches and ideas. Most of the effort is on the technological and algorithmicside – borrowing ideas from game theory, statistics, and machine learning, to develop fast andaccurate techniques which explain black-box or opaque models [21, 19, 14], or train AI which ismore interpretable by nature [1, 18]. Another part of the literature deals with the independentlyimportant question of evaluating explanations based on their attributes, and with user evaluationof explanations and the consequent effect on behavior.
The first important attribute when considering what explanation to generate is the type ofinformation the explanation should convey. We find in the literature three main approaches toexplanations: (1) Global Explanations, (2) Local explanations and (3) what might be called SocialInfluence Explanations.
Global explanation techniques provide an overview of what an algorithm is doing as a whole.The aim of this type of explanation is to convey to a human what the algorithm is doing rather2han explain the process that lead to a specific prediction or decision. These methods often includesummarized information about how a model uses features to produce prediction (some popularapproaches include various notions of feature importance, dependence plots, and global Shapleyvalues), prototype example predictions, or a simplified, interpretable approximation of a black-boxmodel (aka surrogate models) [21, 14].Another type of global explanation that is completely independent of the algorithm used is thepresentation of any type of meta-information that sheds light about the process. This includestransparency around how the model was trained, the type of data that was used, or even simplyreporting model performance statistics. Global explanations are for the most part less costly toproduce in real-world systems, compared to the alternatives, and are readily available in most cases,making them appealing in practise.
Local explanation techniques, on the other hand, provide a more detailed description of how themodel came up with a specific prediction. These may include information about how a model usesfeatures to generate a specific output [19, 22, 23, 2]), how a perturbation in input will influencethe output [25, 4], or a comparison of the specific input-output pair at hand to the model’s outputon similar input data [13]. In general, local explanation techniques are more costly in time andresources since they must be computed on a case-by-case basis rather than globally for the entiresystem. Furthermore, local explanations are inherently only available once the system is being used,and not applicable when the aim is to convince a user beforehand and build a-priori trust andconsent that is independent of actual experience with the system.
Social Influence Explanations techniques contain a type of information which relates to the waysocially relevant others behave. These methods are typically discussed in the context of recom-mendation systems; a system using this sort of explanation may show a report on model adoptionstatistics, or the ranking of a specific item or of the entire system by users with a similar profile orshared characteristics [24, 11]. We will not use this technique in our study, and it is only brieflydiscussed here for the completeness of the attributes review.An additional factor which should be taken into account is the way explanations are presented.The amount of information provided, phrasing, and choice of specific words may all affect what indi-viduals perceive [6, 11]. In parallel with the textual message, the literature on explanations exploresthe visual interface, and which visual method is the more effective way to present a recommendation(e.g. star ranking, histograms, neighbors ranking tables, pies, word cloud) [24].With the multitude of methods that currently exist, it is not clear how to choose for a specific3ase. What constitutes a good explanation, and is it constant or individual and context dependant?
Unlike the well defined metrics used to evaluate machine learning, such as accuracy and precisionfor model performance, the definition of a good explanation of the output or nature of a model issomewhat vague. The definition is presumably not universal, but rather highly dependent on whatwe are trying to achieve by augmenting an algorithm with an explanation. Objectives may varydramatically for different use-cases, ranging for instance from helping the user understand how adecision was made, to attempting to convince a user to adopt the system or take a recommendation.As a result, the evaluation approaches and methodologies vary between different domains.Explanation evaluation can be conceptually divided into qualitative and quantitative evaluationtechniques. Within the quantitative measures, one can find lines of work that focus on evaluatingthe mathematical (or statistical) attributes of the explanation that is generated. For example, whenthe explanation is presented as a report of feature importance (i.e. a list or raking of the featuresthat were instrumental in making the decision), statistical properties such as local accuracy [19]help determine the quality of the list. This type of evaluation completely avoids the question ofwhat is useful or leads to adoption and trust. In some cases the truthfulness of explanations takesprecedence over all else, justifying this view. Other times it makes sense to sacrifice on this front inorder to achieve the actual goal of the system.Much of the relevant literature from the field of recommendation systems focuses on measuringusage indicators to evaluate effectiveness – the extent to which the system assists the user in makingbetter decisions compared to previous behaviour, and efficiency – the extent to which it helps usersmake faster decisions [24, 11].The qualitative measurement literature is focused mostly on user understanding and reactionto the explanations. This line of research uses various questionnaires. Many measures have beensuggested to reflect the different aspects of what is important when designing explanations. Theseinclude transparency (the level of detail provided), scrutability (the extent to which users can givefeedback to alter the AI system when it’s wrong), trust (the level of confidence in the system), persuasiveness (the degree to which the system itself is convincing in making users buy or tryrecommendations given by it), satisfaction (the level to which the system is enjoyable to use) and userunderstanding (the extent a user understands the nature of the AI service offered, or alternativelythe level of similarity of an explanation generated by the automatic method to explanations producedby a human being) [22, 17, 16, 24, 11]. 4n our study, we attempt to form a synergy between explainable AI and the multiple fields thathave previously studied algorithmic adoption, and machine-human relations. We use the quantitativemeasure of Readiness to Adopt (RTA), simply defined as the fraction of users that use the AI systemwhen presented with the choice, and later the Willingness to Pay(WTP) to explore acceptance andit’s relations to trust and user satisfaction. We study the impact of the explanation type (bothglobal and local) on the adoption of and payment willingness towards an AI financial decision-making advisor. We do so in a unique experimental framework, in a controlled environment withreal money consequences, and repeating interactions which evolve over time. In addition, we relatethe Readiness To Adopt (RTA) and the Willingness to Pay(WTP) in the different treatments toknown constructs from the literature on trust [7, 20, 15], and perceived quality of explanations[15, 12] . This innovative framework allows us to examine whether there are differences in adoptionand payment when the advice is labeled as human advisor compare to AI based algorithm, what arethe effects of types of explanations on initial adoption, adoption over time, algorithm aversion whenthe model fails, fully autonomous decision-making services, and willingness to pay. The researchquestions we aim to explore include: whether different types of explanations are important foradoption of an financial AI algorithm, when all else is equal (actual advice is the same)? Willproviding more detailed information (via local explanations) increase trust, RTA? How will thiseffect change after multiple interactions with the model? How will a failure in model performanceeffect the way people perceive financial algorithms with different explanations? does explanationsinfluence the RTA autonomous advice or have an effect when the advice is costly? Does explanationsinfluence the consumer of the advice WTP?To the best of our knowledge, this study is the first to compare the effect of different typesof textual explanations of AI advice in term of RTA over time and in different situations. Ourapproach stems from the conjecture that the model-consumer interaction has a dynamic nature,which is reference dependent, and it may be influenced by various factors such as familiarity, pastperformance, and potential cost. This implies that one explanation can be optimal to make a goodsense of trustworthiness, and gain initial trust and understanding, while a different one could bebetter fitting after a period of using the model, or in a such a case when it fails. In addition, and toour knowledge, this study is the first that explores the affect of explanation on customers Willingnessto Pay (WPT) for the AI advice.The rest of the paper is organized as follows: in the next section we provide a detailed descriptionof experiment methodology and design, we discuss the choice of experimental treatments and game5ow and relate it to the existing literature from the field of explainable AI and algorithmic trustrespectively. Next, we present the results from an online experiment showing how explanations affectadoption and trust in the different phases of human-AI interactions, as well as the consequences forparticipants’ willingness to pay for the service. Finally, we discuss the finding from this study, andsuggest future directions with the potential to broaden the scope of the current work and generalizeto other domains.
2. Methods
The study consistent of a 3 parts as follow:1. Pre-game quantitative questionnaire. participants had a time limit of 3 minutes to answer 3simple mathematical questions (addition and subtraction). Upon completing this part, partic-ipants earned 20 initial game coins that were used in the main part of the study. Each gamecoin during the game was worth 2 U.S. cents. The purpose of this stage was twofold: The firstwas to ensure participants’ attention during the experiment. The second relates to the mentalaccounting literature [10]. We wanted participants to treat the game coins account as moneythat they earned while investing effort, rather than money that they obtained as a ”reward”.2. The main part of the study. A fun, interactive, decision-making game – The Lemonade Stand(see full description below). In this part participants could gain more coins depending on theirdecisions, and potentially use an advisor, that was labeled differently with several explanation,in doing so.3. A post-game questionnaires about trust, engagement, explanation satisfaction and personaldemographic details .Figure 1 illustrates the lemonade stand game. In the beginning of the game participants wereinstructed as followed: “You own a lemonade stand, your goal is to make as much money as youcan in 2 weeks by selling lemonade. Decide how many cups you want to make, per day, based onthe price of lemons and the weather forecast.” At the beginning of each day, a weather forecast(sunny, cloudy, or rainy) was displayed together with the varying price of lemons (0.45-0.55 coinsper cup). The lemonade selling price was fixed (1 coin). Participants had to decide how many The questionnaires about Trust and explanation satisfaction are based on [7, 15, 20, 12]. The metrics and thequestionnaires can be found in appendix A.
Table 1: The ranges of demand level conditioned on weather. The demand for lemonade cups is sampled uniformlyfrom the range associated with the actual weather each day.
The game starts with a short learning session (3 days) designed to verify participants under-standing. These days are not counted in the day numbering used throughout the analysis. Aftertraining, participants play the game for 14 game-days for real game coins. The complete game flowis described in the bottom graphic in Figure 1.
Initial adoption – At day 3, after completing 2 days of playing for real game money, participantswere given the option to take recommendations from a so-called ”advisor”. Each recommendationby the advisor was accompanied with a short explanation which varied between experimental con-ditions (details below). To avoid an anchoring effect, recommendations were always displayed afterparticipants already decided how many cups to make. They then could change their decision basedon the recommendation (a ”Take Advice” button allowed them to switch to the number prescribedby the advisor. No other change to the the amount of lemonade production was allowed at thisstage); alternatively, they could stay with their original choice. Initial adoption following the firstimpression was measured by the Readiness To Adopt (RTA) in this phase, the percent of users thattook the advice each day, and was compared between the different explanation treatments (see 2.3).
Adoption gain – during the first days of the experiment (between days 3 and 7) the advice inall explanation treatments was set to perform perfectly (i.e. the advice given is precisely the actual7emand that will occur). During this period we test how confidence in the algorithm advice is gainedupon exposure to the accurate advisor, and specifically we compare the different conditions on day7 after 4 rounds of exposure.
Advice failure and algorithm aversion – On day 7, the weather forecast was sunny for all partic-ipants and the recommendation drastically fails, with a recommendation to produce the maximalamount of lemonade (10 cups), and an actual demand for the minimal amount (0 cups). We explorethe effect of experimental treatment (different explanations) on RTA the next day (i.e. day 8). Fromday 9 and until day 10 we look at the accumulation of renewed trust, as the algorithm returns to itsaccurate predictions.
Automatic adoption – starting from day 11, ”Auto-pilot” mode is made available. Namely,participants are able to request the model to decide how many cups to make, without ever seeing itsrecommendations, or entering their decision manually ( ”Decide for me” ). By using this, users getthe option not to think about amounts of production, and this makes sense for them if they knowthey are going to take the advice of the model anyway. The RTA here is measured as the numberof participants who adopt the algorithm either automatically or manually as in the previous days.
Willingness to pay – During the last day of the game (day 13), participants are told that advicewill no longer be free, and are asked how much they are will to pay to keep receiving recommendationsin the following days. The purpose of this part is to check whether the adoption rate is affected bythe algorithm not being free and to explore whether there are differences in the willingness to paybetween the different mechanisms [3].
Participants were randomly assigned to 5 experimental conditions, which differ only in the expla-nations for the advice that is given (See Table 2 for the full text of the explanations provided for eachgroup). Each subject participated was consistently shown only one type of advice throughout theexperiment. Participants in the
Human Expert group were informed that advice is being generatedby a so called human expert in the field of lemonade stand operations. This condition was designedto asses whether the mere usage of a computer algorithm has an effect on trust and adoption, andwhether there is an algorithmic aversion as previously found by [5]. In all other groups, participantswere told that the suggestions are based on a so called computer algorithm. In the
No Explanation condition, participants were given no information about how the algorithm generates the predictions.This condition was used as baseline against which all other algorithm explanation treatments arecompared. 8 earning session Start playing
For real coins
Advice is now available
Recommendations from an advisor
DAY 11Auto-pilot is now available
Let the advisor decide for you
DAY 14Advice is no longer free
How much are you willing to pay?
DAY 3 (Model fails)
You sold no lemonade cups!
DAY 7
The algorithm recommendation is to make 3 cups oflemonade”
Weather forecast : Sunny
Cash:
21 Coins
Cost to make lemonade: DAY 3
You sold 3 cups of lemonade and earned 3 coins !
END OF DAY 3
It was a cloudy day
You made 5 cups of lemonade but there was only demand for 3.
Figure 1: Illustration of one day at the lemonade stand (top), and the general flow of the lemonade stand game(bottom). Each day the player is given the weather forecast and asked how many cups of lemonade they would like toproduce. After selecting the desired amount, advice is presented and the user can either accept the advice or producethe amount previously stated. Finally, the day is concluded with a screen that summarizes the events of the day,including the amount produced, demand, and earnings.
In the
Global Explanation condition, participants were given very general information about howthe model operates, namely the type and extent of data it uses (sales data from many lemonadestands over several years). This condition simulates a very prevalent real-world scenario where weaim to increase model transparency with minimal cost in computation time and resources. In the ”Feature-based” condition, on the other hand, participants were given a fully detailed account aboutthe features that were used to generate the specific prediction, similar to using local explanationsin a real world scenario. To test the additive value of this information, it was attached to thebaseline global explanation. Finally, to test the potential effect of reporting model performance, inthe
Performance-Based condition information regarding performance (”with 90 percent certainty”)was also attached to the baseline global explanation (we note that the performance observed byparticipants during the experiment is indeed statistically consistent with the 90 percent accuracyvalue mentioned to them). 9ondition Explanation Type
Human Expert
The human advisor recommendation is to make 6cups of lemonade.
No explanation
The algorithm recommendation is to make 6 cups oflemonade.
Global explanation
Based on data from lemonade stands over severalyears, the algorithm recommendation is to make 6cups of lemonade.
Feature-based
Based on data from lemonade stands over severalyears, your previous sales, and market demand, thealgorithm recommendation is to make 6 cups oflemonade.
Performance-Based
Based on data from lemonade stands over severalyears, the algorithm recommendation, with 90 per-cent certainty, is to make 6 cups
Table 2: The text presented to users in each experimental treatment. The recommended number presented changedby the specific day advice and varied between 1-10
To clarify, all participants faced the same game phases and process with only one difference, thelabel on the advice offered. Given that the only difference was the ”label” of the advice, and thefact that participants were randomly assigned to the conditions, no significant difference in RTAand WTP should be observed. Any observe differences serve as an indication of prior perceptionsand preferences relates to the given label.
The study was conducted via the online Amazon Mechanical Turk platform (AMT). The totaltime participants had to complete the experiment was 25 minutes. The average time of experimentcompletion were 12.66 Minutes. There were initial 514 respondents to the experiment. We decidedto add a screening mechanism and added a minimal length of time of 6 minutes for completing theexperiment in order to filter out individuals who did not play the game with sufficient attention andengagement. 65 participants who finished in non-realistic time of less than 6 minutes were excluded.Our final sample consist of 449 subjects. These were randomly distributed between five explanationsconditions, 84 of the participants received the
Human Expert treatment, 104 received the Algorithmwith
No Explanation , 89 of the participant were allocated to the algorithmic advice with
GlobalExplanation , 87 received the
Feature-based explanation the the rest of 85 participants received the
Performance-Based . All participants were located in the U.S., with an average age of 36.6 (Std11.2), 58.8% Male, 68% report working full time, 21.3% part time, and the rest unemployed. 55%of our participants report having an academic degree, 12% an associate degree, and the rest a high10chool education or below.
3. Results A d o p t i o n R a t e Human advisorNo exlanationGlobal ExplanationGlobal with featuresGlobal with accuracy
Day ** ** ** **
Human AdvisorNo ExplanationGlobal ExplanationFeature-basedPerformance-based ** Figure 2: The average acceptance rate of model advise per game-day in each of the treatment groups. change nameof general with features
See Figure 1 for the description of the flow of the game.
We compare the Readiness to Adopt (RTA) for the different experimental condition groupsthroughout the experiment ”two weeks” game duration. We start by describing the results withdescriptive statistics and t-tests, followed by providing a Probit multivariate model regressions onthe RTA while controlling for several relevant variables such as gender, age, trust and explanationperceived goodness indexes as measured by the questionnaires. Regression tables include the exactvalues for statistics reported throughout this section.Figure 2 summarizes the overall adoption rate (RTA) for different experimental conditions. Theoverall model adoption changed throughout the game, supporting our reference-dependent approach.On the very first day when the model was introduced (day 3) we see an average of 60% adoption whenaveraging over all participants and all experimental conditions. Over the next few days, during whichthe algorithmic performance was good, overall adoption increased leading to an average adaption of85.5% on day 7. Following the model failure on day 7, adoption plummeted, returning approximatelyto the initial level with 61% on day 8, this was followed by a slow recovery up until day 10 (66%).11ith the introduction of the auto-pilot option on day 11 adoption soared and reached the peak valueof 94%. Finally, when participants were told advice will no longer be free, and were asked aboutwillingness to pay, we see an overall decrease in adoption on the final day (77%).Next, we break down the effect of the experimental explanation treatments and provide resultsof the RTA for each group.
First, we examine the effect of human versus algorithmic origin of advice with no further expla-nation. Results show no evidence of algorithm aversion– namely there is no significant differencein the initial adoption of the
Human Advisor (RTA of 56%) as opposed to the computer algorithm
No Explanation conditions (with an RTA of 57.7%). However, over the first few days, the algorith-mic advice with
No Explanation showed significantly gains in adoption as compared to the
HumanAdvisor leading to a 16.3% percent gap on day 7 (P-value = 0.0129, t = 2.5114). Following modelfailure on day 7, adoption and RTA levels fell back to around the initial values in both cases (54%and 56% for
No explanation and
Human Advisor , respectively). Recovery in the next few days isslightly and not significantly more favorable for the AI-based advisor (65% compared to 58% by day10), and this trend remains throughout the experiment. The decline in adoption following the probeof willingness to pay is significantly larger in the Human Advisor condition with 79% compared to63% for algorithmic
No Explanation and the
Human advisor , respectively (P-value = 0.0169, t =2.4104).For the regression analysis we proceed by conducting a Probit multivariate model regressions.The base specification is
RT A i,t = α + βT i,t + δ · T E i,t + γ · X i,t + (cid:15) i (1)where RTA is the outcome variable associated with individual i with time difference of t game”days” of the experiment. T is a dummy variable for the type of investment advice: ”HumanAdvisor” (where the underlying condition is the ”AI-No explanation” alternative). T E is a vectorof the Trust and Explanation satisfaction dummies for indicating if an individual is above the medianon the index based on questionnaires developed by [7, 15, 20, 12]. X is the vector of age and gendervariables. Table 1 shows the Probit analysis results. We observe similar results when includinga range of control variables: Significant differences in adoption toward the ”AI-No explanation” The questionnaires can be found in appendix A
Next, we considered the effect of the type of explanations provided by the algorithmic advisoron RTA during the different phases of the game. As before, we start with descriptive statistics andcontinue with a multivariate regression analysis. For the analysis, we are conducting a Probit, cross-section multivariate model regressions, on the different treatments while controlling for gender,age trust antecedents index and Explanation satisfaction index with two approaches: (A) betweenconditions analysis as in model (1) on the different experiment days, and (B) on the change inadoption between the different experiment phases. The base specifications are: RT A i,t = α + βT i,t + δ · T E i,t + γ · X i,t + (cid:15) i (2) RT A ∆ i,t = α + βT i,t + δ · T E i,t + γ · X i,t + (cid:15) i (3)For model (2), the outcome variable is the RTA for individual i in the different t ”days” ofthe experiment and the rest is identical to the specified above in model (1). Table 2 shows theProbit analysis results. For model (3), the analysis is done on the change in adoption betweenthe experiment phases. The outcome variables are as follows: RTA day 7 minus RTA day 3 (asgaining trust), RTA day 7 minus RTA day 8 (as loss of trust), RTA day 10 minus RTA day 8 (as aRecovery), RTA day 13 minus RTA day 11 (as auto advice available), RTA day 14 minus RTA day13 (as willingness to pay). The explanatory variables are similar to as specified above in model (1).Table 3 shows the Probit analysis results for these models. When comparing the
Global Explanation with
No explanation we find no significant differencein initial adoption. When we attach to the Global explanation the more detailed feature-based The results are not sensitive to the choice of regression models such as OLS or Logit.
Performance-Based had significantly more adoption compare to
No-explanation . Inaddition, when
Explanation satisfaction was perceived as high (above median), the initial adoptionwas significantly higher as well.
During these first four days, the advice provided in all experimental conditions allowed partic-ipants to make the highest possible profit (i.e. perfect predictions and advice). In line with this,all advice treatment alternatives gained user confidence, and adoption raised significantly by anaverage of 42.5%. When comparing the different experimental conditions, we find no significantdifferences except the
Performance-Based explanation which gained less adoption compare to theno explanation alternative (P-value = 0.0403, t = 2.0652). From the regressions with controls, wefind, similarly to above, that only the
Performance-Based gained less adoption compare to the no-explanation. This result was found in both regression approaches. These results suggest that whenadvice performance is immaculate, the type of explanation presented with it is less important toindividuals.
Following algorithm failure on day 7, we find an average adoption drop of 28% on day 8. Whenwe explore the different explanations we find the following: the two more elaborate explanations,the
Performance-Based (P-value = 0.0445, t = -2.0225) and the
Feature-Based (P-value =0.0498,t = -1.9741) prove to be more resistant to the effect of algorithm error. The adoption drop wassignificantly lower compared to the
No explanation alternative with 16% and 21% reduction forthe
Feature-Based and the
Performance-Based ,respectively and 41% reduction for the advice withNo explanation. We find no significant differences between the
Global Explanation and the otheralternative explanations. When we explore the adoption recovery, during the next few days anduntil day 10, we find that the
Performance-Based explanation had a marginally significant (P-value= 0.0633, t = 1.8691) tendency to recover better than the global explanation (12% compared toonly 2% for the
Feature-Based and the
Global Explanation , respectively).In the regressions with controls, we find both in model (2) and (3) that the Global with Featureand
Performance-Based have significantly lower adoption drop compare to the alternative with No14xplanation. In addition, when the trust was perceived as higher, it lowered the drop rates andincreased the recovery from the failure. Lastly, on day 10, we observe a marginally significant higheradoption rate for the
Feature-Based condition compare to the No-explanation alternative.
On day 11, subjects have an option to select auto advice before starting the day. The newinstrument that enables the participants to avoid deciding by themselves has a major impact. Weobserve adoption jump of 40% on average to a mean RTA of 95% on day 11 and up to 97% onday 13 (For the algorithmic treatment alternatives that included explanations (Global, Feature-Based, Performance-Based). Among the alternative explanations, we find that the
Feature-Based explanation achieved the highest adoption rates and that they are marginally significant compareto the No explanation alternative (P-value = 0.0651, t = 1.8557) and significantly compared to the
Global explanation (P-value = 0.0249, t = 2.2627). In the regressions, we find similarly to above,no significant differences between the experimental conditions to the no explanation alternative.Concerning the adoption jump, we relate our results to the fact that making financial decisionsis a challenging task that people reluctantly enter [8]. The way the advice is now offered enables theparticipants to transfer the responsibility for the decision making to the advisor and avoid makingit by themselves. Also, the relatively low stakes of the game (winning several U.S dollars at the endof the experiment) can serve as an explanation of why participants transfer responsibility and avoideffort.
On the last day we introduced a new situation where the advisor is no longer free and askedparticipants whether they want the advice and if so, how match they are willing to pay for it. Duringthis round we observe adoption drop of 21% for the algorithmic alternatives. This was despite thefact that we allowed the subjects to self-determine the value (the price) of the advice. The
Feature-Based condition showed a tendency for lower adoption reduction (only 14.5% RTA drop). Thisresult was significant compare to the
Global Explanation (P-value = 0.0398, t = -2.0712). No othersignificant difference were observed between the other algorithmic alternatives. From the regressionanalyses, we find, no significant differences between the experimental conditions to the no explanationalternative. 15 .4. Willingness to pay
The Willingness to Pay measurement, as specified above, can further attest to way participantsperceive the advice given by the different advice mechanisms. We find that participants assignedto the No-explanation alternative were willing to pay the least for the use of the advice with meanpayment of 1.005 game coins, compared to a mean payment of 1.774 game coins for all other ”AIadvice” alternatives – A 76.5% gap. The difference in willingness to pay was significant for
GlobalExplanation and
Performance-Based conditions ((P-value = 0.0150, t = 2.4535), (P-value = 0.0063,t = 2.7621), respectively) and marginally significant for
Feature-Based condition (P-value = 0.0566,t = 1.9181). No significant differences were observed between the other alternatives. In addition,we observe positive marginal significant difference in the willingness to pay for the Human-expertcompared to the No-explanation ((P-value = 0.0611, t = 1.8836). This result is in contrast to whatwe observe with respect to the RTA over the same day.For the regression analysis, we conduct a Tobit, cross-section multivariate model regressions, onthe different treatments while controlling for gender, age, trust antecedents index, and Explanationsatisfaction index. The base specifications is: W T P i = α + βT i + δ · T E i + γ · X i + (cid:15) i (4)The outcome variable is the Willingness to Pay (WTP) for individual i. T is a dummy variablefor the type of investment advice: Global Explanation and
Performance-Based and
Feature-Based and
Human-Expert (where the underlying condition is the ”AI-No explanation” alternative). Therest is identical to the specified above in model (1). Regression analysis results are as describedabove and specified as part of model (4) in Table 4.From the regression analyses, we find that the willingness to pay for
Global Explanation and
Performance-Based conditions were significantly different and positive compared to the
No-explanation alternative as observed by the t-tests and the same marginal significant tendency toward higher pay-ment for
Feature-Based condition compare to the
No-explanation . In addition, we find a significantage U shape for the willingness to pay. constant with [3] results. For the comparison of the Human-Expert vs. No-Explanation, we find that with the controls, the effect observed is not significant.Meaning that we don’t find a willingness to pay significant differences between those advice alter-natives. The results are not sensitive to OLS regression
Global or Performance-Based , hadthe highest impact on the participant’s willingness to pay. We refer our results to [8] theory in whichpeople are willing to pay more for an advisor that they trust more. It seems that adding WPT inaddition to RTA had an additional dimension of trust relating to algorithmic advice.
To reinforce our measure of adoption, and validate the relation to trust, we examined the effectof trust antecedents questions which were found as relevant in the literature, on the adoption of theadvice alternatives using within treatment analysis, over the various days. We used a validated indexincluding 7 trust questions as the measure. We find that on day 8 and day 10, after the algorithmfailed, individuals who indicated that they trust the advice more indeed adopt it more. In particular,individuals who trusted the advice more with
Feature-Based explanation and
Performance-Based based explanation by our index, showed higher adoption rates. In addition to the trust index, whenwe explore the explainable goodness questionnaire index, we find that it was significantly influentialon day 3 adoption. Individuals who valued the explanation more, adopt the advice more. Regressionanalysis results are as described above and specified as part of model (2) and (3).
4. Conclusion
Machine learning has become a prevalent technology, especially over the last decade, with ap-plications and implications in many aspects of everyday life. In order to promote a much neededcooperation between human and machine, one of the research goals in the field of explainable AI isto find the most successful ways of explaining and presenting AI-based decisions to individuals. Themajority of this line of work is focused on algorithmic methods to produce explanations for complexmachine learning methods. However, the complimentary question of whether different types of ex-planations are important to create trust and adoption is for the most part still open. To this end,we constructed a decision making game framework to test adoption, trust, and willingness to payfor financial machine learning models. Unlike questionnaire based paradigms, the lemonade standgame allows us to test these issues quantitatively, based on actual behavior in the presence of realfinancial stakes.To the best of our knowledge, this is the first study to directly evaluate the behavior of users inresponse to varying types of textual global and local explanations of AI over time in different situa-tions. Our experimental paradigm allows testing initial adoption in several experimental conditions17defined by the different types of explanations), as well as the evolution of the relations over timewhen the AI advice proves to be useful, or after it fails. Post-game questionnaires further allowintegration of measures common in the behavioral sciences for additional validation of our findings.We find that attaching different explanations created a significant difference. We observed noalgorithm aversion, and, in our experiment, participants were more inclined to adopt so-called AI-given advice compared to so-called Human advice. We find that an accuracy-based explanation ofthe model in initial phases led to higher adoption rates. When the performance is immaculate, thereis less importance associated with the kind of explanation for adoption. In addition, in cases offailure, using more elaborating feature-based or accuracy-based explanations helps substantially inreducing the adoption drop, and the negative influence on trust towards the algorithm.Furthermore, using an autopilot increased adoption significantly, a finding which we relate tothe unwillingness to make decisions if possible. Presenting a feature-based explanation partiallymitigate the adoption drop caused by the question about willingness to pay. Participants which wereassigned to ”AI-advice” with explanations were willing to pay more for the advice compare to theNo-explanation alternative. We find a correlation between trust antecedents and our measurementof RTA. When levels of trust were high, we observed mitigation in adoption drop when failure occurs,and a correlation with gaining recovery after this failure. We observe positive correlation betweenExplanation satisfaction perceptions with initial adoption levels as well. Lastly, we find that ageis negatively correlated with auto-take advice and a tendency (not significant) of age and loweradoption rates.We contribute to the literature in three key ways:(1) First, we show that there is no single best explanation that fits all and the importance oftime and situation dependence in presenting the type of explanation that is best suited to promotetrust. Namely, the ”What” we should explain depends on the ”when”. Interestingly, participantswere more inclined overall to adopt AI-given advice than follow a human expert. The effect of themodel (and human advice) failure on subsequent advice taking is remarkable, with a single failurecausing adoption rates to plunge.(2) Second, our results show that often the end-user doesn’t need to know more than verygeneral facts to accept the system. Our study shows that such general explanations yield goodresults in terms of RTA and WTP compared to the alternative of no explanations, and especiallyafter AI failure. Moreover, stating accuracy statistics about the algorithm can further strengthenthe aforementioned effect. Furthermore, explanations increased the participants willingness to pay18or the advice. Results of the current study highlight the potential utility of simple explainabilitysolution in these aspects.(3) Thirdly, our experimental paradigm of the lemonade stand game applies the RTA measure-ment in different and evolving situations into the explainable AI literature. This framework may beutilized to answer many other questions in the field of explainable AI and trust in AI. Future workcan investigate the effect of different explanation types in broader conditions such as, but not limitedto: varying reported accuracy values of the algorithm and over more extended periods. WhetherWillingness to Pay changes over time with different situations? Will using additional experimentalmethods such as a more controlled lab experiment or a field experiment on a live system will showthe same results? Moreover, some of the key components can be implemented in research studieson different explainability techniques, such as visual explanations. Answering these questions inaddition to our study will help bridge the gap between the immense technical success we see inmachine learning, and the difficulty often faced when trying to understand and amplify adoptionand trust among users.
References [1] Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learn-ing certifiably optimal rule lists for categorical data.
The Journal of Machine Learning Research ,18(1):8753–8830, 2017.[2] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec-tion: Quantifying interpretability of deep visual representations. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 6541–6549, 2017.[3] Daniel Ben-David and Orly Sade. Robo-advisor adoption, willingness to pay, and trust—anexperimental investigation.
Willingness to Pay, and Trust—An Experimental Investigation(December 2018) , 2018.[4] Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, KarthikeyanShanmugam, and Payel Das. Explanations based on the missing: Towards contrastive explana-tions with pertinent negatives. In
Advances in Neural Information Processing Systems , pages592–603, 2018. 195] Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: People erro-neously avoid algorithms after seeing them err.
Journal of Experimental Psychology: General ,144(1):114, 2015.[6] Derek Doran, Sarah Schulz, and Tarek R Besold. What does explainable ai really mean? a newconceptualization of perspectives. arXiv preprint arXiv:1710.00794 , 2017.[7] David Gefen, Elena Karahanna, and Detmar W Straub. Trust and tam in online shopping: anintegrated model.
MIS quarterly , 27(1):51–90, 2003.[8] Nicola Gennaioli, Andrei Shleifer, and Robert Vishny. Money doctors.
The Journal of Finance ,70(1):91–114, 2015.[9] Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-makingand a “right to explanation”.
AI Magazine , 38(3):50–57, 2017.[10] Pamela W Henderson and Robert A Peterson. Mental accounting and categorization.
Organi-zational Behavior and Human Decision Processes , 51(1):92–117, 1992.[11] Jonathan L Herlocker, Joseph A Konstan, and John Riedl. Explaining collaborative filteringrecommendations. In
Proceedings of the 2000 ACM conference on Computer supported cooper-ative work , pages 241–250. ACM, 2000.[12] Robert R. Hoffman, Shane T. Mueller, Gary Klein, and Jordan Litman. Metrics for explainableai: Challenges and prospects, 2018.[13] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn tocriticize! criticism for interpretability. In
Advances in Neural Information Processing Systems ,pages 2280–2288, 2016.[14] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, andRory Sayres. Interpretability beyond feature attribution: Quantitative testing with conceptactivation vectors (tcav). arXiv preprint arXiv:1711.11279 , 2017.[15] Sherrie YX Komiak and Izak Benbasat. The effects of personalization and familiarity on trustand adoption of recommendation agents.
MIS quarterly , pages 941–960, 2006.2016] Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. Principles of ex-planatory debugging to personalize interactive machine learning. In
Proceedings of the 20thinternational conference on intelligent user interfaces , pages 126–137. ACM, 2015.[17] Tania Lombrozo. The structure and function of explanations.
Trends in cognitive sciences ,10(10):464–470, 2006.[18] Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models withpairwise interactions. In
Proceedings of the 19th ACM SIGKDD international conference onKnowledge discovery and data mining , pages 623–631. ACM, 2013.[19] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In
Advances in Neural Information Processing Systems , pages 4765–4774, 2017.[20] D Harrison McKnight, Vivek Choudhury, and Charles Kacmar. Developing and validating trustmeasures for e-commerce: An integrative typology.
Information systems research , 13(3):334–359, 2002.[21] Christoph Molnar.
Interpretable Machine Learning . 2019.[22] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explain-ing the predictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD internationalconference on knowledge discovery and data mining , pages 1135–1144. ACM, 2016.[23] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic explanations. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[24] Nava Tintarev and Judith Masthoff. A survey of explanations in recommender systems. In , pages 801–810. IEEE,2007.[25] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations withoutopening the black box: Automated decisions and the gpdr.
Harv. JL & Tech. , 31:841, 2017.[26] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. Understanding the effect of ac-curacy on trust in machine learning models. In
Proceedings of the 2019 CHI Conference onHuman Factors in Computing Systems , page 279. ACM, 2019.21
Table 1 - Algorithmic advice with No Explanation treatment compared to the Human Advice with No explanation treatment. The Analysis is for the Readiness to Adopt (RTA) on the various different days Probit Day 3 Day 7 Day 8 Day 10 Day 13 Day 14
Human_advisor -0.0109 -0.514** 0.115 -0.0574 -0.525* -0.428** (0.190) (0.227) (0.192) (0.197) (0.274) (0.202) Gender(male) -0.0410 -0.0665 -0.0647 -0.313 -0.119 -0.0921 (0.189) (0.245) (0.187) (0.193) (0.289) (0.202) Age -0.00811 -0.0107 -0.00589 -0.00583 -0.0399** -0.00373 (0.00817) (0.00935) (0.00783) (0.00816) (0.0188) (0.00824) Age_Exp -3.62e-31 3.12e-32** 4.23e-32*** -3.86e-32*** 2.38e-25** 4.17e-32*** (4.07e-31) (1.54e-32) (1.34e-32) (1.40e-32) (1.21e-25) (1.45e-32) Trust index 0.113 0.155 0.262 0.337* 0.387 0.352* (0.196) (0.248) (0.197) (0.205) (0.327) (0.211) Explanation_goodness index 0.246 -0.641** -0.245 -0.426** -0.355 0.123 (0.204) (0.256) (0.201) (0.208) (0.320) (0.217) Constant 0.380 1.992*** 0.291 0.753** 3.258*** 0.774** (0.372) (0.496) (0.360) (0.375) (0.700) (0.386)
N 188 188 188 188 188 188
Notes: Probit regressions, robust standard errors are in parentheses. The dependent variables are the RTA over the different experimental phases. Human advice is a dummy variable where the underlying condition is the Algorithmic advice with No explanation. Age_exp is age exponent; Trust index is a dummy variable that receives 1 for individuals whose index score of seven trust questions (see appendix 2 for the trust questions) was above the median; Explanation_goodness index is a dummy variable receives 1 for individuals that their Index score of perceived explanation goodness index questions is above the median (see appendix 2 for the perceived explanation goodness questions) ("* p<0.10 ** p<0.05 *** p<0.01").
Table 2 - Algorithmic advice with No
Explanation treatment compared to the other “AI – treatments”. The Analysis is for the
Readiness to Adopt (RTA) on the various different days Probit Day 3 Day 7 Day 8 Day 10 Day 13 Day 14
Global with Accuracy 0.343* -0.452* 0.373* 0.108 0.190 -0.0317 (0.196) (0.242) (0.193) (0.195) (0.381) (0.215) Global with Features -0.106 -0.286 0.354* 0.367* 0 0.229 (0.191) (0.254) (0.191) (0.202) (.) (0.230) Global Explanation -0.0324 -0.147 0.222 -0.0109 -0.280 -0.168 (0.187) (0.255) (0.190) (0.191) (0.294) (0.209) Gender(male) -0.0291 0.163 0.0837 -0.0189 -0.0987 0.308* (0.142) (0.178) (0.141) (0.143) (0.287) (0.159) Age 0.00514 -0.00707 -0.00824 -0.00776 -0.0138 -0.00519 (0.00679) (0.00905) (0.00677) (0.00711) (0.0119) (0.00732) Age_exp 2.36e-31 1.35e-30* 1.20e-31 7.53e-30 7.79e-31* -3.56e-31 (4.39e-31) (6.92e-31) (4.57e-31) (1.57e-29) (4.29e-31) (3.56e-31) Trust index 0.0976 0.103 0.376** 0.331** 0.150 0.153 (0.150) (0.186) (0.150) (0.152) (0.284) (0.166) Explanation_goodness index 0.374** -0.159 -0.0393 -0.304* 0.288 0.274 (0.154) (0.195) (0.152) (0.157) (0.312) (0.170) Constant -0.154 1.543*** 0.173 0.612* 2.215*** 0.680** (0.316) (0.438) (0.314) (0.324) (0.574) (0.335)
N 365 365 365 365 278 365
Notes: Probit regressions, robust standard errors are in parentheses. The dependent variable is the RTA over the different days. The Algorithmic with No explanation is the underlying condition where other treatments are dummy variables. Age_exp is age exponent; Trust index is a dummy variable that receives 1 for individuals whose index score of all seven trust questions (see appendix 2 for the trust questions) is above the median; Explanation goodness index is a dummy variable that receives 1 for individuals that their index score of all perceived explanation goodness index questions is above the median (see appendix 2 for the perceived explanation goodness questions) ("* p<0.10 ** p<0.05 *** p<0.01").
Probit
Gained_trust Lost_trust Recovery Auto_advice_Available no_longer_free
Global with Accuracy -1.240*** -0.501** 0.0986 0.492 0.149 (0.435) (0.211) (0.323) (0.543) (0.219) Global with Features -0.519 -0.419** 0.426 0 -0.115 (0.464) (0.209) (0.312) (.) (0.235) Global Explanation -0.606 -0.262 0.0886 -0.0565 0.166 (0.442) (0.201) (0.288) (0.392) (0.217) Gender(male) 0.0646 -0.0139 -0.203 -0.153 -0.301* (0.321) (0.155) (0.232) (0.444) (0.161) Age -0.0208 0.00727 -0.00680 -0.0249 0.00460 (0.0140) (0.00748) (0.0104) (0.0204) (0.00745) Age_exp 2.33e-30** -5.97e-32 2.83e-29* 1.59e-25 4.36e-31 (1.05e-30) (4.48e-31) (1.45e-29) (1.58e-25) (3.49e-31) Trust index -0.0615 -0.323** 0.628** -0.0154 -0.101 (0.303) (0.162) (0.259) (0.353) (0.170) Explanation_goodness index -0.613** 0.150 -0.417 0.570 -0.337* (0.310) (0.166) (0.263) (0.378) (0.175) Constant 2.695*** -0.291 0.101 2.020** -0.776** (0.787) (0.346) (0.479) (0.903) (0.346) N 135 318 137 97 365
Notes: Probit regressions, robust standard errors are in parentheses. The dependent variable is the RTA differences over the experiment phases. Gained_trust - RTA day 7 adopters minus RTA day 3, Lost_trust - RTA day 7 adopters minus RTA day 8 ,Recovery - RTA day 10 adopters minus RTA day 8 , Auto_advice_available - RTA day 13 minus RTA day 11, No_longer_free - RTA day 14 minus RTA day 13. The Algorithmic with No explanation is the underlying variable where other treatments are dummy variables. Trust index is a dummy variable that receives 1 for individuals whose index score of all seven trust questions (see appendix 2 for the trust questions) was above the median; Explanation_goodness index is a dummy variable that receives 1 for individuals that their Index score of all perceived explanation goodness index questions is above the median (see appendix 2 for the perceived explanation goodness questions) ("* p<0.10 ** p<0.05 *** p<0.01").
Table 3 - Algorithmic advice with No Explanation treatment compared to the other “AI – treatments ” . The Analysis is for the Readiness to Adopt (RTA) differences over the experiment phases able 4 - Algorithmic advice with No Explanation treatment compared to the other treatments ( “AI – treatments” and “Human advisor”) . The Analysis is for Willingness to Pay (WTP) for the advice in the last round of the game Tobit WTP Global with Accuracy 0.840** (0.344) Global with Features 0.537* (0.318) Global Explanation 0.788** (0.380) Human_advisor 0.454 (0.286) Gender(male) -0.118 (0.248) Age -0.156** (0.0702) Age_exp 0.00169** (0.000806) Trust index -0.0460 (0.241) Explanation goodness index 0.213 (0.258) Constant 4.279*** (1.528) N 449
Notes: Tobit regression, robust standard errors are in parentheses. The dependent variable is the WTP on the last day of the game. The Algorithmic with No explanation is the underlying condition where the other treatments are dummy variables. Age_exp is age exponent; Trust index is a dummy variable that receives 1 for individuals that their index score of all seven trust questions (see appendix 2 for the trust questions) is above the median; Explanation goodness index is a dummy variable that receives 1 for individuals that their Index score of all perceived explanation goodness index questions is above the median (see appendix 2 for the perceived explanation goodness questions) ("* p<0.10 ** p<0.05 *** p<0.01").
Appendix A : The questioners on Trust and explanation perceived goodness are described below. Participant were asked to state on a Likert scale of 1 – strongly agree to 5 – strongly disagree. The trust-antecedent questionnaire based on McKnight et al. 2002, Gefen et al. 2003, and Komiak and Benbasat, 2006 1. The advisor explanations made me feel more secure.
2. I feel comfortable relying on the advice offered to me. 3. I feel the advisor has good knowledge about the advice. 4. The advice offered to me has nothing to gain by being dishonest with me. 5. The advice offered to me has nothing to gain by not caring about me. 6. I feel safer that there is an advisor that is making this financial decision The explanation perceived goodness questioner was based on Hoffman el al. 2018 1. I understood what the advisor's recommendations were based on 2. From the explanation, I understand how the advisor works 3. This explanation of how the advisor work is satisfying 4. This explanation of how the advisor works have sufficient detail 5. This explanation of how the advisor works seem complete 6. This explanation of how the advisor works tells me how to use it 7. This explanation of how the advisor works is useful to my goals 8. This explanation of the advisors shows me how accurate the advisor is 9. This explanation lets me judge when I should trust and not trust the advisor2. I feel comfortable relying on the advice offered to me. 3. I feel the advisor has good knowledge about the advice. 4. The advice offered to me has nothing to gain by being dishonest with me. 5. The advice offered to me has nothing to gain by not caring about me. 6. I feel safer that there is an advisor that is making this financial decision The explanation perceived goodness questioner was based on Hoffman el al. 2018 1. I understood what the advisor's recommendations were based on 2. From the explanation, I understand how the advisor works 3. This explanation of how the advisor work is satisfying 4. This explanation of how the advisor works have sufficient detail 5. This explanation of how the advisor works seem complete 6. This explanation of how the advisor works tells me how to use it 7. This explanation of how the advisor works is useful to my goals 8. This explanation of the advisors shows me how accurate the advisor is 9. This explanation lets me judge when I should trust and not trust the advisor