[PDF] Explainable AI and Adoption of Financial Algorithmic Advisors: an Experimental Study

Abstract

We study whether receiving advice from either a human or algorithmic advisor, accompanied by five types of Local and Global explanation labelings, has an effect on the readiness to adopt, willingness to pay, and trust in a financial AI consultant. We compare the differences over time and in various key situations using a unique experimental framework where participants play a web-based game with real monetary consequences. We observed that accuracy-based explanations of the model in initial phases leads to higher adoption rates. When the performance of the model is immaculate, there is less importance associated with the kind of explanation for adoption. Using more elaborate feature-based or accuracy-based explanations helps substantially in reducing the adoption drop upon model failure. Furthermore, using an autopilot increases adoption significantly. Participants assigned to the AI-labeled advice with explanations were willing to pay more for the advice than the AI-labeled advice with a No-explanation alternative. These results add to the literature on the importance of XAI for algorithmic adoption and trust.

Full PDF

EExplainable AI and Adoption of Algorithmic Advisors: anExperimental Study

Daniel Ben David a,b , Yehezkel S. Resheﬀ b , Talia Tron b a Hebrew University of Jerusalem b Intuit inc

Abstract

Machine learning is becoming a commonplace part of our technological experience. The notion ofexplainable AI (XAI) is attractive when regulatory or usability considerations necessitate the abilityto back decisions with a coherent explanation. A large body of research has addressed algorithmicmethods of XAI, but it is still unclear how to determine what is best suited to create humancooperation and adoption of automatic systems. Here we develop an experimental methodologywhere participants play a web-based game, during which they receive advice from either a humanor algorithmic advisor, accompanied with explanations that vary in nature between experimentalconditions. We use a reference-dependent decision-making framework, evaluate the game resultsover time, and in various key situations, to determine whether the diﬀerent types of explanationsaﬀect the readiness to adopt, willingness to pay and trust a ﬁnancial AI consultant. We ﬁnd that thetypes of explanations that promotes adoption during ﬁrst encounter diﬀer from those that are mostsuccessful following failure or when cost is involved. Furthermore, participants are willing to paymore for AI-advice that includes explanations. These results add to the literature on the importanceof XAI for algorithmic adoption and trust.

Keywords:

Explainable AI, Financial Advice, Trust, Algorithm Adoption, Experiment1 a r X i v : . [ c s . H C ] J a n . Introduction The use of machine learning and other automation methods is becoming overwhelmingly popularand increasingly available in diﬀerent facets of everyday life. From basic research these methodshave made their way into medicine, transportation, business processes, retail, customer service, anddiverse ﬁnancial services. At times these methods are utilized to do work for individuals, drivingdown costs of previously labour intensive products and services. Other times the same methods areutilized to automatically make decisions about individuals.When algorithms take part in decision making with signiﬁcant impact to individuals, there isa burden of explainability that naturally falls on the providers of the system. In some cases thereare regulations that imposes obligatory explanation of decisions as an integral part of their output(for example, the recent GDPR legislation by the European Union which requires the maker of AIalgorithms which “signiﬁcantly aﬀects” decisions to explain how any output was obtained [9]). Inother cases, a deep understanding of how an output was generated is crucial for human based decisionmaking in interaction with the automatic process (e.g. in some medical and security applications).However, in many other cases, and in various ﬁelds, the importance of explanations is ﬁrst andforemost in the eﬀect on the perceived trustworthiness of the system [26, 12, 20], and hence on thereadiness of consumers to adopt (RTA) the AI service, and pay for it.In recent years, the ﬁeld of explainable AI (XAI) has seen a boost in interest from the community,with many new approaches and ideas. Most of the eﬀort is on the technological and algorithmicside – borrowing ideas from game theory, statistics, and machine learning, to develop fast andaccurate techniques which explain black-box or opaque models [21, 19, 14], or train AI which ismore interpretable by nature [1, 18]. Another part of the literature deals with the independentlyimportant question of evaluating explanations based on their attributes, and with user evaluationof explanations and the consequent eﬀect on behavior.

The ﬁrst important attribute when considering what explanation to generate is the type ofinformation the explanation should convey. We ﬁnd in the literature three main approaches toexplanations: (1) Global Explanations, (2) Local explanations and (3) what might be called SocialInﬂuence Explanations.

Global explanation techniques provide an overview of what an algorithm is doing as a whole.The aim of this type of explanation is to convey to a human what the algorithm is doing rather2han explain the process that lead to a speciﬁc prediction or decision. These methods often includesummarized information about how a model uses features to produce prediction (some popularapproaches include various notions of feature importance, dependence plots, and global Shapleyvalues), prototype example predictions, or a simpliﬁed, interpretable approximation of a black-boxmodel (aka surrogate models) [21, 14].Another type of global explanation that is completely independent of the algorithm used is thepresentation of any type of meta-information that sheds light about the process. This includestransparency around how the model was trained, the type of data that was used, or even simplyreporting model performance statistics. Global explanations are for the most part less costly toproduce in real-world systems, compared to the alternatives, and are readily available in most cases,making them appealing in practise.

Local explanation techniques, on the other hand, provide a more detailed description of how themodel came up with a speciﬁc prediction. These may include information about how a model usesfeatures to generate a speciﬁc output [19, 22, 23, 2]), how a perturbation in input will inﬂuencethe output [25, 4], or a comparison of the speciﬁc input-output pair at hand to the model’s outputon similar input data [13]. In general, local explanation techniques are more costly in time andresources since they must be computed on a case-by-case basis rather than globally for the entiresystem. Furthermore, local explanations are inherently only available once the system is being used,and not applicable when the aim is to convince a user beforehand and build a-priori trust andconsent that is independent of actual experience with the system.

Social Inﬂuence Explanations techniques contain a type of information which relates to the waysocially relevant others behave. These methods are typically discussed in the context of recom-mendation systems; a system using this sort of explanation may show a report on model adoptionstatistics, or the ranking of a speciﬁc item or of the entire system by users with a similar proﬁle orshared characteristics [24, 11]. We will not use this technique in our study, and it is only brieﬂydiscussed here for the completeness of the attributes review.An additional factor which should be taken into account is the way explanations are presented.The amount of information provided, phrasing, and choice of speciﬁc words may all aﬀect what indi-viduals perceive [6, 11]. In parallel with the textual message, the literature on explanations exploresthe visual interface, and which visual method is the more eﬀective way to present a recommendation(e.g. star ranking, histograms, neighbors ranking tables, pies, word cloud) [24].With the multitude of methods that currently exist, it is not clear how to choose for a speciﬁc3ase. What constitutes a good explanation, and is it constant or individual and context dependant?

Unlike the well deﬁned metrics used to evaluate machine learning, such as accuracy and precisionfor model performance, the deﬁnition of a good explanation of the output or nature of a model issomewhat vague. The deﬁnition is presumably not universal, but rather highly dependent on whatwe are trying to achieve by augmenting an algorithm with an explanation. Objectives may varydramatically for diﬀerent use-cases, ranging for instance from helping the user understand how adecision was made, to attempting to convince a user to adopt the system or take a recommendation.As a result, the evaluation approaches and methodologies vary between diﬀerent domains.Explanation evaluation can be conceptually divided into qualitative and quantitative evaluationtechniques. Within the quantitative measures, one can ﬁnd lines of work that focus on evaluatingthe mathematical (or statistical) attributes of the explanation that is generated. For example, whenthe explanation is presented as a report of feature importance (i.e. a list or raking of the featuresthat were instrumental in making the decision), statistical properties such as local accuracy [19]help determine the quality of the list. This type of evaluation completely avoids the question ofwhat is useful or leads to adoption and trust. In some cases the truthfulness of explanations takesprecedence over all else, justifying this view. Other times it makes sense to sacriﬁce on this front inorder to achieve the actual goal of the system.Much of the relevant literature from the ﬁeld of recommendation systems focuses on measuringusage indicators to evaluate eﬀectiveness – the extent to which the system assists the user in makingbetter decisions compared to previous behaviour, and eﬃciency – the extent to which it helps usersmake faster decisions [24, 11].The qualitative measurement literature is focused mostly on user understanding and reactionto the explanations. This line of research uses various questionnaires. Many measures have beensuggested to reﬂect the diﬀerent aspects of what is important when designing explanations. Theseinclude transparency (the level of detail provided), scrutability (the extent to which users can givefeedback to alter the AI system when it’s wrong), trust (the level of conﬁdence in the system), persuasiveness (the degree to which the system itself is convincing in making users buy or tryrecommendations given by it), satisfaction (the level to which the system is enjoyable to use) and userunderstanding (the extent a user understands the nature of the AI service oﬀered, or alternativelythe level of similarity of an explanation generated by the automatic method to explanations producedby a human being) [22, 17, 16, 24, 11]. 4n our study, we attempt to form a synergy between explainable AI and the multiple ﬁelds thathave previously studied algorithmic adoption, and machine-human relations. We use the quantitativemeasure of Readiness to Adopt (RTA), simply deﬁned as the fraction of users that use the AI systemwhen presented with the choice, and later the Willingness to Pay(WTP) to explore acceptance andit’s relations to trust and user satisfaction. We study the impact of the explanation type (bothglobal and local) on the adoption of and payment willingness towards an AI ﬁnancial decision-making advisor. We do so in a unique experimental framework, in a controlled environment withreal money consequences, and repeating interactions which evolve over time. In addition, we relatethe Readiness To Adopt (RTA) and the Willingness to Pay(WTP) in the diﬀerent treatments toknown constructs from the literature on trust [7, 20, 15], and perceived quality of explanations[15, 12] . This innovative framework allows us to examine whether there are diﬀerences in adoptionand payment when the advice is labeled as human advisor compare to AI based algorithm, what arethe eﬀects of types of explanations on initial adoption, adoption over time, algorithm aversion whenthe model fails, fully autonomous decision-making services, and willingness to pay. The researchquestions we aim to explore include: whether diﬀerent types of explanations are important foradoption of an ﬁnancial AI algorithm, when all else is equal (actual advice is the same)? Willproviding more detailed information (via local explanations) increase trust, RTA? How will thiseﬀect change after multiple interactions with the model? How will a failure in model performanceeﬀect the way people perceive ﬁnancial algorithms with diﬀerent explanations? does explanationsinﬂuence the RTA autonomous advice or have an eﬀect when the advice is costly? Does explanationsinﬂuence the consumer of the advice WTP?To the best of our knowledge, this study is the ﬁrst to compare the eﬀect of diﬀerent typesof textual explanations of AI advice in term of RTA over time and in diﬀerent situations. Ourapproach stems from the conjecture that the model-consumer interaction has a dynamic nature,which is reference dependent, and it may be inﬂuenced by various factors such as familiarity, pastperformance, and potential cost. This implies that one explanation can be optimal to make a goodsense of trustworthiness, and gain initial trust and understanding, while a diﬀerent one could bebetter ﬁtting after a period of using the model, or in a such a case when it fails. In addition, and toour knowledge, this study is the ﬁrst that explores the aﬀect of explanation on customers Willingnessto Pay (WPT) for the AI advice.The rest of the paper is organized as follows: in the next section we provide a detailed descriptionof experiment methodology and design, we discuss the choice of experimental treatments and game5ow and relate it to the existing literature from the ﬁeld of explainable AI and algorithmic trustrespectively. Next, we present the results from an online experiment showing how explanations aﬀectadoption and trust in the diﬀerent phases of human-AI interactions, as well as the consequences forparticipants’ willingness to pay for the service. Finally, we discuss the ﬁnding from this study, andsuggest future directions with the potential to broaden the scope of the current work and generalizeto other domains.

2. Methods

The study consistent of a 3 parts as follow:1. Pre-game quantitative questionnaire. participants had a time limit of 3 minutes to answer 3simple mathematical questions (addition and subtraction). Upon completing this part, partic-ipants earned 20 initial game coins that were used in the main part of the study. Each gamecoin during the game was worth 2 U.S. cents. The purpose of this stage was twofold: The ﬁrstwas to ensure participants’ attention during the experiment. The second relates to the mentalaccounting literature [10]. We wanted participants to treat the game coins account as moneythat they earned while investing eﬀort, rather than money that they obtained as a ”reward”.2. The main part of the study. A fun, interactive, decision-making game – The Lemonade Stand(see full description below). In this part participants could gain more coins depending on theirdecisions, and potentially use an advisor, that was labeled diﬀerently with several explanation,in doing so.3. A post-game questionnaires about trust, engagement, explanation satisfaction and personaldemographic details .Figure 1 illustrates the lemonade stand game. In the beginning of the game participants wereinstructed as followed: “You own a lemonade stand, your goal is to make as much money as youcan in 2 weeks by selling lemonade. Decide how many cups you want to make, per day, based onthe price of lemons and the weather forecast.” At the beginning of each day, a weather forecast(sunny, cloudy, or rainy) was displayed together with the varying price of lemons (0.45-0.55 coinsper cup). The lemonade selling price was ﬁxed (1 coin). Participants had to decide how many The questionnaires about Trust and explanation satisfaction are based on [7, 15, 20, 12]. The metrics and thequestionnaires can be found in appendix A.

Table 1: The ranges of demand level conditioned on weather. The demand for lemonade cups is sampled uniformlyfrom the range associated with the actual weather each day.

The game starts with a short learning session (3 days) designed to verify participants under-standing. These days are not counted in the day numbering used throughout the analysis. Aftertraining, participants play the game for 14 game-days for real game coins. The complete game ﬂowis described in the bottom graphic in Figure 1.

Initial adoption – At day 3, after completing 2 days of playing for real game money, participantswere given the option to take recommendations from a so-called ”advisor”. Each recommendationby the advisor was accompanied with a short explanation which varied between experimental con-ditions (details below). To avoid an anchoring eﬀect, recommendations were always displayed afterparticipants already decided how many cups to make. They then could change their decision basedon the recommendation (a ”Take Advice” button allowed them to switch to the number prescribedby the advisor. No other change to the the amount of lemonade production was allowed at thisstage); alternatively, they could stay with their original choice. Initial adoption following the ﬁrstimpression was measured by the Readiness To Adopt (RTA) in this phase, the percent of users thattook the advice each day, and was compared between the diﬀerent explanation treatments (see 2.3).

Adoption gain – during the ﬁrst days of the experiment (between days 3 and 7) the advice inall explanation treatments was set to perform perfectly (i.e. the advice given is precisely the actual7emand that will occur). During this period we test how conﬁdence in the algorithm advice is gainedupon exposure to the accurate advisor, and speciﬁcally we compare the diﬀerent conditions on day7 after 4 rounds of exposure.

Advice failure and algorithm aversion – On day 7, the weather forecast was sunny for all partic-ipants and the recommendation drastically fails, with a recommendation to produce the maximalamount of lemonade (10 cups), and an actual demand for the minimal amount (0 cups). We explorethe eﬀect of experimental treatment (diﬀerent explanations) on RTA the next day (i.e. day 8). Fromday 9 and until day 10 we look at the accumulation of renewed trust, as the algorithm returns to itsaccurate predictions.

Automatic adoption – starting from day 11, ”Auto-pilot” mode is made available. Namely,participants are able to request the model to decide how many cups to make, without ever seeing itsrecommendations, or entering their decision manually ( ”Decide for me” ). By using this, users getthe option not to think about amounts of production, and this makes sense for them if they knowthey are going to take the advice of the model anyway. The RTA here is measured as the numberof participants who adopt the algorithm either automatically or manually as in the previous days.

Willingness to pay – During the last day of the game (day 13), participants are told that advicewill no longer be free, and are asked how much they are will to pay to keep receiving recommendationsin the following days. The purpose of this part is to check whether the adoption rate is aﬀected bythe algorithm not being free and to explore whether there are diﬀerences in the willingness to paybetween the diﬀerent mechanisms [3].

Participants were randomly assigned to 5 experimental conditions, which diﬀer only in the expla-nations for the advice that is given (See Table 2 for the full text of the explanations provided for eachgroup). Each subject participated was consistently shown only one type of advice throughout theexperiment. Participants in the

Human Expert group were informed that advice is being generatedby a so called human expert in the ﬁeld of lemonade stand operations. This condition was designedto asses whether the mere usage of a computer algorithm has an eﬀect on trust and adoption, andwhether there is an algorithmic aversion as previously found by [5]. In all other groups, participantswere told that the suggestions are based on a so called computer algorithm. In the

No Explanation condition, participants were given no information about how the algorithm generates the predictions.This condition was used as baseline against which all other algorithm explanation treatments arecompared. 8 earning session Start playing

For real coins

Advice is now available

Recommendations from an advisor

DAY 11Auto-pilot is now available

Let the advisor decide for you

DAY 14Advice is no longer free

How much are you willing to pay?

DAY 3 (Model fails)

You sold no lemonade cups!

DAY 7

The algorithm recommendation is to make 3 cups oflemonade”

Weather forecast : Sunny

Cash:

21 Coins

Cost to make lemonade: DAY 3

You sold 3 cups of lemonade and earned 3 coins !

END OF DAY 3

It was a cloudy day

You made 5 cups of lemonade but there was only demand for 3.

Figure 1: Illustration of one day at the lemonade stand (top), and the general ﬂow of the lemonade stand game(bottom). Each day the player is given the weather forecast and asked how many cups of lemonade they would like toproduce. After selecting the desired amount, advice is presented and the user can either accept the advice or producethe amount previously stated. Finally, the day is concluded with a screen that summarizes the events of the day,including the amount produced, demand, and earnings.

In the

Global Explanation condition, participants were given very general information about howthe model operates, namely the type and extent of data it uses (sales data from many lemonadestands over several years). This condition simulates a very prevalent real-world scenario where weaim to increase model transparency with minimal cost in computation time and resources. In the ”Feature-based” condition, on the other hand, participants were given a fully detailed account aboutthe features that were used to generate the speciﬁc prediction, similar to using local explanationsin a real world scenario. To test the additive value of this information, it was attached to thebaseline global explanation. Finally, to test the potential eﬀect of reporting model performance, inthe

Performance-Based condition information regarding performance (”with 90 percent certainty”)was also attached to the baseline global explanation (we note that the performance observed byparticipants during the experiment is indeed statistically consistent with the 90 percent accuracyvalue mentioned to them). 9ondition Explanation Type

Human Expert

The human advisor recommendation is to make 6cups of lemonade.

No explanation

The algorithm recommendation is to make 6 cups oflemonade.

Global explanation

Based on data from lemonade stands over severalyears, the algorithm recommendation is to make 6cups of lemonade.

Feature-based

Based on data from lemonade stands over severalyears, your previous sales, and market demand, thealgorithm recommendation is to make 6 cups oflemonade.

Performance-Based

Based on data from lemonade stands over severalyears, the algorithm recommendation, with 90 per-cent certainty, is to make 6 cups

Table 2: The text presented to users in each experimental treatment. The recommended number presented changedby the speciﬁc day advice and varied between 1-10

To clarify, all participants faced the same game phases and process with only one diﬀerence, thelabel on the advice oﬀered. Given that the only diﬀerence was the ”label” of the advice, and thefact that participants were randomly assigned to the conditions, no signiﬁcant diﬀerence in RTAand WTP should be observed. Any observe diﬀerences serve as an indication of prior perceptionsand preferences relates to the given label.

The study was conducted via the online Amazon Mechanical Turk platform (AMT). The totaltime participants had to complete the experiment was 25 minutes. The average time of experimentcompletion were 12.66 Minutes. There were initial 514 respondents to the experiment. We decidedto add a screening mechanism and added a minimal length of time of 6 minutes for completing theexperiment in order to ﬁlter out individuals who did not play the game with suﬃcient attention andengagement. 65 participants who ﬁnished in non-realistic time of less than 6 minutes were excluded.Our ﬁnal sample consist of 449 subjects. These were randomly distributed between ﬁve explanationsconditions, 84 of the participants received the

Human Expert treatment, 104 received the Algorithmwith

No Explanation , 89 of the participant were allocated to the algorithmic advice with

GlobalExplanation , 87 received the

Feature-based explanation the the rest of 85 participants received the

Performance-Based . All participants were located in the U.S., with an average age of 36.6 (Std11.2), 58.8% Male, 68% report working full time, 21.3% part time, and the rest unemployed. 55%of our participants report having an academic degree, 12% an associate degree, and the rest a high10chool education or below.

3. Results A d o p t i o n R a t e Human advisorNo exlanationGlobal ExplanationGlobal with featuresGlobal with accuracy

Day ** ** ** **

Human AdvisorNo ExplanationGlobal ExplanationFeature-basedPerformance-based ** Figure 2: The average acceptance rate of model advise per game-day in each of the treatment groups. change nameof general with features

See Figure 1 for the description of the ﬂow of the game.

We compare the Readiness to Adopt (RTA) for the diﬀerent experimental condition groupsthroughout the experiment ”two weeks” game duration. We start by describing the results withdescriptive statistics and t-tests, followed by providing a Probit multivariate model regressions onthe RTA while controlling for several relevant variables such as gender, age, trust and explanationperceived goodness indexes as measured by the questionnaires. Regression tables include the exactvalues for statistics reported throughout this section.Figure 2 summarizes the overall adoption rate (RTA) for diﬀerent experimental conditions. Theoverall model adoption changed throughout the game, supporting our reference-dependent approach.On the very ﬁrst day when the model was introduced (day 3) we see an average of 60% adoption whenaveraging over all participants and all experimental conditions. Over the next few days, during whichthe algorithmic performance was good, overall adoption increased leading to an average adaption of85.5% on day 7. Following the model failure on day 7, adoption plummeted, returning approximatelyto the initial level with 61% on day 8, this was followed by a slow recovery up until day 10 (66%).11ith the introduction of the auto-pilot option on day 11 adoption soared and reached the peak valueof 94%. Finally, when participants were told advice will no longer be free, and were asked aboutwillingness to pay, we see an overall decrease in adoption on the ﬁnal day (77%).Next, we break down the eﬀect of the experimental explanation treatments and provide resultsof the RTA for each group.

First, we examine the eﬀect of human versus algorithmic origin of advice with no further expla-nation. Results show no evidence of algorithm aversion– namely there is no signiﬁcant diﬀerencein the initial adoption of the

Human Advisor (RTA of 56%) as opposed to the computer algorithm

No Explanation conditions (with an RTA of 57.7%). However, over the ﬁrst few days, the algorith-mic advice with

No Explanation showed signiﬁcantly gains in adoption as compared to the

HumanAdvisor leading to a 16.3% percent gap on day 7 (P-value = 0.0129, t = 2.5114). Following modelfailure on day 7, adoption and RTA levels fell back to around the initial values in both cases (54%and 56% for

No explanation and

Human Advisor , respectively). Recovery in the next few days isslightly and not signiﬁcantly more favorable for the AI-based advisor (65% compared to 58% by day10), and this trend remains throughout the experiment. The decline in adoption following the probeof willingness to pay is signiﬁcantly larger in the Human Advisor condition with 79% compared to63% for algorithmic

No Explanation and the

Human advisor , respectively (P-value = 0.0169, t =2.4104).For the regression analysis we proceed by conducting a Probit multivariate model regressions.The base speciﬁcation is

RT A i,t = α + βT i,t + δ · T E i,t + γ · X i,t + (cid:15) i (1)where RTA is the outcome variable associated with individual i with time diﬀerence of t game”days” of the experiment. T is a dummy variable for the type of investment advice: ”HumanAdvisor” (where the underlying condition is the ”AI-No explanation” alternative). T E is a vectorof the Trust and Explanation satisfaction dummies for indicating if an individual is above the medianon the index based on questionnaires developed by [7, 15, 20, 12]. X is the vector of age and gendervariables. Table 1 shows the Probit analysis results. We observe similar results when includinga range of control variables: Signiﬁcant diﬀerences in adoption toward the ”AI-No explanation” The questionnaires can be found in appendix A

Next, we considered the eﬀect of the type of explanations provided by the algorithmic advisoron RTA during the diﬀerent phases of the game. As before, we start with descriptive statistics andcontinue with a multivariate regression analysis. For the analysis, we are conducting a Probit, cross-section multivariate model regressions, on the diﬀerent treatments while controlling for gender,age trust antecedents index and Explanation satisfaction index with two approaches: (A) betweenconditions analysis as in model (1) on the diﬀerent experiment days, and (B) on the change inadoption between the diﬀerent experiment phases. The base speciﬁcations are: RT A i,t = α + βT i,t + δ · T E i,t + γ · X i,t + (cid:15) i (2) RT A ∆ i,t = α + βT i,t + δ · T E i,t + γ · X i,t + (cid:15) i (3)For model (2), the outcome variable is the RTA for individual i in the diﬀerent t ”days” ofthe experiment and the rest is identical to the speciﬁed above in model (1). Table 2 shows theProbit analysis results. For model (3), the analysis is done on the change in adoption betweenthe experiment phases. The outcome variables are as follows: RTA day 7 minus RTA day 3 (asgaining trust), RTA day 7 minus RTA day 8 (as loss of trust), RTA day 10 minus RTA day 8 (as aRecovery), RTA day 13 minus RTA day 11 (as auto advice available), RTA day 14 minus RTA day13 (as willingness to pay). The explanatory variables are similar to as speciﬁed above in model (1).Table 3 shows the Probit analysis results for these models. When comparing the

Global Explanation with

No explanation we ﬁnd no signiﬁcant diﬀerencein initial adoption. When we attach to the Global explanation the more detailed feature-based The results are not sensitive to the choice of regression models such as OLS or Logit.

Performance-Based had signiﬁcantly more adoption compare to

No-explanation . Inaddition, when

Explanation satisfaction was perceived as high (above median), the initial adoptionwas signiﬁcantly higher as well.

During these ﬁrst four days, the advice provided in all experimental conditions allowed partic-ipants to make the highest possible proﬁt (i.e. perfect predictions and advice). In line with this,all advice treatment alternatives gained user conﬁdence, and adoption raised signiﬁcantly by anaverage of 42.5%. When comparing the diﬀerent experimental conditions, we ﬁnd no signiﬁcantdiﬀerences except the

Performance-Based explanation which gained less adoption compare to theno explanation alternative (P-value = 0.0403, t = 2.0652). From the regressions with controls, weﬁnd, similarly to above, that only the

Performance-Based gained less adoption compare to the no-explanation. This result was found in both regression approaches. These results suggest that whenadvice performance is immaculate, the type of explanation presented with it is less important toindividuals.

Following algorithm failure on day 7, we ﬁnd an average adoption drop of 28% on day 8. Whenwe explore the diﬀerent explanations we ﬁnd the following: the two more elaborate explanations,the

Performance-Based (P-value = 0.0445, t = -2.0225) and the

Feature-Based (P-value =0.0498,t = -1.9741) prove to be more resistant to the eﬀect of algorithm error. The adoption drop wassigniﬁcantly lower compared to the

No explanation alternative with 16% and 21% reduction forthe

Feature-Based and the

Performance-Based ,respectively and 41% reduction for the advice withNo explanation. We ﬁnd no signiﬁcant diﬀerences between the

Global Explanation and the otheralternative explanations. When we explore the adoption recovery, during the next few days anduntil day 10, we ﬁnd that the

Performance-Based explanation had a marginally signiﬁcant (P-value= 0.0633, t = 1.8691) tendency to recover better than the global explanation (12% compared toonly 2% for the

Feature-Based and the

Global Explanation , respectively).In the regressions with controls, we ﬁnd both in model (2) and (3) that the Global with Featureand

Performance-Based have signiﬁcantly lower adoption drop compare to the alternative with No14xplanation. In addition, when the trust was perceived as higher, it lowered the drop rates andincreased the recovery from the failure. Lastly, on day 10, we observe a marginally signiﬁcant higheradoption rate for the

Feature-Based condition compare to the No-explanation alternative.

On day 11, subjects have an option to select auto advice before starting the day. The newinstrument that enables the participants to avoid deciding by themselves has a major impact. Weobserve adoption jump of 40% on average to a mean RTA of 95% on day 11 and up to 97% onday 13 (For the algorithmic treatment alternatives that included explanations (Global, Feature-Based, Performance-Based). Among the alternative explanations, we ﬁnd that the

Feature-Based explanation achieved the highest adoption rates and that they are marginally signiﬁcant compareto the No explanation alternative (P-value = 0.0651, t = 1.8557) and signiﬁcantly compared to the

Global explanation (P-value = 0.0249, t = 2.2627). In the regressions, we ﬁnd similarly to above,no signiﬁcant diﬀerences between the experimental conditions to the no explanation alternative.Concerning the adoption jump, we relate our results to the fact that making ﬁnancial decisionsis a challenging task that people reluctantly enter [8]. The way the advice is now oﬀered enables theparticipants to transfer the responsibility for the decision making to the advisor and avoid makingit by themselves. Also, the relatively low stakes of the game (winning several U.S dollars at the endof the experiment) can serve as an explanation of why participants transfer responsibility and avoideﬀort.

On the last day we introduced a new situation where the advisor is no longer free and askedparticipants whether they want the advice and if so, how match they are willing to pay for it. Duringthis round we observe adoption drop of 21% for the algorithmic alternatives. This was despite thefact that we allowed the subjects to self-determine the value (the price) of the advice. The

Feature-Based condition showed a tendency for lower adoption reduction (only 14.5% RTA drop). Thisresult was signiﬁcant compare to the

Global Explanation (P-value = 0.0398, t = -2.0712). No othersigniﬁcant diﬀerence were observed between the other algorithmic alternatives. From the regressionanalyses, we ﬁnd, no signiﬁcant diﬀerences between the experimental conditions to the no explanationalternative. 15 .4. Willingness to pay

The Willingness to Pay measurement, as speciﬁed above, can further attest to way participantsperceive the advice given by the diﬀerent advice mechanisms. We ﬁnd that participants assignedto the No-explanation alternative were willing to pay the least for the use of the advice with meanpayment of 1.005 game coins, compared to a mean payment of 1.774 game coins for all other ”AIadvice” alternatives – A 76.5% gap. The diﬀerence in willingness to pay was signiﬁcant for

GlobalExplanation and

Performance-Based conditions ((P-value = 0.0150, t = 2.4535), (P-value = 0.0063,t = 2.7621), respectively) and marginally signiﬁcant for

Feature-Based condition (P-value = 0.0566,t = 1.9181). No signiﬁcant diﬀerences were observed between the other alternatives. In addition,we observe positive marginal signiﬁcant diﬀerence in the willingness to pay for the Human-expertcompared to the No-explanation ((P-value = 0.0611, t = 1.8836). This result is in contrast to whatwe observe with respect to the RTA over the same day.For the regression analysis, we conduct a Tobit, cross-section multivariate model regressions, onthe diﬀerent treatments while controlling for gender, age, trust antecedents index, and Explanationsatisfaction index. The base speciﬁcations is: W T P i = α + βT i + δ · T E i + γ · X i + (cid:15) i (4)The outcome variable is the Willingness to Pay (WTP) for individual i. T is a dummy variablefor the type of investment advice: Global Explanation and

Performance-Based and

Feature-Based and

Human-Expert (where the underlying condition is the ”AI-No explanation” alternative). Therest is identical to the speciﬁed above in model (1). Regression analysis results are as describedabove and speciﬁed as part of model (4) in Table 4.From the regression analyses, we ﬁnd that the willingness to pay for

Global Explanation and

Performance-Based conditions were signiﬁcantly diﬀerent and positive compared to the

No-explanation alternative as observed by the t-tests and the same marginal signiﬁcant tendency toward higher pay-ment for

Feature-Based condition compare to the

No-explanation . In addition, we ﬁnd a signiﬁcantage U shape for the willingness to pay. constant with [3] results. For the comparison of the Human-Expert vs. No-Explanation, we ﬁnd that with the controls, the eﬀect observed is not signiﬁcant.Meaning that we don’t ﬁnd a willingness to pay signiﬁcant diﬀerences between those advice alter-natives. The results are not sensitive to OLS regression

Global or Performance-Based , hadthe highest impact on the participant’s willingness to pay. We refer our results to [8] theory in whichpeople are willing to pay more for an advisor that they trust more. It seems that adding WPT inaddition to RTA had an additional dimension of trust relating to algorithmic advice.

To reinforce our measure of adoption, and validate the relation to trust, we examined the eﬀectof trust antecedents questions which were found as relevant in the literature, on the adoption of theadvice alternatives using within treatment analysis, over the various days. We used a validated indexincluding 7 trust questions as the measure. We ﬁnd that on day 8 and day 10, after the algorithmfailed, individuals who indicated that they trust the advice more indeed adopt it more. In particular,individuals who trusted the advice more with

Feature-Based explanation and

Performance-Based based explanation by our index, showed higher adoption rates. In addition to the trust index, whenwe explore the explainable goodness questionnaire index, we ﬁnd that it was signiﬁcantly inﬂuentialon day 3 adoption. Individuals who valued the explanation more, adopt the advice more. Regressionanalysis results are as described above and speciﬁed as part of model (2) and (3).

4. Conclusion

Machine learning has become a prevalent technology, especially over the last decade, with ap-plications and implications in many aspects of everyday life. In order to promote a much neededcooperation between human and machine, one of the research goals in the ﬁeld of explainable AI isto ﬁnd the most successful ways of explaining and presenting AI-based decisions to individuals. Themajority of this line of work is focused on algorithmic methods to produce explanations for complexmachine learning methods. However, the complimentary question of whether diﬀerent types of ex-planations are important to create trust and adoption is for the most part still open. To this end,we constructed a decision making game framework to test adoption, trust, and willingness to payfor ﬁnancial machine learning models. Unlike questionnaire based paradigms, the lemonade standgame allows us to test these issues quantitatively, based on actual behavior in the presence of realﬁnancial stakes.To the best of our knowledge, this is the ﬁrst study to directly evaluate the behavior of users inresponse to varying types of textual global and local explanations of AI over time in diﬀerent situa-tions. Our experimental paradigm allows testing initial adoption in several experimental conditions17deﬁned by the diﬀerent types of explanations), as well as the evolution of the relations over timewhen the AI advice proves to be useful, or after it fails. Post-game questionnaires further allowintegration of measures common in the behavioral sciences for additional validation of our ﬁndings.We ﬁnd that attaching diﬀerent explanations created a signiﬁcant diﬀerence. We observed noalgorithm aversion, and, in our experiment, participants were more inclined to adopt so-called AI-given advice compared to so-called Human advice. We ﬁnd that an accuracy-based explanation ofthe model in initial phases led to higher adoption rates. When the performance is immaculate, thereis less importance associated with the kind of explanation for adoption. In addition, in cases offailure, using more elaborating feature-based or accuracy-based explanations helps substantially inreducing the adoption drop, and the negative inﬂuence on trust towards the algorithm.Furthermore, using an autopilot increased adoption signiﬁcantly, a ﬁnding which we relate tothe unwillingness to make decisions if possible. Presenting a feature-based explanation partiallymitigate the adoption drop caused by the question about willingness to pay. Participants which wereassigned to ”AI-advice” with explanations were willing to pay more for the advice compare to theNo-explanation alternative. We ﬁnd a correlation between trust antecedents and our measurementof RTA. When levels of trust were high, we observed mitigation in adoption drop when failure occurs,and a correlation with gaining recovery after this failure. We observe positive correlation betweenExplanation satisfaction perceptions with initial adoption levels as well. Lastly, we ﬁnd that ageis negatively correlated with auto-take advice and a tendency (not signiﬁcant) of age and loweradoption rates.We contribute to the literature in three key ways:(1) First, we show that there is no single best explanation that ﬁts all and the importance oftime and situation dependence in presenting the type of explanation that is best suited to promotetrust. Namely, the ”What” we should explain depends on the ”when”. Interestingly, participantswere more inclined overall to adopt AI-given advice than follow a human expert. The eﬀect of themodel (and human advice) failure on subsequent advice taking is remarkable, with a single failurecausing adoption rates to plunge.(2) Second, our results show that often the end-user doesn’t need to know more than verygeneral facts to accept the system. Our study shows that such general explanations yield goodresults in terms of RTA and WTP compared to the alternative of no explanations, and especiallyafter AI failure. Moreover, stating accuracy statistics about the algorithm can further strengthenthe aforementioned eﬀect. Furthermore, explanations increased the participants willingness to pay18or the advice. Results of the current study highlight the potential utility of simple explainabilitysolution in these aspects.(3) Thirdly, our experimental paradigm of the lemonade stand game applies the RTA measure-ment in diﬀerent and evolving situations into the explainable AI literature. This framework may beutilized to answer many other questions in the ﬁeld of explainable AI and trust in AI. Future workcan investigate the eﬀect of diﬀerent explanation types in broader conditions such as, but not limitedto: varying reported accuracy values of the algorithm and over more extended periods. WhetherWillingness to Pay changes over time with diﬀerent situations? Will using additional experimentalmethods such as a more controlled lab experiment or a ﬁeld experiment on a live system will showthe same results? Moreover, some of the key components can be implemented in research studieson diﬀerent explainability techniques, such as visual explanations. Answering these questions inaddition to our study will help bridge the gap between the immense technical success we see inmachine learning, and the diﬃculty often faced when trying to understand and amplify adoptionand trust among users.

References [1] Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learn-ing certiﬁably optimal rule lists for categorical data.

The Journal of Machine Learning Research ,18(1):8753–8830, 2017.[2] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec-tion: Quantifying interpretability of deep visual representations. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 6541–6549, 2017.[3] Daniel Ben-David and Orly Sade. Robo-advisor adoption, willingness to pay, and trust—anexperimental investigation.

Willingness to Pay, and Trust—An Experimental Investigation(December 2018) , 2018.[4] Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, KarthikeyanShanmugam, and Payel Das. Explanations based on the missing: Towards contrastive explana-tions with pertinent negatives. In

Advances in Neural Information Processing Systems , pages592–603, 2018. 195] Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. Algorithm aversion: People erro-neously avoid algorithms after seeing them err.

Journal of Experimental Psychology: General ,144(1):114, 2015.[6] Derek Doran, Sarah Schulz, and Tarek R Besold. What does explainable ai really mean? a newconceptualization of perspectives. arXiv preprint arXiv:1710.00794 , 2017.[7] David Gefen, Elena Karahanna, and Detmar W Straub. Trust and tam in online shopping: anintegrated model.

MIS quarterly , 27(1):51–90, 2003.[8] Nicola Gennaioli, Andrei Shleifer, and Robert Vishny. Money doctors.

The Journal of Finance ,70(1):91–114, 2015.[9] Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-makingand a “right to explanation”.

AI Magazine , 38(3):50–57, 2017.[10] Pamela W Henderson and Robert A Peterson. Mental accounting and categorization.

Organi-zational Behavior and Human Decision Processes , 51(1):92–117, 1992.[11] Jonathan L Herlocker, Joseph A Konstan, and John Riedl. Explaining collaborative ﬁlteringrecommendations. In

Proceedings of the 2000 ACM conference on Computer supported cooper-ative work , pages 241–250. ACM, 2000.[12] Robert R. Hoﬀman, Shane T. Mueller, Gary Klein, and Jordan Litman. Metrics for explainableai: Challenges and prospects, 2018.[13] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn tocriticize! criticism for interpretability. In

Advances in Neural Information Processing Systems ,pages 2280–2288, 2016.[14] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, andRory Sayres. Interpretability beyond feature attribution: Quantitative testing with conceptactivation vectors (tcav). arXiv preprint arXiv:1711.11279 , 2017.[15] Sherrie YX Komiak and Izak Benbasat. The eﬀects of personalization and familiarity on trustand adoption of recommendation agents.

MIS quarterly , pages 941–960, 2006.2016] Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. Principles of ex-planatory debugging to personalize interactive machine learning. In

Proceedings of the 20thinternational conference on intelligent user interfaces , pages 126–137. ACM, 2015.[17] Tania Lombrozo. The structure and function of explanations.

Trends in cognitive sciences ,10(10):464–470, 2006.[18] Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models withpairwise interactions. In

Proceedings of the 19th ACM SIGKDD international conference onKnowledge discovery and data mining , pages 623–631. ACM, 2013.[19] Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In

Advances in Neural Information Processing Systems , pages 4765–4774, 2017.[20] D Harrison McKnight, Vivek Choudhury, and Charles Kacmar. Developing and validating trustmeasures for e-commerce: An integrative typology.

Information systems research , 13(3):334–359, 2002.[21] Christoph Molnar.

Interpretable Machine Learning . 2019.[22] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explain-ing the predictions of any classiﬁer. In

Proceedings of the 22nd ACM SIGKDD internationalconference on knowledge discovery and data mining , pages 1135–1144. ACM, 2016.[23] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic explanations. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.[24] Nava Tintarev and Judith Masthoﬀ. A survey of explanations in recommender systems. In , pages 801–810. IEEE,2007.[25] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations withoutopening the black box: Automated decisions and the gpdr.

Harv. JL & Tech. , 31:841, 2017.[26] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. Understanding the eﬀect of ac-curacy on trust in machine learning models. In

Proceedings of the 2019 CHI Conference onHuman Factors in Computing Systems , page 279. ACM, 2019.21

Table 1 - Algorithmic advice with No Explanation treatment compared to the Human Advice with No explanation treatment. The Analysis is for the Readiness to Adopt (RTA) on the various different days Probit Day 3 Day 7 Day 8 Day 10 Day 13 Day 14

Human_advisor -0.0109 -0.514** 0.115 -0.0574 -0.525* -0.428** (0.190) (0.227) (0.192) (0.197) (0.274) (0.202) Gender(male) -0.0410 -0.0665 -0.0647 -0.313 -0.119 -0.0921 (0.189) (0.245) (0.187) (0.193) (0.289) (0.202) Age -0.00811 -0.0107 -0.00589 -0.00583 -0.0399** -0.00373 (0.00817) (0.00935) (0.00783) (0.00816) (0.0188) (0.00824) Age_Exp -3.62e-31 3.12e-32** 4.23e-32*** -3.86e-32*** 2.38e-25** 4.17e-32*** (4.07e-31) (1.54e-32) (1.34e-32) (1.40e-32) (1.21e-25) (1.45e-32) Trust index 0.113 0.155 0.262 0.337* 0.387 0.352* (0.196) (0.248) (0.197) (0.205) (0.327) (0.211) Explanation_goodness index 0.246 -0.641** -0.245 -0.426** -0.355 0.123 (0.204) (0.256) (0.201) (0.208) (0.320) (0.217) Constant 0.380 1.992*** 0.291 0.753** 3.258*** 0.774** (0.372) (0.496) (0.360) (0.375) (0.700) (0.386)

N 188 188 188 188 188 188

Notes: Probit regressions, robust standard errors are in parentheses. The dependent variables are the RTA over the different experimental phases. Human advice is a dummy variable where the underlying condition is the Algorithmic advice with No explanation. Age_exp is age exponent; Trust index is a dummy variable that receives 1 for individuals whose index score of seven trust questions (see appendix 2 for the trust questions) was above the median; Explanation_goodness index is a dummy variable receives 1 for individuals that their Index score of perceived explanation goodness index questions is above the median (see appendix 2 for the perceived explanation goodness questions) ("* p<0.10 ** p<0.05 *** p<0.01").

Table 2 - Algorithmic advice with No

Explanation treatment compared to the other “AI – treatments”. The Analysis is for the

Readiness to Adopt (RTA) on the various different days Probit Day 3 Day 7 Day 8 Day 10 Day 13 Day 14

Global with Accuracy 0.343* -0.452* 0.373* 0.108 0.190 -0.0317 (0.196) (0.242) (0.193) (0.195) (0.381) (0.215) Global with Features -0.106 -0.286 0.354* 0.367* 0 0.229 (0.191) (0.254) (0.191) (0.202) (.) (0.230) Global Explanation -0.0324 -0.147 0.222 -0.0109 -0.280 -0.168 (0.187) (0.255) (0.190) (0.191) (0.294) (0.209) Gender(male) -0.0291 0.163 0.0837 -0.0189 -0.0987 0.308* (0.142) (0.178) (0.141) (0.143) (0.287) (0.159) Age 0.00514 -0.00707 -0.00824 -0.00776 -0.0138 -0.00519 (0.00679) (0.00905) (0.00677) (0.00711) (0.0119) (0.00732) Age_exp 2.36e-31 1.35e-30* 1.20e-31 7.53e-30 7.79e-31* -3.56e-31 (4.39e-31) (6.92e-31) (4.57e-31) (1.57e-29) (4.29e-31) (3.56e-31) Trust index 0.0976 0.103 0.376** 0.331** 0.150 0.153 (0.150) (0.186) (0.150) (0.152) (0.284) (0.166) Explanation_goodness index 0.374** -0.159 -0.0393 -0.304* 0.288 0.274 (0.154) (0.195) (0.152) (0.157) (0.312) (0.170) Constant -0.154 1.543*** 0.173 0.612* 2.215*** 0.680** (0.316) (0.438) (0.314) (0.324) (0.574) (0.335)

N 365 365 365 365 278 365

Notes: Probit regressions, robust standard errors are in parentheses. The dependent variable is the RTA over the different days. The Algorithmic with No explanation is the underlying condition where other treatments are dummy variables. Age_exp is age exponent; Trust index is a dummy variable that receives 1 for individuals whose index score of all seven trust questions (see appendix 2 for the trust questions) is above the median; Explanation goodness index is a dummy variable that receives 1 for individuals that their index score of all perceived explanation goodness index questions is above the median (see appendix 2 for the perceived explanation goodness questions) ("* p<0.10 ** p<0.05 *** p<0.01").

Probit

Gained_trust Lost_trust Recovery Auto_advice_Available no_longer_free

Global with Accuracy -1.240*** -0.501** 0.0986 0.492 0.149 (0.435) (0.211) (0.323) (0.543) (0.219) Global with Features -0.519 -0.419** 0.426 0 -0.115 (0.464) (0.209) (0.312) (.) (0.235) Global Explanation -0.606 -0.262 0.0886 -0.0565 0.166 (0.442) (0.201) (0.288) (0.392) (0.217) Gender(male) 0.0646 -0.0139 -0.203 -0.153 -0.301* (0.321) (0.155) (0.232) (0.444) (0.161) Age -0.0208 0.00727 -0.00680 -0.0249 0.00460 (0.0140) (0.00748) (0.0104) (0.0204) (0.00745) Age_exp 2.33e-30** -5.97e-32 2.83e-29* 1.59e-25 4.36e-31 (1.05e-30) (4.48e-31) (1.45e-29) (1.58e-25) (3.49e-31) Trust index -0.0615 -0.323** 0.628** -0.0154 -0.101 (0.303) (0.162) (0.259) (0.353) (0.170) Explanation_goodness index -0.613** 0.150 -0.417 0.570 -0.337* (0.310) (0.166) (0.263) (0.378) (0.175) Constant 2.695*** -0.291 0.101 2.020** -0.776** (0.787) (0.346) (0.479) (0.903) (0.346) N 135 318 137 97 365

Notes: Probit regressions, robust standard errors are in parentheses. The dependent variable is the RTA differences over the experiment phases. Gained_trust - RTA day 7 adopters minus RTA day 3, Lost_trust - RTA day 7 adopters minus RTA day 8 ,Recovery - RTA day 10 adopters minus RTA day 8 , Auto_advice_available - RTA day 13 minus RTA day 11, No_longer_free - RTA day 14 minus RTA day 13. The Algorithmic with No explanation is the underlying variable where other treatments are dummy variables. Trust index is a dummy variable that receives 1 for individuals whose index score of all seven trust questions (see appendix 2 for the trust questions) was above the median; Explanation_goodness index is a dummy variable that receives 1 for individuals that their Index score of all perceived explanation goodness index questions is above the median (see appendix 2 for the perceived explanation goodness questions) ("* p<0.10 ** p<0.05 *** p<0.01").

Table 3 - Algorithmic advice with No Explanation treatment compared to the other “AI – treatments ” . The Analysis is for the Readiness to Adopt (RTA) differences over the experiment phases able 4 - Algorithmic advice with No Explanation treatment compared to the other treatments ( “AI – treatments” and “Human advisor”) . The Analysis is for Willingness to Pay (WTP) for the advice in the last round of the game Tobit WTP Global with Accuracy 0.840** (0.344) Global with Features 0.537* (0.318) Global Explanation 0.788** (0.380) Human_advisor 0.454 (0.286) Gender(male) -0.118 (0.248) Age -0.156** (0.0702) Age_exp 0.00169** (0.000806) Trust index -0.0460 (0.241) Explanation goodness index 0.213 (0.258) Constant 4.279*** (1.528) N 449

Notes: Tobit regression, robust standard errors are in parentheses. The dependent variable is the WTP on the last day of the game. The Algorithmic with No explanation is the underlying condition where the other treatments are dummy variables. Age_exp is age exponent; Trust index is a dummy variable that receives 1 for individuals that their index score of all seven trust questions (see appendix 2 for the trust questions) is above the median; Explanation goodness index is a dummy variable that receives 1 for individuals that their Index score of all perceived explanation goodness index questions is above the median (see appendix 2 for the perceived explanation goodness questions) ("* p<0.10 ** p<0.05 *** p<0.01").

Appendix A : The questioners on Trust and explanation perceived goodness are described below. Participant were asked to state on a Likert scale of 1 – strongly agree to 5 – strongly disagree. The trust-antecedent questionnaire based on McKnight et al. 2002, Gefen et al. 2003, and Komiak and Benbasat, 2006 1. The advisor explanations made me feel more secure.

2. I feel comfortable relying on the advice offered to me. 3. I feel the advisor has good knowledge about the advice. 4. The advice offered to me has nothing to gain by being dishonest with me. 5. The advice offered to me has nothing to gain by not caring about me. 6. I feel safer that there is an advisor that is making this financial decision The explanation perceived goodness questioner was based on Hoffman el al. 2018 1. I understood what the advisor's recommendations were based on 2. From the explanation, I understand how the advisor works 3. This explanation of how the advisor work is satisfying 4. This explanation of how the advisor works have sufficient detail 5. This explanation of how the advisor works seem complete 6. This explanation of how the advisor works tells me how to use it 7. This explanation of how the advisor works is useful to my goals 8. This explanation of the advisors shows me how accurate the advisor is 9. This explanation lets me judge when I should trust and not trust the advisor2. I feel comfortable relying on the advice offered to me. 3. I feel the advisor has good knowledge about the advice. 4. The advice offered to me has nothing to gain by being dishonest with me. 5. The advice offered to me has nothing to gain by not caring about me. 6. I feel safer that there is an advisor that is making this financial decision The explanation perceived goodness questioner was based on Hoffman el al. 2018 1. I understood what the advisor's recommendations were based on 2. From the explanation, I understand how the advisor works 3. This explanation of how the advisor work is satisfying 4. This explanation of how the advisor works have sufficient detail 5. This explanation of how the advisor works seem complete 6. This explanation of how the advisor works tells me how to use it 7. This explanation of how the advisor works is useful to my goals 8. This explanation of the advisors shows me how accurate the advisor is 9. This explanation lets me judge when I should trust and not trust the advisor