[PDF] Disparate Impact Diminishes Consumer Trust Even for Advantaged Users

Abstract

Systems aiming to aid consumers in their decision-making (e.g., by implementing persuasive techniques) are more likely to be effective when consumers trust them. However, recent research has demonstrated that the machine learning algorithms that often underlie such technology can act unfairly towards specific groups (e.g., by making more favorable predictions for men than for women). An undesired disparate impact resulting from this kind of algorithmic unfairness could diminish consumer trust and thereby undermine the purpose of the system. We studied this effect by conducting a between-subjects user study investigating how (gender-related) disparate impact affected consumer trust in an app designed to improve consumers' financial decision-making. Our results show that disparate impact decreased consumers' trust in the system and made them less likely to use it. Moreover, we find that trust was affected to the same degree across consumer groups (i.e., advantaged and disadvantaged users) despite both of these consumer groups recognizing their respective levels of personal benefit. Our findings highlight the importance of fairness in consumer-oriented artificial intelligence systems.

Full PDF

DDisparate Impact Diminishes Consumer TrustEven for Advantaged Users

Tim Draws , (cid:63) ( (cid:0) ) , Zolt´an Szl´avik , (cid:63) , Benjamin Timmermans , (cid:63) , NavaTintarev , Kush R. Varshney , and Michael Hind IBM Center for Advanced Studies Benelux Delft University of Technology [email protected] myTomorrows [email protected] IBM Research [email protected], [email protected], [email protected] Maastricht University [email protected]

Abstract.

Systems aiming to aid consumers in their decision-making(e.g., by implementing persuasive techniques) are more likely to be eﬀec-tive when consumers trust them. However, recent research has demon-strated that the machine learning algorithms that often underlie suchtechnology can act unfairly towards speciﬁc groups (e.g., by making morefavorable predictions for men than for women). An undesired disparateimpact resulting from this kind of algorithmic unfairness could diminishconsumer trust and thereby undermine the purpose of the system. Westudied this eﬀect by conducting a between-subjects user study investi-gating how (gender-related) disparate impact aﬀected consumer trust inan app designed to improve consumers’ ﬁnancial decision-making. Ourresults show that disparate impact decreased consumers’ trust in the sys-tem and made them less likely to use it. Moreover, we ﬁnd that trust wasaﬀected to the same degree across consumer groups (i.e., advantaged anddisadvantaged users) despite both of these consumer groups recognizingtheir respective levels of personal beneﬁt. Our ﬁndings highlight the im-portance of fairness in consumer-oriented artiﬁcial intelligence systems.

Keywords: disparate impact · algorithmic fairness · consumer trust Applications that seek to advise or nudge consumers into better decision-making(e.g., concerning personal health or ﬁnance) can only be eﬀective when con-sumers trust their guidance. Trustworthiness is an essential aspect in the designof such persuasive technology (PT); i.e., technology aiming to change attitudesor behaviors without using coercion or deception [26,28,41]; because consumers (cid:63) current aﬃliation a r X i v : . [ c s . H C ] F e b Draws et al. are unlikely to use (or be persuaded by) systems that they do not trust [23,35].Recent research has identiﬁed several factors that aﬀect consumer trust in thiscontext; including consumers’ emotional states [1] as well as the system’s relia-bility [26] and transparency [35]. Moreover, it has been argued that trust alsodepends on moral expectations that consumers have towards the technology theyuse [26,35]. Consumer trust could increasingly depend on such moral expecta-tions as more systems implement machine learning algorithms (e.g., in personalhealth [33,35] or ﬁnance [20] applications) that make them harder to scrutinize.A speciﬁc moral expectation that acts as a requirement for trust in this con-text may be fairness [39]. When nudges and advice are tailored to the individualconsumer using machine learning, consumers may expect that the system actsfairly towards diﬀerent consumer groups (e.g., concerning race or gender). Nudg-ing or advising such that the degree of positive impact that the system has onconsumers’ lives varies with group membership could constitute an undesired disparate impact . For example, a robo-advisor (i.e., PT designed to improve con-sumers’ ﬁnancial situation [20]) could have a disparate impact by systematicallyrecommending “safer”, lower-risk investments to female consumers comparedto male consumers, yielding them lower returns. Such disparate impact wouldviolate the moral expectation of fairness and thereby undermine consumer trust.Employing machine learning in consumer-oriented applications often holdsthe promise of increasing their usefulness to the individual consumer [33,46]but also bears a greater vulnerability for disparate impact. Recent research hasdemonstrated that machine learning algorithms may unfairly discriminate basedon group membership [2,6,27]. Such discrimination is referred to as algorithmicunfairness if a pre-deﬁned notion of fairness is violated [27,42] and can easilylead to an undesired disparate impact [6,13]. For example, outcomes in advicefrom robo-advisors may diﬀer between groups, given that ﬁnancial advice hashistorically been gender-biased to the disadvantage of female consumers [5,24]and algorithmic unfairness often results from disparities in the historical datathat is used to train the algorithm [27]. Although several methods have beendeveloped to mitigate algorithmic unfairness [7], in many cases it is currentlynot possible to do so to a satisfactory degree [8].Disparate impact is thus a realistic issue that could undermine the eﬃcacyof consumer-oriented artiﬁcial intelligence (AI) systems. It has been argued thatfairness plays a key role in fostering trust in AI [4,34,37,39,40]. However, to thebest of our knowledge, no previous work has studied the inﬂuence of undesireddisparate impact (i.e., as a result of algorithmic unfairness) on consumer trust.It is further unclear whether unfairly advantaged consumers are aﬀected to thesame degree as disadvantaged consumers in this context. That is, the inﬂuenceof disparate impact on consumer trust may depend on perceived personal beneﬁt (i.e., advantaged users trusting the system’s advice despite disparate impactas long as they personally beneﬁt) or not (i.e., advantaged users losing trustin lockstep with disadvantaged users despite a perceived personal beneﬁt). Westudy the eﬀect of disparate impact on consumer trust at the use case of genderbias in robo-advisors by investigating the following research questions: isparate Impact Diminishes Consumer Trust 3 – RQ1.

Does an apparent disparate impact of a robo-advisor aﬀect the degreeof trust that consumers place in it? – RQ2.

Does disparate impact aﬀect the trust of unfairly advantaged con-sumers to a diﬀerent degree than that of unfairly disadvantaged consumers?To answer these questions, we conducted a between-subjects user study wherewe exposed participants to varying degrees of disparate impact of a robo-advisor(i.e., advantaging male users; see Section 3). Our results show that disparateimpact negatively aﬀected consumer’s trust in the robo-advisor and decreasedtheir willingness to use it (see Section 4). Furthermore, we ﬁnd that, despiteboth groups recognizing their respective personal (dis)advantage, both the dis-advantaged group (women) as well as the advantaged group (men) experiencedthe same decrease in trust when learned about a disparate impact of the robo-advisor. Our ﬁndings underline the importance of ensuring algorithmic fairnessin consumer-oriented (AI) systems when aiming to maintain consumer trust.

We study the eﬀect of disparate impact on consumer trust at the use case of gen-der bias in robo-advisors. Our reasons for choosing the ﬁnancial domain here arethreefold. First, algorithmic decision-making is already widespread in consumer-oriented ﬁnancial applications (e.g., in robo-advisors) [20]. Second, algorithmicdecision-making in such systems is highly impactful: it directly aﬀects consumers’ﬁnancial situations and thereby their life quality. Third, (human) ﬁnancial ad-vice has traditionally been gender-biased, underestimating and disadvantagingfemale consumers [24,5]. Historical data on ﬁnancial advice thus contain thesebiases. If the algorithms that underlie robo-advisors are trained using these data,robo-advisors may have according disparate impact.

Trust in AI systems.

Consumers do not use systems that they do not trust[23]. That is why trust is an important aspect in the interaction between consumer-oriented AI systems (e.g., those implementing PT) and consumers [1,26,35,41].Recent research has linked trust in such systems to the reliability [26] and trans-parency [35] of the system at hand as well as consumers’ emotional states [1]and moral expectations [26,35]. Such moral expectations may gain in importanceas systems increasingly rely on machine learning algorithms [20,30,33,35,46].Moreover, whereas in some cases consumers fall prey to automation bias (i.e., atendency to prefer automated over human decisions) [10], in other cases, theyexperience what has been referred to as algorithm aversion : a tendency to preferhuman over algorithmic advice [11,29,32]. Research has shown that algorithmaversion can be the result of witnessing how an algorithm errs [12]. Especially incases where a machine learning algorithm acted unfairly , leading to an undesireddisparate impact (i.e., violating consumers’ moral expectations and reﬂecting er-roneous decision-making), consumer trust could thus be diminished.

Measuring and mitigating algorithmic unfairness.

Research has demonstratedthat machine learning algorithms can make biased (unfair) predictions to the

Draws et al. disadvantage of speciﬁc groups [2,6,27,43]. For instance, AI systems may dis-criminate between white and black defendants in predicting their likelihood ofre-oﬀending [2] and between male and female consumers in predicting their cred-itworthiness [43]. Several methods have been proposed to measure and mitigatebiases in algorithmic decision-making [7,15,21,22,27,47]. Despite these eﬀorts,the measurement and mitigation of algorithmic bias remain challenging [8,27].

Disparate impact and trust.

Algorithmic fairness has been identiﬁed as a corebuilding block of trustworthy AI systems [4,34,39,37,40], yet few studies directlyinvestigate the relationship between algorithmic fairness (or disparate impact)and consumer trust. Participants in one study reported that learning aboutalgorithmic unfairness induced negative feelings and that it might cause themto lose trust in a company or product [45]. Consumers have further expressedgeneral concerns about disparate impact of AI on a societal level [3] and aremore likely to judge decisions as less fair and trustworthy if they are made by analgorithm as opposed to a human [19]. However, it has also been shown that thedegree to which people are concerned about disparate impact depends on theirpersonal biases [31,36]. What remains unclear is to what extent disparate impact(as a result of algorithmic unfairness) aﬀects consumer trust and, if so, who (i.e.,unfairly advantaged and disadvantaged consumers) are aﬀected in particular.

To investigate the two research questions identiﬁed in Section 1, we conducteda between-subjects user study. The setting of this study was a ﬁctional scenarioin which a bank oﬀers a robo-advisor – called the

AI Advisor – to its customers.We aimed to perform a granular analysis of the eﬀect of disparate impact onconsumer trust by exposing participants to diﬀerent degrees of disparate impactsupposedly caused by the

AI Advisor and measuring their attitudes towards thissystem. Speciﬁcally, we analyzed whether the diﬀerent degrees of disparate im-pact aﬀected participant’s trust (i.e., whether they believed that the AI Advisorwould make correct predictions and therefore beneﬁt its users). To diﬀerentiatebetween this general notion of trust and related attitudes, we also measured willingness to use and perceived personal beneﬁt concerning the

AI Advisor . Dependent Variables.

Our experiment involved measuring participants’ at-titudes towards the

AI advisor ; speciﬁcally trust , willingness to use , perceivedpersonal beneﬁt . Each variable was measured twice: once after participants sawgeneral user statistics (Step 1; see Section 3.3) and once after participants sawgender-speciﬁc user statistics on the AI advisor (Step 2). We computed diﬀerencescores from these two measurements that reﬂected how seeing the gender-speciﬁcstatistics aﬀected participant’s attitudes as compared to their baseline attitudes. – Change in Trust (Continuous).

Participants rated their trust by respond-ing to the item “In general, the AI advisor can be trusted to make correct isparate Impact Diminishes Consumer Trust 5 recommendations” on a 7-point Likert scale. We coded all responses on anordinal scale ranging from − change intrust . Values could thus range from − – Change in Willingness to Use (Categorical).

Participants could respond tothe item “I would personally use the AI Advisor” with either “yes” or “no”.We recorded whether their answer had changed (i.e., “yes” to “no” or viceversa) or stayed the same in the second measurement. This variable thusencompassed three categories. – Change in Perceived Personal Beneﬁt (Continuous).

Participants rated theirperceived personal beneﬁt by responding to the item “I would personallybeneﬁt from using the AI advisor” on a 7-point Likert scale. To computethe change in perceived personal beneﬁt , we again subtracted the secondmeasurement from the ﬁrst. Values could thus range from − Independent Variable.

Our experiment varied depending on the conditionthat a participant was placed in (see Section 3.3): – Condition.

During the experiment, we showed participants a table with userstatistics of bank customers that use the

AI advisor . These statistics, sup-posedly showing the average change in bank account balance for users andnon-users of the

AI Advisor , split by gender, diﬀered depending on the con-dition a participant had been placed in. Each participant saw only one offour conditions: the control condition (in which the statistics were balancedacross genders, reﬂecting an absence of disparate impact) or one of threeexperimental conditions – which we call little bias , strong bias , and extremebias – that reﬂected varying degrees of disparate impact in favor of male con-sumers. Speciﬁcally, these diﬀerent degrees of disparate impact representedscenarios in which female users of the AI advisor were disadvantaged butstill beneﬁted from using the

AI advisor (little bias), did not beneﬁt fromthe

AI advisor (strong bias), or would in fact beneﬁt from not using the

AIadvisor (extreme bias). Table 1 shows the numbers that were shown in thesecond statistics table in each of the conditions.

Individual Diﬀerences and Descriptive Statistics.

We took two additionalmeasurements to enable more ﬁne-grained analyses and describe our sample: – Gender.

Participants could state which gender they identiﬁed with by pickingfrom the options “male”, “female”, and “other / not speciﬁed”. – Age.

Participants could write their age in an open text ﬁeld.

Based on the research questions

RQ1 and

RQ2 introduced in Section 1, therelated work from Section 2, and the experimental setup described in this sec-tion, we formulated several hypotheses. We expected that disparate impact will

Draws et al.

Table 1.

Fictional gender-speciﬁc statistics shown to participants during the secondstep of the study across. Only the top left two cells (concerning users of the

AI Advisor )diﬀered across conditions, reﬂecting varying degrees of disparate impact.

Using Not UsingAI Advisor AI AdvisorMale 20% 10%Female 20% 10%All 20% 10%

Control condition Using Not UsingAI Advisor AI AdvisorMale 25% 10%Female 15% 10%All 20% 10%

Little bias conditionUsing Not UsingAI Advisor AI AdvisorMale 30% 10%Female 10% 10%All 20% 10%

Strong bias condition Using Not UsingAI Advisor AI AdvisorMale 35% 10%Female 5% 10%All 20% 10%

Extreme bias condition decrease consumer trust (H1a) and that consumers will be less likely to use the

AI Advisor (H1b) if it has disparate impact (i.e., the stronger the disparateimpact, the lower consumer trust and willingness to use the

AI Advisor ). Wepredicted that disparate impact would aﬀect the perceived personal beneﬁt ofmale consumers diﬀerently compared to female consumers (i.e., following whatthe displayed statistics suggest; H2a). Accordingly, we further expected that thedecrease in trust described in H1a would be moderated by gender (H2b). Thatis, we predicted that the trust of advantaged consumers (i.e., men) would beaﬀected diﬀerently compared to disadvantaged consumers (i.e., women). – H1a.

Consumers who are exposed to statistics that reveal a disparate im-pact of a robo-advisor in favor of male users will trust this system less togive correct recommendations compared to consumers who are exposed tobalanced statistics. – H1b.

Consumers who are exposed to statistics that reveal a disparate impactof a robo-advisor in favor of male users will be less likely to use this system compared to consumers who are exposed to balanced statistics. – H2a.

The eﬀect of statistics suggesting a disparate impact of a robo-advisorin favor of men on perceived personal beneﬁt is moderated by gender. – H2b.

The eﬀect of statistics suggesting a disparate impact of a robo-advisorin favor of men on consumer trust is moderated by gender.

We set up our user study by creating a task on the online study platform

FigureEight . Before commencing with the experiment, participants were shown a short Since conducting this study in June 2019,

Figure Eight has been renamed to

Appen .More information can be found at https://appen.com .isparate Impact Diminishes Consumer Trust 7 introduction and asked to state their gender and age. The experiment consistedof two steps. Whereas Step 1 was the same for all participants, Step 2 diﬀereddepending on which one of four conditions a participant had been assigned to.

Step 1.

We introduced participants to a ﬁctional scenario in which they couldactivate a robo-advisor – called the

AI advisor – in their banking app: “Imagine your bank oﬀers a digital assistant called the ‘AI advisor’. If you acti-vate the AI advisor in your banking app, it will monitor your ﬁnancial situationand give you relevant recommendations that may improve your ﬁnancial situa-tion. For example, it may suggest saving strategies or recommend investments.”

Additionally, to promote the idea that the AI Advisor is generally reliable, par-ticipants were given an idea of whether people beneﬁt from using the

AI Advisor : “Overall statistics suggest that people beneﬁt from using the AI advisor. Thebank account balance of bank customers who use the AI advisor increases by anaverage of 20% every year, whereas the balance of customers who don’t use theAI advisor increases by an average of only 10% per year.” Below was a table displaying the mentioned statistics. We then measured trust,willingness to use, and perceived personal beneﬁt concerning the

AI Advisor . Step 2.

Participants were led to a new page for the second step of the experi-ment. Here we added some additional information on the AI advisor: “Next to general statistics on all bank customers, we can also look at how theAI advisor performs for subgroups of bank customers. Below you can see thechange in bank account balance for men and women in particular.”

Below this text was a table similar to the table in Step 1, but with twoadded rows that showed the average change in bank account balance per yearfor men and women in particular (see Table 1). Whereas the statistics for all bankcustomers overall, as well as for men and women not using the

AI advisor wasthe same in all conditions, the statistics for men and women using the

AI advisor varied depending on the condition they were assigned to (see Section 3.1). EachTable 1 shows the displayed statistics for male and female users per condition.We then again measured trust, willingness to use, and perceived personal beneﬁt.

Testing H1a and H2b.

To test whether there is an eﬀect of disparate impacton consumer trust (H1a) that is moderated by gender (H2b), we conducted aclassical ANOVA with condition and gender as between-subjects factors and change in trust as the dependent variable. A signiﬁcant main eﬀect of condition on change in trust in this analysis would suggest that change in trust diﬀeredbetween conditions (H1a). In this case, we would perform posthoc analyses toinvestigate the diﬀerences between the conditions in more detail. A signiﬁcantinteraction eﬀect between condition and gender would suggest that the condi-tions had a diﬀerent eﬀect for the disadvantaged group (i.e. female participants)compared to the advantaged group (i.e., male participants; H2b). Draws et al.

We further conducted a Bayesian ANOVA according to the protocol proposedby van den Bergh et al. [38]. Bayesian hypothesis tests involve the computationof the

Bayes factor , a quantitative comparison of the predictive power of twocompeting statistical models [44]. The Bayes factor weighs the evidence providedby the data and thus allows for direct model comparison. Practically, comparingdiﬀerent models (i.e., including or excluding an interaction eﬀect of condition and gender ) this way allowed for a richer interpretation of our results. We performedthe Bayesian ANOVA using the software JASP [16] with default settings. Wecomputed

Bayes Factors (BFs) by comparing the models of interest to a nullmodel and interpreted them according to the guidelines proposed by Lee andWagenmakers [18], who adopted them from Jeﬀreys [17]. Testing H1b.

We tested whether disparate impact aﬀected participants’ will-ingness to use the

AI Advisor by conducting a chi-squared test between condition and change in willingness to use . A signiﬁcant result in this analysis would sug-gest that the number of participants’ who changed their willingness to use the

AI Advisor diﬀered across conditions.

Testing H2a.

We conducted another ANOVA with condition and gender asbetween-subjects factors and change in perceived personal beneﬁt as dependentvariable to test whether gender acted as a moderator here. A signiﬁcant inter-action eﬀect in this analysis would indicate that this was the case.

Signiﬁcance Threshold and Correction for Multiple Testing.

In all clas-sical analyses we conducted, we aimed for a type 1 error probability of no morethan 0.05. However, by conducting our planned analyses we automatically testeda total of seven hypotheses: three in each ANOVA (i.e., two main eﬀects andone interaction) and one in the chi-squared test. This meant that the proba-bility of committing a type 1 error rose considerably [9].Therefore, we adjustedour signiﬁcance threshold by applying a Bonferroni correction, where the desiredtype 1 error rate is divided by the number of hypotheses that are tested [25].In our main analyses we thus handled a signiﬁcance threshold of . = 0 . p -value fell belowthis adjusted threshold. The same procedure was applied for posthoc analysescomparing each of the four conditions with each other as this meant conducting (cid:0) (cid:1) = 6 hypothesis tests (i.e., adjusting the threshold to . = 0 . We recruited 567 participants via the

Figure Eight pool of contributors (554) anddirect contacts (13). Seventy-three participants were excluded from the studybecause they either ﬁlled at least one of the obligatory text ﬁelds with less than10 characters, took less than 60 seconds to complete the task, or took more than10 minutes to complete the task. Furthermore, we did not analyze data of ﬁveparticipants who stated “other / not speciﬁed” as their gender because our studyinvolved a disparate impact between male and female consumers. The null model in this procedure consisted of only an intercept.isparate Impact Diminishes Consumer Trust 9

After exclusion, 489 participants remained. Of those, 238 (49%) were maleand 251 (51%) were female; with a mean age of 41 . . $ control , little bias , strong bias , and extreme bias conditions, respectively. H1a: Disparate Impact Decreased Consumer Trust.

As hypothesized, change in trust diﬀered across conditions ( F = 6.906, p < . = 70 .

02, seeTable 2). To test for diﬀerences between the individual conditions, we conductedposthoc analyses (i.e., Mann-Whitney U tests). Only the diﬀerence between the control and extreme bias conditions was signiﬁcant ( W = 9368, p < . Table 2.

Bayesian ANOVA with change in trust as dependent variable.

Models P(M) P(M | Data) BF M BF Error %Null model 0.200 8.209e-4 0.003 1.000condition 0.200 0.057 0.244 70.024 0.001gender 0.200 0.004 0.018 5.413 1.533e-6condition + gender 0.200 0.735 11.116 895.772 1.638condition + gender +condition * gender 0.200 0.202 1.012 245.894 1.978 −0.50−0.250.00 Control LittleBias StrongBias ExtremeBias

Condition C hange i n t r u s t −0.8−0.40.0 Control LittleBias StrongBias ExtremeBias Condition C hange i n t r u s t genderfemalemale Fig. 1.

Change in trust across conditions for all participants (left-hand panel) and splitby gender (right-hand panel). The error bars represent 95% conﬁdence intervals.

H1b: Disparate Impact Decreased Willingness to Use.

In accordancewith disparate impact negatively aﬀecting trust (H1a), it decreased participants’willingness to use the

AI Advisor (see Table 3). The increasing proportion of par-ticipants who changed their attitude from “yes” to “no” as conditions reﬂectedstronger disparate impact was statistically signiﬁcant ( χ = 25 . p < . Table 3.

Change in willingness to use across conditions. The labels − , =, and + reﬂectchanges from “yes” to “no”, no change, “no” to “yes”, respectively. ConditionChange

Control Little Bias Strong Bias Extreme Bias − H2a: Gender Moderated the Eﬀect of Disparate Impact on PerceivedPersonal Beneﬁt.

As expected, the results from the second ANOVA showa signiﬁcant interaction eﬀect of condition and gender on change in perceivedpersonal beneﬁt ( F = 8.525, p < . H2b: Male Consumers Experienced the Same Decrease in Trust asFemale Consumers.

In contrast to what we hypothesized, we do not ﬁnd asigniﬁcant interaction eﬀect of condition and gender on change in trust ( F =2 . p = 0 . changein trust compared to that of female participants. The Bayesian ANOVA conﬁrmsthis result: the model containing just two main eﬀects for condition and genderexplain the data best (BF = 895.77; see Table 2) and roughly four timesbetter than the model that includes the interaction eﬀect (BF = 245.89). Thissuggests that unfairly advantaged and disadvantaged participants (i.e., men andwomen, respectively) experienced the same decrease in trust due to algorithmicunfairness despite diverging levels of perceived personal beneﬁt (H2a). −2.0−1.5−1.0−0.50.0 Control LittleBias StrongBias ExtremeBias Condition C hange i npe r c e i v ed pe r s ona l bene f i t genderfemalemale Fig. 2.

Change in perceived personal beneﬁt across conditions and split by gender. Theerror bars represent 95% conﬁdence intervals.isparate Impact Diminishes Consumer Trust 11

In this paper, we presented a between-subjects user study that aimed to investi-gate the inﬂuence of algorithmically-driven disparate impact on consumer trustat the use case of gender-bias in robo-advisors. Our results suggest that disparateimpact – at least when it is extreme – decreases trust and makes consumers lesslikely to use such systems. We further ﬁnd that, although disadvantaged andadvantaged users recognize their respective levels of personal beneﬁt in scenar-ios of disparate impact, both experience equally decreasing levels of trust whenthey learn about a disparate impact caused by the system at hand. Our workcontributes to a growing body of literature that highlights the importance of en-suring fairness and avoiding disparate impact of consumer-oriented AI systems.

Our ﬁndings have implications for consumers as well as industry. Consumersshould be aware that machine-learning-based applications can be biased. If dis-parate impact is an important factor for consumer trust, consumers need tothink critically when using such systems. One potential way forward for con-sumers would be to demand from companies to publish independently carriedout research into the (algorithmic) fairness and impact of their products.Publishers of consumer-oriented AI systems need to establish algorithmicfairness in their products and avoid disparate impact to serve consumers ef-fectively. Our ﬁndings show that failing to do so may lead to a decrease inconsumers’ trust and willingness to use such systems.

Our study is subject to at least ﬁve important limitations. First, we studiedthe eﬀect of disparate impact on consumer trust at a speciﬁc use case: a binarygender bias in robo-advisors. This makes our results diﬃcult to generalize be-cause many other forms of bias (including those based on race, religion, or sexualorientation) as well as other AI systems (e.g., for recommendations of medicaltreatment, tourist attractions, or movies) exist. It is easy to imagine how con-sumer trust could be aﬀected diﬀerently when, for example, disparate impactconcerns small minorities, multitudes of gender identities (or another consumercharacteristic), a chosen group membership such as consumers’ profession, or asystem that is less impactful on consumers’ personal lives than a robo-advisor.On a related note, we here positioned women in the disadvantage and men in theadvantage (i.e., the setting that corresponds to biases in human ﬁnancial advice)but it is not certain if we were to obtain the same results if the (dis-)advantagewas distributed the other way round. Future work could explore these diﬀerentscenarios to help generalize and better understand the relationship between theeﬀect of disparate impact on consumer trust.Second, our ﬁnding that advantaged and disadvantaged users experienced thesame decrease in trust appears to go counter to previous research suggesting that people make stronger fairness judgments when they are personally aﬀected [14].However, it is not clear from our results to what degree advantaged users (i.e.,men) felt personally aﬀected; e.g., because they have women in their lives whothey deeply care about. The role of personal relevance in the eﬀect of disparateimpact on consumer trust thus remains to be clariﬁed by future research.Third, our results show a decreasing trend in consumer trust as conditionsbecome more extreme, but show a statistically signiﬁcant diﬀerence only betweenthe control and extreme bias conditions. Future work could examine these diﬀer-ences (also across domains) in more detail to establish the relationship between the level of disparate impact and consumer trust (e.g., to determine what lieswithin and beyond an “acceptable margin” of disparate impact).Fourth, we studied fairness related to group membership (i.e., gender), whichmight elicit a diﬀerent (moral) evaluation than fairness on the individual level.Our results show that trust can decrease despite perceived personal beneﬁt.However, this eﬀect might have been caused by a sense of loyalty towards thedisadvantaged group. An interesting direction for future work is to study whethersimilar patterns emerge when disparate impact concerns individuals; e.g., whenadvantaged and disadvantaged subjects are randomly chosen.

We presented a user study investigating the eﬀect of algorithmically-driven dis-parate impact (i.e., when algorithm outcomes adversely aﬀect one group of con-sumers compared to another) on consumer trust. Speciﬁcally, we studied theeﬀect of gender-bias in an application that aimed to persuade consumers’ tomake better ﬁnancial decisions. We found that disparate impact decreased par-ticipants’ trust and willingness to use the application. Furthermore, our resultsshow that the trust of unfairly advantaged participants was just as aﬀected asthat of disadvantaged participants. These ﬁndings imply that disparate impact(i.e., as a result of algorithmic unfairness) can undermine trust in consumer-oriented AI systems and should therefore be avoided or mitigated when aimingto create trustworthy technology.

Acknowledgements

This research has been supported by the

Think Forward Initiative (a partnershipbetween ING Bank, Deloitte, Dell Technologies, Amazon Web Services, IBM,and the Center for Economic Policy Research – CEPR). The views and opinionsexpressed in this paper are solely those of the authors and do not necessarilyreﬂect the oﬃcial policy or position of the Think Forward Initiative or any of itspartners. isparate Impact Diminishes Consumer Trust 13

References

1. Ahmad, W.N.W., Ali, N.M.: A Study on Persuasive Technologies: The Relationshipbetween User Emotions, Trust and Persuasion. Int. J. Interact. Multimed. Artif.Intell. (1), 57–61 (2018). https://doi.org/10.9781/ijimai.2018.02.0102. Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias: There’ssoftware used across the country to predict future criminals and it’s bi-ased against blacks. ProPublica (2019),

3. Araujo, T., Helberger, N., Kruikemeier, S., de Vreese, C.H.: In AI we trust? Per-ceptions about automated decision-making by artiﬁcial intelligence. AI Soc. (3),611–623 (2020). https://doi.org/10.1007/s00146-019-00931-w, https://doi.org/10.1007/s00146-019-00931-w

4. Arnold, M., Piorkowski, D., Reimer, D., Richards, J., Tsay, J., Varshney,K.R., Bellamy, R.K., Hind, M., Houde, S., Mehta, S., Mojsilovic, A., Nair, R.,Ramamurthy, K.N., Olteanu, A.: FactSheets: Increasing trust in AI servicesthrough supplier’s declarations of conformity. IBM J. Res. Dev. (4-5) (2019).https://doi.org/10.1147/JRD.2019.29422885. Baeckstr¨om, Y., Silvester, J., Pownall, R.A.: Millionaire investors: ﬁnancial advi-sors, attribution theory and gender diﬀerences. Eur. J. Financ. (15), 1333–1349(2018). https://doi.org/10.1080/1351847X.2018.14383016. Barocas, Solon and Selbst, A.D.: Big data’s disparate impact. Calif. Law Rev. (671), 671–732 (2016)7. Bellamy, R.K., Mojsilovic, A., Nagar, S., Ramamurthy, K.N., Richards, J., Saha,D., Sattigeri, P., Singh, M., Varshney, K.R., Zhang, Y., Dey, K., Hind, M., Hoﬀman,S.C., Houde, S., Kannan, K., Lohia, P., Martino, J., Mehta, S.: AI Fairness 360:An extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res.Dev. (4-5) (2019). https://doi.org/10.1147/JRD.2019.29422878. Corbett-Davies, S., Goel, S.: The Measure and Mismeasure of Fairness: A CriticalReview of Fair Machine Learning. arXiv Prepr. arXiv1808.00023 (2018), http://arxiv.org/abs/1808.00023

9. Cramer, A.O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R.,Grasman, R.P., Waldorp, L.J., Wagenmakers, E.J.: Hidden multiplicity in ex-ploratory multiway ANOVA: Prevalence and remedies. Psychon. Bull. Rev. (2),640–647 (2016). https://doi.org/10.3758/s13423-015-0913-510. Cummings, M.L.: Automation bias in intelligent time critical decision supportsystems. Collect. Tech. Pap. - AIAA 1st Intell. Syst. Tech. Conf. , 557–562 (2004).https://doi.org/10.2514/6.2004-631311. Diab, D.L., Pui, S.Y., Yankelevich, M., Highhouse, S.: Lay perceptions of selectiondecision aids in US and non-US samples. Int. J. Sel. Assess. (2), 209–216 (jun2011). https://doi.org/10.1111/j.1468-2389.2011.00548.x12. Dietvorst, B.J., Simmons, J.P., Massey, C.: Algorithm aversion: People erroneouslyavoid algorithms after seeing them err. J. Exp. Psychol. Gen. (1), 114–126(2015). https://doi.org/10.1037/xge000003313. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S.:Certifying and Removing Disparate Impact. In: Proc. 21th ACM SIGKDD Int.Conf. Knowl. Discov. data Min. pp. 259–268 (2015)14. Ham, J., van den Bos, K.: Not fair for me! The inﬂuence of personal rele-vance on social justice inferences. J. Exp. Soc. Psychol. (3), 699–705 (2008).https://doi.org/10.1016/j.jesp.2007.04.0094 Draws et al.15. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning.Adv. Neural Inf. Process. Syst. pp. 3323–3331 (2016)16. JASP Team: JASP (Version 0.14) (2020)17. Jeﬀreys, H.: Theory of Probability. Oxford University Press, Oxford (1939)18. Lee, M.D., Wagenmakers, E.J.: Bayesian cognitive modeling: A practical course.Cambridge University Press (2014). https://doi.org/10.1017/CBO978113908775919. Lee, M.K.: Understanding perception of algorithmic decisions: Fairness, trust, andemotion in response to algorithmic management. Big Data Soc. (1), 1–16 (2018).https://doi.org/10.1177/205395171875668420. Lieber, R.: Financial Advice for People Who Aren’t Rich(apr 2014),

21. Mary, J.J., Calauz`enes, C., Karoui, N.E.: Fairness-Aware Learning for ContinuousAttributes and Treatments. Icml , 4382–4391 (2019), http://proceedings.mlr.press/v97/mary19a.html

22. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A Surveyon Bias and Fairness in Machine Learning. arXiv Prepr. arXiv1908.09635 (2019), http://arxiv.org/abs/1908.09635

23. Muir, Bonnie M and Moray, N.: Trust in automation. Part II. Experimental studiesof trust and human intervention in a process control simulation. Ergonomics (3),429–460 (1996)24. Mullainathan, S., Noeth, M., Schoar, A.: The Market for Financial Advice: AnAudit Study. SSRN Electron. J. (2012). https://doi.org/10.2139/ssrn.157233425. Napierala, M, A.: What Is the Bonferroni correction? (2012),

26. Nickel, P., Spahn, A.: Trust, Discourse Ethics, and Persuasive Technology. In:Persuas. Technol. Des. Heal. Safety; 7th Int. Conf. Persuas. Technol. 2012. pp.37–40. Link¨oping University Electronic Press (2012)27. Ntoutsi, E., Fafalios, P., Gadiraju, U., Iosiﬁdis, V., Nejdl, W., Vidal, M.E., Rug-gieri, S., Turini, F., Papadopoulos, S., Krasanakis, E., Kompatsiaris, I., Kinder-Kurlanda, K., Wagner, C., Karimi, F., Fernandez, M., Alani, H., Berendt, B.,Kruegel, T., Heinze, C., Broelemann, K., Kasneci, G., Tiropanis, T., Staab,S.: Bias in data-driven artiﬁcial intelligence systems—An introductory sur-vey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. (3), 1–14 (2020).https://doi.org/10.1002/widm.135628. Oinas-Kukkonen, H., Harjumaa, M.: Towards deeper understanding of persuasionin software and information systems. In: Proc. 1st Int. Conf. Adv. Comput. Inter-act. ACHI 2008 (2008). https://doi.org/10.1109/ACHI.2008.3129. ¨Onkal, D., Goodwin, P., Thomson, M., G¨on¨ul, S., Pollock, A.: The relative inﬂuenceof advice from human experts and statistical methods on forecast adjustments. J.Behav. Decis. Mak. (4), 390–409 (2009). https://doi.org/10.1002/bdm.63730. Orji, R., Moﬀatt, K.: Persuasive technology for health and wellness: State-of-the-art and emerging trends. Health Informatics J. (1), 66–91 (2018).https://doi.org/10.1177/146045821665097931. Otterbacher, J., Checco, A., Demartini, G., Clough, P.: Investigating userperception of gender bias in image search: The role of sexism. 41st Int.ACM SIGIR Conf. Res. Dev. Inf. Retrieval, SIGIR 2018 pp. 933–936 (2018).https://doi.org/10.1145/3209978.321009432. Promberger, M., Baron, J.: Do patients trust computers? J. Behav. Decis. Mak. (5), 455–468 (2006). https://doi.org/10.1002/bdm.542isparate Impact Diminishes Consumer Trust 1533. Purpura, S., Schwanda, V., Williams, K., Stubler, W., Sengers, P.: Fit4Life: TheDesign of a Persuasive Technology Promoting Healthy Behavior and Ideal Weight.In: Proc. SIGCHI Conf. Hum. factors Comput. Syst. pp. 423–432 (2011)34. Rossi, F.: Building trust in artiﬁcial intelligence. J. Int. Aﬀ. (1), 127–133 (2019)35. Sattarov, F., Nagel, S.: Building trust in persuasive gerontechnology: User-centric and institution-centric approaches. Gerontechnology (1), 1–14 (2019).https://doi.org/10.4017/gt.2019.18.1.001.0036. Smith, J., Sonboli, N., Fiesler, C., Burke, R.: Exploring User Opinions of Fairnessin Recommender Systems. In: CHI’20 Work. Human-Centered Approaches to FairResponsible AI (2020), http://arxiv.org/abs/2003.06461

37. Toreini, E., Aitken, M., Coopamootoo, K., Elliott, K., Zelaya, C.G., van Moorsel,A.: The relationship between trust in AI and trustworthy machine learning tech-nologies. FAT* 2020 - Proc. 2020 Conf. Fairness, Accountability, Transpar. pp.272–283 (2020). https://doi.org/10.1145/3351095.337283438. Van Den Bergh, D., Van Doorn, J., Marsman, M., Draws, T., Van Kesteren, E.J.,Derks, K., Dablander, F., Gronau, Q.F., Kucharsk´y, ˇS., Gupta, A.R.N., Sarafoglou,A., Voelkel, J.G., Stefan, A., Ly, A., Hinne, M., Matzke, D., Wagenmakers, E.J.:A tutorial on conducting and interpreting a bayesian ANOVA in JASP. AnneePsychol. (1), 73–96 (2020). https://doi.org/10.3917/anpsy1.201.007339. Varshney, K.R.: Trustworthy machine learning and artiﬁcial intelligence. XRDSCrossroads, ACM Mag. Students (3) (2019). https://doi.org/10.1145/331310940. Varshney, K.R.: On Mismatched Detection and Safe, Trustworthy MachineLearning. In: 2020 54th Annu. Conf. Inf. Sci. Syst. CISS 2020 (2020).https://doi.org/10.1109/CISS48834.2020.157062776741. Verbeek, P.P.: Persuasive Technology and Moral Responsibility Toward an ethicalframework for persuasive technologies. Persuasive , 1–15 (2006)42. Verma, S., Rubin, J.: Fairness deﬁnitions explained. In: Proc. Int. Work. Softw.Fairness. pp. 1–7. FairWare ’18, Association for Computing Machinery, New York,NY, USA (2018). https://doi.org/10.1145/3194770.319477643. Vigdor, N.: Apple card investigated after gender discrimination complaints. NewYork Times (2019)44. Wagenmakers, E.J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J.,Selker, R., Gronau, Q.F., ˇSm´ıra, M., Epskamp, S., Matzke, D., Rouder, J.N.,Morey, R.D.: Bayesian inference for psychology. Part I: Theoretical advan-tages and practical ramiﬁcations. Psychon. Bull. Rev. (1), 35–57 (2018).https://doi.org/10.3758/s13423-017-1343-345. Woodruﬀ, A., Fox, S.E., Rousso-Schindler, S., Warshaw, J.: A qualitative explo-ration of perceptions of algorithmic fairness. Conf. Hum. Factors Comput. Syst. -Proc. , 1–14 (2018). https://doi.org/10.1145/3173574.317423046. Yang, Q., Banovic, N., Zimmerman, J.: Mapping machine learning ad-vances from HCI research to reveal starting places for design innova-tion. Conf. Hum. Factors Comput. Syst. - Proc.2018-April