Disparate Impact Diminishes Consumer Trust Even for Advantaged Users
Tim Draws, Zoltán Szlávik, Benjamin Timmermans, Nava Tintarev, Kush R. Varshney, Michael Hind
DDisparate Impact Diminishes Consumer TrustEven for Advantaged Users
Tim Draws , (cid:63) ( (cid:0) ) , Zolt´an Szl´avik , (cid:63) , Benjamin Timmermans , (cid:63) , NavaTintarev , Kush R. Varshney , and Michael Hind IBM Center for Advanced Studies Benelux Delft University of Technology [email protected] myTomorrows [email protected] IBM Research [email protected], [email protected], [email protected] Maastricht University [email protected]
Abstract.
Systems aiming to aid consumers in their decision-making(e.g., by implementing persuasive techniques) are more likely to be effec-tive when consumers trust them. However, recent research has demon-strated that the machine learning algorithms that often underlie suchtechnology can act unfairly towards specific groups (e.g., by making morefavorable predictions for men than for women). An undesired disparateimpact resulting from this kind of algorithmic unfairness could diminishconsumer trust and thereby undermine the purpose of the system. Westudied this effect by conducting a between-subjects user study investi-gating how (gender-related) disparate impact affected consumer trust inan app designed to improve consumers’ financial decision-making. Ourresults show that disparate impact decreased consumers’ trust in the sys-tem and made them less likely to use it. Moreover, we find that trust wasaffected to the same degree across consumer groups (i.e., advantaged anddisadvantaged users) despite both of these consumer groups recognizingtheir respective levels of personal benefit. Our findings highlight the im-portance of fairness in consumer-oriented artificial intelligence systems.
Keywords: disparate impact · algorithmic fairness · consumer trust Applications that seek to advise or nudge consumers into better decision-making(e.g., concerning personal health or finance) can only be effective when con-sumers trust their guidance. Trustworthiness is an essential aspect in the designof such persuasive technology (PT); i.e., technology aiming to change attitudesor behaviors without using coercion or deception [26,28,41]; because consumers (cid:63) current affiliation a r X i v : . [ c s . H C ] F e b Draws et al. are unlikely to use (or be persuaded by) systems that they do not trust [23,35].Recent research has identified several factors that affect consumer trust in thiscontext; including consumers’ emotional states [1] as well as the system’s relia-bility [26] and transparency [35]. Moreover, it has been argued that trust alsodepends on moral expectations that consumers have towards the technology theyuse [26,35]. Consumer trust could increasingly depend on such moral expecta-tions as more systems implement machine learning algorithms (e.g., in personalhealth [33,35] or finance [20] applications) that make them harder to scrutinize.A specific moral expectation that acts as a requirement for trust in this con-text may be fairness [39]. When nudges and advice are tailored to the individualconsumer using machine learning, consumers may expect that the system actsfairly towards different consumer groups (e.g., concerning race or gender). Nudg-ing or advising such that the degree of positive impact that the system has onconsumers’ lives varies with group membership could constitute an undesired disparate impact . For example, a robo-advisor (i.e., PT designed to improve con-sumers’ financial situation [20]) could have a disparate impact by systematicallyrecommending “safer”, lower-risk investments to female consumers comparedto male consumers, yielding them lower returns. Such disparate impact wouldviolate the moral expectation of fairness and thereby undermine consumer trust.Employing machine learning in consumer-oriented applications often holdsthe promise of increasing their usefulness to the individual consumer [33,46]but also bears a greater vulnerability for disparate impact. Recent research hasdemonstrated that machine learning algorithms may unfairly discriminate basedon group membership [2,6,27]. Such discrimination is referred to as algorithmicunfairness if a pre-defined notion of fairness is violated [27,42] and can easilylead to an undesired disparate impact [6,13]. For example, outcomes in advicefrom robo-advisors may differ between groups, given that financial advice hashistorically been gender-biased to the disadvantage of female consumers [5,24]and algorithmic unfairness often results from disparities in the historical datathat is used to train the algorithm [27]. Although several methods have beendeveloped to mitigate algorithmic unfairness [7], in many cases it is currentlynot possible to do so to a satisfactory degree [8].Disparate impact is thus a realistic issue that could undermine the efficacyof consumer-oriented artificial intelligence (AI) systems. It has been argued thatfairness plays a key role in fostering trust in AI [4,34,37,39,40]. However, to thebest of our knowledge, no previous work has studied the influence of undesireddisparate impact (i.e., as a result of algorithmic unfairness) on consumer trust.It is further unclear whether unfairly advantaged consumers are affected to thesame degree as disadvantaged consumers in this context. That is, the influenceof disparate impact on consumer trust may depend on perceived personal benefit (i.e., advantaged users trusting the system’s advice despite disparate impactas long as they personally benefit) or not (i.e., advantaged users losing trustin lockstep with disadvantaged users despite a perceived personal benefit). Westudy the effect of disparate impact on consumer trust at the use case of genderbias in robo-advisors by investigating the following research questions: isparate Impact Diminishes Consumer Trust 3 – RQ1.
Does an apparent disparate impact of a robo-advisor affect the degreeof trust that consumers place in it? – RQ2.
Does disparate impact affect the trust of unfairly advantaged con-sumers to a different degree than that of unfairly disadvantaged consumers?To answer these questions, we conducted a between-subjects user study wherewe exposed participants to varying degrees of disparate impact of a robo-advisor(i.e., advantaging male users; see Section 3). Our results show that disparateimpact negatively affected consumer’s trust in the robo-advisor and decreasedtheir willingness to use it (see Section 4). Furthermore, we find that, despiteboth groups recognizing their respective personal (dis)advantage, both the dis-advantaged group (women) as well as the advantaged group (men) experiencedthe same decrease in trust when learned about a disparate impact of the robo-advisor. Our findings underline the importance of ensuring algorithmic fairnessin consumer-oriented (AI) systems when aiming to maintain consumer trust.
We study the effect of disparate impact on consumer trust at the use case of gen-der bias in robo-advisors. Our reasons for choosing the financial domain here arethreefold. First, algorithmic decision-making is already widespread in consumer-oriented financial applications (e.g., in robo-advisors) [20]. Second, algorithmicdecision-making in such systems is highly impactful: it directly affects consumers’financial situations and thereby their life quality. Third, (human) financial ad-vice has traditionally been gender-biased, underestimating and disadvantagingfemale consumers [24,5]. Historical data on financial advice thus contain thesebiases. If the algorithms that underlie robo-advisors are trained using these data,robo-advisors may have according disparate impact.
Trust in AI systems.
Consumers do not use systems that they do not trust[23]. That is why trust is an important aspect in the interaction between consumer-oriented AI systems (e.g., those implementing PT) and consumers [1,26,35,41].Recent research has linked trust in such systems to the reliability [26] and trans-parency [35] of the system at hand as well as consumers’ emotional states [1]and moral expectations [26,35]. Such moral expectations may gain in importanceas systems increasingly rely on machine learning algorithms [20,30,33,35,46].Moreover, whereas in some cases consumers fall prey to automation bias (i.e., atendency to prefer automated over human decisions) [10], in other cases, theyexperience what has been referred to as algorithm aversion : a tendency to preferhuman over algorithmic advice [11,29,32]. Research has shown that algorithmaversion can be the result of witnessing how an algorithm errs [12]. Especially incases where a machine learning algorithm acted unfairly , leading to an undesireddisparate impact (i.e., violating consumers’ moral expectations and reflecting er-roneous decision-making), consumer trust could thus be diminished.
Measuring and mitigating algorithmic unfairness.
Research has demonstratedthat machine learning algorithms can make biased (unfair) predictions to the
Draws et al. disadvantage of specific groups [2,6,27,43]. For instance, AI systems may dis-criminate between white and black defendants in predicting their likelihood ofre-offending [2] and between male and female consumers in predicting their cred-itworthiness [43]. Several methods have been proposed to measure and mitigatebiases in algorithmic decision-making [7,15,21,22,27,47]. Despite these efforts,the measurement and mitigation of algorithmic bias remain challenging [8,27].
Disparate impact and trust.
Algorithmic fairness has been identified as a corebuilding block of trustworthy AI systems [4,34,39,37,40], yet few studies directlyinvestigate the relationship between algorithmic fairness (or disparate impact)and consumer trust. Participants in one study reported that learning aboutalgorithmic unfairness induced negative feelings and that it might cause themto lose trust in a company or product [45]. Consumers have further expressedgeneral concerns about disparate impact of AI on a societal level [3] and aremore likely to judge decisions as less fair and trustworthy if they are made by analgorithm as opposed to a human [19]. However, it has also been shown that thedegree to which people are concerned about disparate impact depends on theirpersonal biases [31,36]. What remains unclear is to what extent disparate impact(as a result of algorithmic unfairness) affects consumer trust and, if so, who (i.e.,unfairly advantaged and disadvantaged consumers) are affected in particular.
To investigate the two research questions identified in Section 1, we conducteda between-subjects user study. The setting of this study was a fictional scenarioin which a bank offers a robo-advisor – called the
AI Advisor – to its customers.We aimed to perform a granular analysis of the effect of disparate impact onconsumer trust by exposing participants to different degrees of disparate impactsupposedly caused by the
AI Advisor and measuring their attitudes towards thissystem. Specifically, we analyzed whether the different degrees of disparate im-pact affected participant’s trust (i.e., whether they believed that the AI Advisorwould make correct predictions and therefore benefit its users). To differentiatebetween this general notion of trust and related attitudes, we also measured willingness to use and perceived personal benefit concerning the
AI Advisor . Dependent Variables.
Our experiment involved measuring participants’ at-titudes towards the
AI advisor ; specifically trust , willingness to use , perceivedpersonal benefit . Each variable was measured twice: once after participants sawgeneral user statistics (Step 1; see Section 3.3) and once after participants sawgender-specific user statistics on the AI advisor (Step 2). We computed differencescores from these two measurements that reflected how seeing the gender-specificstatistics affected participant’s attitudes as compared to their baseline attitudes. – Change in Trust (Continuous).
Participants rated their trust by respond-ing to the item “In general, the AI advisor can be trusted to make correct isparate Impact Diminishes Consumer Trust 5 recommendations” on a 7-point Likert scale. We coded all responses on anordinal scale ranging from − change intrust . Values could thus range from − – Change in Willingness to Use (Categorical).
Participants could respond tothe item “I would personally use the AI Advisor” with either “yes” or “no”.We recorded whether their answer had changed (i.e., “yes” to “no” or viceversa) or stayed the same in the second measurement. This variable thusencompassed three categories. – Change in Perceived Personal Benefit (Continuous).
Participants rated theirperceived personal benefit by responding to the item “I would personallybenefit from using the AI advisor” on a 7-point Likert scale. To computethe change in perceived personal benefit , we again subtracted the secondmeasurement from the first. Values could thus range from − Independent Variable.
Our experiment varied depending on the conditionthat a participant was placed in (see Section 3.3): – Condition.
During the experiment, we showed participants a table with userstatistics of bank customers that use the
AI advisor . These statistics, sup-posedly showing the average change in bank account balance for users andnon-users of the
AI Advisor , split by gender, differed depending on the con-dition a participant had been placed in. Each participant saw only one offour conditions: the control condition (in which the statistics were balancedacross genders, reflecting an absence of disparate impact) or one of threeexperimental conditions – which we call little bias , strong bias , and extremebias – that reflected varying degrees of disparate impact in favor of male con-sumers. Specifically, these different degrees of disparate impact representedscenarios in which female users of the AI advisor were disadvantaged butstill benefited from using the
AI advisor (little bias), did not benefit fromthe
AI advisor (strong bias), or would in fact benefit from not using the
AIadvisor (extreme bias). Table 1 shows the numbers that were shown in thesecond statistics table in each of the conditions.
Individual Differences and Descriptive Statistics.
We took two additionalmeasurements to enable more fine-grained analyses and describe our sample: – Gender.
Participants could state which gender they identified with by pickingfrom the options “male”, “female”, and “other / not specified”. – Age.
Participants could write their age in an open text field.
Based on the research questions
RQ1 and
RQ2 introduced in Section 1, therelated work from Section 2, and the experimental setup described in this sec-tion, we formulated several hypotheses. We expected that disparate impact will
Draws et al.
Table 1.
Fictional gender-specific statistics shown to participants during the secondstep of the study across. Only the top left two cells (concerning users of the
AI Advisor )differed across conditions, reflecting varying degrees of disparate impact.
Using Not UsingAI Advisor AI AdvisorMale 20% 10%Female 20% 10%All 20% 10%
Control condition Using Not UsingAI Advisor AI AdvisorMale 25% 10%Female 15% 10%All 20% 10%
Little bias conditionUsing Not UsingAI Advisor AI AdvisorMale 30% 10%Female 10% 10%All 20% 10%
Strong bias condition Using Not UsingAI Advisor AI AdvisorMale 35% 10%Female 5% 10%All 20% 10%
Extreme bias condition decrease consumer trust (H1a) and that consumers will be less likely to use the
AI Advisor (H1b) if it has disparate impact (i.e., the stronger the disparateimpact, the lower consumer trust and willingness to use the
AI Advisor ). Wepredicted that disparate impact would affect the perceived personal benefit ofmale consumers differently compared to female consumers (i.e., following whatthe displayed statistics suggest; H2a). Accordingly, we further expected that thedecrease in trust described in H1a would be moderated by gender (H2b). Thatis, we predicted that the trust of advantaged consumers (i.e., men) would beaffected differently compared to disadvantaged consumers (i.e., women). – H1a.
Consumers who are exposed to statistics that reveal a disparate im-pact of a robo-advisor in favor of male users will trust this system less togive correct recommendations compared to consumers who are exposed tobalanced statistics. – H1b.
Consumers who are exposed to statistics that reveal a disparate impactof a robo-advisor in favor of male users will be less likely to use this system compared to consumers who are exposed to balanced statistics. – H2a.
The effect of statistics suggesting a disparate impact of a robo-advisorin favor of men on perceived personal benefit is moderated by gender. – H2b.
The effect of statistics suggesting a disparate impact of a robo-advisorin favor of men on consumer trust is moderated by gender.
We set up our user study by creating a task on the online study platform
FigureEight . Before commencing with the experiment, participants were shown a short Since conducting this study in June 2019,
Figure Eight has been renamed to
Appen .More information can be found at https://appen.com .isparate Impact Diminishes Consumer Trust 7 introduction and asked to state their gender and age. The experiment consistedof two steps. Whereas Step 1 was the same for all participants, Step 2 differeddepending on which one of four conditions a participant had been assigned to.
Step 1.
We introduced participants to a fictional scenario in which they couldactivate a robo-advisor – called the
AI advisor – in their banking app: “Imagine your bank offers a digital assistant called the ‘AI advisor’. If you acti-vate the AI advisor in your banking app, it will monitor your financial situationand give you relevant recommendations that may improve your financial situa-tion. For example, it may suggest saving strategies or recommend investments.”
Additionally, to promote the idea that the AI Advisor is generally reliable, par-ticipants were given an idea of whether people benefit from using the
AI Advisor : “Overall statistics suggest that people benefit from using the AI advisor. Thebank account balance of bank customers who use the AI advisor increases by anaverage of 20% every year, whereas the balance of customers who don’t use theAI advisor increases by an average of only 10% per year.” Below was a table displaying the mentioned statistics. We then measured trust,willingness to use, and perceived personal benefit concerning the
AI Advisor . Step 2.
Participants were led to a new page for the second step of the experi-ment. Here we added some additional information on the AI advisor: “Next to general statistics on all bank customers, we can also look at how theAI advisor performs for subgroups of bank customers. Below you can see thechange in bank account balance for men and women in particular.”
Below this text was a table similar to the table in Step 1, but with twoadded rows that showed the average change in bank account balance per yearfor men and women in particular (see Table 1). Whereas the statistics for all bankcustomers overall, as well as for men and women not using the
AI advisor wasthe same in all conditions, the statistics for men and women using the
AI advisor varied depending on the condition they were assigned to (see Section 3.1). EachTable 1 shows the displayed statistics for male and female users per condition.We then again measured trust, willingness to use, and perceived personal benefit.
Testing H1a and H2b.
To test whether there is an effect of disparate impacton consumer trust (H1a) that is moderated by gender (H2b), we conducted aclassical ANOVA with condition and gender as between-subjects factors and change in trust as the dependent variable. A significant main effect of condition on change in trust in this analysis would suggest that change in trust differedbetween conditions (H1a). In this case, we would perform posthoc analyses toinvestigate the differences between the conditions in more detail. A significantinteraction effect between condition and gender would suggest that the condi-tions had a different effect for the disadvantaged group (i.e. female participants)compared to the advantaged group (i.e., male participants; H2b). Draws et al.
We further conducted a Bayesian ANOVA according to the protocol proposedby van den Bergh et al. [38]. Bayesian hypothesis tests involve the computationof the
Bayes factor , a quantitative comparison of the predictive power of twocompeting statistical models [44]. The Bayes factor weighs the evidence providedby the data and thus allows for direct model comparison. Practically, comparingdifferent models (i.e., including or excluding an interaction effect of condition and gender ) this way allowed for a richer interpretation of our results. We performedthe Bayesian ANOVA using the software JASP [16] with default settings. Wecomputed
Bayes Factors (BFs) by comparing the models of interest to a nullmodel and interpreted them according to the guidelines proposed by Lee andWagenmakers [18], who adopted them from Jeffreys [17]. Testing H1b.
We tested whether disparate impact affected participants’ will-ingness to use the
AI Advisor by conducting a chi-squared test between condition and change in willingness to use . A significant result in this analysis would sug-gest that the number of participants’ who changed their willingness to use the
AI Advisor differed across conditions.
Testing H2a.
We conducted another ANOVA with condition and gender asbetween-subjects factors and change in perceived personal benefit as dependentvariable to test whether gender acted as a moderator here. A significant inter-action effect in this analysis would indicate that this was the case.
Significance Threshold and Correction for Multiple Testing.
In all clas-sical analyses we conducted, we aimed for a type 1 error probability of no morethan 0.05. However, by conducting our planned analyses we automatically testeda total of seven hypotheses: three in each ANOVA (i.e., two main effects andone interaction) and one in the chi-squared test. This meant that the proba-bility of committing a type 1 error rose considerably [9].Therefore, we adjustedour significance threshold by applying a Bonferroni correction, where the desiredtype 1 error rate is divided by the number of hypotheses that are tested [25].In our main analyses we thus handled a significance threshold of . = 0 . p -value fell belowthis adjusted threshold. The same procedure was applied for posthoc analysescomparing each of the four conditions with each other as this meant conducting (cid:0) (cid:1) = 6 hypothesis tests (i.e., adjusting the threshold to . = 0 . We recruited 567 participants via the
Figure Eight pool of contributors (554) anddirect contacts (13). Seventy-three participants were excluded from the studybecause they either filled at least one of the obligatory text fields with less than10 characters, took less than 60 seconds to complete the task, or took more than10 minutes to complete the task. Furthermore, we did not analyze data of fiveparticipants who stated “other / not specified” as their gender because our studyinvolved a disparate impact between male and female consumers. The null model in this procedure consisted of only an intercept.isparate Impact Diminishes Consumer Trust 9
After exclusion, 489 participants remained. Of those, 238 (49%) were maleand 251 (51%) were female; with a mean age of 41 . . $ control , little bias , strong bias , and extreme bias conditions, respectively. H1a: Disparate Impact Decreased Consumer Trust.
As hypothesized, change in trust differed across conditions ( F = 6.906, p < . = 70 .
02, seeTable 2). To test for differences between the individual conditions, we conductedposthoc analyses (i.e., Mann-Whitney U tests). Only the difference between the control and extreme bias conditions was significant ( W = 9368, p < . Table 2.
Bayesian ANOVA with change in trust as dependent variable.
Models P(M) P(M | Data) BF M BF Error %Null model 0.200 8.209e-4 0.003 1.000condition 0.200 0.057 0.244 70.024 0.001gender 0.200 0.004 0.018 5.413 1.533e-6condition + gender 0.200 0.735 11.116 895.772 1.638condition + gender +condition * gender 0.200 0.202 1.012 245.894 1.978 −0.50−0.250.00 Control LittleBias StrongBias ExtremeBias
Condition C hange i n t r u s t −0.8−0.40.0 Control LittleBias StrongBias ExtremeBias Condition C hange i n t r u s t genderfemalemale Fig. 1.
Change in trust across conditions for all participants (left-hand panel) and splitby gender (right-hand panel). The error bars represent 95% confidence intervals.
H1b: Disparate Impact Decreased Willingness to Use.
In accordancewith disparate impact negatively affecting trust (H1a), it decreased participants’willingness to use the
AI Advisor (see Table 3). The increasing proportion of par-ticipants who changed their attitude from “yes” to “no” as conditions reflectedstronger disparate impact was statistically significant ( χ = 25 . p < . Table 3.
Change in willingness to use across conditions. The labels − , =, and + reflectchanges from “yes” to “no”, no change, “no” to “yes”, respectively. ConditionChange
Control Little Bias Strong Bias Extreme Bias − H2a: Gender Moderated the Effect of Disparate Impact on PerceivedPersonal Benefit.
As expected, the results from the second ANOVA showa significant interaction effect of condition and gender on change in perceivedpersonal benefit ( F = 8.525, p < . H2b: Male Consumers Experienced the Same Decrease in Trust asFemale Consumers.
In contrast to what we hypothesized, we do not find asignificant interaction effect of condition and gender on change in trust ( F =2 . p = 0 . changein trust compared to that of female participants. The Bayesian ANOVA confirmsthis result: the model containing just two main effects for condition and genderexplain the data best (BF = 895.77; see Table 2) and roughly four timesbetter than the model that includes the interaction effect (BF = 245.89). Thissuggests that unfairly advantaged and disadvantaged participants (i.e., men andwomen, respectively) experienced the same decrease in trust due to algorithmicunfairness despite diverging levels of perceived personal benefit (H2a). −2.0−1.5−1.0−0.50.0 Control LittleBias StrongBias ExtremeBias Condition C hange i npe r c e i v ed pe r s ona l bene f i t genderfemalemale Fig. 2.
Change in perceived personal benefit across conditions and split by gender. Theerror bars represent 95% confidence intervals.isparate Impact Diminishes Consumer Trust 11
In this paper, we presented a between-subjects user study that aimed to investi-gate the influence of algorithmically-driven disparate impact on consumer trustat the use case of gender-bias in robo-advisors. Our results suggest that disparateimpact – at least when it is extreme – decreases trust and makes consumers lesslikely to use such systems. We further find that, although disadvantaged andadvantaged users recognize their respective levels of personal benefit in scenar-ios of disparate impact, both experience equally decreasing levels of trust whenthey learn about a disparate impact caused by the system at hand. Our workcontributes to a growing body of literature that highlights the importance of en-suring fairness and avoiding disparate impact of consumer-oriented AI systems.
Our findings have implications for consumers as well as industry. Consumersshould be aware that machine-learning-based applications can be biased. If dis-parate impact is an important factor for consumer trust, consumers need tothink critically when using such systems. One potential way forward for con-sumers would be to demand from companies to publish independently carriedout research into the (algorithmic) fairness and impact of their products.Publishers of consumer-oriented AI systems need to establish algorithmicfairness in their products and avoid disparate impact to serve consumers ef-fectively. Our findings show that failing to do so may lead to a decrease inconsumers’ trust and willingness to use such systems.
Our study is subject to at least five important limitations. First, we studiedthe effect of disparate impact on consumer trust at a specific use case: a binarygender bias in robo-advisors. This makes our results difficult to generalize be-cause many other forms of bias (including those based on race, religion, or sexualorientation) as well as other AI systems (e.g., for recommendations of medicaltreatment, tourist attractions, or movies) exist. It is easy to imagine how con-sumer trust could be affected differently when, for example, disparate impactconcerns small minorities, multitudes of gender identities (or another consumercharacteristic), a chosen group membership such as consumers’ profession, or asystem that is less impactful on consumers’ personal lives than a robo-advisor.On a related note, we here positioned women in the disadvantage and men in theadvantage (i.e., the setting that corresponds to biases in human financial advice)but it is not certain if we were to obtain the same results if the (dis-)advantagewas distributed the other way round. Future work could explore these differentscenarios to help generalize and better understand the relationship between theeffect of disparate impact on consumer trust.Second, our finding that advantaged and disadvantaged users experienced thesame decrease in trust appears to go counter to previous research suggesting that people make stronger fairness judgments when they are personally affected [14].However, it is not clear from our results to what degree advantaged users (i.e.,men) felt personally affected; e.g., because they have women in their lives whothey deeply care about. The role of personal relevance in the effect of disparateimpact on consumer trust thus remains to be clarified by future research.Third, our results show a decreasing trend in consumer trust as conditionsbecome more extreme, but show a statistically significant difference only betweenthe control and extreme bias conditions. Future work could examine these differ-ences (also across domains) in more detail to establish the relationship between the level of disparate impact and consumer trust (e.g., to determine what lieswithin and beyond an “acceptable margin” of disparate impact).Fourth, we studied fairness related to group membership (i.e., gender), whichmight elicit a different (moral) evaluation than fairness on the individual level.Our results show that trust can decrease despite perceived personal benefit.However, this effect might have been caused by a sense of loyalty towards thedisadvantaged group. An interesting direction for future work is to study whethersimilar patterns emerge when disparate impact concerns individuals; e.g., whenadvantaged and disadvantaged subjects are randomly chosen.
We presented a user study investigating the effect of algorithmically-driven dis-parate impact (i.e., when algorithm outcomes adversely affect one group of con-sumers compared to another) on consumer trust. Specifically, we studied theeffect of gender-bias in an application that aimed to persuade consumers’ tomake better financial decisions. We found that disparate impact decreased par-ticipants’ trust and willingness to use the application. Furthermore, our resultsshow that the trust of unfairly advantaged participants was just as affected asthat of disadvantaged participants. These findings imply that disparate impact(i.e., as a result of algorithmic unfairness) can undermine trust in consumer-oriented AI systems and should therefore be avoided or mitigated when aimingto create trustworthy technology.
Acknowledgements
This research has been supported by the
Think Forward Initiative (a partnershipbetween ING Bank, Deloitte, Dell Technologies, Amazon Web Services, IBM,and the Center for Economic Policy Research – CEPR). The views and opinionsexpressed in this paper are solely those of the authors and do not necessarilyreflect the official policy or position of the Think Forward Initiative or any of itspartners. isparate Impact Diminishes Consumer Trust 13
References
1. Ahmad, W.N.W., Ali, N.M.: A Study on Persuasive Technologies: The Relationshipbetween User Emotions, Trust and Persuasion. Int. J. Interact. Multimed. Artif.Intell. (1), 57–61 (2018). https://doi.org/10.9781/ijimai.2018.02.0102. Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias: There’ssoftware used across the country to predict future criminals and it’s bi-ased against blacks. ProPublica (2019),
3. Araujo, T., Helberger, N., Kruikemeier, S., de Vreese, C.H.: In AI we trust? Per-ceptions about automated decision-making by artificial intelligence. AI Soc. (3),611–623 (2020). https://doi.org/10.1007/s00146-019-00931-w, https://doi.org/10.1007/s00146-019-00931-w
4. Arnold, M., Piorkowski, D., Reimer, D., Richards, J., Tsay, J., Varshney,K.R., Bellamy, R.K., Hind, M., Houde, S., Mehta, S., Mojsilovic, A., Nair, R.,Ramamurthy, K.N., Olteanu, A.: FactSheets: Increasing trust in AI servicesthrough supplier’s declarations of conformity. IBM J. Res. Dev. (4-5) (2019).https://doi.org/10.1147/JRD.2019.29422885. Baeckstr¨om, Y., Silvester, J., Pownall, R.A.: Millionaire investors: financial advi-sors, attribution theory and gender differences. Eur. J. Financ. (15), 1333–1349(2018). https://doi.org/10.1080/1351847X.2018.14383016. Barocas, Solon and Selbst, A.D.: Big data’s disparate impact. Calif. Law Rev. (671), 671–732 (2016)7. Bellamy, R.K., Mojsilovic, A., Nagar, S., Ramamurthy, K.N., Richards, J., Saha,D., Sattigeri, P., Singh, M., Varshney, K.R., Zhang, Y., Dey, K., Hind, M., Hoffman,S.C., Houde, S., Kannan, K., Lohia, P., Martino, J., Mehta, S.: AI Fairness 360:An extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res.Dev. (4-5) (2019). https://doi.org/10.1147/JRD.2019.29422878. Corbett-Davies, S., Goel, S.: The Measure and Mismeasure of Fairness: A CriticalReview of Fair Machine Learning. arXiv Prepr. arXiv1808.00023 (2018), http://arxiv.org/abs/1808.00023
9. Cramer, A.O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R.,Grasman, R.P., Waldorp, L.J., Wagenmakers, E.J.: Hidden multiplicity in ex-ploratory multiway ANOVA: Prevalence and remedies. Psychon. Bull. Rev. (2),640–647 (2016). https://doi.org/10.3758/s13423-015-0913-510. Cummings, M.L.: Automation bias in intelligent time critical decision supportsystems. Collect. Tech. Pap. - AIAA 1st Intell. Syst. Tech. Conf. , 557–562 (2004).https://doi.org/10.2514/6.2004-631311. Diab, D.L., Pui, S.Y., Yankelevich, M., Highhouse, S.: Lay perceptions of selectiondecision aids in US and non-US samples. Int. J. Sel. Assess. (2), 209–216 (jun2011). https://doi.org/10.1111/j.1468-2389.2011.00548.x12. Dietvorst, B.J., Simmons, J.P., Massey, C.: Algorithm aversion: People erroneouslyavoid algorithms after seeing them err. J. Exp. Psychol. Gen. (1), 114–126(2015). https://doi.org/10.1037/xge000003313. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S.:Certifying and Removing Disparate Impact. In: Proc. 21th ACM SIGKDD Int.Conf. Knowl. Discov. data Min. pp. 259–268 (2015)14. Ham, J., van den Bos, K.: Not fair for me! The influence of personal rele-vance on social justice inferences. J. Exp. Soc. Psychol. (3), 699–705 (2008).https://doi.org/10.1016/j.jesp.2007.04.0094 Draws et al.15. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning.Adv. Neural Inf. Process. Syst. pp. 3323–3331 (2016)16. JASP Team: JASP (Version 0.14) (2020)17. Jeffreys, H.: Theory of Probability. Oxford University Press, Oxford (1939)18. Lee, M.D., Wagenmakers, E.J.: Bayesian cognitive modeling: A practical course.Cambridge University Press (2014). https://doi.org/10.1017/CBO978113908775919. Lee, M.K.: Understanding perception of algorithmic decisions: Fairness, trust, andemotion in response to algorithmic management. Big Data Soc. (1), 1–16 (2018).https://doi.org/10.1177/205395171875668420. Lieber, R.: Financial Advice for People Who Aren’t Rich(apr 2014),
21. Mary, J.J., Calauz`enes, C., Karoui, N.E.: Fairness-Aware Learning for ContinuousAttributes and Treatments. Icml , 4382–4391 (2019), http://proceedings.mlr.press/v97/mary19a.html
22. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A Surveyon Bias and Fairness in Machine Learning. arXiv Prepr. arXiv1908.09635 (2019), http://arxiv.org/abs/1908.09635
23. Muir, Bonnie M and Moray, N.: Trust in automation. Part II. Experimental studiesof trust and human intervention in a process control simulation. Ergonomics (3),429–460 (1996)24. Mullainathan, S., Noeth, M., Schoar, A.: The Market for Financial Advice: AnAudit Study. SSRN Electron. J. (2012). https://doi.org/10.2139/ssrn.157233425. Napierala, M, A.: What Is the Bonferroni correction? (2012),
26. Nickel, P., Spahn, A.: Trust, Discourse Ethics, and Persuasive Technology. In:Persuas. Technol. Des. Heal. Safety; 7th Int. Conf. Persuas. Technol. 2012. pp.37–40. Link¨oping University Electronic Press (2012)27. Ntoutsi, E., Fafalios, P., Gadiraju, U., Iosifidis, V., Nejdl, W., Vidal, M.E., Rug-gieri, S., Turini, F., Papadopoulos, S., Krasanakis, E., Kompatsiaris, I., Kinder-Kurlanda, K., Wagner, C., Karimi, F., Fernandez, M., Alani, H., Berendt, B.,Kruegel, T., Heinze, C., Broelemann, K., Kasneci, G., Tiropanis, T., Staab,S.: Bias in data-driven artificial intelligence systems—An introductory sur-vey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. (3), 1–14 (2020).https://doi.org/10.1002/widm.135628. Oinas-Kukkonen, H., Harjumaa, M.: Towards deeper understanding of persuasionin software and information systems. In: Proc. 1st Int. Conf. Adv. Comput. Inter-act. ACHI 2008 (2008). https://doi.org/10.1109/ACHI.2008.3129. ¨Onkal, D., Goodwin, P., Thomson, M., G¨on¨ul, S., Pollock, A.: The relative influenceof advice from human experts and statistical methods on forecast adjustments. J.Behav. Decis. Mak. (4), 390–409 (2009). https://doi.org/10.1002/bdm.63730. Orji, R., Moffatt, K.: Persuasive technology for health and wellness: State-of-the-art and emerging trends. Health Informatics J. (1), 66–91 (2018).https://doi.org/10.1177/146045821665097931. Otterbacher, J., Checco, A., Demartini, G., Clough, P.: Investigating userperception of gender bias in image search: The role of sexism. 41st Int.ACM SIGIR Conf. Res. Dev. Inf. Retrieval, SIGIR 2018 pp. 933–936 (2018).https://doi.org/10.1145/3209978.321009432. Promberger, M., Baron, J.: Do patients trust computers? J. Behav. Decis. Mak. (5), 455–468 (2006). https://doi.org/10.1002/bdm.542isparate Impact Diminishes Consumer Trust 1533. Purpura, S., Schwanda, V., Williams, K., Stubler, W., Sengers, P.: Fit4Life: TheDesign of a Persuasive Technology Promoting Healthy Behavior and Ideal Weight.In: Proc. SIGCHI Conf. Hum. factors Comput. Syst. pp. 423–432 (2011)34. Rossi, F.: Building trust in artificial intelligence. J. Int. Aff. (1), 127–133 (2019)35. Sattarov, F., Nagel, S.: Building trust in persuasive gerontechnology: User-centric and institution-centric approaches. Gerontechnology (1), 1–14 (2019).https://doi.org/10.4017/gt.2019.18.1.001.0036. Smith, J., Sonboli, N., Fiesler, C., Burke, R.: Exploring User Opinions of Fairnessin Recommender Systems. In: CHI’20 Work. Human-Centered Approaches to FairResponsible AI (2020), http://arxiv.org/abs/2003.06461
37. Toreini, E., Aitken, M., Coopamootoo, K., Elliott, K., Zelaya, C.G., van Moorsel,A.: The relationship between trust in AI and trustworthy machine learning tech-nologies. FAT* 2020 - Proc. 2020 Conf. Fairness, Accountability, Transpar. pp.272–283 (2020). https://doi.org/10.1145/3351095.337283438. Van Den Bergh, D., Van Doorn, J., Marsman, M., Draws, T., Van Kesteren, E.J.,Derks, K., Dablander, F., Gronau, Q.F., Kucharsk´y, ˇS., Gupta, A.R.N., Sarafoglou,A., Voelkel, J.G., Stefan, A., Ly, A., Hinne, M., Matzke, D., Wagenmakers, E.J.:A tutorial on conducting and interpreting a bayesian ANOVA in JASP. AnneePsychol. (1), 73–96 (2020). https://doi.org/10.3917/anpsy1.201.007339. Varshney, K.R.: Trustworthy machine learning and artificial intelligence. XRDSCrossroads, ACM Mag. Students (3) (2019). https://doi.org/10.1145/331310940. Varshney, K.R.: On Mismatched Detection and Safe, Trustworthy MachineLearning. In: 2020 54th Annu. Conf. Inf. Sci. Syst. CISS 2020 (2020).https://doi.org/10.1109/CISS48834.2020.157062776741. Verbeek, P.P.: Persuasive Technology and Moral Responsibility Toward an ethicalframework for persuasive technologies. Persuasive , 1–15 (2006)42. Verma, S., Rubin, J.: Fairness definitions explained. In: Proc. Int. Work. Softw.Fairness. pp. 1–7. FairWare ’18, Association for Computing Machinery, New York,NY, USA (2018). https://doi.org/10.1145/3194770.319477643. Vigdor, N.: Apple card investigated after gender discrimination complaints. NewYork Times (2019)44. Wagenmakers, E.J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J.,Selker, R., Gronau, Q.F., ˇSm´ıra, M., Epskamp, S., Matzke, D., Rouder, J.N.,Morey, R.D.: Bayesian inference for psychology. Part I: Theoretical advan-tages and practical ramifications. Psychon. Bull. Rev. (1), 35–57 (2018).https://doi.org/10.3758/s13423-017-1343-345. Woodruff, A., Fox, S.E., Rousso-Schindler, S., Warshaw, J.: A qualitative explo-ration of perceptions of algorithmic fairness. Conf. Hum. Factors Comput. Syst. -Proc. , 1–14 (2018). https://doi.org/10.1145/3173574.317423046. Yang, Q., Banovic, N., Zimmerman, J.: Mapping machine learning ad-vances from HCI research to reveal starting places for design innova-tion. Conf. Hum. Factors Comput. Syst. - Proc.2018-April