[PDF] Soliciting Stakeholders' Fairness Notions in Child Maltreatment Predictive Systems

Abstract

Recent work in fair machine learning has proposed dozens of technical definitions of algorithmic fairness and methods for enforcing these definitions. However, we still lack an understanding of how to develop machine learning systems with fairness criteria that reflect relevant stakeholders' nuanced viewpoints in real-world contexts. To address this gap, we propose a framework for eliciting stakeholders' subjective fairness notions. Combining a user interface that allows stakeholders to examine the data and the algorithm's predictions with an interview protocol to probe stakeholders' thoughts while they are interacting with the interface, we can identify stakeholders' fairness beliefs and principles. We conduct a user study to evaluate our framework in the setting of a child maltreatment predictive system. Our evaluations show that the framework allows stakeholders to comprehensively convey their fairness viewpoints. We also discuss how our results can inform the design of predictive systems.

Full PDF

SSoliciting Stakeholders’ Fairness Notions in Child MaltreatmentPredictive Systems

Hao-Fei Cheng

University of [email protected]

Logan Stapleton

University of [email protected]

Ruiqi Wang

Carnegie Mellon [email protected]

Paige Bullock

Kenyon [email protected]

Alexandra Chouldechova

Carnegie Mellon [email protected]

Zhiwei Steven Wu

Carnegie Mellon [email protected]

Haiyi Zhu

Carnegie Mellon [email protected]

ABSTRACT

Recent work in fair machine learning has proposed dozens of tech-nical definitions of algorithmic fairness and methods for enforcingthese definitions. However, we still lack an understanding of how todevelop machine learning systems with fairness criteria that reflectrelevant stakeholders’ nuanced viewpoints in real-world contexts.To address this gap, we propose a framework for eliciting stakehold-ers’ subjective fairness notions. Combining a user interface thatallows stakeholders to examine the data and the algorithm’s predic-tions with an interview protocol to probe stakeholders’ thoughtswhile they are interacting with the interface, we can identify stake-holders’ fairness beliefs and principles. We conduct a user studyto evaluate our framework in the setting of a child maltreatmentpredictive system. Our evaluations show that the framework allowsstakeholders to comprehensively convey their fairness viewpoints.We also discuss how our results can inform the design of predictivesystems.

KEYWORDS human-centered AI; machine learning; algorithmic fairness; algorithm-assisted decision-making; child welfare

Machine learning (ML) algorithms are increasingly being used tosupport human decision-making in high-stakes contexts such asonline information curation, resume screening, mortgage lending,police surveillance, public resource allocation, and pretrial deten-tion. However, concerns have been raised that algorithmic systemsmight inherit human biases from historical data, and thereby per-petuate discrimination against already vulnerable subgroups. Theseconcerns have given rise to a rapidly growing research area of fairmachine learning. Recent work in this area has produced dozens ofquantitative notions of algorithmic fairness [2, 17, 24, 30, 57, 71], andprovided methods for enforcing these notions [1, 2, 24, 41, 42, 80].Existing research on fair machine learning has primarily focusedon fairness at the level of pre-defined groups. This group fairness approach first fixes a small collection of groups defined by protectedattributes (e.g., race or gender) and then asks for approximate equal-ity of some statistic of the predictor, such as positive classification rate or false positive rate, across these groups (see, e.g., [1, 30, 46]).While notions of group fairness are easy to operationalize, theyare aggregate in nature and make no promises of fairness to finersubgroups or individuals [24, 31, 42]. In contrast, the individualfairness approach aims to address this limitation by asking for ex-plicit fairness criteria at an individual level. For example, Dworket al. [24] propose an individual fairness notion that requires thatsimilar people are treated similarly. Their formulation of fairnesscrucially relies on a task-specific metric that captures whether twoindividuals are similar for the purpose of the task at hand. Due tothe challenges of specifying such a metric in any given real-worlddecision-making problem, it remains difficult to operationalize in-dividual fairness in practice.Irrespective of the approach one takes to quantify fairness, itis important to engage relevant stakeholders in the design of real-world decision-making systems. As Shah [66] has argued, achievinglegitimacy or “social license” from the broader community is criticalto the ability of even the best-conceived technologies to have a pos-itive social impact. Similarly, [47, 77] recommend that stakeholders affected by the decisions should be centered in these processes.One example of a “fair” technology that failed to be adopted dueto a lack of stakeholder support is a school start time schedulingtool proposed in Boston intended to decrease bussing costs whileimproving racial equity and better accommodating differences incircadian rhythms across students of different ages. The system’sdesign failed to account for the excess burden that the proposedtimes would place on families with multiple children who attenddifferent schools, particularly for lower-income parents who tendto have inflexible work schedules [76]. This is not an isolated ex-ample. In a recent study, Veale et al. [70] interviewed 27 publicsector ML practitioners across 5 OECD countries and noted thecommon disconnects between current fair ML approaches and theorganizational and institutional realities, constraints, and needs inwhich algorithms are applied. False positives occur when a subject has true negative label, but a classifier erro-neously classifies the subject positively. For example, if a child is truly at low risk ofmaltreatment, but a classify predicts that they are a high risk, this is a false positive.

False negatives occur when subjects with true positive labels are negatively classified.1 hus, involving affected stakeholders in the algorithm designprocess—particularly, in the process of defining fairness—is of ut-most importance. To this end, we propose a novel framework foreliciting stakeholders’ opinions around algorithmic fairness. Theframework combines two components: an interactive interface that allows stakeholders to examine the data and audit an algo-rithm’s predictions, and an interview protocol that is designed toprobe stakeholders’ thoughts and beliefs on fairness and biases ofthe algorithm while they are interacting with the interface.We evaluated our framework in the high-stakes context of de-veloping machine learning-based risk assessment tools to assistchild abuse hotline call workers in their screening decisions. Ourwork is motivated by the Allegheny Family Screening Tool (AFST),which has been used in Allegheny County, PA since the summer of2016 [69]. We conducted in-depth interviews with 12 participantsfrom two groups of stakeholders (parents and social workers) tounderstand their fairness viewpoints. The interviews allow us iden-tify fairness approaches that align with stakeholders’ beliefs, andallow stakeholders to provide rich reasoning to explain their view-points. For child maltreatment risk assessment, the stakeholderswe interviewed slightly preferred equalized odds (i.e., equalizingaccuracy at identifying low- and high-risk cases across the sen-sitive attributes), compared to unawareness (i.e., not consideringthe sensitive attributes at all) and statistical parity (i.e., equalizinghigh-risk predictions across sensitive attributes). When asked tomake individual fairness comparisons, there was little agreementbetween these stakeholders in most scenarios.We propose a novel method for engaging human stakeholdersin the algorithm design process - in the process of defining fairness.Our work also contributes empirical understanding of stakeholders’fairness opinions in the high-stakes context of developing machinelearning-based risk assessment tools.

There has been significant development in research on machinelearning fairness and accountability in recent years [2, 17, 30, 57,60, 71]. Prior literature on ML fairness can generally be classifiedin two categories: group fairness and individual fairness. The morecommonly studied notion, group fairness, requires parity of somestatistical measure across a fixed number of protected groups. In thispaper, we ask study participants about three of the most popularnotions of group fairness: fairness through unawareness (hence-forth unawareness ), which is the notion that in order to be fair,an algorithm should explicitly not consider a protected attribute(e.g. race or gender) when making its decisions [59]; statistical(or demographic) parity , which entails that a fair algorithm haveparity of positive classification rates across a fixed number of pro-tected groups; and equalized odds [30], which entails that a fairalgorithm have equal accuracies —true positive and false positiverates— across a fixed number of protected groups. While all threenotions offer some theoretical fairness guarantees, they also havedifferent shortcomings. First, unawareness has long been critiqued:[59] argue that even when protected attributes are not considered aspredictive features, there may be “background knowledge” —otherdata which serves as proxy or strong predictor for the removed attributes— which recreates the effect of including the removedattributes, e.g. someone’s zipcode may be a strong predictor oftheir race. In general, policy and decision making which activelydisregards sensitive attributes, e.g. color-blind policies to mitigateracial discrimination, have been critiqued, as well [4]. Second, [30]critique statistical parity on the grounds that 1) it is not fair in-sofar as it “permits that we accept the qualified applicants in onedemographic, but random individuals in another, so long as thepercentages of acceptance match”; and 2) we might incorrectlyclassify a number of samples in order to maintain equal rates ofpositive classification. Third, equalized odds can be impossible toachieve simultaneously with other common fairness notions, likestatistical parity or calibration [16, 46]. In general, group fairnessmetrics provide no meaningful guarantees of fairness to individualsor more refined sub-groups [24, 31, 42].On the other hand, notions of individual fairness explicitly con-strain algorithmic decisions at an individual level [24, 37]. For ex-ample, the individual fairness notion in [24] requires “treating sim-ilar individuals similarly;” the meritocratic fairness notion in [37]requires that the algorithm should “prioritize more deserving indi-viduals.” However, these approaches require strong assumptions,such as a consistent measure for similarity or merits across individ-uals, which usually do not hold in real-world contexts. Furthermore,[77] suggests centering research not on proposing new technicaldefinitions, but rather on proposing new procedures for involvingstakeholders to determine which notion of fairness is best. To thisend, recent work [8, 33, 38] provides theoretical models of humanauditors or arbiters who can provide fairness feedback to assist analgorithm to provably enforce individual fairness without an ex-plicit similarity measure. Beyond individual fairness, there has alsobeen work that involves human efforts in developing algorithms[5, 29, 51, 81], making final decisions after algorithm recommenda-tions [48], and making decisions about fairness trade-offs [79] (assatisfying the criteria for all fairness definitions is mathematicallyimpossible [26, 45]). However, some critique previous elicitationmethods for not capturing the reasons behind responses [34, 77].Additionally, it remains a major challenge to devise mechanismsto involve stakeholders in algorithmic development and auditingthat do not require unrealistic levels of technical knowledge amongparticipants.

With the increased attention on algorithmic fairness, researchersand practitioners designed novel visualization techniques to helppeople examine the machine learning algorithms and identify biases.For examples, the What-if Tool and AI Fairness 360 are two open-source tools that allow users to visually examine the behavior oftheir machine learning models and identify potential biases [9, 75].Other similar visual analytics system are also developed that allowusers to audit the group and subgroup fairness of machine learningmodels [13], or to help data scientists and practitioners make fairdecisions [3].More recently, HCI researchers have begun to investigate humanperspectives on algorithmic fairness. Several recent studies haveinvestigated public [28, 64, 65, 72, 74] and practitioner [32, 56, 70]perspectives on the use of algorithmic systems for public-sector ecisions. This body of work suggests that fairness principles needto be context-specific, and the algorithmic systems should embodythe fairness notions derived from the community of stakeholders[10, 11, 23, 49, 50, 63]. There has been encouraging work towardsthis direction. For example, researchers have conducted workshopsand interviews to understand what people think fairness meansin the context of resource allocation [52] or targeted online ads[78]. Researchers have also conducted surveys to gauge well non-technical subjects understand existing fairness metrics [63], howexplaining these fairness differently affects subjects’ beliefs aboutfairness [10], and what features should or should not be used by afair learning algorithm [28, 65]. While these studies provide us witha better understanding of general public and user perceptions ofjustice and fairness, it is often difficult to translate these qualitativeunderstandings into the system criteria and directly inform thealgorithm developments. More recently, interdisciplinary research teams have begun inves-tigating how hybrid approaches combining the tools of both HCIand machine learning can be effectively applied in developing fairand accountable algorithmic systems. Such work includes the We-BuildAI framework [53] and related approaches [40, 61] for incor-porating stakeholder preferences into allocation decisions. A recentworkshop on Participatory Approaches to Machine Learning heldat ICML 2020 featured recent and ongoing work in this emergingspace. Featured research included studies of recommender systems[19], approaches to patient triage during the COVID-19 pandemic[36], and critical assessments of the role of participatory methodsin algorithm design [62].

In the paper, we want to answer the following research question:

How can we effectively elicit fairness notions from a com-munity of stakeholders who are not technical experts?

To answer this question, we propose an elicitation frameworkthat consists of two components: (1) an interactive interface thatallows stakeholders to express their subjective fairness notions, and(2) an associated interview protocol that further probes stakehold-ers’ reasoning behind their elicited notions (see Figure 1).

The goal of the interface is to enable stakehold-ers to express their perspectives by reasoning about the algorithm’simpact at different levels, ranging from individual decisions to theeffects on demographic groups.

Goal 1: Elicitation at the “macro” level.

Corresponding to the“group fairness” notion, the interface should enable users to ex-amine the data and algorithm performance in the groups definedby the users (not limited to groups defined by common protected In our study, we changed the order of the interview slightly by asking participantsabout their viewpoints on common group fairness approaches in step one due to timeconsideration.

Figure 1: Fairness elicitation framework using our interac-tive interface and interview protocol. Arrows indicate theprogression of our interviews, starting from the case-by-case view (step 1) and proceeding to the group view (step 4). attributes such as gender and race). The interface should presentthe various statistical metrics for each subgroup and visualize themfor stakeholders to investigate. The stakeholders can then expresswhether each statistical group fairness measure is aligned withtheir perspectives. Goal 2: Elicitation at the “micro” level.

Corresponding to the“individual fairness” notion, the interface should enable users toinspect the data and algorithm recommendations at a case-by-caselevel. Combining the approaches from prior work [8, 38, 53], theinterface elicits individual fairness feedback by asking stakeholdersto make two types of pairwise comparisons: (1) whether the pair ofindividuals should be treated similarly or not, and (2) whether oneindividual should be prioritized over the other one or not.

Goal 3: Elicitation at the “meso” level.

The goal is to enablestakeholders to compare any single selected case with all other casesin the dataset. Different stakeholders may have different criteriafor evaluating the similarity and priority across the cases. Thusthe interface should allow users to specify their own metrics whenexploring the data.

Our interactive interface prototype consists of three primary views: (i) a group view corresponding toGoal 1 (Figure 2a), (ii) a case-by-case view corresponding to Goal 2(Figure 2b), and (iii) a similarity comparison view corresponding toGoal 3 (Figure 2c). Group view:

This view aims to give users a holistic view of thealgorithm’s performance by showing how it varies across groupsaccording to different metrics. Users have the option to select from alist of common classification performance metrics. The drop-downmenus allow the user to select attributes with which to separate thedata into subgroups. The interface displays a bar chart depicting thealgorithm’s performance across the specified subgroups. A textualdescription is also provided below the graph to provide an alter-nate description of the algorithm’s performance. The visualizationcorresponds to group fairness notions and the interactive interface A demo of our interface can be assessed here (note that the data shown in the demoare synthetic): https://z.umn.edu/fairnessElicitationInterface emale, other Among the cases predicted as high risk, lead to re-referral (62%)Male, other Among the cases predicted as high risk, lead to re-referral (75%)Female, caucasian Among the cases predicted as high risk, lead to re-referral (73%)Female, african-american Among the cases predicted as high risk, lead to re-referral (66%)Female, asian Among the cases predicted as high risk, lead to re-referral (62%)Female, hispanic Among the cases predicted as high risk, lead to re-referral (67%) a) Group viewb) Case-by-case view c) Similarity comparison view Figure 2: Our fairness elicitation interface contains three different views, which allows stakeholders examine the algorithm atdifferent levels. In this example, the interface is presenting synthetic data from a child maltreatment prediction system (seeSection 4.3). allows users to explore any group or performance metrics they areinterested in.

Case-by-case view:

This view allows users to deliberate the al-gorithm at a granular level of individual predictions. Each case ofalgorithm prediction is presented as a card; the interface showstwo cases at a time for pairwise comparison. On each card, thealgorithmic prediction is shown on top, followed by features thealgorithm used to make the prediction. Hovering over each featurewill show users the detailed description of that feature, and thepossible values the feature can take. Users can browse through thecases back and forth. The tool will randomly select a new case fromthe dataset, and replace the currently displayed case. Users can ex-plore new cases by changing the case on either the left or right. Theinterface lets stakeholders inspect the profiles and detailed featuresof any two individuals treated by the algorithm, allowing them todetermine if the pair should be treated equally. This aligns with thedefinition of individual fairness [24] and works that operationalizeit (e.g. [8, 39]).

Similarity comparison view:

This view shows a one-dimensionalscatter plot that compares a selected reference case with all othercases in the dataset. This allows users to explore the dataset at amacro view and narrow down to individual cases for inspection.This scatter plot displays all the cases in the dataset, with eachcase represented by a dot on the plot, color-coded according to thealgorithm prediction. The reference case is positioned at the far leftof the plot, with other cases ordered by similarity to the referencecase along the x-axis. A weighted Euclidean distance metric isused to calculate the similarity of the cases. The y-axis shows thedistribution of the cases at that similarity level. A control panelallows users to change the weight associated with each feature.Users can customize the weights to re-rank the cases in an orderthat aligns with their viewpoints. Users can select a case from theplot to compare with the reference case, or set a new case as the The weighted Euclidean distance between cases 𝑝 and 𝑞 is calculatedby: (cid:112)(cid:205) 𝑛𝑖 = 𝑤 𝑖 ( 𝑞 𝑖 − 𝑝 𝑖 ) , where 𝑤 𝑖 denotes the user-assigned weight for feature 𝑖 . eference case. The similarity comparison view allows stakeholdersto compare a reference case with all the other cases. The dots at thesame position on the X-axis have the same similarity score (samedistance from the reference case), which allows users to quickly seethe distribution of the similarity scores across a large number ofcases. This allows them to quickly narrow down to individual casesfrom the whole dataset for comparison, such as looking at caseswhere the cases with high level of feature similarity and evaluate ifthey should be receive the same decision. To complement this interface, we develop interview protocols toprobe stakeholders’ fairness viewpoints and principles. Our pro-tocols are based on the think-aloud approach, which is one of themost valuable usability engineering methods in HCI [58]. We askstakeholders to use the interface we described above “while con-tinuously thinking out loud—that is, verbalizing their thoughts asthey move through the user interface”[58]. Think aloud serves as“a window on the soul,” letting us discover what participants reallythink about the fairness and bias of the algorithm [58].First, we ask stakeholders to compare pairs of cases in the data without showing the algorithmic predictions, in the case-by-caseview (Figure 2b). We ask stakeholders if both cases should be treatedequally (i.e. receive the same prediction by the algorithm), and ifnot, what alternative outcomes should the two cases receive toalign with the stakeholders’ fairness principles. In this stage, weonly show users the features for the cases, as we aim to collectstakeholders’ fairness notions regardless of the predictions of thoseoutcomes, and the factors they would consider when evaluating thecases in the context. Participants start by comparing pairs of caseswhich differ by only one factor, then move onto pairs which differby two or more factors. At this stage, we also ask our participantswhether three common group fairness approaches (unawareness,statistical parity, and equalized odds) are appropriate with respectto sensitive attributes (e.g. for child maltreatment, these are victimage, victim gender, family race, use of public assistance serviceand perpetrator gender). For a given sensitive attribute, we elicitopinions on whether the following approaches should be met forthe algorithm to be fair:(1) the sensitive attribute should not be a predictive factor ( un-awareness ); (2) the rates of positive classification should be equal across asensitive attribute ( statistical parity ); or(3) the false positive and false negative rates should be equalacross a sensitive attribute ( equalized odds ).See Figure 3 for visual explanations of statistical parity and equal-ized odds. We used similar visuals in our interviews, as well.Second, we ask stakeholders to make pairwise comparisons againwith cases showing the algorithmic prediction (Figure 2b). Partici-pants compare cases that are selected randomly from the dataset.We ask users to identify and explain (pairs of) cases that are beingtreated unfairly. We also asked them to evaluate if the algorithmpredictions are in general biased according to their fairness notions. Though there may be significant problems with unawareness as an approach tofairness (as noted in Section 2), we think it important to gather stakeholder beliefsabout it, as it is a common and widely-used policy (e.g. color-blind policies). (a) Two examples which violate and satisfy statisticalparity between infants and adolescents (two subgroupsalong the sensitive attribute victim age ), respectively.The orange box with no fringe on top contains childrenpredicted to be at high risk of maltreatment; the bluefringe box at bottom is low risk prediction. On the left,the proportion of high risk predictions for infants is 75%,whereas for adolescents this rate is 50%. On the right, theproportions of high risk predictions for infants and ado-lescents are 50%.(b) Two examples which violate and satisfy equal-ized odds between infants and adolescents, respectively.Green children are truly at high risk of maltreatment;white are truly at low risk. False positives (children whoare truly at low risk of maltreatment are predicted to beat high risk) are in the upper righthand corner; false neg-atives (children who are truly high risk are predicted tobe at low risk) are in the lower left corner. In the left ex-ample, the false positive and false negative rates for in-fants are 50% and 20%, respectively; for adolescents theserates are 33.3% and 66.6%, respectively. On the right, forboth infants and adolescents, the false positive and falsenegative rates are 50% and 66.6%, respectively.

Figure 3: High risk prediction rates must be equal along asensitive attribute to satisfy statistical parity. False positiveand false negative rates must equal to satisfy equalized odds.

Third, we ask stakeholders to use the similarity comparison viewto compare reference cases with all the other cases in the data (Fig-ure 2c). We ask stakeholders to define their own similarity metricsby ranking the importance of each feature in determining similarpairs. We then ask participants to identify cases that should beprioritized by the algorithm. We also invite participants to identify airs which are similar to each other, but received different predic-tions by the algorithm. Stakeholders are free to explain the reasonsbehind their selections, and the information they rely on to identifythem.Lastly, we show stakeholders the group view of the interface(Figure 2a). Stakeholders can define the groups they want to in-spect, and see the algorithm’s performance on the groups. We askstakeholders to explore the groups and subgroups they are mostconcerned with. If participants believe any particular groups (andsubgroups) are being treated unfairly, we ask follow up questionsto probe the reasons for this belief.Throughout the interview, participants are encouraged to sharetheir views on the cases before them even if those views do notreflect perceptions of fairness per se. Participants may indicate, forinstance, that they are uncomfortable with the use of algorithms incertain cases, that particular case characteristics are of paramountimportance to the decision-making process, or that having modelexplanations would improve their understanding of the tool. Thisis all valuable, actionable feedback that may be incorporated intothe algorithm re-training process. While the use of predictive models in critical societal domains hasonly recently begun to receive widespread attention from the com-puter science community, predictive “risk assessment tools” have along history in child welfare and beyond. Machine learning toolsof the kind we discuss in this paper fall into a family of methodstraditionally referred to as ‘actuarial risk assessment.’ The term‘actuarial’ is used to indicate that a tool relies on associations in-ferred from data between an outcome and so-called risk factors (i.e.,input features). This terminology is used to contrast with, ‘clinicalrisk assessment’, also known as professional judgment, in whichexperts subjectively assess risk. One of the earliest actuarial riskassessment tools was developed by Burgess [12] to calculate therecidivism risk for offenders being released from Illinois state pris-ons. Actuarial risk assessment instruments are now widely usedthroughout the criminal justice system, from pre-trial [20, 21], tosentencing [43, 55] to probation and parole [7]. They are also usedin academic advising, healthcare, welfare allocation, homelessnessservices, and many other settings [6, 14, 44, 67].Over the past couple of decades, many child welfare agencieshave incorporated actuarial risk assessment tools—or hybrid modelsthat combine prediction with professional judgment—into variousstages of the child protection decision-making process [35, 69].While the most widely-used tools take the form of simple pointsystems that consider only a handful of manually-entered factors,machine learning models such as neural networks have been con-sidered since at least the early 2000’s [54]. Contemporary toolssuch as the AFST differ from the majority of existing tools in thatthey rely on a much larger set of features that are automaticallypopulated from multi-system administrative data. This obviates the problem of inter-rater reliability, wherein different users may havedifferent assessments of manually-entered features in a mannerthat results in different risk scores. But it leaves open the possibilityof more systematic errors potentially going undetected for longperiods of time [18]. See Figure 4 for further explanation of the childwelfare screening process used at Allegheny County Departmentof Human Services (DHS).

Figure 4: Allegheny DHS screening process. Step 1: The ex-ternal caller, e.g. the child’s teacher, family member, calls achild welfare hotline to make a report, which directs to ei-ther a state or county hotline. The state hotline takes downthe caller’s information and forwards it to the county staffin step 2 for screening. The county hotline goes directly tothe call screening staff (call screeners and supervisors) instep 2, where they gather information from the caller, re-trieve existing information from the department’s databaseabout the case, and assess the risk of harm to the child. Here,the AFST risk score is considered. Step 3: the call screeningstaff determines whether to screen out the report —meaningthat the call is not investigated further by the department(step 4)— or to screen in , which can entail assigning a case-worker to the case and investigating further (step 5). Steps6+: the case may be referred to a caseworker and/or otherchild welfare staff (supervisors, administrators, etc.), wherethey might proceed in a number of directions, e.g. furtherobservation, investigation, or intervention (e.g. removal ofthe child or referral to other governmental services).

The use of algorithmic decision support tools in the child welfarecontext is a contentious issue. For instance, there is the possibilitythat communities of color and families experiencing poverty maybe disadvantaged by virtue of having more comprehensive dataavailable on them in the government administrative data systemsused to evaluate algorithmic risk scores. Such concerns have beengiven voice by authors such as Virginia Eubanks, who in her book

Automating Inequality [25] argues that such tools oversample thepoor and present a fundamentally flawed approach to improving hild welfare decision-making. Similar objections have been raisedby Richard Wexler, a long-time critic of algorithmic tools in childwelfare.While child welfare agencies are sensitive to these concerns, many also believe that data-driven decision-making approacheshold the promise of significantly improving decisions and familyoutcomes. Few would argue against using all available resourcesto promote child safety. Indeed, it can be viewed as unethical toknowingly do otherwise. Administrative system data is one increas-ingly available resource, but it is one that is challenging for humandecision-makers to make effective and systematic use of in everyinstance. This is where algorithmic tools enter.In exploring how to realize the potential upside of such tools, itis essential that agencies take a rigorous approach to development,deployment, and evaluation, and that this approach be informed byethical considerations. Allegheny County’s work on the AlleghenyFamily Screening Tool (AFST) used for call screening, for instance,involved both a pre-deployment ethical analysis conducted by re-searchers Tim Dare and Eileen Gambrill [69] and an independentpost-deployment impact evaluation [27]. Studies of affected com-munity perspectives [11] have also found that certain proposeduses of algorithmic tools are viewed by families and child welfareworkers as providing considerable benefit. However, study partic-ipants voiced concern over the potential for such tools to exhibitbiases and emphasized the need for a human-in-the-loop approach.Our proposed elicitation framework is intended to respond tothe clear need for algorithmic systems in sensitive domains toreflect relevant fairness and equity desiderata. We hope that themethodology we propose can work in concert with participatorydesign, community engagement, and impact evaluation strategies todevelop tools that achieve meaningful legitimacy and demonstrablyimprove child and family outcomes. We evaluated our framework in a real-world, high-stakes context—child maltreatment prediction. Our study was motivated by recentefforts by child welfare agencies around the country to incorporatealgorithmic decision support tools into their existing processes.Our study was most closely related to AFST, which since August2016 has been used during child abuse call screening in AlleghenyCounty, Pennsylvania [69]. The AFST score is based on data relatedto the victim child(ren), parents, legal guardians, perpetrators, priorchild welfare history, criminal history, and use of public assistance.Call screeners are presented with an AFST score for each referral.Due to the sensitive nature of the data, our study relied exclusivelyon synthetic data based on the real dataset provided by the Al-legheny County Department of Human Services. We also convertedthe AFST score into binary labels— high risk and low risk cases.

We recruited two groups of stakeholders for the user study: (1)social workers with experience of investigating allegations of childabuse and (2) parents. To recruit the social workers, we reachedout to the departments of social work of four public universities in ID Age Gender Race Social workerexperience Enrolled insocial work programsS1 25-34 Woman Latinx Yes YesS2 25-34 Woman Black Yes YesS3 25-34 Woman White No YesS4 18-24 Non-Binary Other No YesS5 18-24 Woman White No YesS6 18-24 Woman White Yes YesS7 25-34 Woman White Yes YesS8 18-24 Woman White Yes YesID Age Gender Race No. of children Children age(s)P1 25 - 34 Man Asian 2 2 or younger, 3-7P2 45 - 54 Woman Asian 2 13-17, 18+P3 45 - 54 Woman Other 1 18+P4 45 - 54 Woman White 3 18+

Table 1: Participant Summary the US, which helped us send out the recruitment emails to theirundergraduate and graduate students. To recruit the parents, weposted recruitment messages on the social media of the authorsand sent out recruitment emails to parent groups. We recruited 12participants in total for the study, and Table 1 shows the details ofthe participants.We conducted the user studies over video chat. In each study, theresearcher first gave an introduction of the study and an overview ofhow to use the interactive interface described in Section 3.2. Then,we invited each participant to use the tool to explore the childmaltreatment prediction data. We followed the interview protocolintroduced in Section 3.3 to elicit participants’ fairness notionssurrounding the algorithm’s decisions. Participants shared theirscreen in the video chat so that researchers could see the sameinformation in the process. Participants were also encouraged tothink aloud during the study. Each user study lasted for about 90minutes. Each participant was compensated with a $30 Amazon giftcard. The user study was reviewed and approved by the CarnegieMellon University Institutional Review Board.

All study sessions were audio-recorded with consent from the par-ticipants. The first two authors transcribed all 12 interviews with20.5 hours of recorded audio. We employed a qualitative, groundedtheory analysis to inductively analyze our data and generate thefindings and insights from the interviews. We adopted Charmaz’sapproach of grounded theory analysis which allows us to considerprior ideas and theory in the analysis [15]. We open coded inter-view transcripts, held team meetings to discuss emerging themesand ideas, and iterate on our codebook. We describe the findings ofour analysis in the next section.

In this section, we summarize participants’ viewpoints on bothgroup and individual fairness gathered throughout the interviewprocess. Overall we see no difference in patterns of responses be-tween social workers and parents. Throughout this section, wefollow the prior literature and refer to the three group fairnessapproaches by these abbreviated phrases: unawareness means toleave a sensitive attribute (e.g. race) out of the model; statistical arity means to equalize the positive classification rates betweengroups within a sensitive attribute (e.g. different racial groups);and equalized odds means to equalize the false positive and falsenegative rates between groups within a sensitive attribute. Thehighlights of our results are as follows:(1) Among the three group fairness approaches, equalized oddswas the most supported group fairness criteria (66.7%), fol-lowed by statistical parity (43.3%) and unawareness (41.7%).(2) Even though equalized odds was the most supported, therewere nuances within this, e.g. many participants were willingto accept disparities in accuracy across groups rather thansacrifice overall accuracy.(3) Even though there are heated discussions around the over-representation of Black children in the child welfare system[22], participants thought that statistical parity was not nec-essarily fair, though it is a good goal.(4) Participants thought awareness could both address or en-force systemic discrimination.(5) Among the individual fairness comparisons, we did not ob-serve unanimous agreement between participants most ofthe pairs.(6) Participants maintained consistent responses to each groupfairness question across different protected attributes (i.e. vic-tim age, victim gender, family race, use of public assistance,and perpetrator gender).(7) Participants interpreted our fairness questions differentlyand experienced cognitive overload when examining caseswith a high number of different attributes, leading to addi-tional challenges. We asked the partici-pants whether three common group fairness approaches (unaware-ness, statistical parity, and equalized odds) are appropriate for eachof the five sensitive attributes (victim age, victim gender, family race,use of public assistance service and perpetrator gender). Particu-larly, for a given sensitive attribute, we asked whether the followingcriteria should be met for the algorithm to be fair:(1) the sensitive attribute should not be considered for decision-making ( unawareness );(2) the high-risk predictions be the same rate across the sensitiveattribute ( statistical parity ); or(3) the predictions should be equally accurate at identifying low-and high-risk cases across the sensitive attribute ( equalizedodds ).See Figure 3 on page 5 for visual explanations of statistical parityand equalized odds.Among the three group fairness criteria, equalized odds was themost preferred group fairness criteria (66.7%), followed by statisticalparity (43.3%) and unawareness (41.7%) (see Figure 6). The observant reader might notice that the equalized odds responses are split 8-4,which is the same as the proportion of social workers versus parents. This is simply acoincidence. The four participants who responded ‘No’ were half social workers andhalf parents. In general, we tested for clustering of our social workers’ and parents’responses to all questions to see if there were any divisions between the two groupsof participants and found little. (a) Responses to the question “Should a fair algorithm be aware of agiven sensitive attribute?” (awareness) a a ‘No’ means the participant believes the algorithm should be unaware of the attribute.‘Yes’ means the participant believes the algorithm should be aware of it. (b) Responses to “Should a fair algorithm classify equal proportions ofcases as high risk between subgroups within a given sensitive attribute?” (statistical parity)(c) Responses to “Should a fair algorithm have equal accuracy (false pos-itive and false negative rates) between subgroups within a given sensitiveattribute?” (equalized odds) Figure 5: Frequencies of responses to the three group fair-ness questions for sensitive attributes victim age throughperpetrator gender. Also, e.g. infants and adolescents aretwo subgroups within the victim age sensitive attribute. igure 6: Frequencies of responses to the three group fair-ness questions. See Figure 5 for further explanation of ques-tions asked. Our participants thinkunawareness is an appropriate fairness criteria in 41.7% of situations.6 out of 12 participants believed that algorithm should be unawareof victim gender, family race and perpetrator gender; only 2 out of12 participants believed that the algorithm should be unaware ofvictim age. (see Figure 5a).

Algorithms should be aware of important predictive attributes.

Many participants disfavored unawareness, i.e. endorsed awareness,for sensitive attributes they thought were important indicators ofrisk. For example, S2 thought age should be taken into account,since “when you’re a younger child that doesn’t have language, youdefinitely are higher risk than a child that does have language” (S2).Many other participants reiterated that victim age is an indicatorof language and, thus, of risk. This is one reason why we saw only2 of 12 participants endorsing unawareness for victim age (seeFigure 5a). Others thought an algorithm should be aware of only important predictive attributes. For example, S7 –who endorsedunawareness for victim age– said, “I wouldn’t want to like teach thealgorithm to prioritize [a case] based on age... I would want it moreabout the type of alleged abuse,” indicating that type of abuse is animportant predictor, whereas victim age is not.

Awareness could reinforce systemic discrimination.

Some par-ticipants endorsed unawareness, because they were concerned thatawareness of sensitive attributes would lead to systemic discrim-ination. For example, P3 thought that the algorithm ought to beunaware of family race, since awareness “opens up... room for sys-temic racism” and would “bring unconscious biases” (P3). Further-more, some suggested awareness of sensitive attributes to auditthe algorithm’s decision, but not when making predictions. P3 saidthat the algorithm should be aware of family race “only to reportcertain disparities” between races (P3). Similarly, S4 endorsed un-awareness so that the algorithm would not inherit biases basedon gender, saying: “There is a big bias on gender. Everybody hasdifferent opinions on what gender is [and] what entails gender" anda fair algorithm should be unaware of gender “to erase all of that” (S4). One participant, S7, endorsed unawareness to prevent bias, notjust for marginalized groups, saying that an algorithm that is awareof victim gender “would definitely learn to prioritize female victims” over male victims, so it should be unaware of victim gender (S7).

Awareness could help address systemic discrimination.

Otherparticipants the opposite: that an algorithm should be aware of sen-sitive attributes, because awareness could help address and correctfor historical disparities. For example, S1 thought an algorithm should be aware of family race in order to “negate any... racial bi-ases” (S1). S4 thought that family race should be one of the mostimportant predictive factors, saying “people of color are dispropor-tionately affected and they always have been;” so “if you’re a personof color, you should be prioritized over white people,” because fairpredictive systems “should be targeted to help people who we... knowneed the help” (S4). Endorsing awareness may come with apprehension.

Someparticipants identified a reason for unawareness, but ultimatelydecided against it. For example, S6 thought that the algorithm oughtto be aware of all sensitive attributes, saying, “The more informationyou have, the better. But I can understand how that can also leadto bias” (S6). S5 said that victim “age is an important factor,” butit should not be overemphasized as a predictor: there is “room foralgorithm error if, it’s focusing so much on age, rather than [on] otherfactors” (S5).

Our participantsthought statistical parity was a slightly more appropriate fairnessapproach than unawareness, supported in 43.3% of the cases. Simi-larly, statistical parity was most supported for victim gender (7/12)and family race (6/12) and least supported for victim age (3/12) (seeFigure 5b).

Statistical parity is a good goal, but is not necessarily fair.

While many participants wanted to see similar positive predictiverates across these groups, they also thought that disparities in theprediction rates do not indicate that an algorithm is unfair. Forexample, with regards to high-risk prediction rates, S5 said, “You’dwant to see similar numbers achieved, but if the numbers aren’t quitethe same, I think that’s also okay as long as you’re still detectingfairly” (S5). P1 said, “I wouldn’t say that if the algorithm predictsdifferently across different races, then it’s an unfair algorithm” (P1).

Statistical parity overlooks contextual differences betweencases.

Some participants who disfavored statistical parity reasonedthat equalizing positive classification rates may overlook contextualdifferences specific to each case. For example, P2 said that statisticalparity “is arbitrary... It depends on what the cases are” (P2).

Statistical parity overlooks different base rates between groups.

Many participants recognized that different groups may have dif-ferent base rates of being at high risk of abuse, especially amongdifferent victim age groups. As a result, many expressed apprehen-sion towards statistical parity. For example, P4 and S4 disfavoredstatistical parity among victim of different ages, because, as P4 said, “I think the prediction [rates] will be higher in the lower age group,so [different age groups] shouldn’t be looked at the same” (P4). S8explained why they disfavored statistical parity using a scenariowhere the base rates of high-risk classification among different agesare different due to circumstances unrelated to abuse: “Childrenunder five are not school age yet. So, there may be fewer adult eyes onthe child. There might be more eyes on older children, so there might behigher rates of referrals; but, that doesn’t mean there’s higher rates ofabuse” (S8). P1 disfavored statistical parity between victim genders,because “there’s probably difference, in terms of the risk, betweenthe two genders” (P1). When asked whether statistical parity is fair,S7 said, “Yes, with an asterisk, because I do think in cases of allegedsexual abuse, I would expect to see a higher percentage of high riskfor female victims” (S7). urthermore, some participants advocated for something likecalibrating the rates of high-risk predictions to the base rates. For ex-ample, S3 said, an algorithm “should be making high-risk predictionsbased on what demographic data says is the most at risk” (S3). Theyeven recognized that “there might be certain populations that areunder-reported are over-represented, but I think that to the best of ourability high risk predictions from the algorithm should match... demo-graphic information” (S3). This comment is particularly interesting,considering concerns over the child welfare system stemming fromover-representation of Black children in foster care [22], as will befurther explained in Section 6.1.1. Mandating statistical parity is not fair.

Some participants dis-favored statistical parity, because they did not think it was fairto mandate this condition be met. For example, if the algorithmwere focused on achieving statistical parity, S8 said they “wouldworry that [the algorithm] would be focused on meeting a number” (S8). P2 said, “The algorithm should not try to balance [the rates ofclassification] out so that it appears to be fair... Trying to balance outthe rates of high-risk predictions means... you’re manipulating thesituation” (P2).

Equalized odds wasthe most supported fairness approach: participants thought it wasan appropriate fairness approach 66.7% of the time. There wereno differences in the frequencies of support across all sensitiveattributes (see Figure 5c).

Equalized odds aligns with existing fairness beliefs.

We foundthat participants tended to agree with equalized odds since it ismore closely aligned with their existing fairness beliefs—manyparticipants expressed that they want the accuracy of the algorithmto be as high and as even as possible. For example, S4 said, “I thinkthe goal is for [the algorithm] to be 100% accurate. But if I had to chooseone over the other. Yeah, I would want [the accuracies between groups]to be [equal]. It should have the same accuracy and it should be high” (S4). Numerous participants echoed almost this exact sentiment,e.g. S6 said, “I would want for it to have the same accuracy and for theaccuracy to be really high across all groups” (S6). This is consistentwith equalizing the true positive and true negative rates acrossgroups, which equalized odds entails.

Accuracy should not be sacrificed to achieve equalized odds.

However, an algorithm cannot always achieve equalized odds. Par-ticipants also discussed trade-offs they are willing to accept if equal-ized odds is not attainable. To achieve equalized odds, one can lowerthe accuracy –i.e., increase the false positive rate or false negativerate– for the group where the algorithm originally performs better.The participants generally disagree with such practices to attainequalized odds. S8 said, “I wouldn’t want to lower the accuracy ofa group. That seems counter-intuitive. You want it to be accurate asmuch as it can” (S8). For this reason, S8 disfavored equalized odds.Other participants, however, thought that the accuracies should notbe lowered and endorsed equalized odds. For example, S5 endorsedequalized odds for all sensitive attributes, but responded, “I don’tthink the accuracy should be lowered” (S5).It’s worth noting that other participants, such as P3, were finewith lowering the accuracy of one group to match the other inorder to achieve equalized odds.

Improvements in accuracy should help as many high-riskpeople as possible.

In the case that equal odds cannot be achieved,some participants wanted the algorithm to improve accuracy forlarger groups of people in order to help as many people as possible.For example, S6 said, “I obviously want [the algorithm] to be as ac-curate as possible across all groups. I would want it to have the sameaccuracy and for the accuracy to be really high across all groups. But,when there’s different accuracies, you would want a higher accuracyfor groups that are bigger. Hopefully, you would help the most... chil-dren as possible.” (S6). S6 clarified that not only large groups, butthose who are high-risk should be prioritized: “If the most abusedor neglected group is like Hispanic girls you would want that [group]to have higher accuracy than like white boys” (S6). S4 also expressedsimilar ideas, saying “I think [the algorithm] should be more accurateon non-white races and ethnicities than white ones, just because I feellike they are people of color and are more at risk” (S4).

Equalized odds hides historical inaccuracies.

Instead of equal-ized odds, some participants would rather be aware of an algo-rithm’s historical false positive and false negative rates for a givengroup. As a social worker, S8 thought that mandating equalizedodds hides where the algorithm is more or less accurate: instead,they said, “I would want to know where [the algorithm] was less ac-curate, so that we could be looking at what’s getting in the way. Andhow can we improve” (S8). Participants recognized the role of thehuman decision-maker that considers the algorithm’s prediction asa non-binding recommendation. For example, P1 did not think thatyou have to make “your decision purely based on the algorithm: youcan know that the algorithm has different prediction accuracy acrossdifferent factors” (P1).It’s worth noting that other participants recognized the impor-tance of knowing historical inaccuracies, yet endorsed equalizedodds. For example, when asked whether differences in false positiveor false negative rates across groups were unfair, S5 (who endorsedequalized odds for all sensitive attributes) responded, “it dependson why the difference is there” (S5).

In addition tothe established group fairness definitions that often ask for par-ity between the metrics, participants also discussed an additionalguideline that the algorithm should be following. S3 expressed theidea that, instead of having statistical parity between the groups,the positive prediction rate should be matched to the historicaldemographic data: “I think that [the algorithm] should be makinghigh risk predictions based on what demographic data says is the mosthigh risk. The high risk predictions... from the algorithm should matchas close as possible to the demographic information that showed youwho was being victims of abuse” (S3). This approach, however, hasthe risk of potentially reinforcing historical injustice. To put thatinto practice, S3 thought that statistical parity was not appropriateacross perpetrator genders: “I think it should match demographicinformation and demographic information has shown us that perpe-trators are more likely to be male” (S3). When asked which sensitiveattributes equalized odds is an appropriate fairness measure for, S2said, “my thought process is that I want to make sure that whateverdecision that I would make is informed by research” (S2).At the same time, when discussing if the algorithm should con-sider a feature in its prediction, participants thought that if the igure 7: Frequencies of responses by case number, wherethe case pair numbers correspond to differences between thepair as follows. Case 1: Victim age; 2: Victim gender; 3: Fam-ily race; 4: Use of public assistance; 5: Perpetrator gender; 6:Allegation type, Perpetrator age; 7: Family race, Referral his-tory; 8: Use of public assistance, Victim age, Reporter type;9: Victim age, Perpetrator age (Perpetrator not related to vic-tim); 10: Victim age, Perpetrator age (Perpetrator related tovictim); 11: Number of parents, Region wealth, Perpetratorrelationship to victim; 12: Region wealth, Use of public assis-tance, Referral history; 13: Family race, Region wealth, Useof public assistance; 14: Number of parents, Victim age, Vic-tim gender algorithm’s decisions are supported by research findings and exist-ing data, then this is a signal that the algorithm is fair. For example,when asked about which sensitive attributes a fair algorithm shouldbe aware of, S7 said that “any factor that’s supported by research asbeing linked to a higher frequency of outcomes should be consideredby the algorithm... If it’s not supported in the research, then I don’tthink it needs to be introduced” (S7). Participants use thecase-by-case view (Figure 2b) to evaluate different pairs of casestreated by the algorithm. For the first 14 pairs of cases, the actualalgorithm predictions are not shown and participants discussedtheir opinions on how the algorithm should be treating these cases.Figure 7 shows participants’ responses to the 14 pairs of fixed cases.For each pair of cases A and B, we offered participants the followingfive options to choose from:(1) (Equally prioritize) : cases A and B should be given the sameclassification (either high- or low-risk);(2) (Prioritize A) : only case A should be classified as high-risk;(3) (Prioritize B) : only case B should be classified as high-risk;(4) (Not comfortable answering) ; or(5) (No opinion) . We did not observe unanimous agreement.

Though there isfrequent majority consensus among the individual fairness compar-isons —only three case pairs (1, 11, and 14) do not exceed a majority(above the 50% line),— we only saw unanimous agreement betweenparticipants in one (case pair 6) of fourteen cases. This is signifi-cant, because if we expected there to be one best fair response foreach case, then (even among our small number of participants) wemight have expected unanimous consensus among more case pairs.Yet, only five pairs had 10 or more participants who responded thesame. One common theme between these 5 pairs was that the onlydifferences between the pairs were allegation type, gender, race ora combination of them. Most agreed on equally prioritizing thosein case pairs 2, 3, and 5, wherein the cases differ only along victimgender, family race, and perpetrator gender, respectively. In casepair 6, the pairs differed only along allegation type and perpetratorage: all participants voted to prioritize the case with the more seri-ous allegation. In case pair 7, the pairs differed only along referralhistory and family race: all participants but one voted to prioritizethe case with with more prior referrals.In all the other cases, responses were more contentious. In casepair 1, where the cases differ only in victim age, we saw an almosteven split between prioritizing the case with younger age (6 partic-ipants) and prioritizing both cases equally (5 participants). For casepair 11, we saw another even split between prioritizing both casesequally (5 participants) and prioritizing the case with the singleparent, non-parent perpetrator in a less wealthy region (6 partici-pants). For case 14, we saw an even split between prioritizing bothcases equally (4 participants), prioritizing the case with a younger,male child and two parents (5 participants), and prioritizing thecase with an older, female child and single parents (3 participants).

For the highlyagreed-upon pairs, all participants had a strong and clearly-definedconsensus on how these features affect the prioritization that amaltreatment case should receive. In one situation, the implicationof a single feature was so clear that the participants universallyagree on prioritizing on the case with the more severe allegationtype, and do not need to look at the other less significant features: “I think for me immediately what gets me to prioritize B over A is theallegation type. For me I think allegation types should be prioritizedover the age. So that’s the first thing... I would screen... and I don’treally think I have to look at the age anymore” (S4).In some scenarios, the decision was a lot more contentious, espe-cially when participants disagreed on the implications of a specificfactor. For example, participants disagreed about whether differ-ences in victim’s age was a strong enough reason for an algorithmto prioritize two cases differently. Some participants thought agedifference alone should not be a reason to differentiate two cases: “Now the only thing that’s different is the age, so... it doesn’t reallyfeel right to say one should be prioritized over the other, necessarily...I don’t really think [age] should make a difference” (S5).Some participants thought younger children were generally morevulnerable than older children and, therefore, should be prioritized The threshold for unanimous consensus is not entirely clear. We point out thisthreshold of 10 participants in agreement as an example. However, if we chose 9participants as a threshold, the number of case pairs increases to eight out of fourteen.The point still remains that full unanimous consensus was not reached in most cases. rst by the algorithm: “the twelve-year-old is going to be able to saywhat’s happening easier than the four year old, because the twelve-year-old girl would have a better idea what’s wrong and right than afour year old.” (S6). When navi-gating through the case-by-case and similarity comparison view,participants were asked to make decisions for child maltreatmentcases based on the limited information available to the algorithm.As a result, some information that the participants wanted to knowmight not have been available when making the decisions. We ob-served two different types of heuristics when the participants werecomparing different cases side-by-side.

Participants constructed stories upon available case infor-mation.

Some participants used the information available in eachcase to reconstruct the story the victim is facing in each case. Theywould attempt to compare and visualize which of the cases arefacing a more imminent risk. When there was information that wasmissing, the participants would come up narratives of the potentialscenarios. For example, S1 said, “A really important factor definitelywould be like what kind of abuse, we’re talking about here like is it aneglect case? Is it a case of a babysitter not paying attention to the kidsor is this like, if it was something physical or sexual then it’s obviouslyone I would be much more concerned about” (S1) . These participantsweighed the risks and probabilities of all these alternatives in orderto make a final decision.

Participants focused on the differences between cases.

Theother way participants compared the cases was to focus primarilyon the differences in the features between the two cases. For theseparticipants, they would first look through the features one by oneand identified the first differences they spotted between the cases.Based on the differences, the participants would weigh in how eachof the differences affected the differences in their potential risk,then made a final decision based on that. For example, S6 explainedhow they chose to prioritize a cases based on the difference in thenumber of prior referrals: “the only difference really is the incomeand the fact that this one had a referral. I guess I would choose theone that’s already had a referral” (S6). Participants prioritized caseswith features that they thought were more significant.We also asked participants to rank the importance of each fea-ture in determining if two cases are similar or not. We found thatparticipants were more likely to use the features they ranked asmore important as the primary factor in deciding which case shouldbe prioritized by the algorithm.When facing situations where the differences were of the samesignificance to them, these participants would simply count thepros and cons for prioritizing each case. For example, S4 reasoned, “So going off the race. I would prioritize the right [i.e. case B]. But, thenbecause of public assessor, [I would] prioritize the left [i.e. case A].And then on the socioeconomic factors I would prioritize have left. So,because it’s two to one, I would prioritize case A over case B in thissituation” (S4).

When askedabout group fairness questions, each participant often answered similarly across all sensitive attributes. For each group fairnessquestion when participants responded to all five sensitive attributes(victim age, victim gender, perpetrator gender, family race, and useof public assistance), they responded to a mean of about 4.4 out of5 questions (87.8%) the same. For equalized odds, all participantsanswered only either ‘yes’ or ‘no’ over all sensitive attributes.This indicates that our participants had internally consistent be-liefs across protected attributes. Participant responses indicate thatthis was because participants had a common reason for favoring ordisfavoring a group fairness approach across all protected attributes.For example, when asked whether equalized odds across differentages was fair, S8 gave a reason for responding ‘no’; then, they saidthey “feel similarly regardless of the identity” (S8). This kind of re-sponse was common among participants: they held the same reason,so they answered the same across all protected attributes.

Many participants maintained consistency in reasoning overprotected attributes between individual and group-level responses.For example, when asked about case pair 1 –which differed onlybased on age,– S3 thought that the younger child ought to be prior-itized. Later, when asked whether statistical parity across differentages is fair, S3 answered ‘no,’ reasoning: “because the cases that Ilooked at, I selected the lower-age children [to be prioritized]” (S3).Participants reasoned about individual cases when asked group-level questions and vice versa. For example, S7 also reasoned abouta previous pair of cases when asked whether a fair algorithm shouldbe aware of victim gender. From the other direction, when presentedwith case pair 3, wherein the only difference was that one familywas Caucasian and the other was African-American, S2 reasonedthat “race... holds a lot of social meaning in America” (S2) and choseto equally prioritize these pairs as a proxy for making sure thatthese different races are being treated equally.These results indicate internal consistency in reasoning acrossindividual and group fairness.

We speculate that this is becauseparticipants have preconceived notions of fairness that run throughall their answers.

Whenparticipants were asked about group fairness criteria (statisticalparity or equalized odds in particular), they gave reasons that indi-cated different understandings of these questions. As a result, themeanings of ‘yes’ and ‘no’ answers to these questions were varied.

Some understood statistical parity and equalized odds ques-tions as asking about sufficient conditions.

Some participantsunderstood these questions as asking,

Are statistical parity or equal-ized odds sufficient conditions for fairness?

For example, when askedwhether statistical parity was fair, S3 said they “want to see similarnumbers [i.e. rates of positive classification] achieved, but if thenumbers aren’t quite the same,... that’s also okay” (S3). Thus, S3thinks that statistical parity is a sufficient condition for fairness,because if the algorithm meets statistical parity, then this is a fairoutcome. Based on their reasoning, if they understood the questionas asking whether statistical parity were a necessary condition, theymay have answered ‘no’. ome understood statistical parity and equalized odds ques-tions as asking about necessary conditions. Some participantsunderstood these questions as asking,

Are statistical parity or equal-ized odds necessary conditions for fairness?

For example, when askedwhether a fair algorithm should achieve statistical parity, P1 said, “The question [is] weird because if I say ‘No,’ what I’m saying is thata fair algorithm shouldn’t make high risk prediction at the samerate across groups” (P1). P1 answered ‘no’ to all statistical parityquestions, because, even though they indicated that ideally thealgorithm should classify different groups at similar rates, it “doesnot have to” to be fair.

Some understood statistical parity and equalized odds ques-tions as asking whether they should be mandatory.

Anothergroup of participants understood these group-fairness question asasking,

Should the algorithm be mandated to fulfill statistical parityor equalized odds in order to be fair?

For example, P2 answered ‘no’to all statistical parity questions, because “the algorithm should nottry to balance [the rates of classification] out so that it appears to befair... Trying to balance out the rates of high risk predictions means...you’re manipulating the situation” (P2). P1 and S8 answered ‘no’to all statistical parity and equalized odds questions: S8 said they “would worry that [the algorithm] would be focused on meeting anumber” (S8); P1 said that “deliberately changing the algorithm to belower accuracy so that [the classification accuracies across protectedgroups] match... just doesn’t make sense” (P1). These participantsthought that if group fairness constraints were met by mandatingthe algorithm to do so, these were not necessarily fair situations.Thus, their ‘yes’ or ‘no’ responses have no bearing on whetherstatistical parity or equalized odds would be fair if they were met without being mandated . It’s possible they would have answered‘yes’ if we asked this.These varied understandings of group fairness questions illus-trate that group fairness approaches can be complex and ambiguous.Additionally, questions about group fairness definitions should spec-ify whether they are about necessary and sufficient conditions, aswell as whether the question is asking to mandate these rules ornot.

When participants weremaking pairwise comparisons between the cases, we found thatsome participants perceived the algorithm as making an unfairdecision if they thought the algorithm made a mistake for a singlecase, rather than a pair of cases. For example, S6 explained whyshe thought a particular case (currently predicted as low-risk) wastreated unfairly by the algorithm: “I would say, like this one is unfair.They have one parent and allegation is parents, substance abuse, itshould be a higher risk prediction” (S6). When asked if the decisionwas reached based off of a comparison with a similar case, S6 ex-plained “ I just think it’s unfair. This kid only has one parent andthe allegation is substance abuse, that would mean that parent isnot really able to take care of that kid, and the kid doesn’t have anyother parents there” (S6). This participant did not identify unfairnessbased on comparing individual case pairs, but rather by a perceivedmisclassification of a single case made by the algorithm.

The notion of individual fairness callsfor comparison between individuals to determine if they shouldbe treated similarly. However, we found that the comparison was not always easy for human stakeholders, especially in real-lifescenarios where each case included numerous features. Participantsexplained that having to objectively compare the effect of multipledifferences between individuals can be challenging. “I think it’stricky to compare things this way,... because of the multiple factordifference. It’s hard to say. If you can control all of them and only oneis changing, then that might be easier” (P1).We did not measure the cognitive load of the participants directly;thus, we cannot tell if the task of comparing individual pairs wasmentally taxing. The participants’ response reflected that it wouldhave been easier for them to compare pairs with limited differences(e.g. selecting pairs with high similarity in similarity comparisonview), than to compare randomly-selected pairs (e.g. randomly pairsin case-by-case view).

In this section, we discuss the implications of our findings for thedesign, development and evaluation of future predictive systems foruse in child welfare decision-making. While there is existing workon how to directly incorporate pairwise comparison feedback in tomodel training [38, 53], their mechanisms focus on simple aggrega-tion rules. In comparison, our framework elicits richer perspectives,especially the stakeholders’ reasoning behind their responses. Thesefindings can potentially lead to more structural changes to the riskassessment tools. We discuss three directions on how to incorporateour findings into model design and deployment.

One interesting take-away from our studyis that broadly accepted group-level indicators of potentially un-warranted disparity within the child welfare system are not unani-mously held as indicators of algorithmic bias. For instance, one ofthe most commonly cited indicators of bias is the over-representationof Black children among those investigated and placed in foster care[22]. In the language of group fairness, these decisions fail to satisfystatistical parity. However, as we note in our results, participantswere evenly split on whether they viewed statistical parity to bea desirable algorithmic fairness property, especially for differentraces (see Figure 5b). This finding indicates that common measuresof bias in historical decision-making may not constitute reliablefairness metrics for the purpose of algorithm design and evaluation.Our results also do not fully agree with prior work on human per-ceptions of fairness [68], which suggests that statistical parity mostclosely matches people’s existing notion of fairness. Our work dif-fers from [68] in terms of context and user elicitation method: [68]investigates opinions on criminal risk and skin cancer screeningby crowdsourcing opinions from the public; our work investigateschild welfare by engaging with relevant stakeholders. These dif-ferences in methodology may explain the differences between ourresults and [68]. This may also indicate that the appropriate fairnessapproach is both dependent on context and stakeholders, whichfurther exemplifies the need for algorithm designers to incorporaterelevant stakeholders fairness viewpoints into the design process.While we found that for child maltreatment prediction, there wasno single group fairness approach that was universally supported y all participants, equalized odds aligned with the existing fairnessbeliefs of most participants, as they tended to want an algorithmto be as accurate as possible and similarly accurate for as manypeople as possible. Our participants’ individual fairness feed-back can be used in the model re-training process. A starting pointis to follow the same approach as in [8, 38] and formulate the pair-wise comparison responses as constraints on the predictive model.For example, if a participant indicated that case A should be priori-tized over case B, then the model should provide a higher risk scorefor case A. Our study showed that there are frequent disagreementsamong the participants’ responses, so incorporating this individualfairness feedback requires a mechanism to resolve disagreements.Prior work [53] resolves such disagreements based on the Bordarule from voting theory. Alternatively, one can also resolve suchdisagreements through a deliberation process among stakeholders.Furthermore, our interface provides richer information that enablesfeedback beyond these pairwise constraints. For example, the simi-larity interface in Figure 2c elicits from each participant a similaritymetric, which can potentially define individual fairness criteriabeyond the sets of pairwise comparisons. Another important factoris that we choose not to present all of the attributes used in childwelfare screening, since it may be excessively cognitively difficultfor the participants to process over 100 attributes. For any givencomparison, the response of prioritizing case A over case B can beinterpreted as (1) providing a higher average risk score for all caseslike case A, or (2) providing higher risk scores for typical cases likecase A.

While the primary focus of our study andinterface design is on fairness preference elicitation, our interviewprotocol enables us to learn about stakeholder perspectives alongother dimensions as well. For example, our study also indicates thatthe child maltreatment risk model and the overall decision processcan benefit from the inclusion of additional attributes. From theoutset of our study, participants expressed how challenging it wasto make prioritization decisions based on the limited informationprovided. For example, S8 noted that there can be all sorts of reasonsthat a family might need public assistance service and, dependingon what these reasons are, they could be either a protective factoror a potential indicator that the family is struggling. Having accessto this additional information may help both the model and thehuman decision-maker in determining the risk in each specific case.Our participants naturally put more emphasis on certain featureswhen evaluating whether one of the two cases should be be priori-tized by the algorithm. Participants universally agreed that victimage, allegation type and prior referrals were more important whenevaluating which case should be prioritized. More generally, in anygiven context, an automated model selection procedure is proneto produce an algorithm that doesn’t rely on or prioritize many ofthe features that expert stakeholders believe are the most impor-tant. This is problematic because model predictions that frequentlydisagree with users’ perceptions may be viewed as not credible bythose users [73]; or, worse, users that frequently disagree with themodel predictions may be viewed as not credible. In the presentcontext, an algorithm that is trained to prioritize attributes that stakeholders find important may see greater uptake that one thatoptimizes for predictive accuracy alone.Ultimately, a predictive model should be only one of the steps inthe pipeline of child maltreatment. In the end, the final decision ismade by human call screeners, who look at the model’s predictionas a non-binding recommendation. It is important to look at thefairness of the decisions of the entire socio-technical system, not justthe predictive model within it. We argue that it is critical that thefairness viewpoints of these stakeholders be heard and incorporatedin the model. At the same time, while it may not be possible for thealgorithm to be fair in every possible scenario, it is more importantfor the users (e.g., the call screeners in the child welfare context)to recognize the limitations of the model in order to make a finaldecision based on the recommendation of the model.

As with any study, it is important to note the limitations of this work.Since this is a qualitative study, the insights we report only representthe fairness viewpoints of the 12 participants we interviewed. Otherstakeholders may hold different opinions from the participantswe interviewed. The proportion of responses in our results maynot reflect exact distributions. Nevertheless, our results highlightthat participants did not unanimously agree on what should beconsidered fair, and they indicated qualitatively different reasoningfor their responses.For ethical reasons, we did not interview parents who have in-teracted with or are interacting with the child welfare system inAllegheny County, PA. Therefore, the demographics of our parentsample group may not reflect the demographics of parents in thechild welfare system. In addition, while children are also stake-holders of the child welfare system, we were unable to interviewchildren directly due to ethical concerns of asking children sensitivequestions and their lack of understanding to the system. Finally,we primarily interviewed social workers with casework experience.Further studies which aim to capture the beliefs of all stakehold-ers within a child welfare department should likely include moresupervisors, administrators, and executives, as well.

In this work, we present a general framework to elicit stakeholders’subjective fairness notions regarding algorithmic systems. We eval-uate our framework on a child maltreatment predictive system andconduct a user study with relevant stakeholders. The interviewsprovide us with a comprehensive understanding of stakeholders’perspective of algorithmic fairness. We find that equalized odds isthe slightly more preferred group fairness approach for child mal-treatment risk assessment, but stakeholders do not unanimouslyagreement about individual fairness comparisons. We relate ourfindings to incorporating stakeholders’ feedback in the design anddevelopment of algorithmic predictive systems.

ACKNOWLEDGMENTS

We thank our anonymous reviewers, colleagues from GroupLensResearch at the University of Minnesota and the HCI Instituteat Carnegie Mellon University for their feedback. This work wassupported by the National Science Foundation (NSF) under Award o. 2001851, 2000782 and 1952085, the NSF Program on Fairness inAI in collaboration with Amazon under Award No. 1939606, and aJ.P. Morgan Faculty Award. REFERENCES [1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna M.Wallach. 2018. A Reductions Approach to Fair Classification. In

Proceedings of the35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan,Stockholm, Sweden, July 10-15, 2018 . 60–69. http://proceedings.mlr.press/v80/agarwal18a.html[2] Alekh Agarwal, Miroslav Dudík, and Zhiwei Steven Wu. 2019. Fair Regression:Quantitative Definitions and Reduction-Based Algorithms. In

Proceedings ofthe 36th International Conference on Machine Learning, ICML 2019, 9-15 June2019, Long Beach, California, USA . 120–129. http://proceedings.mlr.press/v97/agarwal19d.html[3] Yongsu Ahn and Yu-Ru Lin. 2019. Fairsight: Visual analytics for fairness indecision making.

IEEE transactions on visualization and computer graphics

26, 1(2019), 1086–1095.[4] E. P. Apfelbaum, K. Pauker, S. R. Sommers, and N. Ambady. 2010. In blindpursuit of racial equality?

Psychological Science

21 (2010), 1587—1592. https://doi.org/10.1177/0956797610384741[5] Edmond Awad, Sohan Dsouza, Richard Kim, Jonathan Schulz, Joseph Henrich,Azim Shariff, Jean-François Bonnefon, and Iyad Rahwan. 2018. The Moral Ma-chine experiment.

Nature (2018), 59–64. Issue 563. https://doi.org/10.1038/s41586-018-0637-6[6] Caroline Balagot, Hector Lemus, Megan Hartrick, Tamera Kohler, and Suzanne PLindsay. 2019. The homeless Coordinated Entry System: the VI-SPDAT andother predictors of establishing eligibility for services for single homeless adults.

Journal of Social Distress and the Homeless

28, 2 (2019), 149–157.[7] Geoffrey Barnes and Jordan M Hyatt. 2012. Classifying adult probationers byforecasting future offending. (2012).[8] Yahav Bechavod, Christopher Jung, and Zhiwei Steven Wu. 2020. Metric-Free Indi-vidual Fairness in Online Learning.

CoRR abs/2002.05474 (2020). arXiv:2002.05474https://arxiv.org/abs/2002.05474[9] Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, StephanieHoude, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, AMojsilović, et al. 2019. AI Fairness 360: An extensible toolkit for detecting andmitigating algorithmic bias.

IBM Journal of Research and Development

63, 4/5(2019), 4–1.[10] Reuben Binns, Max Van Kleek, Michael Veale, Ulrik Lyngs, Jun Zhao, and NigelShadbolt. 2018. ’It’s Reducing a Human Being to a Percentage’: Perceptions ofJustice in Algorithmic Decisions. In

Proceedings of the 2018 CHI Conference onHuman Factors in Computing Systems . ACM, 377.[11] Anna Brown, Alexandra Chouldechova, Emily Putnam-Hornstein, Andrew To-bin, and Rhema Vaithianathan. 2019. Toward Algorithmic Accountability inPublic Services: A Qualitative Study of Affected Community Perspectives onAlgorithmic Decision-making in Child Welfare Services. In

Proceedings of the2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, Glasgow,Scotland, UK, May 04-09, 2019 . 41. https://doi.org/10.1145/3290605.3300271[12] Ernest W. Burgess. 1928. Factors determining success or failure on parole. In

The workings of the indeterminate-sentence law and parole system in Illinois , A. A.Bruce, A. J. Harno, E. W. Burgess, and J. Landesco (Eds.). Springfield, IL: StateBoard of Parole, 221–234.[13] Ángel Alexander Cabrera, Will Epperson, Fred Hohman, Minsuk Kahng, JamieMorgenstern, and Duen Horng Chau. 2019. FairVis: Visual analytics for discov-ering intersectional bias in machine learning. In . IEEE, 46–56.[14] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and NoemieElhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk andhospital 30-day readmission. In

Proceedings of the 21th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining . ACM, 1721–1730.[15] Kathy Charmaz. 2014.

Constructing grounded theory . sage.[16] Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A studyof bias in recidivism prediction instruments.

Big data

5, 2 (2017), 153–163.[17] Sam Corbett-Davies and Sharad Goel. 2018. The Measure and Mismeasure ofFairness: A Critical Review of Fair Machine Learning.

CoRR abs/1808.00023(2018). arXiv:1808.00023 http://arxiv.org/abs/1808.00023[18] Maria De-Arteaga, Riccardo Fogliato, and Alexandra Chouldechova. 2020. A Casefor Humans-in-the-Loop: Decisions in the Presence of Erroneous AlgorithmicScores. In

Proceedings of the 2020 CHI Conference on Human Factors in ComputingSystems . 1–12.[19] Sarah Dean, Mihaela Curmei, and Benjamin Recht. 2020. Designing RecommenderSystems with Reachability in Mind.

Workshop on Participatory Approaches toMachine Learning (2020), –.[20] Matthew DeMichele, Peter Baumgartner, Michael Wenger, Kelle Barrick, MeganComfort, and Shilpi Misra. 2018. The public safety assessment: A re-validation and assessment of predictive utility and differential prediction by race and genderin kentucky.

Available at SSRN 3168452 (2018).[21] Sarah L Desmarais, Samantha A Zottola, Sarah E Duhart Clarke, and Evan MLowder. 2020. Predictive Validity of Pretrial Risk Assessments: A SystematicReview of the Literature.

Criminal Justice and Behavior (2020), 0093854820932959.[22] Alan J Dettlaff. 2014. The evolving understanding of disproportionality anddisparities in child welfare. In

Handbook of child maltreatment . Springer, 149–168.[23] Jonathan Dodge, Q. Vera Liao, Yunfeng Zhang, Rachel K. E. Bellamy, and CaseyDugan. 2019. Explaining Models: An Empirical Study of How ExplanationsImpact Fairness Judgment. In

Proceedings of the 24th International Conference onIntelligent User Interfaces (Marina del Ray, California) (IUI ’19) . ACM, New York,NY, USA, 275–285. https://doi.org/10.1145/3301275.3302310[24] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S.Zemel. 2012. Fairness through awareness. In

Innovations in Theoretical ComputerScience 2012, Cambridge, MA, USA, January 8-10, 2012 . 214–226. https://doi.org/10.1145/2090236.2090255[25] Virginia Eubanks. 2018.

Automating inequality: How high-tech tools profile, police,and punish the poor . St. Martin’s Press.[26] Sorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2016.On the (im) possibility of fairness. arXiv preprint arXiv:1609.07236 (2016).[27] Jeremy D Goldhaber-Fiebert and Lea Prince. 2019. Impact Evaluation of a Predic-tive Risk Modeling Tool for Allegheny County’s Child Welfare Office.

Pittsburgh:Allegheny County (2019).[28] Nina Grgic-Hlaca, Elissa M. Redmiles, Krishna P. Gummadi, and Adrian Weller.2018. Human Perceptions of Fairness in Algorithmic Decision Making: A CaseStudy of Criminal Risk Prediction. In

Proceedings of the 2018 World Wide Web Con-ference (Lyon, France) (WWW ’18) . International World Wide Web ConferencesSteering Committee, 903–912. https://doi.org/10.1145/3178876.3186138[29] Nina Grgić-Hlača, Muhammad Bilal Zafar, Krishna P. Gummadi, and AdrianWeller. 2018. Beyond distributive fairness in algorithmic decision making: Featureselection for procedurally fair learning. In

Proceedings of the 32nd AAAI Conferenceon Artificial Intelligence .[30] Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of Opportunity inSupervised Learning. In

Advances in Neural Information Processing Systems 29:Annual Conference on Neural Information Processing Systems 2016, December 5-10,2016, Barcelona, Spain . 3315–3323. http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning[31] Úrsula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum.2018. Multicalibration: Calibration for the (Computationally-Identifiable) Masses.In

Proceedings of the 35th International Conference on Machine Learning, ICML2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 . 1944–1953. http://proceedings.mlr.press/v80/hebert-johnson18a.html[32] Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miroslav Dudík,and Hanna M. Wallach. 2019. Improving Fairness in Machine Learning Systems:What Do Industry Practitioners Need?. In

Proceedings of the 2019 CHI Conferenceon Human Factors in Computing Systems, CHI 2019, Glasgow, Scotland, UK, May04-09, 2019 . 600. https://doi.org/10.1145/3290605.3300830[33] Christina Ilvento. 2020. Metric Learning for Individual Fairness. In , AaronRoth (Ed.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2:1–2:11. https://doi.org/10.4230/LIPIcs.FORC.2020.2[34] Abby Everett Jaques. 2019. Why the Moral Machine Is a Monster. In

University ofMiami Law School: We Robot Conference . https://robots.law.miami.edu/2019/wp-content/uploads/2019/03/MoralMachineMonster.pdf[35] Will Johnson. 2004. Effectiveness of California’s child welfare structured decisionmaking (SDM) model: a prospective study of the validity of the California FamilyRisk Assessment.

Madison (Wisconsin, USA): Children’s Research Center (2004).[36] Caroline M Johnston, Simon Blessenohl, and Phebe Vayanos. [n.d.]. PreferenceElicitation and Aggregation to Aid with Patient Triage during the COVID-19Pandemic.

Workshop on Participatory Approaches to Machine Learning ([n. d.]).[37] Matthew Joseph, Michael J. Kearns, Jamie H. Morgenstern, and Aaron Roth. 2016.Fairness in Learning: Classic and Contextual Bandits. In

Advances in NeuralInformation Processing Systems 29: Annual Conference on Neural InformationProcessing Systems 2016, December 5-10, 2016, Barcelona, Spain . 325–333. http://papers.nips.cc/paper/6355-fairness-in-learning-classic-and-contextual-bandits[38] Christopher Jung, Michael J. Kearns, Seth Neel, Aaron Roth, Logan Stapleton, andZhiwei Steven Wu. 2019. Eliciting and Enforcing Subjective Individual Fairness.

CoRR abs/1905.10660 (2019). arXiv:1905.10660 http://arxiv.org/abs/1905.10660[39] Christopher Jung, Michael J. Kearns, Seth Neel, Aaron Roth, Logan Stapleton, andZhiwei Steven Wu. 2019. Eliciting and Enforcing Subjective Individual Fairness.

CoRR abs/1905.10660 (2019). arXiv:1905.10660 http://arxiv.org/abs/1905.10660[40] Anson Kahng, Min Kyung Lee, Ritesh Noothigattu, Ariel D. Procaccia, andChristos-Alexandros Psomas. 2019. Statistical Foundations of Virtual Democ-racy. In

Proceedings of the 36th International Conference on Machine Learning,ICML 2019, 9-15 June 2019, Long Beach, California, USA . 3173–3182. http://proceedings.mlr.press/v97/kahng19a.html

41] Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques forclassification without discrimination.

Knowledge and Information Systems

33, 1(2012), 1–33.[42] Michael J. Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Pre-venting Fairness Gerrymandering: Auditing and Learning for Subgroup Fair-ness. In

Proceedings of the 35th International Conference on Machine Learning,ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 . 2569–2577.http://proceedings.mlr.press/v80/kearns18a.html[43] Danielle Leah Kehl and Samuel Ari Kessler. 2017. Algorithms in the criminaljustice system: Assessing the use of risk assessments in sentencing. (2017).[44] Amir E Khandani, Adlar J Kim, and Andrew W Lo. 2010. Consumer credit-riskmodels via machine-learning algorithms.

Journal of Banking & Finance

34, 11(2010), 2767–2787.[45] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and SendhilMullainathan. 2017. Human decisions and machine predictions.

The quarterlyjournal of economics .[47] Felicitas Kraemer, Kees van Overveld, and Martin Peterson. 2011. Is there anethics of algorithms?

Ethics and Information Technology

13 (2011), 251—-260.[48] Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations andpredictions of machine learning models: A case study on deception detection.In

Proceedings of the Conference on Fairness, Accountability, and Transparency .29–38.[49] Min Kyung Lee. 2018. Understanding perception of algorithmic decisions: Fair-ness, trust, and emotion in response to algorithmic management.

Big Data &Society

5, 1 (2018), 2053951718756684.[50] Min Kyung Lee and Su Baykal. 2017. Algorithmic mediation in group decisions:Fairness perceptions of algorithmically mediated vs. discussion-based socialdivision. In

Proceedings of the 2017 ACM Conference on Computer SupportedCooperative Work and Social Computing . ACM, 1035–1048.[51] Min Kyung Lee, Anuraag Jain, Hea Jin Cha, Shashank Ojha, and Daniel Kusbit.2019. Procedural justice in algorithmic fairness: Leveraging transparency andoutcome control for fair algorithmic mediation.

Proceedings of the ACM onHuman-Computer Interaction

3, CSCW (2019), 1–26.[52] Min Kyung Lee, Ji Tae Kim, and Leah Lizarondo. 2017. A human-centered ap-proach to algorithmic services: Considerations for fair and motivating smartcommunity service management that allocates donations to non-profit organiza-tions. In

Proceedings of the 2017 CHI Conference on Human Factors in ComputingSystems . ACM, 3365–3376.[53] Min Kyung Lee, Daniel Kusbit, Anson Kahng, Ji Seong Tae, Xinran Yuan, A. D. C.Chan, Ritesh Noothigattu, Daniel See, Siheon Lee, Christos-Alexandros Psomas,and Ariel D. Procaccia. 2018. WeBuildAI : Participatory Framework for Fair andEfficient Algorithmic Governance.[54] David B Marshall and Diana J English. 2000. Neural network modeling of riskassessment in child protective services.

Psychological Methods

5, 1 (2000), 102.[55] John Monahan. 2017. Risk assessment in sentencing.

Academy for Justice, a Reporton Scholarship and Criminal Justice Reform (Erik Luna ed., 2017, Forthcoming) (2017).[56] John Monahan, Anne Metz, and Brandon L Garrett. 2018. Judicial Appraisals ofRisk Assessment in Sentencing. (2018).[57] Arvind Narayanan. 2018. Translation tutorial: 21 fairness definitions and theirpolitics. In

Proc. Conf. Fairness Accountability Transp., New York, USA .[58] Jakob Nielsen. 1994.

Usability engineering . Morgan Kaufmann.[59] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. 2008. Discrimination-awaredata mining. In

Proceedings of the 14th ACM SIGKDD international conference onknowledge discovery and data mining . ACM, 560—568.[60] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon M. Kleinberg, and Kilian Q.Weinberger. 2017. On Fairness and Calibration. In

Advances in Neural Infor-mation Processing Systems 30: Annual Conference on Neural Information Pro-cessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA . 5684–5693.http://papers.nips.cc/paper/7151-on-fairness-and-calibration[61] Edmond Awad Sohan Dsouza Iyad Rahwan Pradeep Ravikumar Ritesh Nooth-igattu, Snehalkumar ‘Neil’ S. Gaikwad and Ariel D. Procaccia. 2018. A voting-based system for ethical decision making. In

Proceedings of the 32nd AAAI Con-ference on Artificial Intelligence (AAAI) . [62] Samantha Robertson and Niloufar Salehi. 2020. What If I Don’t Like Any Of TheChoices? The Limits of Preference Elicitation for Participatory Algorithm Design. arXiv preprint arXiv:2007.06718 (2020).[63] Debjani Saha, Candice Schumann, Duncan C. McElfresh, John P. Dickerson,Michelle L. Mazurek, and Michael Carl Tschantz. 2020. Measuring Non-ExpertComprehension of Machine Learning Fairness Metrics. In

Proceedings of the 37thInternational Conference on Machine Learning, ICML 2020, Vienna, Austria, July12–18, 2020 .[64] Nripsuta Ani Saxena, Karen Huang, Evan DeFilippis, Goran Radanovic, David C.Parkes, and Yang Liu. 2019. How Do Fairness Definitions Fare?: Examining PublicAttitudes Towards Algorithmic Definitions of Fairness. In

Proceedings of the 2019AAAI/ACM Conference on AI, Ethics, and Society . ACM, 99—106.[65] Nicholas Scurich and John Monahan. 2016. Evidence-based sentencing: Pub-lic openness and opposition to using gender, age, and race as risk factors forrecidivism.

Law and Human Behavior

40, 1 (2016), 36.[66] Hetan Shah. 2018. Algorithmic accountability.

Philosophical Transactions of theRoyal Society A: Mathematical, Physical and Engineering Sciences

Journal of Asynchronous Learning Networks

16, 3 (2012), 51–61.[68] Megha Srivastava, Hoda Heidari, and Andreas Krause. 2019. Mathematicalnotions vs. human perception of fairness: A descriptive approach to fairnessfor machine learning. In

Proceedings of the 25th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining

Proceedings of the 2018 chi conference on human factors in computingsystems . 1–14.[71] Sahil Verma and Julia Rubin. 2018. Fairness definitions explained. In . IEEE, 1–7.[72] AJ Wang. 2018. Procedural Justice and Risk-Assessment Algorithms. (2018).[73] Jiaxuan Wang, Jeeheh Oh, Haozhu Wang, and Jenna Wiens. 2018. Learningcredible models. In

Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining . 2417–2426.[74] Ruotong Wang, F Maxwell Harper, and Haiyi Zhu. 2020. Factors InfluencingPerceived Fairness in Algorithmic Decision-Making: Algorithm Outcomes, De-velopment Procedures, and Individual Differences. In

Proceedings of the 2020 CHIConference on Human Factors in Computing Systems . 1–14.[75] James Wexler, Mahima Pushkarna, Tolga Bolukbasi, Martin Wattenberg, FernandaViégas, and Jimbo Wilson. 2019. The what-if tool: Interactive probing of machinelearning models.

IEEE transactions on visualization and computer graphics

26, 1(2019), 56–65.[76] Meredith Whittaker, Kate Crawford, Roel Dobbe, Genevieve Fried, ElizabethKaziunas, Varoon Mathur, Sarah Mysers West, Rashida Richardson, Jason Schultz,and Oscar Schwartz. 2018.

AI now report 2018 . AI Now Institute at New YorkUniversity.[77] Pak-Hang Wong. 2020. Democratizing Algorithmic Fairness.

Philosophy &Technology

33 (2020), 225–244. https://doi.org/10.1145/3290605.3300830[78] Allison Woodruff, Sarah E Fox, Steven Rousso-Schindler, and Jeffrey Warshaw.2018. A qualitative exploration of perceptions of algorithmic fairness. In

Proceed-ings of the 2018 CHI Conference on Human Factors in Computing Systems . ACM,656.[79] Bowen Yu, Ye Yuan, Loren Terveen, Zhiwei Steven Wu, Jodi Forlizzi, and HaiyiZhu. 2020. Keeping Designers in the Loop: Communicating Inherent AlgorithmicTrade-offs Across Multiple Objectives. In

Proceedings of the 2020 ACM DesigningInteractive Systems Conference . 1245–1257.[80] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez-Rodriguez, and Krishna P.Gummadi. 2017. Fairness Beyond Disparate Treatment & Disparate Impact:Learning Classification without Disparate Mistreatment. In

Proceedings of the26th International Conference on World Wide Web, WWW . ACM, 1171–1180.[81] Haiyi Zhu, Bowen Yu, Aaron Halfaker, and Loren Terveen. 2018. Value-sensitivealgorithm design: Method, case study, and lessons.

Proceedings of the ACM onHuman-Computer Interaction

2, CSCW (2018), 194.2, CSCW (2018), 194.