Biased Programmers? Or Biased Data? A Field Experiment in Operationalizing AI Ethics
Bo Cowgill, Fabrizio Dell'Acqua, Samuel Deng, Daniel Hsu, Nakul Verma, Augustin Chaintreau
aa r X i v : . [ ec on . GN ] D ec Biased Programmers? Or Biased Data? A FieldExperiment in Operationalizing AI Ethics
Bo Cowgill and Fabrizio Dell’Acqua
Graduate School of Business, Columbia University {bo.cowgill,fdellacqua21}@gsb.columbia.edu,
Samuel Deng, Daniel Hsu, Nakul Verma, Augustin Chaintreau
Department of Computer Science, Columbia University [email protected],{djhsu,verma,augustin}@cs.columbia.edu
Abstract
Why do biased predictions arise? What interventions can prevent them? We eval-uate 8.2 million algorithmic predictions of math performance from ≈ AI en-gineers, each of whom developed an algorithm under a randomly assigned experi-mental condition. Our treatment arms modified programmers’ incentives, trainingdata, awareness, and/or technical knowledge of AI ethics. We then assess out-of-sample predictions from their algorithms using randomized audit manipulationsof algorithm inputs and ground-truth math performance for 20K subjects. Wefind that biased predictions are mostly caused by biased training data. However,one-third of the benefit of better training data comes through a novel economicmechanism: Engineers exert greater effort and are more responsive to incentiveswhen given better training data. We also assess how performance varies withprogrammers’ demographic characteristics, and their performance on a psycho-logical test of implicit bias (IAT) concerning gender and careers. We find no evi-dence that female, minority and low-IAT engineers exhibit lower bias or discrim-ination in their code. However, we do find that prediction errors are correlatedwithin demographic groups, which creates performance improvements throughcross-demographic averaging. Finally, we quantify the benefits and tradeoffs ofpractical managerial or policy interventions such as technical advice, simple re-minders, and improved incentives for decreasing algorithmic bias. Our full resultsare available at https://ssrn.com/abstract=3615404.
Our submission is a 4-page summary to comply with the workshop page limits. Our full results areavailable at https://ssrn.com/abstract=3615404.
Why do biased predictions arise, and what interventions can prevent them? Across a wide variety oftheoretical models of behavior, biased predictions are responsible for demographic segregation andoutcome disparities in settings including labor markets, criminal justice, and advertising. Althoughmany classic theoretical models feature decision-makers with accurate (if discriminating) statisticalpredictors (Phelps, 1972; Arrow, 1973), empirical evidence often shows that predictions are system-atically inaccurate in practice (Bohren et al., 2019). How do these biased predictions arise? Whattheoretical mechanisms produce them, and what practical interventions can reduce prediction bias?
Navigating the Broader Impacts of AI Research Workshop at the 34th Conference on Neural Information Pro-cessing Systems (NeurIPS 2020). n this paper, we address these questions through a field experiment applying machine learning topredict workers’ performance. Automated hiring is a leading concern of policymakers questioningthe ethics and fairness of AI systems. Research and public discourse on this topic have grownenormously in the past five years along with a growth in programs to introduce ethics into technicaltraining. However, few studies have attempted to evaluate, audit, or learn from these interventionsor connect them back to theory. This paper aims to step in that direction.We examine the formation of biased predictions in a unique field experiment about the developmentof AI technology. Our subjects are approximately 400 machine learning engineers. Our experi-ment gives us a direct view of prediction technology while it is being assembled. We then assessthe resulting predictive algorithms using ground truth outcomes, and using randomized audit-likemanipulations of algorithmic inputs. This setting creates measurement opportunities that would beimpossible for learning processes in other settings, allowing us to study the mechanisms behind bi-ased learning and prediction more directly. Through randomized treatments, we show how policy ormanagerial interventions can change the formation of biased predictions.
We conducted our experiment in two empirical settings. Approximately 80% of our subjects wereparticipants in a machine learning bootcamp at a large research university. The bootcamp taughtmachine learning programming techniques at a CS masters or advanced undergraduate level. Weconducted our experiment in this setting twice; once in the Spring and once in the Fall versions ofthe bootcamp.Over half the bootcamp participants already graduated. The average participant had 1.2 years ofwork experience (median 0.67 years) at the time of the experiment. These are attractive researchsubjects for our topic. Students from this program are frequently hired to work at large Internet com-panies such as Facebook and Google; the algorithms they will develop in the future may plausiblyaffect billions of Internet users. At the time of the experiment, 31% of the bootcamp participantshad already been employed by a household-name company as a software engineer.The programming task we studied was a graded homework assignment, and performance was highlyincentivized and competitive. Although this setting has many attractive qualities for our experiment,it required some design adaptations to address the possibility of cheating and contamination betweentreatment groups (discussed in our full manuscript).To complement these relatively inexperienced subjects, we sought a population of programmersto complement the students: Freelance machine learning engineers. Using an online platform, werecruited 60 machine learning engineers (20% of total subjects). The average contractor in our studywas more experienced than our bootcamp programmers, worked 3.89 years before our experiment(median 3.16 years).Our full manuscript summarizes the characteristics of both sets of engineers in our experiment. Ourengineers are 71% male, 29% female, 28% White, 52% East Asian, 15% Indian, and only about5% Black or Latino/a/x. Like the broader population of engineers, our subject population lacksdiversity in key characteristics. However, our sample is slightly more diverse than the US softwareengineering population as a whole.
Task . All subjects in our experiment were assigned the same job: Develop an algorithm to pre-dict math performance from biographical features on a job application, and apply it to 20,000 newindividuals who do not appear in the training data. This task allows us to study a fundamentalmechanism behind biased performance predictions. For reasons we discuss in our full paper, math is an attractive topic for empirical studies of al-gorithmic bias. As training data, engineers were given a sample of the OECD’s Programme forthe International Assessment of Adult Competencies (“PIAAC”) dataset. PIAAC is the canonical Approximately half had undergraduate degrees and half had MS or Ph.D. degrees (or pursuing them). We created a private job listing on the platform, then randomly invited subjects from the machine learningsection of the platform. We restricted our invitations to engineers who had at least 1.5 years of work experience,were based inside the United States, and whose rates were between $20 and $80. There might be other biases in algorithms besides prediction bias which we do not discuss in this paper. The PIAAC data solves several critical research challenges for algorithmicbias researchers.
Our paper features high-quality performance metrics about two groups of subjects. The first is our400 AI programmers, and the second is the OECD test subjects. Performance for both groups can bemeasured objectively, avoiding the typical pitfalls of “outcome tests” such as subjective performancecriteria.For the math subjects, performance on a standardized test is available for a professionally-weightedsample of the entire OECD population, and not a limited, non-random subset. Regarding engineers,predictive outcomes are available for every engineer in our sample. For our engineering sample, wealso have natural ways to aggregate individual performance to estimate team contributions. We canmeasure how correlated each engineer’s model is with potential teammates, and how well a simpleaverage (or more complicated aggregation) of two engineers perform as an ensemble.
Experimental Treatment Arms . Our research design included four main treatment arms. The firstwas a control group in which engineers were given PIAAC data featuring realistic sample selectionproblems in the training data, thus, we label this group as receiving “biased training data.” Thesecond randomly-selected group received PIAAC data featuring no sample selection problems. Inour remaining two treatments, engineers were given the first group’s training data (the biased trainingdata, featuring sample selection problems). However, these experimental groups were also givenpolicy interventions. The third group was given a non-technical reminder about the possibility ofalgorithmic bias. The fourth was given this reminder as well as a simple, jargon-free white paperabout sample selection correction methods in machine learning (Zadrozny et al., 2003; Chawla et al.,2002).
Subtreatments: Performance Incentives . Within each of the above treatments, subjects wererandomized into varying performance incentive schemes. The goal of this randomization was tomeasure the effectiveness of using incentives to reduce algorithmic bias.
Audit Manipulations . Our design also features an audit-like manipulation of algorithmic inputs.As part of our test data, we ask engineers to evaluate candidates whose characteristics are identical,except that a single covariate (gender) has changed. We can then compare the predicted outcomesof identical candidates, who differ only on their gender. Because the gender of the OECD subjectsis accessible to the engineers’ programs, we can avoid the issues raised by manipulating first names.The outcomes of all predictions are far richer than a typical audit study; we estimate a continuousmeasure of predicted math performance that spans the entire spectrum.Because our design features randomization both on screeners (programmers) and on candidates(subjects evaluated by the resume), our design resembles the two-sided audit (Agan et al., 2019;Cowgill and Perkowski, 2019). In these experiments, researchers hire professional recruiters to se-lect resumes under experimental conditions. The recruiters are then asked to evaluate job applica-tions with audit-like manipulations of names and characteristics. Our design is essentially identical,but we use software engineers to build algorithms for decision-making, rather than human screeners.
Engineer Demographics . Our subject population contained substantial variation in gender, race andother demographic characteristics. We utilized this diversity to measure whether demographicallynon-traditional programmers were more likely to notice and reduce algorithmic bias, and whetherprediction errors were correlated within demographic groups.
We find that biased predictions are mostly caused by biased training data. Having access to bet-ter training data helps engineers lower bias. One-third of the benefit of better training data comes Our cleaned version of this data ready for use by other researchers is available athttps://doi.org/10.7910/DVN/JAJ3CP.
As algorithms spread in influence, concern has grown about bias. However, the root causes ofalgorithmic bias are often unclear. In public discourse and academic literature, two theories of algo-rithmic bias have gained prominence. The first theory emphasizes “Biased Training Data.” Becausemachine learning applications are developed using historical data about outcomes, data comingfrom it would reflect and perpetuate any bias in the real world. A second theory emphasizes anotherfactor “Biased Programmers.” This theory emphasizes the fact that programmers are highly non-representative and may exhibit biases (consciously or otherwise) that are passed onto the algorithmsthey write.Both of these theories are likely contributors to algorithmic bias. However, the two theories requiredifferent policy solutions. In this paper, we seek to measure the relative contributions of biased dataand biased programmers.Many of our results may be specific to our particular setting and interventions. Our paper should notbe the final word on any of these topics, and there might be other sources of algorithmic bias besidesprediction bias which we do not discuss in this paper. However, we do believe that empirical andexperimental studies offer a novel, important and underutilized perspective on algorithmic bias. Questions about algorithmic bias are often framed as theoretical computer science problems. How-ever, productionized algorithms are developed by humans, working inside organizations who aresubject to training, persuasion, culture, incentives, and implementation frictions. An empirical, fieldexperimental approach is also useful for evaluating practical policy solutions. As an example, theexperiment above tests several plausible managerial and educational interventions for reducing algo-rithmic bias.The empirics of algorithmic bias also present unique opportunities for discrimination researchers.Algorithms are increasingly influential in core areas of discrimination including hiring, lending,and criminal justice. However, in addition, algorithms offer unique measurement opportunities thatwould be impossible in other economic settings.The “source code,” for human-driven decision-making – or even a random sample of outputs – israrely available to any researcher. By contrast, algorithmic settings allowing inequality researchers Our approach is similar to “empirical software engineering” approaches (Shull et al., 2007; Wohlin et al.,2012), however, this field has been less likely to study algorithmic bias and fairness topics.
4o study the mechanisms behind discrimination more directly and transparently. This paper attemptsto shed light on these processes – and their relationship to realized prediction bias in algorithms –using a large, preregistered randomized controlled trial.
References
Agan, Amanda, Bo Cowgill, and Laura Gee , “The Effects of Salary History Bans: Evidence froma Field Experiment,”
Working paper , 2019.
Arrow, Kenneth , “The theory of discrimination,”
Discrimination in labor markets , 1973, (10),3–33. Bohren, J Aislinn, Kareem Haggag, Alex Imas, and Devin G Pope , “Inaccurate statistical dis-crimination,” Technical Report, National Bureau of Economic Research 2019.
Chawla, Nitesh V, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer , “SMOTE:synthetic minority over-sampling technique,”
Journal of artificial intelligence research , 2002, ,321–357. Cowgill, Bo and Patryk Perkowski , “Agency and Homophily: Evidence from a Two-Sided Audit,”
Working Paper , 2019.
Phelps, Edmund S , “The statistical theory of racism and sexism,”
The american economic review ,1972, (4), 659–661. Shull, Forrest, Janice Singer, and Dag IK Sjøberg , Guide to advanced empirical software engi-neering , Springer, 2007.
Wohlin, Claes, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and AndersWesslén , Experimentation in software engineering , Springer Science & Business Media, 2012.