[PDF] On Statistical Discrimination as a Failure of Social Learning: A Multi-Armed Bandit Approach

Abstract

We analyze statistical discrimination using a multi-armed bandit model where myopic firms face candidate workers arriving with heterogeneous observable characteristics. The association between the worker's skill and characteristics is unknown ex ante; thus, firms need to learn it. In such an environment, laissez-faire may result in a highly unfair and inefficient outcome -- myopic firms are reluctant to hire minority workers because the lack of data about minority workers prevents accurate estimation of their performance. Consequently, minority groups could be perpetually underestimated -- they are never hired, and therefore, data about them is never accumulated. We proved that this problem becomes more serious when the population ratio is imbalanced, as is the case in many extant discrimination problems. We consider two affirmative-action policies for solving this dilemma: One is a subsidy rule that is based on the popular upper confidence bound algorithm, and another is the Rooney Rule, which requires firms to interview at least one minority worker for each hiring opportunity. Our results indicate temporary affirmative actions are effective for statistical discrimination caused by data insufficiency.

Full PDF

OOn Statistical Discrimination as a Failure of Social Learning:A Multi-Armed Bandit Approach ∗Junpei Komiyama † Shunya Noda ‡ First Draft: October 2, 2020 Current Version: October 29, 2020

Abstract

We analyze statistical discrimination using a multi-armed bandit model where myopic ﬁrmsface candidate workers arriving with heterogeneous observable characteristics. The associationbetween the worker’s skill and characteristics is unknown ex ante; thus, ﬁrms need to learn it.In such an environment, laissez-faire may result in a highly unfair and ineﬃcient outcome—myopic ﬁrms are reluctant to hire minority workers because the lack of data about minorityworkers prevents accurate estimation of their performance. Consequently, minority groups couldbe perpetually underestimated —they are never hired, and therefore, data about them is neveraccumulated. We proved that this problem becomes more serious when the population ratio isimbalanced, as is the case in many extant discrimination problems. We consider two aﬃrmative-action policies for solving this dilemma: One is a subsidy rule that is based on the popular upperconﬁdence bound algorithm, and another is the Rooney Rule, which requires ﬁrms to interview atleast one minority worker for each hiring opportunity. Our results indicate temporary aﬃrmativeactions are eﬀective for statistical discrimination caused by data insuﬃciency.

JEL Codes:

C44, D82, D83, J71

Keywords:

Statistical Discrimination, Aﬃrmative Action, Multi-Armed Bandit, SocialLearning, Strategic Experimentation ∗ We are grateful to Itai Ashlagi, Tomohiro Hara, Yoko Okuyama, Masayuki Yagasaki, and all the participants ofHappy Hour Seminar! for helpful comments. All remaining errors are our own. † Leonard N. Stern School of Business, New York University, 44 West 4th Street, New York, NY 10012, UnitedStates. E-mail: [email protected]. ‡ Vancouver School of Economics, University of British Columbia, 6000 Iona Dr, Vancouver, BC V6T 1L4 Canada.E-mail: [email protected]. Noda has been supported by the Social Sciences and Humanities Research Councilof Canada. a r X i v : . [ ec on . T H ] O c t Introduction

Statistical discrimination refers to discrimination against minority people, taken by fully rationaland non-prejudiced agents. In contrast to taste-based discrimination (Becker, 1957), which regardsagents’ preferences (e.g., racism, sexism) as the primary source of discrimination, the model of sta-tistical discrimination does not assume preferences of hating discriminated groups. Previous studieshave shown that even in the absence of prejudice, discrimination could occur persistently becauseof various reasons, such as the discouragement of human capital investment (Arrow, 1973; Fosterand Vohra, 1992; Coate and Loury, 1993; Moro and Norman, 2004), information friction (Phelps,1972; Cornell and Welch, 1996), and search friction (Mailath, Samuelson, and Shaked, 2000). Theliterature has proposed a variety of aﬃrmative-action policies to solve statistical discrimination,and many of them are being implemented in practice.The contribution of this paper is to articulate a new channel of statistical discrimination—underestimation of minority workers that appears as a consequence of social learning. Most ofthe extant literature focuses on behaviors of rational agents under an equilibrium where agentshave a correct belief about the relationship between observable characteristics and unobservableskills. However, several empirical studies have shown that real-world people often have a biasedbelief towards minority groups. The aim of this study is to endogenize the evolution of the biasedbelief and analyze its consequence. In our model, (i) all ﬁrms (decision makers) are fully rationaland non-prejudiced (i.e., attempt to hire the most productive worker), and (ii) all workers are exante symmetric. We show that, even in such an environment, a biased belief could be generatedendogenously and persist in the long run.For instruction, we use the terminology of hiring markets (while our model is applicable to According to Moro’s (2009) deﬁnition,Statistical discrimination is a theory of inequality between demographic groups based on stereotypesthat do not arise from prejudice or racial and gender bias.Although some previous studies in statistical discrimination consider the consequence of exogenously endowed bi-ased beliefs (e.g., Bohren, Haggag, Imas, and Pope, 2019a; Bohren, Imas, and Rosenberg, 2019b; Monachou andAshlagi, 2019), we additionally require that agents are fully rational. De Paola, Scoppa, and Lombardo (2010) analyzed an Italian local administration record where a gender quotais introduced in a short period of 1993-1995, which resulted in increased representation of women politicians evenafter the quota is terminated. Battaglini, Harris, and Patacchini (2020) showed that increased professional exposureof women judges promotes hiring of other women judges and hypothesized that it is due to the reinforced belief ofwomen’s professional capabilities. Bohren et al. (2019a) showed that there is a widespread misconception on themathematical competence of American people. However, the true statistical relationship between characteristics andactual skills is not observed directly; thus, ﬁrms need to learn it based on the data about past hiringcases. Firms tend to have insuﬃcient data about minority groups because (i) minority groups areliterally a “minority” (in terms of the population), and (ii) they have been discriminated againstand not hired in history. The lack of data makes it diﬃcult to assess the skills of minority workers.Hence, it tends to be safer and more proﬁtable to hire a majority candidate, whose skill is accuratelyestimable. This situation persists because no ﬁrm is willing to “experiment” with hiring minorities.Consequently, minority workers may never be hired, and ﬁrms may miss many skillful workers fromthe minority group. We also show that some temporary aﬃrmative-action policies can eﬀectivelyprevent this form of discrimination.We develop a multi-armed bandit model of social learning, in which many myopic and short-livedﬁrms sequentially make hiring decisions. In each round, a ﬁrm faces multiple candidate workers.Each ﬁrm wants to hire only one person. Each ﬁrm’s utility is determined by the hired worker’sskill, which cannot be observed directly until employment. However, as in the standard statisticaldiscrimination model, each worker also has an observable characteristic that is associated with theworker’s hidden skill. In the beginning, no one knows the precise way to interpret the the worker’sobservable characteristic for predicting that skill. Hence, ﬁrms ﬁrst need to learn the relationshipbetween the characteristic and skill, and then apply the statistical model to evaluate the predictedskill of workers. We assume that, ﬁrms submit all the information about their hiring cases to apublic database, and therefore, each ﬁrm can observe all the past hiring cases (the characteristicsand skills of all the workers actually hired in the past).Each worker belongs to a group that represents the worker’s gender, race, and ethnicity. Weassume that the characteristics of workers who belong to diﬀerent groups should be interpreteddiﬀerently. This assumption is realistic. First, previous studies have revealed that the underrepre-sented groups receive unfairly low evaluations in many places. When the observable characteristic Observable characteristics can be very informative in one’s career. Wang, Zhang, Posse, and Bhasin (2013)showed that the CV is very informative for predicting whether a software engineer switches another senior positionin three years. For example, Trix and Psenka (2003) study letters of recommendation for medical faculty and ﬁnd that letterswritten for female applicants diﬀer systematically from those written for male applicants. Hanna and Linden (2012)suggest that students who belong to lower caste (in India) tend receive unfairly lower exam scores. Conversely, as forteaching evaluation, MacNell, Driscoll, and Hunt (2015) and Mitchell and Martin (2018) demonstrate that students

3s an evaluation provided by an outside rater, then the characteristic information itself could bebiased because of the prejudice of the rater. Second, evaluations may also reﬂect diﬀerences in theculture, living environment, and social system. For example, ﬁrms must be familiar with the customof writing recommendation letters to interpret letters correctly. Hence, the observable character-istics (curriculum vitae, exam score, grading report, recommendation letter, teaching evaluation,etc.) may provide very diﬀerent implications even when their appearances are similar. If ﬁrms areimpartial and aware of these biases, they should adjust the way they interpret the characteristics,by applying diﬀerent statistical models for diﬀerent groups. Firms are typically less knowledgeable about minority workers. In many cases, discriminatedgroups are literally “minorities,” and therefore, the number of candidate workers itself tends to besmaller. Furthermore, even when discriminated groups are demographically a majority, they mightnot have been hired in the past due to a historical reason. Hence, compared with majority workers,the data about minority workers are often insuﬃcient.The lack of data results in inaccurate prediction of minority workers’ skills, and the inaccuracydiscourages ﬁrms from hiring the minorities. Many workers apply for each job opening. To gethired, a worker must have the highest predicted skill. Once the minority group is underestimated,it is diﬃcult for a minority worker to appear to be the best candidate worker—even if the trueskill is the highest, the ﬁrm will not be convinced. Underestimation rarely happens once societyacquires a suﬃciently rich data set. However, in an early stage of the game, the minority group isunderestimated due to bad realizations of the unpredictable component.The structure described above causes perpetual underestimation . Firms tend to hire majorityworkers because of the imbalance of data richness. However, as long as ﬁrms only hire majorityworkers, society cannot learn about the minority group; thus, the imbalance remains even in thelong run. Here, the minority group is perpetually underestimated: the lack of data prevents hiring,and therefore, minority workers are never hired. We prove that, perpetual underestimation may rated the male identity signiﬁcantly higher than the female identity. Hann´ak, Wagner, Garcia, Mislove, Strohmaier,and Wilson (2017) study online freelance marketplaces and ﬁnd that gender and race are signiﬁcantly correlated withworker evaluations. Precht (1998) and Al-Ali (2004) report cross-cultural diﬀerences in letters of recommendations (that do notoriginate from discrimination). Through a randomized experiment, Williams and Ceci (2015) demonstrate that as for the STEM tenure trackhiring, female applicants are favored over male applicants. This result is consistent with our assumption here: If theobserved characteristics are systematically biased, an impartial employer would debias the data before interpretingit. This may lead to reversal discrimination. laissez-faire results in the underprovision of apublic good—the information about minority groups. By enforcing or incentivizing early movers(ﬁrms) to review the minority groups, late movers can refer to a more useful data set of hiring cases,leading to improvement of social welfare. Note that the policy intervention need not be persistent:once suﬃciently rich data are collected, the government can terminate the aﬃrmative action andreturn to laissez-faire.We analyze the equilibrium consequence of laissez-faire and study desirable policy interventions.Multi-armed bandit models are useful for quantifying the value of information as the width ofconﬁdence bounds. We use a linear contextual bandit model to study whether a policy can leadsociety to achieve “no regret” in the long run. The regret is one of the most popular criteria forevaluating the performance of algorithms in multi-armed bandit problems. The regret measuresthe welfare loss compared with the ﬁrst-best decision rule (which ﬁrms would take when they hadperfect information about the statistical model). When the regret grows sublinearly in N , it meansthat ﬁrms make fair and eﬃcient decisions after a certain time.In our theoretical analyses of laissez-faire, we ﬁrst prove that it achieves no regret in the longrun: When the groups are ex ante symmetric and the population ratio is equal, the expected regretof laissez-faire is shown to be ˜ O ( √ N ) , where ˜ O is a Landau notation that ignores a logarithmicfactor. In contrast, when the population ratio is imbalanced (i.e., the number of majority workersis larger than the number of minority workers), this result no longer holds. In such a case, theexpected regret is proven to be ˜Ω( N ) , which implies that eﬃciency is not attained even in the longrun.This paper studies two policy interventions towards fair and eﬃcient social learning. The ﬁrstpolicy is a subsidy rule, based on the idea of the upper conﬁdence interval algorithm (UCB). UCB5s an eﬀective solution for balancing exploration and exploitation in the (standard single-agent)multi-armed bandit problem (Lai and Robbins, 1985; Auer, Cesa-Bianchi, and Fischer, 2002). Byincentivizing ﬁrms to take actions that are consistent with the recommendation of UCB, we canlead the social learning to no regret in the long run. We achieve it by providing ﬁrms subsidies whenthey hire a worker who belongs to an underexplored group. The subsidy is adjusted to the degreeof information externality; thus, its total amount shrinks as time goes. Formally, we show that theUCB mechanism has expected regret of ˜ O ( √ N ) . The subsidy amount required for implementingthe UCB mechanism is also ˜ O ( √ N ) .This paper further proposes a hybrid mechanism , which terminates aﬃrmative actions once asuﬃciently rich data set is collected and returns to laissez-faire. In our setting, once ﬁrms obtain acertain amount of data, the diversity of workers’ characteristics naturally promotes learning aboutthe minority group. Hence, even if we terminate the policy intervention earlier than a standardUCB algorithm would do, society rarely falls into perpetual underestimation. We prove that ourhybrid mechanism achieves ˜ O ( √ N ) regret with ˜ O (1) subsidy in N rounds. Furthermore, in oursimulation, the hybrid mechanism achieved smaller regret than the UCB mechanism.The second policy is the Rooney Rule. The Rooney Rule is a “soft” aﬃrmative action (in thatno hiring quota is required) and it does not require monetary compensation. Instead, the RooneyRule requires each ﬁrm to select at least one minority candidate as a ﬁnalist for each job opening. Inthe ﬁnal selection, ﬁrms can obtain additional signals, besides the observable characteristics shownas an application document. The Rooney Rule leaves minority workers an opportunity to be hired.Even when a ﬁrm underestimates a minority worker’s skill in the beginning (due to the predictioninaccuracy), the worker may turn out to be the most attractive candidate once the interview isdone. As long as minority workers have a chance to be hired, perpetual underestimation will notoccur. However, our analysis also shows that the Rooney Rule may hinder hiring of skilled majoritycandidates, and therefore, should not be adopted as a permanent policy.The remainder of this paper is organized as follows. Section 2 reviews the literature. Section 3introduces the model. Section 4 studies the equilibrium consequence of laissez-faire. Section 5develops the upper-conﬁdence-bound subsidy rules, and Section 6 improves it further. Section 7 The Rooney Rule is originally introduced to the National Football League, and the original version of the rulerequired league teams to interview ethnic-minority candidates for head coaching and senior football operation jobs.The rule is named after Dan Rooney, the former chairman of the league’s diversity committee (Eddo-Lodge, 2017).

A survey by Fang and Moro (2011) classiﬁes the literature of statistical discrimination broadly intotwo strands. The ﬁrst strand, originates from Arrow (1973), assumes that groups are ex ante iden-tical and analyzes how statistical discrimination occurs as an asymmetric equilibrium (e.g., Fosterand Vohra; Coate and Loury, 1992; 1993; Mailath et al., 2000; Moro and Norman, 2003, 2004; Guand Norman, 2020). This strand interprets statistical discrimination as a random selection of mul-tiple equilibria and does not explain why demographic minorities tend to be discriminated against.The second strand of the literature, originates from Phelps (1972), studies discrimination triggeredby unexplained exogenous diﬀerences between groups, coupled with incomplete information aboutworkers’ skills (e.g., Aigner and Cain, 1977; Lundberg and Startz, 1983; Cornell and Welch, 1996).The diﬀerence in the signal distribution of workers’ skill is one of the most popular assumptions inthis strand. This paper uniﬁes these two strands in that we endogenize the diﬀerence in the signaldistribution. We consider otherwise ex ante identical individuals from diﬀerent groups. Using asocial learning model, we demonstrate how the diﬀerence in the prediction of skills is generatedand persists. We ﬁnd that when the population ratio of a group is small, the group tends to bestatistically discriminated against. Hence, in contrast to most papers in the ﬁrst strand, our resultindicates that a minority group tends to suﬀer as an inevitable consequence under laissez-faire.More recently, several works (e.g., Bohren et al., 2019a, 2019b; Monachou and Ashlagi, 2019)demonstrate how misspeciﬁed beliefs about groups will result in discrimination. Thus far, this lit-erature has attributed the belief misspeciﬁcation to psychological bias (e.g., Judd and Park, 1993;Hilton and Von Hippel, 1996) and bounded rationality (e.g., Fryer and Jackson, 2008; Schwartzstein, 2014;Bordalo, Coﬀman, Gennaioli, and Shleifer, 2019). In contrast, we develop a model of fully rationalagents and show that a misspeciﬁed belief persists (i.e., a minority group is perpetually underes-timated) even in the long run. Our result supports a fundamental assumption of the belief-based The population imbalance is not an “unexplained” diﬀerence. Bardhi et al. (2020) show that a small diﬀerence inthe prior belief about each worker’s type (associated with the group the worker belongs to) couldgenerate a signiﬁcant diﬀerence in payoﬀs of workers . In contrast, the focus of this paper is onhow society acquires a persistent misspeciﬁed belief about the minority group endogenously.Some previous studies consider a linear contextual bandit problem and study the performance ofa “greedy” algorithm, which myopically makes decision in accordance with the current information(Bastani, Bayati, and Khosravi, 2020; Kannan, Morgenstern, Roth, Waggoner, and Wu, 2018). Asﬁrms take greedy actions under laissez-faire, their results are also relevant to our model. They showthat the greedy algorithm could lead to no regret in the long run, if (i) the contexts (correspondingto workers’ characteristics in our model) are diverse enough, and (ii) the decision maker acquiressuﬃciently many uniform samples in the beginning. While their results suggest that laissez-faireperforms well, uniform sampling is not adaptive, and thus does not adequately quantify the valueof information. Subsection 8.6 provides a detailed analysis on this point—we show that, our hybridmechanism performs better than uniform sampling followed by laissez-faire.Theoretical analyses for the Rooney Rule are relatively scarce. Kleinberg and Raghavan(2018) show that, when the recruiter has an unconscious bias against the discriminated group, the As a decision maker is a long-lived employer, Bardhi et al. (2020) belongs to the literature on dynamic employerlearning (Farber and Gibbons, 1996; Altonji and Pierret, 2001), not to the social learning literature. Bardhi et al. (2020) also described the eﬀect of population imbalance in an independent section. De Paola et al. (2010) empirically show that a soft gender quota imposed on an Italian local administration brokedown negative stereotypes towards women even after it was terminated. This quota can be regarded as a version ofthe Rooney Rule.

Basic Setting

We develop a linear contextual bandit problem with myopic agents (ﬁrms). Weconsider a situation where N ﬁrms (indexed by n = 1 , . . . , N ) sequentially hire one worker for each.In each round n , a set of workers I ( n ) arrives. Each worker i ∈ I ( n ) takes no action, and ﬁrm n selects one worker ι ( n ) ∈ I ( n ) . We denote the set of all workers by I := (cid:83) Nn =1 I ( n ) . Both ﬁrms andworkers are short-lived. Once round n is ﬁnished, ﬁrm n ’s payoﬀ is ﬁnalized, and all the workersnot hired leave the market.Each worker i belongs to a group g ∈ G . We assume that the population ratio is ﬁxed: forevery round n , the number of arrived workers who belong to group g is K g ∈ N and K = (cid:80) g ∈ G K g .Slightly abusing the notation, we denote the group worker i belongs to by g ( i ) . Each worker i ∈ I also has an observable characteristic x i ∈ R d , where d ∈ N is its dimension. Finally, each worker i also has a skill y i ∈ R , which is not observable until worker i is hired. The characteristics andskills are random variables.Because each ﬁrm’s payoﬀ is equal to the hired worker’s skill y i (plus the subsidy assigned toworker i as an aﬃrmative action, if any), ﬁrms want to predict the skill y i based on the character-istics x i . We assume that the characteristics and skills are associated in the following way: y i = x (cid:48) i θ g ( i ) + (cid:15) i , where θ g ∈ R d is a coeﬃcient parameter , and (cid:15) i ∼ N (0 , σ (cid:15) ) i.i.d. is an unpredictable error term. Weassume || θ g || ≤ S for some S ∈ R + , where || · || is the standard L2-norm. Since (cid:15) i is unpredictable, q i := x (cid:48) i θ g ( i ) (1)10s the best predictor of worker i ’s skill y i .The coeﬃcient parameters ( θ g ) g ∈ G are unknown in the beginning. Hence, unless ﬁrms sharethe information about past hiring cases, ﬁrms are unable to predict each worker’s skill y i . Weassume that all ﬁrms share information about hiring cases. Accordingly, when ﬁrm n makes adecision, besides current workers’ characteristics and groups ( x i , g ( i )) i ∈ I ( n ) , ﬁrm n can observe allthe past candidate workers’ characteristics and groups, ( x i , g ( i )) for all i ∈ (cid:83) n − n (cid:48) =1 I ( n (cid:48) ) , the pastﬁrms’ decisions ( ι ( n (cid:48) )) n − n (cid:48) =1 , and the past hired workers’ skills ( y ι ( n (cid:48) ) ) n − n (cid:48) =1 . We refer to all of therealizations of these variables as the history in round n , and denote it by h ( n ) . Formally, h ( n ) isgiven by h ( n ) = (cid:16) ( x i , g ( i )) i ∈ I ( n ) , ( x i , g ( i )) i ∈ (cid:83) n − n (cid:48) =1 I ( n (cid:48) ) , ( ι ( n (cid:48) )) n − n (cid:48) =1 , ( y ι ( n (cid:48) ) ) n − n (cid:48) =1 (cid:17) . Note that, h ( n ) does not include the information about (i) the worker hired in ﬁrm n , and (ii)that worker’s actual skill. This is because h ( n ) represents the information set ﬁrm n faces when itmakes a hiring decision. We deﬁne the set of all histories in round n as H ( n ) . We deﬁne the set ofall histories as H := (cid:83) Nn =1 H ( n ) . The ﬁrm’s decision rule for hiring and the government’s subsidyrule will be deﬁned as a function that maps a history to a hiring decision and the subsidy amount(described later). For notational convenience, we often drop h ( n ) . Prediction

We assume that ﬁrms are not Bayesian but frequentist . Hence, ﬁrms have no priordistribution but they estimate the true parameter θ using the available data set.We assume that each ﬁrm predicts the skill by using ridge regression (also known as regularizedleast square) to stabilize the small-sample inference. Let N g ( n ) be the number of rounds at whichgroup- g workers are hired by round n . Let X g ( n ) ∈ R N g ( n ) × d be a matrix that lists the characteris-tics of group- g workers hired by round n : each row of X g ( n ) corresponds to { x ι ( n (cid:48) ) : ι ( n (cid:48) ) = g } n − n (cid:48) =1 .Likewise, let Y g ( n ) ∈ R N g ( n ) be a vector that lists the skills of group- g workers hired by round n :each element of Y g ( n ) corresponds to { y ι ( n (cid:48) ) : ι ( n (cid:48) ) = g } n − n (cid:48) =1 . We deﬁne V g ( n ) := ( X g ( n )) (cid:48) X g ( n ) .For a parameter λ > , we deﬁne ¯ V g ( n ) = V g ( n ) + λ I d , where I d denotes the d × d identity matrix.Firm n estimates the parameter as follows: ˆ θ g ( n ) := ( ¯ V g ( n )) − ( X g ( n )) (cid:48) Y g ( n ) . (2)11nlike ordinary least square (OLS), for λ > the inverse ( ¯ V g ( n )) − is always well-deﬁned. Firm n predicts worker i ’s skill by (1), while substituting θ g with ˆ θ g ( n ) : ˆ q i ( n ) := x (cid:48) i ˆ θ g ( i ) ( n ) . Note that, both ˆ q i ( n ) and ˆ θ g ( n ) depend on the history h ( n ) . We often drop h ( n ) for notationalsimplicity. Mechanism

Besides the (predicted) skill of workers, ﬁrms also take the subsidies provided asaﬃrmative actions into consideration. We assume that ﬁrms’ preferences are risk-neutral andquasi-linear. Hence, if ﬁrm n hires worker i , ﬁrm n ’s payoﬀ (von-Neumann–Morgenstern utility) isgiven by y i + s i , where where s i ∈ R + denotes the amount of the subsidy assigned to worker i .In the beginning of the game, the government commits to a subsidy rule s i ( n, · ) : H → R + ,which maps a history to a subsidy amount. Hence, once a history h ( n ) is speciﬁed, ﬁrm n canidentify the subsidy assigned to each worker i ∈ I ( n ) . Firm n attempts to maximize E [ y i + s i ( n ; h ( n )) | h ( n )] = ˆ q i ( n ; h ( n )) + s i ( n ; h ( n )) . Firm n ’s decision rule ι ( n, · ) : H ( n ) → I ( n ) speciﬁes the worker ﬁrm n hires given a history h ( n ) . We say that, a decision rule ι is implemented by a subsidy rule s i if for all n , for all h ( n ) , wehave ι ( n ; h ( n )) = arg max i ∈ I ( n ) { ˆ q i ( n ; h ( n )) + s i ( n ; h ( n )) } . (3)We call a pair of a decision rule and subsidy rule a mechanism .Throughout this paper, any ties are broken in an arbitrary way. Again, we often drop h ( n ) from the input of decision rule ι when it does not cause confusion. Remark 1 (Observability of the Past Hiring Data) . While we assume that ﬁrms share the entirehistory of past hiring data for simplicity, practically, each ﬁrm may have limited access to thedatabase. Even if we make such an assumption, our analysis and results will not require qualitativechanges. Rational ﬁrms estimate θ g based on the available data and use it to predict workers’ skills.The smaller the sample size of the available data is, the severer the data insuﬃciency of minority12orkers is. Social Welfare

We measure social welfare by the smallness of regret , which is the standardmeasure to evaluate the performance of algorithms in multi-armed bandit models. The regret isdeﬁned as follows:

Reg( N ) := N (cid:88) n =1 (cid:26) max i ∈ I ( n ) q i − q ι ( n ) (cid:27) . Since (cid:15) i is unpredictable, it is natural to evaluate the performance of the algorithm (or the equi-librium consequence of the policy intervention) by checking the value of predictors q i . If theparameter ( θ g ) g ∈ G were known, each ﬁrm could easily calculate q i for each worker i and choose ι ( n ) = arg max i ∈ I ( n ) q i . In this case, the regret would become zero. However, since ( θ g ) g ∈ G isunknown, it is too demanding to aim at zero regret. The goal of the policy design is to set upa mechanism that minimizes the expected regret E [Reg( N )] , where the expectation is taken on arandom draw of the workers. This aim is equivalent to maximizing the sum of the skills of thehired workers.Following the literature, we mainly evaluate the performance by the limiting behavior (order) ofexpected regrets. One useful benchmark is whether the expected regret is linear (i.e., E [Reg( N )] =Ω( N ) ) or sublinear (i.e., E [Reg( N )] = o ( N ) ). As we described above, once ( θ g ) g ∈ G is known,ﬁrms can use the best predictor q i to evaluate workers. After that point, regret does not increase.Although ( θ g ) g ∈ G is unknown ex ante, ﬁrms can learn it from the data. A linear regret means thatsociety fails to learn the underlying parameter ( θ g ) g ∈ G , and therefore, ﬁrms are hiring less-skilledworkers even in the long run. In our model, perpetual underestimation is often a consequenceof statistical discrimination—typically, minority workers are more likely to be underexplored, andtherefore, they are unfairly rejected. Budget

Some of the policies we study incentivize exploration by subsidization. The total budgetrequired by a subsidy rule is also an important policy concern. The total amount of the subsidy is All the mechanisms proposed in this paper are deterministic. Hence, there is no algorithmic randomness. In the literature of the multi-armed bandit problem, sublinear regret is also referred as no regret since the regretper round approaches zero as N → ∞ . lgorithm 1 Initial Sampling Phase { g n } N (0) n =1 is allocated such that (cid:80) N (0) n =1 [ g n = g ] = N (0) g . for n = 1 , · · · , N (0) do Hire ι ( n ; h ( n )) = min i ∈ I ( n ): g ( i )= g n i . (cid:46) Firm n blindly hires a group- g n candidate. end for given by Sub( N ) := N (cid:88) n =1 s ι ( n ) ( n ) . Initial Sampling Phase

For analytical tractability, we assume that for the ﬁrst N (0) rounds,each ﬁrm n is forced to hire from a pre-speciﬁed group, g n . We refer to the ﬁrst N (0) rounds asthe initial sampling phase (Algorithm 1). Namely, for all n = 1 , . . . , N (0) , ﬁrm n hires a group- g n candidate who has the smallest agent number: ι ( n ; h ( n )) = min i ∈ I ( n ) i subject to g ( i ) = g n . (4)Choosing an agent who has the smallest number is just a random choice. Whenever agents belongto the same group, their characteristics and skill distributions are the same. Accordingly, (4) isequivalent to choosing a group- g n worker blindly (i.e., uniformly at random without looking atworkers’ predicted skills). We deﬁne N (0) g := (cid:80) N (0) n =1 [ g n = g ] as the data size of initial samplingfor group g . The initial sampling phase is exogenous and not regarded as a part of the mechanism.Hence, we ignore the incentives and payoﬀs of ﬁrms hiring in the initial sampling phase. This section analyzes the equilibrium under laissez-faire , that is, the consequence of social learningwhen policy intervention is absent. Subsection 4.1 introduces a basic fact: laissez-faire has linearregret in a general domain. However, a general domain is not suitable for the analysis of statisticaldiscrimination. Hence, in Subsection 4.2, we deﬁne a symmetric and diverse environment, withwhich we can discuss how statistical discrimination grows. In Subsection 4.3, we formally deﬁneperpetual underestimation and discuss its implications. Subsection 4.4 describes the case where14 lgorithm 2

Laissez-FaireComplete the initial sampling phase by running Algorithm 1. for n = N (0) + 1 , · · · , N do (cid:46) Laissez-Faire starts.Oﬀer s i ( n ) = 0 for all i ∈ I ( n ) . (cid:46) No Subsidy is provided.Firm n hires ι ( n ) = arg max i x (cid:48) i ˆ θ g ( i ) ( n ) as an equilibrium consequence. end for (i) both of the groups have suﬃcient variation, and (ii) the population ratio is balanced. In thiscase, the underestimation of minority groups is spontaneously resolved, and therefore, laissez-faireperforms well. However, as shown in Subsection 4.5, when the population ratio is imbalanced,laissez-faire tends to result in perpetual underestimation, and therefore, performs poorly. We ﬁrst deﬁne laissez-faire.

Deﬁnition 1 (Laissez-Faire) . The laissez-faire decision rule always selects the worker who has thehighest predicted skill, i.e., ι ( n ) = arg max i ∈ I ( n ) ˆ q i ( n ) . Clearly, the laissez-faire decision rule is implemented by the laissez-faire subsidy rule , which provideno subsidy s i = 0 after any history (Algorithm 2).Laissez-faire makes no intervention, and therefore, each ﬁrm hires the worker whose expectedskill, predicted by the current data set, is the highest. In the multi-armed bandit literature, thelaissez-faire decision rule is referred to as the greedy algorithm . The greedy algorithm often resultsin a catastrophic outcome due to insuﬃcient exploration. Since information is a public good, itssupply is ineﬃciently low if the government makes no policy intervention. This well-known resultapplies to our environment if no structure is assumed. We state this basic result as a benchmark. Theorem 1 (Failure of Laissez-Faire in General Domain) . Let

Reg LF be the regret under thelaissez-faire decision rule. There exists an instance with which E [Reg LF ( N )] = Ω( N ) . roof. See Appendix B.2.The analysis in Appendix B.2 is essentially the same as the analysis of greedy algorithm inthe standard K -armed bandit problem, which is well-known to be Ω( N ) . We show Theorem 1by constructing an instance explicitly. By assuming that the distribution of the characteristics( x i ) to be degenerate, our linear contextual bandit problem reduces to a basic K -armed banditproblem, where the expected skill (reward) of each group (arm) is ﬁxed. We assume that onegroup is more productive than another, and therefore, the ﬁrst-best decision rule would alwayshire from the better group. With a constant probability, ﬁrms happen to underestimate the moreproductive group in the beginning. When a less productive group constantly performs better thanthe underestimated predicted skill of the better group, ﬁrms never want to investigate the bettergroup further. Consequently, with a signiﬁcant probability, a worker from the better group isnever hired again, implying linear expected regret. Once an underestimation of the minority groupoccurs, it tends to persist: When the “context” of the majority group is ﬁxed, there is a constantprobability that the minority group is never chosen throughout all the rounds. It is too naive to conclude from Theorem 1 that the laissez-faire decision rule may cause statisticaldiscrimination. First, the instance constructed in the proof of Theorem 1 assumes an unexplainedexogenous diﬀerence (in expected skills) between groups, while our aim is to endogenize the dif-ference. Second, we assumed that one group has higher expected skill than the other. With thisassumption, it is eﬃcient to always hire a worker from one group. Under such an assumption, whensocial learning is successful, workers from the inferior group are never hired. Third, we reduced acontextual bandit model to a K -armed bandit model by assuming that the distribution of charac-teristics is degenerate. However, in the real word, candidate workers have diverse characteristics,even when they belong to the same group.To provide better analysis for the laissez-faire decision rule, we make the following three as-sumptions. First, we focus on the case of two groups. Assumption 1 (Two Groups) . The population consists of two groups G = { , } . The example we consider in the proof ﬁxes the context of each worker, and the problem boils down to the standard K -armed bandit problem without context. as a majority (domi-nant) group and group as a minority (discriminated) group. The two-group assumption helps usto elucidate how the minority group is discriminated against by the majority group.Second, we assume that groups are symmetric. Assumption 2 (Symmetric Groups) . The characteristics of all groups are identically distributed,and the coeﬃcient parameters are the same across all groups. Namely, a probability distribution F such that for all i ∈ I , x i ∼ F, and there exists θ ∈ R d such that for all g ∈ G , θ g = θ . Note that although we assume that groups are symmetric, ﬁrms do not see them as symmetric,and therefore, apply diﬀerent statistical models for diﬀerent groups. In other words, even thoughthe true coeﬃcients are identical ( θ g = θ (cid:48) g for all g, g (cid:48) ∈ G ), ﬁrms estimate them separately; thus,the values of the estimated coeﬃcients are typically diﬀerent ( ˆ θ g ( n ) (cid:54) = ˆ θ g (cid:48) ( n ) for g (cid:54) = g (cid:48) ).Although Assumption 2 is unrealistic (as it is evident that the characteristics should be inter-preted diﬀerently), it is useful for elucidating how laissez-faire nourishes statistical discrimination.Under Assumption 2, there is no ex ante diﬀerence between groups (as assumed in Arrow, 1973;Foster and Vohra, 1992; Coate and Loury, 1993; Moro and Norman, 2004, etc.). Hence, all the dif-ferences we observe in the equilibrium consequence are purely due to the property of the equilibriumlearning process.Under Assumption 2, statistical discrimination implies ineﬃciency: although a best candidatebelongs to a minority group with substantial probability ( K /K ), that candidate is not hireddue to underexploration. Hence, when the groups are symmetric, the resolution of statisticaldiscrimination make the hiring process not only fair but also eﬃcient. By contrast, when there isexogenous asymmetry between groups, fairness and eﬃciency are often conﬂicting. For example, demographic parity is one of the most popular fairness notions studied in machine learning (orsupervised learning) literature. In our model, the demographic parity requires that the probability17f hiring from the minority group is equal to the population ratio; i.e., K /K . Clearly, when thegroups are asymmetric, the “ﬁrst-best decision rule” does not satisfy this condition—it hires morefrom a “more productive group,” while it is arguable that such a decision rule is socially desirable.As long as we assume the group symmetry, our argument avoids this controversy: The ﬁrst-bestdecision rule is fair and eﬃcient. Thus, we should attempt to approximate it.Third, we assume that characteristics are normally distributed, and therefore, the distributionis non-degenerate. This assumption captures the diversity of workers, which is the nature of thereal-world labor market. Assumption 3 (Normally Distributed Characteristics) . For every candidate i , x i ∼ N ( µ xg ( i ) , σ xg ( i ) I d ) , where µ xg ∈ R d and σ xg ∈ R ++ for every g ∈ G . We also denote x i = µ xg ( i ) + e xi to highlight thenoise term e xi .Note that when we have both Assumptions 2 and 3, then there exist µ x , σ x such that µ xg = µ x ,σ xg = σ x , for all g ∈ G . Hence, x i ∼ N ( µ x , σ x I d ) for all i ∈ I . To determine whether social learning incurs linear expected regret or not, it is useful to checkwhether it results in perpetual underestimation with a signiﬁcant probability.

Deﬁnition 2 (Perpetual Underestimation) . A group g is perpetually underestimated if, for all n > N (0) , we have g ( ι ( n )) (cid:54) = g .Namely, when group g is perpetually underestimated, no worker from group g is hired afterthe initial sampling phase. 18f social learning results in perpetual underestimation with a signiﬁcant probability, then itoften incurs linear expected regret. In particular, under Assumptions 2, perpetual underestimationagainst any group g ∈ G implies that ﬁrms fail to hire at least K g K (cid:16) N − N (0) (cid:17) best candidate workers, which is linear in N . Hence, if the probability of perpetual underestimationis constant (independent of N ), then we have linear expected regret.In our model, perpetual underestimation is also closely related to social discrimination. Whenperpetual underestimation occurs, a candidate who belongs to an underestimated group is not hired,while groups are symmetric. This outcome happens because society cannot accurately predict theskills of minority workers due to the lack of data. Hence, in our model, perpetual discriminationcan be regarded as a form of statistical discrimination. This section analyzes the case where is only one candidate arrives at each round for both groups.In this case, the variation of context implicitly urges the ﬁrms to explore all the groups with somefrequency. Consequently, laissez-faire has sublinear regret, implying that statistical discriminationis eventually resolved, spontaneously.

Theorem 2 (Sublinear Regret with Balanced Population) . Suppose Assumptions 1, 2, and 3.Suppose also that K g = 1 for g = 1 , . Then, the expected regret is bounded as E [Reg LF ( N )] ≤ C bal √ N where C bal is a ˜ O (1) factor that depends on model parameters. Here, ˜ O (1) is a Landau notationthat ignores polylogarithmic factors. Letting µ x = || µ x || , the factor C bal is inverse proportionalto Φ c ( µ x /σ x ) , which is approximately scales as exp( − ( µ x /σ x ) / . Proof.

See Appendix B.3. The explicit form of C bal is found at the end of the Appendix B.3. Namely, there exists N ∈ N and a function f ( N ) that is ﬁnite-order polynomial of log N such that E [Reg LF ( N )] ≤ f ( N ) for all N ≥ N . In this and subsequent theorems, we often ignore polylogarithmic factors (factors that areﬁnite-order polynomial of the logarithm) of N because they grow very slowly as N grows large. We remark on theimportant dependence on model parameters and refer to the equation of explicit formulae of each factor. is underestimated, which happens with some constant probability. The ratio µ x /σ x represents the stability of characteristics. The larger this value is, the more stable the skill ofcandidates. If µ x /σ x is small, there is some probability such that the skill of the group- candidateis predicted to be bad. In such a case, the candidate from group might be chosen, which updatesthe belief about group to resolve underestimation. As is expected by the theory of least squares,the standard deviation of ˆ θ g ( n ) is proportional to ( ¯ V g ( n )) − / , and we show that its diameter ( λ min ( ¯ V g ( n ))) − / shrinks as ˜ O (1 / √ n ) . The regret per error is deﬁned by this quantity, and thetotal regret is ˜ O ( (cid:80) n ≤ N (1 / √ n )) = ˜ O ( √ N ) .Theorem 3 shows that statistical discrimination is resolved spontaneously when the candidatevariation is large. At a glance, this appears to be in contradiction with widely known resultsthat states laissez-faire may lead to suboptimal results in bandit problems due to underexplo-ration. Since selﬁsh ﬁrms do not want to experiment with underrepresented groups at their ownrisk, laissez-faire perpetually underestimates the skill of the minority group (as demonstrated inTheorem 1). However, the variation in characteristics naturally incentivizes selﬁsh agents to ex-plore the underestimated group, and therefore, with some additional conditions, we can bound theprobability of perpetual underestimation.Theorem 2 shares some intuitions with the previous results (Kannan et al., 2018; Bastani et al.,2020), which have shown that the variation in contexts (characteristics) improves the performanceof the greedy algorithm (laissez-faire) in contextual multi-armed bandit problems. Kannan et al.(2018) assume that there is a suﬃciently long initial sampling phase, in which society can collectthe uniform-sample data until the model parameters are stabilized. Theorem 1 in Bastani et al.(2020) corresponds to Theorem 2 in our paper, and we further attributes it to the stability µ x /σ x rather than the diameter of the characteristics.More importantly, in the next subsection, we prove that these positive results are “specialcases”—we will show that, even when there is a variation in characteristics, when the popula-tion ratio imbalanced, the laissez-faire decision rule may cause perpetual underestimation with asubstantial probability. 20 .5 Large Regret with Imbalanced Population While Theorem 2 implies that statistical discrimination might be spontaneously resolved in thelong run (if we admit that workers’ characteristics are diverse enough), it crucially relies on one un-realistic assumption—the balanced population ratio. In many real-world problems, the populationratio is imbalanced. The dominant group is often the majority of the population, and the discrim-inated is minority. Even when the population demographically balanced, if we look at a speciﬁclabor market, the population ratio could be imbalanced due to an imbalanced wealth distributionor discouragement of human capital investments.We indeed ﬁnd that the population ratio between groups has a crucial role for the welfare underlaissez-faire. In the following theorem, we assume that, in each round, while only one minorityworker arrives (i.e., K = 1 ), while many majority workers ( K > ) arrive. When the populationis imbalanced, perpetual underestimation becomes more likely, and therefore, society suﬀers froma large expected regret. Theorem 3 (Large Regret with Imbalanced Population) . Suppose Assumptions 1, 2, and 3. Sup-pose also that K = 1 and d = 1 . Let K > log N . Then, under the laissez-faire decision rule,group is perpetually underestimated with the probability at least C imb = ˜Θ(1) . Accordingly, theexpected regret of the laissez-faire decision rule is E (cid:104) Reg LF ( N ) (cid:105) ≥ C imb ( N − N (0) ) K = ˜Ω( N ) . Proof.

See Appendix B.4. The explicit form of C imb is found at Eq. (39).In the proof of Theorem 3, we evaluate the probability of the following two events occuring. (i)The coeﬃcient parameter for the minority candidates, θ is underestimated. (ii) The characteristicsand skills of the hired majority workers are consistently good throughout the rounds. The proba-bility of (i) is constant (independent of N ) and the probability of (ii) is constant if K > log N .When both (i) and (ii) occur, minority workers are perpetually underestimated, and therefore, wehave a large regret.Theorem 3 indicates that we should not be too optimistic about the consequence of laissez-faire.The imbalance in the population ratio naturally favors the majority group by helping society to21ollect a richer data set about them, leading to statistical discrimination. This insight applies tomany real-world problems because an imbalanced population is a commonplace. Remark 2.

In the proof of Theorem 3, we explicitly bound the probability that each event happens.Hence, the eﬀect of initial sample size is revealed. The probability of underestimating the minoritygroup is exponentially small to the number initial samples for minorities, N (0)2 , which implies thatsmall number of initiators in the minority group can prevent the underestimation to be perpet-uated. Note also that this is consistent with the prior results by Kannan et al. (2018), whichstate a suﬃciently large initial samples prevent perpetual underestimation because it alleviates theunderestimation of ˆ θ . In Subsection 8.6, we demonstrate that this solution is not desirable becauseuniform sampling is costly and diﬃcult to implement. According to our simulation, the UCB-basedsubsidy rule (the hybrid mechanism, proposed in Section 6) outperforms uniform sampling followedby laissez-faire. Remark 3.

In our framework, the statistical distribution purely attributes to a failure of sociallearning and the resultant misinformation. Hence, even when perpetual underestimation occurs, thetrue skills of minority workers (the distribution of y i ) are not lowered. However, if we additionallyincorporate the choice of the education level and human capital investments (as in Foster andVohra, 1992; Coate and Loury, 1993), the misinformation naturally discourages minority workersfrom improving their skills. Therefore, if education is endogenous, the welfare loss and inequalityraised by social learning would be more serious. Section 4 has discussed the equilibrium consequence under laissez-faire. We observed that, when thepopulation ratio is imbalanced (as in the real-world job market), there is a substantial probabilitythat the underestimation is perpetuated. This result indicates that a policy intervention (aﬃrmativeaction) is eﬀective for improving social welfare and fairness of the hiring market.This section proposes a subsidy rule to resolve such a perpetual underestimation. We use theidea of the upper conﬁdence bound (UCB) algorithm, which is widely used in the literature of See Lemma 24 (in the Appendix) for the full detail. The UCB algorithm balances exploration and exploitation by allocatinghandicaps to less explored arms (groups), whose rewards (skills) cannot be predicted accurately.The UCB algorithm develops a conﬁdence interval for the true reward and evaluate each arm’sperformance by its upper conﬁdence bound to achieve this balance. Although ﬁrms are not willingto follow the UCB’s recommendation under laissez-faire, the government can provide a subsidy topromote a candidate worker who has the highest UCB. In this section, we establish a UCB-basedsubsidy rule and evaluate its performance.

To establish the UCB-based subsidy rule, we ﬁrst deﬁne the hiring decision suggested by the UCBalgorithm. After that, we construct a subsidy rule that incentivizes ﬁrms to hire workers based onUCB. A challenge is that the adaptive selection of the candidates based on history can induce somebias, and the standard conﬁdence bound no longer applies to our case. To overcome this issue,we use martingale inequalities (Pe˜na, Lai, and Shao, 2008; Rusmevichientong and Tsitsiklis, 2010;Abbasi-Yadkori, P´al, and Szepesv´ari, 2011). We here introduce the conﬁdence interval for the truecoeﬃcient parameter, ( θ g ) g ∈ G . Deﬁnition 3 (Conﬁdence Interval, Abbasi-Yadkori et al., 2011) . Given the group g ’s collecteddata matrix ¯ V g ( n ) , the conﬁdence interval of group g ’s coeﬃcient parameter θ g is given by C g ( n ) =  ¯ θ g ∈ R d : (cid:13)(cid:13)(cid:13) ¯ θ g − ˆ θ g ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ σ (cid:15) (cid:115) d log (cid:18) det( ¯ V g ( n )) / det( λ I d ) − / δ (cid:19) + λ / S  where || v || A = √ v (cid:48) Av for a d -dimensional vector v and d × d matrix A .The standard conﬁdence interval, C g ( n ) , shrinks as ﬁrm n has a richer set of data about group g . Abbasi-Yadkori et al. (2011) study the property of this conﬁdence interval, and they prove thatthe true parameter θ g lies in C g ( n ) with a probability − δ (Lemma 19). If we choose suﬃciently The idea of UCB goes back to at least in 1980s. The seminal paper by Lai and Robbins (1985) analyzed aversion of UCB. More recently, Auer et al. (2002) introduced UCB1, which is widely known in the machine learningliterature. δ , it is “safe” to assess that worker i ’s skill is at most ˜ q i ( n ) := max ¯ θ g ( i ) ∈C g ( i ) ( n ) x (cid:48) i ¯ θ g ( i ) . We call ˜ q i ( n ) the upper conﬁdence bound index (UCB index) of worker i ’s skill. Intuitively, ˜ q i ( n ) isworker i ’s skill in the most optimistic scenario. The UCB decision rule makes a decision based onthis UCB index. Deﬁnition 4 (UCB Decision Rule) . The UCB decision rule selects the worker who has the highestUCB index; i.e., ι ( n ) = arg max i ∈ I ( n ) ˜ q i ( n ) . (5)The UCB index ˜ q i ( n ) is close to the pointwise estimate ˆ q i ( n ) when society has a rich data aboutgroup g ( i ) , because C g ( i ) ( n ) is small in such a case. However, when the information about group g ( i ) is insuﬃcient, ˜ q i ( n ) is much larger than ˆ q i ( n ) , because the ﬁrm is not sure about the true skillof worker i and C g ( i ) ( n ) is large. In this sense, the UCB decision rule oﬀers aﬃrmative actionsto underexplored groups. In contrast to the greedy algorithm (laissez-faire), the UCB algorithmappropriately balances exploration and exploitation, and therefore, it has a sublinear expectedregret in general environments.The UCB decision rule recommends the exploration of majority candidates as well as minor-ity candidates. The amount of the subsidy is proportional to the uncertainty of the candidate’scharacteristic, which is represented by the conﬁdence interval C g ( n ) . The conﬁdence interval C g ( n ) is inverse proportional to ¯ V g ( n ) = ( X g ( n )) (cid:48) X g ( n ) + λ I d . Hence, if the data V g ( n ) do not havea large variation in a particular dimension of x i , then the prediction from that dimension can beinaccurate. In such a case, the UCB decision rule recommends hiring a candidate who contributesto increasing the data variation for that dimension. For example, when a candidate has some skillsthat previous candidates do not have, then the candidate’s UCB index tends to become large.As the UCB decision rule nicely balances exploration and experimentation, it has a sublinear We typically choose δ = 1 /N so that the conﬁdence interval is asymptotically correct in the limit of N → ∞ . The standard ordinary least square has a conﬁdence bound of the form θ g − ˆ θ g ( n ) ∼ N (0 , σ (cid:15) V − g ( n )) and thus | θ g − ˆ θ g ( n ) | ∼ σ (cid:15) V − / g ( n ) . The martingale conﬁdence bound C g ( n ) is larger than OLS conﬁdence bound in twofactors because of the price of adaptivity. Namely, (1) √ d factor and (2) (cid:112) log(det( ¯ V g ( n ))) factor. As discussedin Xu, Honda, and Sugiyama (2018), the ﬁrst √ d factor unnecessarily overestimates the conﬁdence bound for mostcases. Theorem 4 (Sublinear Regret of UCB) . Suppose Assumptions 3. Let

Reg

UCB be the regret fromthe UCB decision rule. Let λ ≥ max(1 , L ) . Then, by choosing suﬃciently small δ , the regretunder the UCB decision rule is bounded as E [Reg UCB ( N )] ≤ C ucb √ N , where C ucb is a ˜ O (1) factor to N that depends on model parameters. Proof.

See Appendix B.5. The explicit form of C ucb is found in Eq. (43) therein.Note that, ˜ O ( √ N ) regret is the optimal rate for these sequential optimization problems underpartial feedback (Chu, Li, Reyzin, and Schapire, 2011). Hence, Theorem 4 states that the UCBdecision rule eﬀectively prevents perpetual underestimation and is asymptotically eﬃcient. Theanalysis here does not depend on the size of candidate pool K , and thus eﬀective regardless of thepopulation ratio. Remark 4.

Although we have made several strong assumptions for the analysis of laissez-faire (e.g.,two groups, symmetry), Theorem 4 does not rely on them, and therefore, it is applicable to a verygeneral environment. The groups need not be symmetric. The normal characteristic assumption(Assumption 3) can be relaxed to a weaker condition that guarantees that the distributions arelight-tailed, or the characteristics can even be arbitrary as long as they are bounded with highprobability.

To implement the UCB decision rule, we need to satisfy the ﬁrms’ obedience condition (3) alongwith the UCB’s decision rule (5). In this paper, we focus on two types of subsidy rules. One is the

UCB index subsidy rule , and another is the

UCB cost-saving subsidy rule .First, we formally deﬁne the UCB index subsidy rule. The UCB index subsidy rule inducesﬁrms to hire a candidate with the largest UCB index by aligning each ﬁrm’s proﬁt with the UCBindex. 25 lgorithm 3

The UCB Index Subsidy RuleComplete the initial sampling phase by running Algorithm 1. for n = N (0) + 1 , · · · , N do (cid:46) The UCB index subsidy rule starts. for i do Compute ˜ q i ( n ) = max ¯ θ g ( i ) ∈C g ( i ) ( n ) x (cid:48) i ¯ θ g ( i ) . (cid:46) Obtain UCB indices.Oﬀer s i = ˜ q i ( n ) − ˆ q i ( n ) for all i ∈ I ( n ) . (cid:46) Align ﬁrm n ’s payoﬀ to the UCB index.Firm n hires ι ( n ) = arg max i ∈ I ( n ) ˜ q i ( n ) as an equilibrium consequence. end forend forDeﬁnition 5 (UCB Index Subsidy Rule) . The

UCB index subsidy rule s subsidizes worker i whoarrives in round n by s i ( n ; h ( n )) = ˜ q i ( n ; h ( n )) − ˆ q i ( n ; h ( n )) . The formal algorithm is shown as Algorithm 3.The UCB index subsidy rule is named “index” because it belongs to an index policy (Gittins,1979) in the terminology of the multi-armed bandit literature.

Deﬁnition 6 (Index Policy) . A subsidy rule s is an index policy if for all n and i ∈ I ( n ) , s i ( n ; · ) only depends on X g ( i ) ( n ) , Y g ( i ) ( n ) , x i .To be more precise, our deﬁnition of the index rule is slightly weaker than the standard deﬁni-tion. A standard deﬁnition requires that the index of an arm only depends on the data generatedby the arm. However, since we regard a set of arms as a group, it does not make sense to focuson the data generated by “an arm.” Hence, we utilize all the data about group g ( i ) . Having saidthat, our deﬁnition requires that the subsidy for worker i is independent of (i) the other agents’characteristics x j for any j ∈ I ( n ) \ { i } and (ii) the data about other groups, X g (cid:48) ( n ) for any g (cid:48) (cid:54) = g ( i ) .If a subsidy rule is an index policy, the government need not observe the characteristics of I ( n ) \ { i } to determine the subsidy assigned to the employment of worker i . This is a practicallydesirable property: In many real-world problems, it is diﬃcult for the government to observe thecharacteristics of candidate workers who are not hired.The following theorem states the property of the UCB index subsidy rule. Among all indexsubsidy rules that implement the UCB decision rule, the UCB index subsidy rule requires the26inimum amount of the subsidy. Its expected amount is proven to be ˜ O ( √ N ) . Theorem 5 (Sublinear Subsidy of the UCB Index Rule) .

1. The UCB index subsidy rule implements the UCB decision rule.2. The UCB index subsidy rule needs a minimum amount of the subsidies among all subsidyrules that (i) implement the UCB decision rule, and (ii) str an index policy. Formally, let s U-I be the UCB cost-saving subsidy rule and s be an arbitrary subsidy rule that satisﬁes (i) and(ii). Then, for all i , n and h ( n ) , we have s U-I i ( n ; h ( n )) ≤ s i ( n ; h ( n )) .

3. Under the same assumptions as Theorem 4, the amount of the subsidy required by the UCBindex subsidy rule is bounded as E [Sub UCB-I ( N )] ≤ C ucb √ N . where C ucb is an ˜ O (1) factor that is the same as Theorem 4. Proof.

See Appendix B.6.The square-root subsidy implies that the government can eventually end the subsidy because

Sub

UCB-I ( N ) /N → as N → ∞ . Alternatively, Theorem 5 implies that society can terminateaﬃrmative actions once a suﬃciently rich data set about the minority groups is obtained. If the mechanism does not have to be an index policy (i.e., the subsidy for worker i ∈ I ( n ) maydepend on ( x j ) j ∈ I ( n ) of the other candidates), then we can save the budget without modifyingthe decision rule. To achieve it, we can subsidize the minimum amount such that candidate ι ismore proﬁtable than the other candidates. Formally, the UCB cost-saving subsidy rule is deﬁnedas follows. 27 lgorithm 4 The UCB Cost-Saving Subsidy RuleComplete the initial sampling phase by running Algorithm 1. for n = N (0) + 1 , · · · , N do (cid:46) The UCB cost-saving subsidy rule starts. for i do Compute ˆ q i ( n ) = x (cid:48) i ˆ θ g ( i ) ( n ) .Compute ˜ q i ( n ) = max ¯ θ g ( i ) ∈C g ( i ) ( n ) x (cid:48) i ¯ θ g ( i ) .Compute ι ( n ) = arg max i ∈ I ( n ) ˜ q i ( n ) . (cid:46) ι ( n ) is the UCB winner.Oﬀer s ι ( n ) ( n ) = max j ∈ I ( n ) ˆ q j ( n ) − ˆ q ι ( n ) ( n ) . (cid:46) Make ι ( n ) becomes most proﬁtable.Oﬀer s j ( n ) = 0 for all j ∈ I ( n ) \ { ι ( n ) } .Firm n hires ι ( n ) as an equilibrium consequence. end forend forDeﬁnition 7 (UCB Cost-Saving Subsidy Rule) . For every round n , the UCB cost-saving subsidyrule chooses s i ( n ) = 0 for every i ∈ I ( n ) \ { ι ( n ) } , where ι ( n ) is the candidate worker selected bythe UCB algorithm, (5). For i = ι ( n ) , the subsidy s i is given by s i ( n ; h ( n )) = max j ∈ I ( n ) ˆ q j ( n ; h ( n )) − ˆ q i ( n ; h ( n )) . The formal algorithm is shown as Algorithm 3.The UCB cost-saving subsidy rule subsidizes only the targeted worker, ι ( n ) . Hence, for otherworkers j (cid:54) = ι ( n ) , the payoﬀ from the employment is ˆ q j ( n ) . The UCB cost-saving subsidy rulesets the subsidy amount s ι ( n ) in such a way that the payoﬀ from hiring worker ι ( n ) , which is ˜ q ι ( n ) ( n ) + s ι ( n ) , is equal to (or slightly larger than) the payoﬀ from hiring the worker who has thehighest predicted skill, max j ∈ I ( n ) ˜ q j ( n ) .Clearly, the UCB cost-saving subsidy rule is the subsidy rule that requires the minimum budgetto implement the UCB decision rule. As ﬁnes (negative subsidies) are not allowed in our model, thegovernment cannot discourage the employment of the other candidate workers, j ∈ I ( n ) \ { ι ( n ) } ,further. Hence, the UCB cost-saving subsidy rule requires the smallest budget among all subsidyrules that implements the decision rule (5).Combining this observation with Theorem 5, we obtain the following theorem. Theorem 6 (Sublinear Subsidy of the UCB Cost-Saving Rule) .

1. The UCB cost-saving subsidy rule implements the UCB decision rule.28. The UCB cost-saving subsidy requires the minimum budget for implementing the UCB de-cision rule. Formally, let s U-CS be the UCB cost-saving subsidy rule and s be an arbitrarysubsidy rule that implements the UCB decision rule. Then, for all i , n and h ( n ) , we have s U-CS i ( n ; h ( n )) ≤ s i ( n ; h ( n )) .

3. The amount of the subsidy required by the UCB cost-saving subsidy rule is bounded as E [Sub UCB-CS ( N )] ≤ E [Sub UCB-I ( N )] ≤ C ucb √ N .

Proof.

The ﬁrst two statements straightforwardly follow from the argument above. The last state-ment follows from the ﬁrst two and Theorem 5.The cost-saving subsidy rule has some drawbacks. It depends on the characteristics of all thepotential candidates. Hence, the government must have precise knowledge about candidates whoappeared in each round but were not hired by the ﬁrm. Still, as a theoretical benchmark, it is usefulto study the minimum subsidy amount incurred. In Subsection 8.4, we compare the index rule andcost-saving rule numerically. Our simulation results indicate that the cost-saving rule outperformthe index rule by much in terms of the total amount of subsidy.

In the previous section, we showed that the UCB mechanism eﬀectively prevents perpetual under-estimation and achieves sublinear regret for general environments. However, the UCB mechanismhas one draw back: it assigns subsidies forever. Although the conﬁdence interval C g ( n ) shrinks as n grows large, it does not degenerate to a singleton for any ﬁnite n . Accordingly, even for a large n , there remains a gap between expected skill ˆ q i ( n ) and the UCB index ˜ q i ( n ) (though small insize). This feature is not desirable for the following reasons. First, introducing a permanent policyis often more politically diﬃcult than introducing a temporary policy. If the government declaresthat hiring of minority workers is permanently subsidized, the policy may look quite unfair to themajority group. The appearance of unfairness would cause signiﬁcant opposition. Second, if we29eep distributing subsidies over the long run, the required budget tends to grow. Third, besidesthe subsidy itself, the permanent allocation of the subsidy comes with (unmodeled) administrationcosts.To overcome these limitations of the UCB mechanism, we propose the hybrid mechanism , whichstarts with the UCB mechanism and turns to laissez-faire by terminating the subsidy at some point.We terminate the UCB-phase once the amount of data of the minority group is enough to inducespontaneous exploration. We prove that, our hybrid mechanism has ˜ O ( √ N ) regret (as the UCBmechanism does), and its expected total subsidy amount is ˜ O (1) (as opposed to ˜ O ( √ N ) subsidy ofUCB).The construction of the hybrid mechanism is as follows. Let s U-I i ( n ) = ˜ q i ( n ) − ˆ q i ( n ) be the sizeof conﬁdence bound. Note that, s U-I i ( n ) corresponds to the amount of the subsidy allocated by theUCB index subsidy rule (Deﬁnition 5). The hybrid index ˜ q H i is deﬁned as ˜ q H i ( n ; h ( n )) :=  ˜ q i ( n ; h ( n )) if s U-I i ( n ; h ( n )) > aσ x || ˆ θ g ( i ) ( n ; h ( n )) || , ˆ q i ( n ; h ( n )) otherwise , (6)where a ≥ is the mechanism’s parameter.The hybrid index is literally a “hybrid” of the predicted skill ˆ q i ( n ) and the UCB index ˜ q i ( n ) .If the diﬀerence between the UCB index and the predicted skill is larger than the threshold (i.e., s U-I i ( n ) > aσ x || ˆ θ g ( i ) ( n ) || ), the hybrid index is equal to the UCB index ˜ q i ( n ) . The conﬁdence bound | ˜ q i ( n ) − ˆ q i ( n ) | is large while we have insuﬃcient knowledge about group g ( i ) ; this is typicallythe case in an early stage of the game. Once this gap becomes smaller than the threshold (i.e., s U-I i ( n ) ≤ aσ x || ˆ θ g ( i ) ( n ) || ), then the hybrid index becomes equal to the predicted skill ˆ q i ( n ) .Naturally, the hybrid decision rule is deﬁned as the rule that hires the highest hybrid index. Deﬁnition 8 (The Hybrid Decision Rule) . The hybrid decision rule selects the worker who hasthe highest hybrid index; i.e., ι H ( n ; h ( n )) = arg max i ∈ I ( n ) ˜ q H i ( n ; h ( n )) . As the hybrid decision rule is a hybrid of the UCB decision rule and the laissez-faire decision30ule, it can be implemented by mixing the laissez-faire subsidy rule and either the UCB indexsubsidy rule or the UCB cost-saving subsidy rule.

Deﬁnition 9 (The Hybrid Index Subsidy Rule) . Let s U-I i be the UCB index subsidy rule. The hybrid index subsidy rule s H-I is deﬁned by s H-I i ( n ; h ( n )) :=  s U-I i ( n ; h ( n )) if s U-I i ( n ; h ( n )) > aσ x || ˆ θ g ( i ) ( n ; h ( n )) || , otherwise . Or, equivalently, the hybrid index subsidy rule can be deﬁned by s H-I i ( n ; h ( n )) = ˜ q H i ( n ; h ( n )) − ˆ q i ( n ; h ( n )) . Deﬁnition 10 (The Hybrid Cost-Saving Subsidy Rule) . Let s U-CS i be the UCB cost-saving subsidyrule. The hybrid cost-saving subsidy rule s H-CS is deﬁned by s H-CS i ( n ; h ( n )) :=  s U-CS i ( n ; h ( n )) if s U-I i ( n ; h ( n )) > aσ x || ˆ θ g ( i ) ( n ; h ( n )) || , otherwise . Theorem 7 (The Properties of the Hybrid Subsidy Rules) .

1. The hybrid index subsidy rule s H-I and the hybrid cost-saving subsidy rule s H-CS implementthe hybrid decision rule ι H .2. The hybrid index subsidy rule s H-I requires the minimum subsidy among all index subsidyrules that implement ι H .3. The hybrid cost-saving subsidy rule s H-CS requires the minimum subsidy among all subsidyrules that implement ι H .The proof of Theorem 7 is analogous to that of Theorems 5 and 6, and thus is omitted.The algorithm of the hybrid index subsidy rule and its equilibrium consequence is stated asAlgorithm 5. As it is straightforward to modify Algorithm 5 to construct a hybrid cost-savingsubsidy rule, we omit the algorithm for the hybrid cost-saving subsidy rule here.31 lgorithm 5 The Hybrid Index Subsidy RuleComplete the initial sampling phase by running Algorithm 1. for n = N + 1 , · · · , N do (cid:46) The hybrid index subsidy rule starts. for i do Compute ˆ q i ( n ) = x (cid:48) i ˆ θ g ( i ) ( n ) .Compute ˜ q i ( n ) = max ¯ θ ∈C g ( n ) x (cid:48) i ¯ θ .Compute s U-I i ( n ) = ˜ q i ( n ) − ˆ q i ( n ) .Oﬀer s i ( n ) = (cid:40) if s U-I i ≤ aσ x || ˆ θ g ( i ) || ,s U-I i ( n ) otherwise. (cid:46) The hybrid index subsidy.Firm n hires ι ( n ) = arg max i ∈ I ( n ) (cid:8) ˆ q i ( n ) + s U-I i ( n ) (cid:9) as an equilibrium consequence. end forend for The following two theorems characterize the regret and amount of subsidy of the hybrid decisionrule.

Theorem 8 (Regret Bound for the Hybrid Decision Rule) . Suppose Assumptions 1, 2, and 3.Then, by choosing suﬃciently small δ , the regret under the hybrid decision rule ι H is bounded as E [Reg H ( N )] ≤ C hyb √ N where C hyb is a factor that is ˜ O (1) to N . Theorem 9 (Subsidy Bound for the Hybrid Subsidy Rules) . Suppose Assumptions 1, 2, and 3.By choosing suﬃciently small δ , for any a > , the total amount of the subsidy under the hybridindex subsidy rule ( Sub

H-I ) and the hybrid cost-saving subsidy rule (

Sub

H-CS ) is bounded as

Sub

H-CS ( N ) ≤ Sub

H-I ( N ) = C hyb-sub . where C hyb-sub is a factor that is ˜ O (1) to N . Proof.

See Appendix B.7. The explicit form of C hyb is found at Eq. (54). The explicit form of C hyb-sub is found at Eq. (61) therein.Theorem 8 states that the order of the regret under the hybrid decision rule is ˜ O ( √ N ) , whichis the same as the original UCB decision rule. Theorem 9 states that the amount of the subsidy is32olylogarithmic to N , which is a substantial improvement from the standard UCB where ˜ O ( √ N ) subsidy is required.The threshold of switching from the UCB mechanism to laissez-faire is crucial for guaranteeingthe performance of the hybrid mechanism. Our threshold, aσ x || ˆ θ ( n ) || , is determined in such a waythat the hybrid decision rule ι H satisﬁes proportionality , which is a new concept established in thispaper. The formal statement appears in Lemma 28 in Appendix B.7 but it requires additionalnotations that do not appear in the main body of this paper. In what follows, we provide ahigh-level intuition regarding the concept of proportionality.We evaluate the expected regret of the hybrid decision rule by comparing it with the expectedregret of the UCB decision rule. However, since diﬀerent decision rules generate diﬀerent historiesand data, neither decision rule dominates the other. This is why the comparison is challenging.We overcome this problem by proving that the hybrid decision rule ι H is proportional to ι U inthe sense that there exists a constant c > such that when the UCB rule ι U hires worker i withprobability p i , then the hybrid rule ι H hires worker i with probability at least cp i given the samehistory. This property guarantees that the hybrid rule escapes from underexploring the minoritygroup and secures expected regret of ˜ O ( √ N ) .The timing of switching to laissez-faire is crucial for the proportionality. When the data aboutthe minority group are insuﬃcient, ﬁrms rarely hire minority workers under laissez-faire. Weprove that, when the threshold is set to aσ x || ˆ θ g ( n ) || , then ﬁrms keep hiring minority workers withsuﬃciently high frequency, and therefore, statistical discrimination is resolved eventually. Remark 5 (Dependence on Parameter a ) . There is a tradeoﬀ between the regret and subsidy.The constant on the top of regret (Theorem 8) is exp( a / , which is increasing in a . By contrast,the constant on the top of subsidy (Theorem 9) is exp(3 a / /a , which goes to inﬁnity as a → .Theorem 9 guarantees that the subsidy is ˜ O (1) whenever a > . However, when a is small, thebound provided by Theorem 9 becomes large and may not be insightful. To balance the tradeoﬀ,the government should select a “right-size” value for a . In our simulations (Section 8) we adopt a = 1 .For small a , because the hybrid mechanism is close to the UCB mechanism, we can divert ouranalysis for the UCB mechanism (Theorem 5). When a is small and N is ﬁnite, the square-root33ubsidy bound established in Theorem 5 may provide a tighter characterization of the total subsidy. Although the UCB-based subsidy rule is a powerful policy intervention to resolve statistical discrim-ination, the subsidy rule is sometimes diﬃcult to implement in practice. This section articulatesthe advantages and disadvantage of the Rooney Rule, which requires each ﬁrm to invite at least onecandidate of each group to an on-site interview. The Rooney Rule is relatively easy to implementbecause it requires neither the subsidy nor hard hiring quota.To incorporate the additional information the ﬁrms acquire through the interview, we makethe following modiﬁcation to the model. In the modiﬁed model, each round n consists of twostages. In the ﬁrst stage, ﬁrm n observes the characteristics of each arriving agent i ∈ I ( n ) , x i .Based on x i , ﬁrm n selects a shortlist of ﬁnalists I F ( n ) ⊆ I ( n ) , where | I F ( n ) | = K F for some K F ∈ N . In the second stage, by interviewing ﬁnalists, ﬁrm n observes an additional signal η i for each ﬁnalist i (as assumed in Kleinberg and Raghavan, 2018). Firm n predicts each ﬁnalist i ’sskill from the characteristics x i and the additional signal η i , and hires one worker from the set ofﬁnalists, ι ( n ) ∈ I F ( n ) . Firms are not allowed to hire a worker who was not selected as a ﬁnalist.After the ﬁrm makes a decision, the skill of the hired worker y ι ( n ) is publicly disclosed.We assume the following linear relationship between the skill y i and the observable variables x i and ι i : y i = x (cid:48) i θ g ( i ) + η i + (cid:15) i The “noise” term comprises two variables: η i and (cid:15) i . η i is revealed as an additional signal when theﬁrm chooses i as a ﬁnalist. However, (cid:15) i remains to be unpredictable even after the hiring—ﬁrmsonly observe y i after worker i is hired.For analytical tractability, besides Assumptions 1, 2, and 3, we make the following two assump-tions. Assumption 4 (Two Finalists) . Each ﬁrm can invite only two ﬁnalists; i.e., K F = 2 .Assumption 4 generates a minimal environment to consider the performance of the Rooney34ule. Assumption 5 (Normal Additional Signals) . The signal that the ﬁnalist reveals is the independentand identically distributed normal random variable: η i ∼ N (0 , σ η ) . Remark 6. If σ η = 0 , then the two-stage model is the same as the one-stage model that we haveconsidered in the previous sections. This subsection analyzes the performance of laissez-faire in this two-stage setting. The result isanalogous to the one-stage case (Theorem 3) : laissez-faire often falls in perpetual underestimation,and therefore, has linear regret.First, we formally deﬁne the regret. As in the one-stage model, the benchmark is the ﬁrst-bestdecision rule, which is the rule ﬁrms would take if the coeﬃcient parameter θ were known. Clearly,the ﬁrst-best decision rule would greedily invite top- K F workers in terms of q i to the ﬁnal interview.We denote this set of ﬁnalists chosen by the ﬁrst-best decision rule in round n by ¯ I F ( n ) . Formally, ¯ I F ( n ) is obtained by solving the following problem: ¯ I F ( n ) = arg max I (cid:48) ⊆ I ( n ) (cid:88) i ∈ I (cid:48) q i s.t. | I (cid:48) | = K F . After that, the ﬁrst-best decision rule would observe the realization of η i for i ∈ ¯ I F ( n ) , and thenhires the worker i who has the highest skill predictor: q i + η i . The unconstrained two-stage regret isdeﬁned as the loss compared with this ﬁrst-best decision rule. (This regret is named “unconstrained”because we introduce an alternative deﬁnition of regret later.) Deﬁnition 11 (Unconstrained Two-Stage Regret) . In the two-stage hiring model, the uncon-strained two-stage regret

U2S-Reg of decision rule ι is deﬁned as follows:U2S-Reg ( N ) = N (cid:88) n =1 (cid:26) max i ∈ ¯ I F ( n ) ( q i + η i ) − (cid:0) q ι ( n ) + η ι ( n ) (cid:1)(cid:27) . n ’s optimal strategy is to greedily choose their candidates based on thebelief, i.e., I F ( n ) = arg max I (cid:48) ⊆ I ( n ) (cid:88) i ∈ I (cid:48) ˆ q i ( n ) s.t. | I (cid:48) | = K F . After observing the realization of the additional signals η i , ﬁrm n again selects the candidate withthe highest predicted skill: ι ( n ) = arg max i ∈ I F ( n ) { ˆ q i ( n ) + η i } . Even in the two-stage model, laissez-faire has linear regret when the population ratio is imbal-anced.

Theorem 10 (Failure of Laissez-Faire in the Two-Stage model) . Suppose Assumptions 1, 3, 2,4, and 5. Suppose also that K = 1 and d = 1 . Let K − log ( K + 1) > log N . Then, underthe laissez-faire decision rule, group is perpetually underestimated with the probability at least C imb = ˜Θ(1) . Accordingly, the expected regret of the laissez-faire decision rule is E (cid:2) U2S-Reg LF (cid:3) = ˜Ω( N ) . Proof.

See Appendix B.8.The proof idea of Theorem 10 is as follows. Under laissez-faire, each ﬁrm n interviews twocandidates who have the highest expected skills, ˆ q i ( n ) . If both of these two workers are majorities,then minority workers are never hired no matter what the η i for each ﬁnalist is. By evaluating theprobability that both ﬁnalists are majorities, we derive the probability that perpetual underesti-mation occurs. Note that, to meet K − log ( K + 1) ≥ log N , K should be Ω(log N ) = ˜Ω(1) .Hence, Theorems 3 and 10 require the same rate of the imbalanced population ratio.To summarize, even in a two-stage setting, the laissez-faire decision has linear regret (when thepopulation ratio is imbalanced). This is because the laissez-faire decision rule results in perpetualunderestimation with a signiﬁcant probability. 36 .3 The Rooney Rule and Exploration As laissez fair does not perform well, we need to seek for desirable policy intervention. The RooneyRule, which requires each ﬁrm to invite at least one minority ﬁnalist to the ﬁnal interview, is oneof the natural aﬃrmative actions in this setting, and is widely implemented in real-world problems.

Deﬁnition 12 (The Rooney Rule) . In the two stage hiring model, the

Rooney Rule requires eachﬁrm n to select at least one ﬁnalist from every group g ∈ G ; i.e., for every n and every g ∈ G , I F ( n ) must satisfy (cid:12)(cid:12)(cid:8) i ∈ I F ( n ) | g ( i ) = g (cid:9)(cid:12)(cid:12) ≥ . (7)The Rooney Rule is relatively easy to implement because it imposes no hiring quota or hiringpreference given to minorities. The Rooney Rule of originally introduced as the National FootballLeague policy to promote hiring of ethnic-minority candidates for head coaching positions, butvariations of the Rooney Rule are now implemented in many industries. Although the RooneyRule has been used in many places, its theoretical performance has not been studied intensively.To understand the fact that the Rooney Rule resolves statistical discrimination, we introducean alternative (weaker) notion of regret, constrained two-stage regret . Deﬁnition 13 (Constrained Two-Stage Regret) . In the two-stage hiring model, the constrainedtwo-stage regret (C2S-Reg) of decision rule ι is deﬁned as follows:C2S-Reg ( N ) = N (cid:88) n =1 (cid:40) max i ∈ ˘ I F ( n ) ( q i + η i ) − (cid:0) q ι ( n ) + η ι ( n ) (cid:1)(cid:41) . where ˘ I F ( n ) is given by ˘ I F ( n ) = arg max I (cid:48) ⊆ I ( n ) (cid:88) i ∈ I q i (8)s.t. | I (cid:48) | = K F , For example, in a securities and exchange commission ﬁling posted on 2018, Amazon declares that “

The AmazonBoard of Directors has adopted a policy that the Nominating and Corporate Governance Committee include a slate ofdiverse candidates, including women and minorities, for all director openings. This policy formalizes a practice alreadyin place ” ( ). Inaddition, according to O’Brien (2018), Facebook COO Sheryl Sandberg said that “

The company’s ‘diverse slateapproach’ is a sort of ‘Rooney Rule,’ the National Football League policy that requires teams to consider minoritycandidates .” g ∈ G, (cid:12)(cid:12)(cid:12)(cid:110) i ∈ ˘ I F ( n ) | g ( i ) = g (cid:111)(cid:12)(cid:12)(cid:12) ≥ . In plain words, ˘ I F ( n ) is the best list of ﬁnalists who satisfy the constraint (7). If (7) is imposedas an “exogenous constraint” (rather than a policy), the ﬁrst-best decision rule would interview ˘ I F ( n ) to maximize social welfare. Clearly, the unconstrained regret is larger than the constrainedregret.The constrained regret is useful in that it enables us to identify whether the Rooney Ruleprevents perpetual underestimation—if perpetual underestimation occurs under the Rooney Rule,then the constrained regret is linear in N . To the contrary, if the social learning is successful (i.e., ˆ q i is very close to q i for all the workers), the constrained regret would be zero.Under Rooney Rule, myopic ﬁrm n would greedily choose candidates based on estimator ˆ q i ( n ) subject to the constraints: I F ( n ) = arg max I (cid:48) ⊆ I ( n ) (cid:88) i ∈ I ˆ q i ( n ) (9)s.t. | I (cid:48) | = K F , ∀ g ∈ G, (cid:12)(cid:12)(cid:8) i ∈ I F ( n ) | g ( i ) = g (cid:9)(cid:12)(cid:12) ≥ . and ι = arg max i ∈ I F ( n ) { ˆ q i ( n ) + η i } . Note that the only diﬀerence between Eq. (8) and (9) is that q i is replaced by ˆ q i ( n ) .The following theorem states that the Rooney Rule is able to resolve perpetual underestimationwith suﬃciently revealing signal η i . Theorem 11 (Sublinear Constrained Regret under the Rooney Rule) . Suppose Assumptions 1, 2,3, 4, and 5. Then, the regret under the Rooney Rule is bounded as E (cid:2) C2S-Reg

Rooney ( N ) (cid:3) ≤ C √ N where C is ˜ O (1) to N . Proof.

See Appendix B.9. The explicit form of C is found at Eq. (68) therein. Note that C isexponentially dependent on signal variance σ η (see the deﬁnition of C in Eq. (64)), which implies38hat a suﬃciently large value of σ η is required to obtain a reasonable bound.The proof idea is as follows. When a group is underrepresented, no candidates from the group isregarded as the most promising ﬁnalist with a signiﬁcant probability. Hence, laissez-faire may resultin perpetual underestimation. The Rooney Rule mitigates this problem by securing a ﬁnalist seat foreach group. If the additional signal is informative enough (i.e., σ η is large), there is some probabilitythat the minority ﬁnalist beats the majority ﬁnalist and is hired. In other words, additional signalnaturally induces exploration for the minority group and prevents perpetual underestimation. Remark 7.

The Rooney Rule is analogous to the (cid:15) -greedy algorithm that is widely studied in themulti-armed bandit and reinforcement learning literature. The (cid:15) -greedy algorithm usually makesa decision based on the greedy algorithm (equivalent to laissez-faire in our model), but there isa small probability ( (cid:15) ) that the algorithm chooses a worker uniformly at random. In the banditliterature, the (cid:15) - greedy algorithm is known to be robust to the choice of the exploration probability (cid:15) : In fact, one can prove that the regret of the (cid:15) -greedy algorithm is sub-linear for any value (cid:15) > .In our model, the Rooney Rule successfully resolve underexploration because of the randomness inthe additional signal η i induces (cid:15) -experiments. This subsection shows that, although the Rooney Rule successfully prevents statistical discrimina-tion, it may worsen social welfare evaluated by the original unconstrained regret.When the population ratio is imbalanced (i.e., K /K is large), there is a signiﬁcant probabilitythat more than one majority worker has high skills. In that case, the true predicted skill ofthe second-best majority worker ( q i ) is likely to be higher than that of the minority champion.This feature raises constant regret per round: when η i is normally distributed, any ﬁnalist has apositive probability of being hired. Hence, the skills of all candidates matter, and therefore, ﬁrmswant to interview top- K F candidates who have the highest skills. The Rooney Rule prevents thisoutcome. This eﬀect would present even when ﬁrms had perfect information about coeﬃcients θ .Furthermore, the loss from the constraint (7) is constant per round, and therefore, results in theunconstrained regret of Ω( N ) in total. 39 heorem 12 (Linear Unconstrained Regret under the Rooney Rule) . Suppose Assumptions 1, 2,3, 4, 5. Then, the regret under the Rooney Rule is bounded as E (cid:2) U2S-Reg

Rooney ( N ) (cid:3) = Ω( N ) . The proof is straightforward from the argument above, and therefore, is omitted.In summary, both laissez-faire and the Rooney Rule have linear unconstrained regret. However,the structure behind these results are diﬀerent. Laissez-faire has linear regret due to underexplo-ration. In contrast, the Rooney Rule has linear regret due to underexploitation.One way to resolve this trade-oﬀ is to mix the Rooney Rule and laissez-faire (as the hybridmechanism does). By starting with the Rooney Rule and abolishing it after suﬃciently rich datais obtained, we could mitigate the disadvantage of the Rooney Rule. In Section 8, we also testifythe performance of such a mechanism.

This section reports the results of the simulations that we run to support our theoretical ﬁndings .Unless speciﬁed, the model parameters are set as follows: d = 1 , µ x = 3 , σ x = 2 , σ (cid:15) = 2 . Theregularizer of regression is set to be λ = 1 . The group sizes are set to be ( K , K ) = (10 , . Theinitial sample size is N (0) = K + K , and the sample size for each group is equal to its populationratio: N (0)1 = K , N (0)2 = K . All the results are averaged over runs.The value of δ in the conﬁdence bound is set to . . We ﬁrst testify the population ratio eﬀects to the frequency of perpetual underestimation (i.e.,group is never hired after the initial sampling phase). The decision rule is ﬁxed to laissez-faire(LF). We ﬁx the number of minority candidates in each round to two (i.e., K = 2 ) and vary thenumber of majority candidates ( K = 2 , , , ). The source code of the simulations is available at https://github.com/jkomiyama/FairSocialLearning/

10 30 100 o f P U Figure 1: The number of perpetual underestimation among runs under laissez-faire. The errorbars are the two-sigma binomial conﬁdence intervals. o f m i n o r i t i e s h i r e d LFUCBOptimal (a) Number of minority candidates hired R e g r e t LFUCB (b) Regret

Figure 2: The comparison between the LF and UCB decision rules. The lines are the averageover sample paths, and the areas cover between 5% and 95% percentile of runs. The error bars at N = 1000 are the two-sigma conﬁdence intervals.Figure 1 exhibits the simulation result. Consistent with our theoretical analyses, we observethat (i) as indicated by Theorem 2 laissez-faire rarely results in perpetual underestimation if thepopulation is balanced (i.e., K is close to K = 2 ), and (ii) as indicated by Theorem 3, perpetualunderestimation becomes more frequent as the population of majority workers increases (i.e., K increases). 41 .3 Laissez-Faire vs The UCB Decision Rule Figure 2a compares the number of minority workers hired by the laissez-faire (LF) and UCBdecision rules. Figure 2b compares the regret under these two rules. The horizontal axis representsthe round (where the number of total rounds is ﬁxed to N = 1000 ), and the vertical axis representsthe number of minority workers hired and the regret, respectively. The subsidy required by theUCB mechanism will be shown later (in Figure 4).As indicated by Theorem 3, our simulation shows that laissez-faire has a signiﬁcant probabilityof underestimating the minority group. Consequently, we observe the following two facts. First,the number of minority workers hired on average is lower than the ﬁrst-best decision rule wouldhire (hire a minority worker with probability K / ( K + K ) = 2 / (10 + 2) ≈ for each round).Second, laissez-faire sometimes causes perpetual underestimation, and therefore, the number ofminority workers hired could be zero, and the regret grows linearly in n even after rounds.Due to the possibility of perpetual underestimation, the conﬁdence intervals of the sample paths(denoted by the read area) is very large, indicating that the performance of laissez-faire is highlyuncertain.In contrast, consistent with Theorem 4, the performance of the UCB decision rule is shownto be much more stable. As the UCB rule avoids underexploration, it does not cause perpetualunderestimation. Consequently, (i) the UCB’s regret is lower than laissez-faire on average, and(ii) the variance of the regret and the number of minority workers hired is also small. Note thatthe UCB decision rule tends to hire more minority workers than the ﬁrst-best decision rule. Thisoutcome happens because society is typically less knowledgeable about the minority group (due toan uneven population ratio), and therefore, the conﬁdence interval for minority workers is typicallylarger than that for the minority. Next, we compare the performance of the UCB and hybrid mechanisms. The parameter of thehybrid mechanism is set to be a = 1 . Figure 3 compares the performance of these decision rules:Figure 3a shows the number of minority workers hired, and Figure 3b shows the regret.We observe that the number of the minority hired on average becomes closer to the ﬁrst-best42

200 400 600 800 1000Round (n)050100150200250 o f m i n o r i t i e s h i r e d UCBHybridOptimal (a) Number of minority candidates hired R e g r e t UCBHybrid (b) Regret

Figure 3: The comparison between the UCB and hybrid decision rules. The lines are average oversample paths, and the areas cover between 5% and 95% percentile of runs. The error bars at N = 1000 show the two-sigma conﬁdence intervals of the expected regret.decision rule (Figure 3a). Furthermore, as expected by Theorems 4 and 8, as for eﬃciency (regret),the performance of these two decision rules grows in the same order. However, we ﬁnd that thehybrid decision rule outperforms UCB in our simulation setting (Figure 3b). We consider thatthese results happen because the hybrid decision rule stops overexploration of the minority groupin an early stage.Figure 4 compares the total budgets required by (i) the UCB index subsidy rule (UCB), (ii)the hybrid index subsidy rule (Hybrid), (iii) the UCB cost-saving subsidy rule (CS-UCB), and (iv)the hybrid cost-saving subsidy rule (CS-Hybrid).Figure 4a compares the index subsidy rules. As predicted by Theorems 5 and 9, the hybridindex subsidy rule requires a much smaller budget than the UCB index subsidy rule. Furthermore,the subsidy distributed by the UCB rule seems still growing, even after rounds are ﬁnished.This is also consistent with our theory because the UCB rule requires ˜ O ( √ N ) subsidy (while thehybrid rule only requires ˜ O (1) subsidy).Figure 4b compares the subsidy amount of the UCB cost-saving subsidy rule and the hybridsubsidy rules. The UCB index subsidy rule is excluded because it requires a much larger subsidyamount. We observe that (i) two cost-saving subsidy rules require a similar amount of the subsidy(while the hybrid cost-saving subsidy rule performs slightly better), and (ii) the cost-saving methodis very eﬀective, even when it is compared with the hybrid index rule.Note that, although the subsidy amounts required by these two cost-saving rules are similar,43

200 400 600 800 1000Round (n)0100200300400500600 S u b s i d y ( n ) UCBHybrid (a) The UCB index subsidy rule vs the hybrid indexsubsidy rule. S u b s i d y ( n ) HybridCS-UCBCS-Hybrid (b) The UCB cost-saving subsidy rule vs the hybridindex subsidy rule and the hybrid cost-saving subsidyrule.

Figure 4: The comparison of the budget required by subsidy rules. The lines are average oversample paths, and the areas cover between 5% and 95% percentile of runs. The error bars at N = 1000 show the two-sigma conﬁdence intervals of the expected regret.when we have more rounds, the hybrid cost-saving subsidy rule outperforms. Figure 5 articulatesthis result. While the subsidy required by the hybrid cost-saving rule remains constant after a few(about 100) rounds, the subsidy by the UCB cost-saving rule gradually grows. This result is alsoconsistent with our theory: While the subsidy required by the hybrid rule is ˜ O (1) (Theorem 9),the subsidy required by the UCB cost-saving rule is ˜ O ( √ N ) (Theorem 6). This subsection describes the performance of the Rooney Rule compared with laissez-faire. Fig-ure 6a depicts the relationship between the frequency of perpetual underestimation and the infor-mativeness of the signal obtained at the second stage (measured by σ η , which is the variance of η i )under laissez-faire and the Rooney Rule.For the Rooney Rule, we observe that when the second-stage signal η i is more informative, per-petual underestimation occurs less often. This outcome happens because, even when the minorityﬁnalist is underestimated (the predicted skill ˆ q i is small while the true skill q i is large), when σ i islarge, the minority ﬁnalist has a signiﬁcant probability of overturning the situation. If this happensoften enough, society can learn about the minority group, and statistical discrimination can bespontaneously resolved. 44 S u b s i d y ( n ) HybridCS-UCBCS-Hybrid

Figure 5: The UCB cost-saving subsidy rule vs the hybrid index subsidy rule and the hybrid cost-saving subsidy rule where N = 10000 . Each line is an average over sample paths, and the areascover between 5% and 95% percentile of runs. Due to computational limitation, we only did runs of this simulation. The error bars at N = 10000 show the two-sigma conﬁdence intervals ofthe expected regret.As for laissez-faire, we observe that laissez-faire falls in perpetual underestimation with a signif-icant probability for any σ η adopted in the simulation. This outcome is consistent with our analysis(Theorem 10). Since minority workers are rarely chosen as ﬁnalists, they have no opportunity to behired even when σ η is large. These results imply that that, even in a two-stage model, laissez-fairefrequently results in statistical discrimination.Figure 6b shows the constrained regret of the Rooney Rule. We can observe that the constrainedregret grows sublinearly in n , implying that the Rooney Rule resolves perpetual underestimation.Hence, under the Rooney Rule, society does not suﬀer from underexploration of the minority group.However, this does not imply that the Rooney Rule arrives come without cost. As we discussedin Subsection 7.4, once the coeﬃcient parameter θ is learned, the Rooney Rule may prevent societyfrom making a fair and eﬃcient decision. To testify this, we also examine the growth of uncon-strained regret. Figure 7 exhibits the results of this simulation. We ﬁnd that the performanceof the Rooney Rule is worse than laissez-faire because the cost of underexploitation (of Rooney)exceeds the cost of underexploitation (of laissez-faire).As we indicated in Subsection 7.4, the performance of the Rooney Rule could be improved ifwe terminate it after “learning is completed.” In this simulation, we also test this rule, the Rooney-LF rule —impose the Rooney Rule to ﬁrst 100 ﬁrms and then turn to laissez-faire. We ﬁnd thatthe Rooney-LF rule avoids perpetual underestimation, and therefore, has a similar performance to45 .0 0.6 1.2 1.8 2.4sigma_eta0100200300 o f P U RooneyLF (a) The number of perpetual underestimation among runs. The error bars show binomial conﬁdenceintervals. R e g r e t Rooney (b) The constrained two-stage regret. σ η is ﬁxed to . . Figure 6: The Rooney Rule’s performance for exploration. R e g r e t ( N o R e s t r i c t i o n ) RooneyRooney-LFLF

Figure 7: The unconstrained two-stage regret under laissez-faire (LF), the Rooney Rule, and theirhybrid (Rooney-LF).laissez-faire. This result indicates that, if we select the transition timing appropriately, then wecan resolve statistical discrimination without compromising the quality of the ﬁnalists.

Kannan et al. (2018) show that when we have suﬃciently large initial samples (i.e., N (0) is large),the greedy algorithm (corresponding to laissez-faire in this paper) has sublinear regret. As statedin Remark 2, our analysis also indicates that the probability of perpetual underestimation is small We also note that the number of initial samples required by the relevant theorem ( n min of Lemma 4.3therein) is very large and cannot be satisﬁed in our simulation setting: Letting R = σ x (cid:112) N ) , we have n min ≥ R log( R dK/δ ) /λ ≥ . ybrid N0=10 N0=20 N0=50 N0=100decision rules0100200300400 o f P U (a) The number of perpetual underestimations among runs. Hybrid N0=10 N0=20 N0=50 N0=100Decision rules0100200300400 S u b s i d ( N ) (b) The total amount of subsidies. “Hybrid” denotesthe hybrid cost-saving subsidy rule. Figure 8: The comparison between the hybrid mechanism and uniform sampling. N0 (= N (0) ) denotes the number of initial samples taken prior to laissez-faire. The error bars are the two-sigmabinomial conﬁdence intervals.when N (0) is large (see Lemma 24 for the full detail).One may think that this “warm-start” version of laissez-faire is eﬃcient. However, the warm-start approach has several disadvantages. First, while we have ignored the cost of acquiring initialsamples thus far for analytical tractability, we need to take into account of the cost of acquiringuniform samples if we want to take a suﬃciently long warm-start period. As uniform samplingignores ﬁrms’ incentives for hiring workers, we need a large budget to implement it in practice.Second, uniform sampling does not maximize any index. Accordingly, it cannot be implemented byany index policy. Third, uniform sampling is ineﬃcient in terms of information acquisition becauseit is not adaptive to current estimated parameters.We argue that our hybrid mechanism (Section 6) is a more sophisticated version of laissez-faire with a warm start—it initially samples the data adaptively and then switch to laissez-faireat an eﬃcient timing. Hence, we can naturally expect that the hybrid mechanism outperformslaissez-faire with initial uniform sampling.Figure 8 exhibits the simulation results that compare the hybrid mechanism with laissez-fairewith various initial samples. In this simulation, the number of initial samples for each group isproportional to the population ratio; i.e., N (0) g = ( K g /K ) · N (0) .Figure 8a measures the number of perpetual underestimations. As indicated by our theory,the larger the initial sample, the less frequently perpetual underestimation occurs. In addition,47e observed no perpetual underestimation under the hybrid mechanism, as it solidly incentivizeshiring from an underexplored group.Figure 8b depicts the subsidy amount required by the cost-saving subsidy rules (recall thatuniform sampling cannot be acquired by any index subsidy rule). Here, we can observe thatthe hybrid cost-saving subsidy rule outperforms laissez-faire with uniform sampling. Laissez-fairerequires at least N (0) > samples to mitigate perpetual underestimation. However, when N (0) ≥ , the hybrid cost-saving subsidy rule requires a smaller budget than uniform sampling. Thisresult indicates that the hybrid mechanism is more eﬃcient in compensating ﬁrms. Thus far, we have stated all the results in the terminology of the economics and statistical dis-crimination literature. However, this paper also makes several technological contributions to theliterature of the contextual bandit problems, which are of independent interest. In particular, weconsider non-discounted reward formalization (Robbins, 1952; Lai and Robbins, 1985). Unlike otherformalization such as Gittins’s (1979) one (e.g., Sundaram, 2005; Bergemann and V¨alim¨aki, 2006),this formalization weights future rewards and the current reward equally. The greedy and theUCB algorithms have been intensively studied in this literature, and we made several contributionsto it. For convenience of the readers, we state our technological contributions using the banditterminology.

Perpetual Underestimation

The greedy algorithm (which takes optimal decision at each roundbased on plug-in parameters) fails due to the randomness in ﬁnite samples. This concept originatedin a “context-less” bandit, a traditional model that corresponds to the limit of σ x → . We provethat, when the context is ﬁxed (or has very small variance), exploration is required to mitigateperpetual underestimation (Theorem 1). Analysis of the Greedy Algorithm in a Disproportionate Model

Some previous studies(Bastani et al., 2020; Kannan et al., 2018) show that the greedy algorithm performs well if thecontext variation is suﬃcient. Our results (Theorem 3) indicate that, when multiple arms forma group (cluster) and share the coeﬃcient parameter, the ratio of the group size is crucial for48he performance of the greedy algorithm (laissez-faire). This is a novel ﬁnding in the contextualmulti-armed bandit literature. When the contexts have limited variance, the greedy algorithm fails.

Development of the Hybrid Algorithm

Thus far, the contextual bandit literature (e.g., Chuet al., 2011) has studied the regret with an “adversarial” setup where the contexts (characteristics)are chosen to maximize the regret, and the UCB algorithm was designed to solve such an adversarialbandit problem.By contrast, this paper assumes that the contexts are drawn from a ﬁxed distribution. Ourhybrid algorithm, which switches from an UCB algorithm to a greedy algorithm, takes advantage ofthe knowledge about the context distribution (more speciﬁcally, the information about σ x ), and se-lects an appropriate time for switching. Consequently, we obtained the proportionality (Lemma 28),which is a crucial lemma to evaluate the performance of the hybrid algorithm. As shown theoret-ically (Theorem 9) and numerically (Subsection 8.6), the hybrid algorithm outperforms the UCBalgorithm in terms of the total budget required to a large extent. Analysis of the Rooney Rule

To our knowledge, this is the ﬁrst multi-armed bandit studyon the Rooney Rule. We show that, the greedy algorithm underexplores some arms even whenagents are unbiased and fully rational, and the Rooney Rule can mitigate that underexploration(Theorem 11). The uncertainty in the ﬁrst stage (the realization of η i ) helps to mitigate perpetualunderestimation by implicitly encouraging exploration.

10 Conclusion

We studied statistical discrimination using a contextual multi-armed bandit model. Our dynamicmodel articulates that statistical discrimination can be caused by the failure of social learning. Inour model, the insuﬃciency of the data about the minority group is endogenously generated. Thelack of data prevents ﬁrms from estimating the candidate workers’ skill accurately. Consequently,ﬁrms tend to prefer hiring a majority worker, which makes the data suﬃciency persistent (perpetualunderestimation). In our setting, this form of statistical discrimination is not only unfair but also Prior to our work, Kleinberg and Raghavan (2018) study the Rooney Rule in the context of evaluation bias. temporary aﬃrmativeaction would be the best solution to resolve statistical discrimination as a failure of social learning.

Appendix

A Lemmas

This section describes the technical lemmas that are used for deriving the theorems.The Hoeﬀding inequality, which is one of the most well-known versions of concentration in-equality, provides a high-probability bound of the sum of bounded independent random variables.

Lemma 13 (Hoeﬀding inequality) . Let x , x , . . . , x n be i.i.d. random variables in [0 , . Let ¯ x = (1 /n ) (cid:80) nt =1 x t . Then, Pr [¯ x − E [¯ x ] ≥ k ] ≤ e − nk r [¯ x − E [¯ x ] ≤ − k ] ≤ e − nk and taking union bound yields Pr [ | ¯ x − E [¯ x ] | ≥ k ] ≤ e − nk . The following is a version of concentration inequality for a sum of squared normal variables.

Lemma 14 (Concentration Inequality for Chi-squared distribution) . Let Z , Z , . . . , Z n be inde-pendent standard normal variables. Then, Pr (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) k =1 Z k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:35) ≤ e − nt / Lemma 15 (Normal Tail Bound, Feller, 1968) . Let φ ( x ) := e − x / √ π be the probability densityfunction (pdf) of a standard normal random variable. Let Φ c ( x ) = (cid:82) ∞ x φ ( x (cid:48) ) dx (cid:48) . Then, (cid:18) x − x (cid:19) e − x / √ π ≤ Φ c ( x ) ≤ x e − x / √ π Lemma 16 (Largest Context, Theorem 1.14 in Rigollet, 2015) . Let x i ∼ N ( µ x , σ x I d ) for each i ∈ I ( n ) . Let µ x = || µ x || and L = L ( δ ) := µ x + σ x (cid:112) d (2 log( KN ) + log(1 /δ )) Then, with probability at least − δ , we have ∀ i ∈ I ( n ) , n ∈ [ N ] , || x i || ≤ L ( δ ) . The following bounds the variance of a conditioned normal variable.

Lemma 17 (Conditioned Tail Deviation) . Let x ∼ N ( a, be a scalar normal random variable51ith its mean a ∈ R and unit variance. Then, for any b ∈ R , the following two inequalities hold. Var( x | x ≥ b ) ≥ Proof of Lemma 17.

Without loss of generality, we assume b = 0 because otherwise we can reparametrize x (cid:48) = x − b ∼ N ( a − b, . If a ≤ , the pdf of conditioned variable x | x ≥ is ψ ( x ) for x ≥ .Manual evaluation of this distribution reveals that Var( x ) ≥ / . Otherwise ( a > ), the pdfof x | x ≥ b is p ( x ) ≥ ψ ( x − a ) for x ≥ a , which implies Var( x | x ≥ b ) ≥ Var( z ) , where z be a“half-normal” random variable with its cumulative distribution function P ( z ) =  Φ( z ) if z > / if z = 00 otherwise . Manual evaluation of

Var( z ) also shows that Var( z ) ≥ / .The following diversity condition that simpliﬁes the original deﬁnition of (Kannan et al., 2018)is used to lower-bound the expected minimum eigenvalue of ¯ V g . Lemma 18 (Diversity of Multivariate Normal Distribution) . The context x is λ -diverse for λ ∈ R if for any ˆ b ∈ R , ˆ θ ∈ R d λ min (cid:16) E (cid:104) xx (cid:48) | x (cid:48) ˆ θ ≥ ˆ b (cid:105)(cid:17) ≥ λ . Let x ∼ N ( µ x , σ x I d ) . Then, the context x is λ -diverse with λ = σ x / . Proof of Lemma 18. λ min (cid:16) E (cid:104) xx (cid:48) | x (cid:48) ˆ θ ≥ ˆ b (cid:105)(cid:17) = min v : || v || =1 E (cid:104) ( v (cid:48) x ) | x (cid:48) ˆ θ ≥ ˆ b (cid:105) ≥ min v : || v || =1 Var (cid:104) v (cid:48) x | x (cid:48) ˆ θ ≥ ˆ b (cid:105) Let e , e , . . . , e d be the orthogonal bases. Without loss of generality, we assume ˆ θ = θ e for some This distribution is called a folded normal distribution. Half of the mass lies in z > , the other half of mass is at z = 0 . ≥ and µ x = u e + u e for some u , u ∈ N . Let x = x e + x e + · · · + x d e d . Due to the property of the normal distribution, each coordinate x l for l ∈ [ d ] are independent eachother. We will show the variance of Var (cid:104) x l | x (cid:48) ˆ θ ≥ ˆ b (cid:105) ≥ σ x / , (10)which suﬃces to prove Lemma 18.• For the ﬁrst dimension, we have x ∼ N ( u , σ x ) and Var (cid:104) x | x (cid:48) ˆ θ ≥ ˆ b (cid:105) = Var (cid:104) x | x ≥ ˆ b/θ (cid:105) . Applying Lemma 17 with x = sgn(ˆ b/θ ) /σ x , a = µ x /σ x , b = | ˆ b/θ | yield Var (cid:104) x | x ≥ ˆ b/θ (cid:105) ≥ σ x / .• For the second dimension, we have x ∼ N ( u , σ x ) and Var (cid:104) x | x (cid:48) ˆ θ ≥ ˆ b (cid:105) = Var [ x ] = σ x > σ x / . • ( x , x , . . . , x d ) ∼ N (0 , σ x I d − ) . In other words, these characteristics are normally distributedand thus Var( x l ) = σ x > σ x / .In summary, we have Eq. (10), which concludes the proof. Lemma 19 (Martingale Inequality on Ridge Regression, Abbasi-Yadkori et al., 2011) . Assumethat || θ g || ≤ S . Take δ > arbitrarily. With probability at least − δ , the true parameter θ g isbounded as ∀ n, (cid:13)(cid:13)(cid:13) ˆ θ g ( n ) − θ g (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ σ (cid:15) (cid:115) d log (cid:18) det( ¯ V g ( n )) / det( λ I ) − / δ (cid:19) + λ / S. (11)53oreover, let L = max i,n (cid:107) x i ( n ) (cid:107) and β n ( L, δ ) = σ (cid:15) (cid:115) d log (cid:18) nL /λδ (cid:19) + λ / S. Then, with probability at least − δ , ∀ n, (cid:13)(cid:13)(cid:13) ˆ θ g ( n ) − θ g (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ β n ( L, δ ) . (12)The following lemma is used in deriving a regret bound. Lemma 20 (Sum of Diminishing Contexts, Lemma 11 in Abbasi-Yadkori et al., 2011) . Let λ ≥ and L = max n,i (cid:107) x i ( n ) (cid:107) . Then, the following inequality holds: (cid:88) n : ι ( n )= g (cid:13)(cid:13) x ι ( n ) (cid:13)(cid:13)

2( ¯ V g ( n )) − ≤ L log (cid:18) det( ¯ V g ( N ))det( λ I d ) (cid:19) for any group g .The following inequality is used to bound the variation of the minimum eigenvalue of the sumof characteristics (contexts). Lemma 21 (Matrix Azuma Inequality, Tropp, 2012) . Let X , X , . . . , X n be adaptive sequenceof d × d symmetric matrices such that E k − X k = and X k (cid:22) A k almost surely, where A (cid:23) B between two matrices denotes A − B is positive semideﬁnite. Let σ A := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) k A k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) where the matrix norm is deﬁned by the largest eigenvalue. Then, for all t ≥ , Pr (cid:34) λ min (cid:32)(cid:88) k X k (cid:33) ≤ t (cid:35) ≤ d exp( − t / (8 nσ A )) . Proof.

The proof directly follows from Theorem 7.1 and Remark 3.10 in Tropp (2012).The following lemma states that the selection bias makes its variance slightly ( O (1 / log K ) times) smaller than the original variance. 54 emma 22 (Variance of Maximum, Theorem 1.8 in Ding, Eldan, and Zhai, 2015) . Let x , . . . , x K ∈ R be i.i.d. samples from N (0 , . Let I max = arg max i ∈ [ K ] x i . Then, there exists a distribution-independent constant C varmax > such that Var[ I max ] ≥ C varmax log( K ) . B Proofs

This section is structured as follows. Section B.1 describes the common inequalities that we assumethroughout the section. Proofs of individual theorems are shown in what follows.

B.1 Common Inequalities

In the proofs, we often ignore the events that happen with probability O (1 /N ) . The expectedregret per round is at most max i x (cid:48) i θ g ( i ) − min i x (cid:48) i θ g ( i ) , which is O (1) in expectation. Hence, theevents that happen with probability O (1 /N ) contributes to the regret by O (1 /N × N ) = O (1) ,which are insigniﬁcant in our analysis. In particular, Pr (cid:20) ∀ n ∈ [ N ] , i ∈ I ( n ) , || x i ( n ) || ≤ L (cid:18) N (cid:19)(cid:21) ≥ − N . (by Lemma 16) (13)Moreover, Pr (cid:20) ∀ n ∈ [ N ] , g ∈ G, (cid:13)(cid:13)(cid:13) ˆ θ g ( n ) − θ g (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ β n (cid:18) L, N (cid:19)(cid:21) ≥ − | G | N . (by Eq. (12) in Lemma 19)(14)and throughout the proof we ignore the case these events do not hold: All the contexts arebounded by L (1 /N ) = O ( √ log N ) , and all the conﬁdence bounds hold with β n (cid:0) L (1 /N ) , N (cid:1) ≤ β N (cid:0) L (1 /N ) , N (cid:1) = O ( √ log N ) = ˜ O (1) , which grows very slowly as N grows large. We also denote L = L (1 /N ) and β N = β N (cid:0) L, N (cid:1) .We next discuss the upper conﬁdence bounds. Remark 8 (Bound for ˜ θ i ) . Let ˜ θ i = arg max ¯ θ g ( i ) ∈C g ( i ) ( n ) x (cid:48) i ¯ θ g ( i ) . By deﬁnition of ˜ θ i , the following Except for Theorem 1 that does not pose distributional assumptions. ∀ n, (cid:13)(cid:13)(cid:13) ˜ θ i − ˆ θ g ( i ) ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ β N (15)and Eq. (14) implies ∀ n, x i (˜ θ i − θ g ( i ) ( n )) ≥ . (16)Moreover, by using triangular inequality, we have (cid:13)(cid:13)(cid:13) ˜ θ i − θ g ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ (cid:13)(cid:13)(cid:13) ˜ θ i − ˆ θ g ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) + (cid:13)(cid:13)(cid:13) ˆ θ g ( n ) − θ (cid:13)(cid:13)(cid:13) ¯ V g ( n ) and thus Eq. (14) implies ∀ n, (cid:13)(cid:13)(cid:13) ˜ θ i − θ g ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ β N . (17)We use the calligraphic font to denote events. For two events A , B , let A c be a complementaryevent and {A , B} := {A ∩ B} . We also use prime to denote event that is close to the originalevent. For example, event A (cid:48) is diﬀerent from event A but these two events are deeply linked.Finally, we discuss the minimum eigenvalue. We denote A (cid:23) B for two d × d matrices if A − B is positive semideﬁnite: That is, λ min ( A − B ) ≥ . Note that λ min ( A + B ) ≥ λ min ( A ) + λ min ( B ) and λ min ( A + B ) ≥ λ min ( A ) if B (cid:23) . xx (cid:48) (cid:23) for any vector x R d . B.2 Proof of Theorem 1

Consider an environment where there are two groups, G = { , } , and two workers arrive in eachround, K = K = 1 . We assume that error terms follow a standard normal distribution, i.e., σ (cid:15) = 1 . We set the ridge regression parameter λ to be . We assume N (0)1 = 1 and N (0)2 = 0 .Hence, g = 1 . We consider “no context” setting: Namely, d = 1 , and x i = 1 for all i ∈ I .We assume that θ = 0 and θ = − b with b > . Under this assumption, hiring a worker fromgroup incurs regret of b . Let R g ( n ) be the sum of the skills of workers who have been hired untilround n (i.e., have arrived from round to n − ) and belong to group g . Then, (2) implies that ˆ θ g ( n ) = R g ( n ) / ( N g ( n ) + λ ) = R g ( n ) / ( N g ( n ) + 1) . Since x i is ﬁxed to , ﬁrm n chooses a groupwhose predicted expected skill ˆ θ g ( n ) is larger. Note that these the following derivation does not strongly depend on the speciﬁc value of these parameters northe number of groups. g = 1 , ﬁrm hires the group- worker: ι (1) = 1 . Let b (cid:48) > b be a constant that we specifylater. Let A := { ˆ θ (2) < − b (cid:48) } = { (cid:15) ι (1) < − b (cid:48) } be the event that the skill of the worker hired in round is smaller than b (cid:48) / (1 + λ ) = b (cid:48) . Theprobability that A occurs is Φ( − b (cid:48) ) , where Φ( x ) is the cumulative distribution of a standardnormal distribution. Let B := N (cid:92) n =1 (cid:16) ˆ θ ( n ) ≥ − b (cid:48) (cid:17) be the event that ˆ θ ( n ) never becomes smaller than b (cid:48) .We evaluate the probability that A ∩ B occurs. When such an event happens, a group- workeris hired in round , and group- workers are hired all the subsequent rounds (i.e., ι (2) = 2 for anyround n > ). Accordingly, N ( n ) = n − is the case for all n ≥ . (cid:110) ˆ θ ( n ) ≥ − b (cid:48) (cid:111) = (cid:26) R ( n ) N ( n ) + 1 ≥ − b (cid:48) (cid:27) ⊇ (cid:26) R ( n ) N ( n ) ≥ − b (cid:48) (cid:27) Applying Hoeﬀding’s inequality to the empirical average R g ( n ) /N g ( n ) , we have P (cid:18) R ( n ) N ( n ) < − b (cid:48) (cid:19) ≤ exp (cid:0) − b (cid:48) − b ) ( n − (cid:1) . Accordingly, P ( B ) ≥ P (cid:32) N (cid:91) n =1 (cid:110) ˆ θ ( n ) ≥ − b (cid:48) (cid:111)(cid:33) ≥ − N (cid:88) n =3 exp (cid:0) − b (cid:48) − b ) ( n − (cid:1) . Here, N (cid:88) n =3 exp (cid:0) − b (cid:48) − b ) ( n − (cid:1) ≤ ∞ (cid:88) n =3 exp (cid:0) − b (cid:48) − b ) ( n − (cid:1) = exp (cid:0) − b (cid:48) − b ) (cid:1) − exp ( − b (cid:48) − b ) ) ≤ b (cid:48) − b ) , and thus P ( B ) occurs with constant probability − b (cid:48) − b ) > for any b (cid:48) > b + 1 / √ . Remember57hat A ∩ B implies that arm is never drawn after n > , and thus Reg( N ) ≥ bN . In conclusion,we have E [Reg( N )] ≥ Φ( − b (cid:48) ) · (cid:18) − b (cid:48) − b ) (cid:19) · b · N = Ω( N ) , as desired. B.3 Proof of Theorem 2

We ﬁrst bound regret per round reg( n ) := Reg( n ) − Reg( n − in Lemma 23. Then, we proveTheorem 2. Lemma 23 (Regret per Round) . Under the laissez-faire decision rule, the regret per round isbounded as: reg( n ) ≤ i ∈ I ( n ) (cid:107) x i (cid:107) ¯ V − g (cid:13)(cid:13)(cid:13) θ g ( i ) − ˆ θ g ( i ) (cid:13)(cid:13)(cid:13) ¯ V g . Proof of Lemma 23.

We denote the ﬁrst-best decision rule by i ∗ ( n ) := arg max i ∈ I ( n ) x (cid:48) i θ g ( i ) . Then, reg( n ) = x (cid:48) i ∗ θ g ( i ∗ ) − x (cid:48) ι θ g ( ι ) ≤ x (cid:48) i ∗ (cid:16) ˆ θ g ( i ∗ ) + θ g ( i ∗ ) − ˆ θ g ( i ∗ ) (cid:17) − x (cid:48) ι (cid:16) ˆ θ g ( ι ) + θ g ( ι ) − ˆ θ g ( ι ) (cid:17) ≤ x (cid:48) i ∗ (cid:16) θ g ( i ∗ ) − ˆ θ g ( i ∗ ) (cid:17) − x (cid:48) ι (cid:16) θ g ( ι ) − ˆ θ g ( ι ) (cid:17) (by the greedy choice of ﬁrm) ≤ (cid:107) x i ∗ (cid:107) ¯ V − g ( i ∗ ) (cid:13)(cid:13)(cid:13) θ g ( i ∗ ) − ˆ θ g ( i ∗ ) (cid:13)(cid:13)(cid:13) ¯ V g ( i ∗ ) + (cid:107) x ι (cid:107) ¯ V − g ( ι ) (cid:13)(cid:13)(cid:13) θ g ( ι ) − ˆ θ g ( ι ) (cid:13)(cid:13)(cid:13) ¯ V g ( ι ) (by the Cauchy–Schwarz inequality) ≤ i ∈ I ( n ) (cid:107) x i (cid:107) ¯ V − g ( i ) (cid:13)(cid:13)(cid:13) θ g ( i ) − ˆ θ g ( i ) (cid:13)(cid:13)(cid:13) ¯ V g ( i ) . The following proves Theorem 2. For the ease of discussion, we assume N (0) = 0 . That is,there is no initial sampling phase. Taking it into consideration is trivial. We ﬁrst show thatregardless of estimated values ˆ θ , ˆ θ , the candidate of group is drawn with constant probability.Let µ x = || µ x || . Let M ( n ) = (cid:110) x (cid:48) ( n )ˆ θ ≤ (cid:111) ( n ) = (cid:110) x (cid:48) ( n )ˆ θ > (cid:111) The sign of x (cid:48) ˆ θ ( n ) is solely determined by the component of x ( n ) that is parallel to ˆ θ ( n ) . Thiscomponent is drawn from N ( µ x, (cid:107) , σ x ) where µ x, (cid:107) is the component of µ x that is parallel to ˆ θ ( n ) .Therefore, for any ˆ θ , we have Pr[ M ( n )] ≥ Φ c ( µ x /σ x ) . (18)Likewise, for ˆ θ (cid:54) = 0 , we have Pr[ M ( n )] ≥ Φ c ( µ x /σ x ) (19)Let X ( n ) = { g ( ι ( n )) = g } for g ∈ { , } . By using Eq. (18) and (19), Pr[ X ( n )] = Pr[ x (cid:48) ( n )ˆ θ < x (cid:48) ( n )ˆ θ ] ≥ Pr[ x (cid:48) ( n )ˆ θ ≤ < x (cid:48) ( n )ˆ θ ]= Pr[ M ( n ) , M ( n )] ≥ (Φ c ( µ x /σ x )) . (20)(by Eq. (18), (19))Let N ( M )2 ( n ) = (cid:80) nn (cid:48) =1 [ M ( n (cid:48) ) , X ( n (cid:48) )] ≤ N ( n ) . Eq. (20) implies E [ N ( M )2 ( n )] ≥ (Φ c ( µ x /σ x )) n .By using the Hoeﬀding inequality, with probability at least − /N , we have N ( M )2 ≥ n (cid:0) (Φ c ( µ x /σ x )) − k (cid:1) (21)for k = (cid:114) log( N ) n . Therefore, union bound over n = 1 , , . . . , N implies Eq. (21) holds with probability at least − (cid:80) n /N = 1 − /N .In the following we bound the λ min ( ¯ V g ) . Note that a hiring of a worker i under events Pr[ M ( n )] = Φ c ( µ x /σ x ) when µ x, (cid:107) = µ x . Namely, the direction of µ x is exactly the same as ˆ θ . In the subsequent discussion, we do not care point mass ˆ θ = 0 of measure zero for N ( n ) > . ( n ) , X ( n ) satisﬁes a diversity condition (Lemma 18) with ˆ b = 0 , and we have λ min ( E [ x ι x (cid:48) ι |M ( n ) , X ( n )]) ≥ λ with λ = σ x / . Using the matrix Azuma inequality (Lemma 21) for subsequence { x ι x (cid:48) ι : M ( n ) , X ( n ) } with X = x ι x (cid:48) ι − E [ x ι x (cid:48) ι ] and σ A = 2 L , for t = (cid:113) N σ A log( dN ) , with probability − /N λ min  (cid:88) n : ι ( n )=2 x ι x (cid:48) ι  ≥ N ( M )2 λ − t. (22)In summary, with probability − /N , Eq. (21) and (22) hold, and then, we have λ min ( ¯ V ) ≥ N ( M )2 λ − (cid:113) N σ A log( dN ) ≥ ( n (Φ c ( µ x /σ x )) − k ) λ − (cid:113) N σ A log( dN )= n (Φ c ( µ x /σ x )) λ − ˜ O ( √ n ) . (23)By using the symmetry of the two groups, exactly the same results as Eq. (23) holds for group .In the following, we bound the regret as a function of min g λ min ( ¯ V g ) . Eq. (23) holds withprobability − O (1 /N ) , and we ignore events of probability O (1 /N ) that does not aﬀect theanalysis. The regret is bounded as Reg( N ) ≤ (cid:88) n max i (cid:107) x i (cid:107) ¯ V − g ( i ) (cid:13)(cid:13)(cid:13) θ g ( i ) − ˆ θ g ( i ) (cid:13)(cid:13)(cid:13) ¯ V g ( i ) (by Lemma 23) (24) ≤ (cid:88) n max i (cid:107) x i (cid:107) ¯ V − g ( i ) β N (by Eq. 14) ≤ (cid:88) n max i || x i || λ min ( ¯ V g ( i ) ) β N (by deﬁnition of eigenvalues) ≤ (cid:88) n max i Lλ min ( ¯ V g ( i ) ) β N (by Eq. (13)) ≤ L (cid:88) n max i min (cid:32) λ min ( ¯ V g ( i ) ) , λ (cid:33) β N (by λ min ( ¯ V g ( i ) ) ≥ λ ) ≤ L (cid:88) n min (cid:32)(cid:115) n (Φ c ( µ x /σ x )) λ − ˜ O ( √ n ) , λ (cid:33) β N (by Eq. (23))60 L (cid:115) N (Φ c ( µ x /σ x )) λ β N + ˜ O (1) (cid:18) by N (cid:88) n = C +1 (cid:40) (cid:112) n − C √ n (cid:41) = 2 √ N + ˜ O (1) for C = ˜ O (1)) (cid:19) which completes Proof of Theorem 2. B.4 Proof of Theorem 3

Since we consider d = 1 case in this theorem, we remove bold styles in scalar variables. In thisproof, we assume µ x θ > and θ > . The proof for the case of µ x θ < or θ < is similar. Let ˆ θ g,t be the value of ˆ θ g when group g candidate was chosen t times. With a slight abuse of notation,we use i = i ( n ) to denote the unique candidate of group in each round n . We ﬁrst deﬁne theseveral events that characterize the perpetual underestimation. Namely, P = (cid:26)(cid:12)(cid:12)(cid:12) ˆ θ ,N (0)2 (cid:12)(cid:12)(cid:12) < b θ (cid:27) P (cid:48) ( n ) = (cid:26) x i ( n ) ˆ θ ,N (0)2 < µ x θ (cid:27) Q = (cid:26) ∀ t ≥ N (0)1 , ˆ θ ,t ≥ θ (cid:27) Q (cid:48) ( n ) = (cid:26) ∃ i s.t. g ( i ) = 1 , x i ˆ θ ,N ( n ) ≥ µ x θ (cid:27) where b is a small constant that we specify later. P and P (cid:48) are about the minority whereas Q and Q (cid:48) are about the majority: Intuitively, Event P states that ˆ θ is largely underestimated, and P (cid:48) states that the minority candidate is undervalued. Q states that the majority parameter ˆ θ isconsistently lower-bounded, and Q (cid:48) states the stability of the best candidate of the majority after n rounds. Under laissez-faire, N (cid:92) n =1 ( P (cid:48) ( n ) ∩ Q (cid:48) ( n )) We will specify b = O (1 / (log N )) . g ( ι ) = 1 for all n ), which is exactly the perpetualunderestimation of Deﬁnition 2. Therefore, proving Pr (cid:34) N (cid:92) n =1 ( P (cid:48) ( n ) ∩ Q (cid:48) ( n )) (cid:35) ≥ ˜ O (1) (25)concludes the proof. We bound these events by the following lemmas and ﬁnally derives Eq. (25). Lemma 24.

Pr[ P ] ≥ C b for some constant C . Proof of Lemma 24.

We denote x i ,t for representing t -th sample of group during the initialsampling phase, which is an i.i.d. sample from N ( µ x , σ x ) . Likewise, we also denote y i ,t = x i ,t θ + (cid:15) t . Pr[ P ] = Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) N (0)2 t =1 x i ,t ( x i ,t θ + (cid:15) t ) (cid:80) N (0)2 t =1 x i ,t + λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ b θ  = Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (0)2 (cid:88) t =1 x i ,t ( x i ,t θ + (cid:15) t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ b θ  N (0)2 (cid:88) t =1 x i ,t + λ  = Pr  − g ( b ) ≤ N (0)2 (cid:88) t =1 x i ,t ( x i ,t θ + (cid:15) t ) ≤ g ( b )  where g ( b ) = b θ  N (0)2 (cid:88) t =1 x i ,t + λ  . Let x i ,t = µ x + e t . Deﬁne an event R as follows. R =  N (0)2 (cid:88) t =1 e t ≤ σ x N (0)2  ⊆  N (0)2 (cid:88) t =1 x i ,t ≤ N (0)2 ( µ x + 5 σ x )  where we used x i ,t = ( µ x + e t ) ≤ µ x + e t ) in the last transformation. By using Lemma 14, wehave Pr[ R c ] ≤ − e − N (0)2 ≤ / . S =  N (0)2 (cid:88) t =1 x i ,t = N (0)2 (cid:88) t =1 ( µ x + e t ) ≥ N (0)2 µ x  . It is easy to conﬁrm that

Pr[ (cid:80) n ( µ x + e t ) ≥ N (0)2 µ x ] ≥ / , and thus Pr[

R ∩ S ] ≥ − / − / / . (26)Note that S implies g ( b ) ≥ b θN (0)2 µ x + λ. (27)Conditioned on x i ,t , we have x i ,t (cid:15) t ∼ N (0 , x i ,t σ (cid:15) ) . Moreover, by using the property on thesum of independent normal random variables, (cid:88) t x i ,t (cid:15) t ∼ N (0 , (cid:88) t x i ,t σ (cid:15) ) (28)Letting L R = − g ( b ) − (cid:80) t x i ,t θσ (cid:15) (cid:113)(cid:80) t x i ,t U R = g ( b ) − (cid:80) t x i ,t θσ (cid:15) (cid:113)(cid:80) t x i ,t M R = L R + U R − (cid:16)(cid:113)(cid:80) t x i ,t (cid:17) θσ (cid:15) We have Pr (cid:34) − g ( b ) ≤ (cid:88) t =1 ( x i ,t θ + x i ,t (cid:15) n ) ≤ g ( b ) (cid:35) ≥ Pr (cid:34) − g ( b ) ≤ (cid:88) t =1 ( x i ,t θ + x i ,t (cid:15) t ) ≤ g ( b ) , R , S (cid:35) ≥ Pr (cid:34) − g ( b ) − (cid:88) t =1 x i ,t θ ≤ (cid:88) t =1 x i ,t (cid:15) t ≤ g ( b ) − (cid:88) t =1 x i ,t θ, R , S (cid:35) ≥ Pr[ R , S ] min { e n : R , S} (cid:20)(cid:90) U R L R φ ( y ) dy (cid:21) (by Eq. (28))63

14 min { e n : R , S} (cid:20)(cid:90) U R L R φ ( y ) dy (cid:21) (by Eq. (26)) (29)The following bounds Eq. (29). The integral’s bandwidth is U R − L R = 2 g ( b ) σ (cid:15) (cid:113)(cid:80) t x i ,t ≥ g ( b ) σ (cid:15) (cid:113) N (0)2 ( µ x + 5 σ x ) . (by event R )The value of φ ( y ) within [ M R − , M R + 1] is at least φ ( M R ) /e / ≥ (1 / φ ( M R ) . Therefore, (cid:90) U R L R φ ( y ) dy ≥ min  , g ( b ) σ (cid:15) (cid:113) N (0)2 ( µ x + 5 σ x )  × φ ( M R )2 . (30)Moreover, φ ( M R ) = 1 √ π exp (cid:18) − ( M R ) (cid:19) = 1 √ π exp  − θ (cid:80) N (0)2 t =1 x i ,t σ (cid:15)  ≤ √ π exp (cid:32) − θ N (0)2 ( µ x + 5 σ x )2 σ (cid:15) (cid:33) (by event R ) (31)By using these, we have (cid:90) U R L R φ ( y ) dy ≥ min (cid:32) , g ( b ) σ (cid:15) (cid:112) µ x + 5 σ x ) (cid:33) φ ( M R )2 (by Eq. (30)) = min  , g ( b ) σ (cid:15) (cid:113) N (0)2 ( µ x + 5 σ x )  φ ( M R )= O (cid:32) b (cid:113) N (0)2 exp (cid:32) − θ N (0)2 ( µ x + 5 σ x )2 σ (cid:15) (cid:33)(cid:33) (by Eq. (27), (31))The exponent does not depend on b : Given all model parameters as constant, the probability of P is O ( b ) , which concludes the proof.The following Lemma 25 on Q is about the stability of the mean estimator, which is widelyused to prove lower bounds in multi-armed bandit problems. Namely, for any ∆ > , a wide class64f mean estimators ˆ θ of θ satisﬁes Pr (cid:34) ∞ (cid:91) n =1 (cid:16) ˆ θ ( n ) ≥ θ − ∆ (cid:17)(cid:35) ≥ C (32)for some constant C = C ( θ, ∆) > . Lemma 25 is a version Eq. (32) for our ridge estimator. Lemma 25.

There exists a constant N (0)1 that is independent on N such that, with a warm-startof size N (0)1 , Pr[ Q ] ≥ C holds. Proof of Lemma 25.

In this proof, we use t ≥ to denote the estimator where the t -th sample isdrawn. For example, ¯ V g,t := ¯ V g ( n ) of n : N ( n −

1) = t . Note that we consider d = 1 case and ¯ V ,t = (cid:80) tt (cid:48) =1 x ,t + λ . By martingale bound (Eq. (11)), with probability − δ , ∀ t ≥ , | ˆ θ ,t − θ | (cid:113) ¯ V ,t ≤ σ (cid:15) (cid:118)(cid:117)(cid:117)(cid:116) log (cid:32) ¯ V / ,t λ − / δ (cid:33) + λ / S. (33)Let δ = 1 / . It follows from √ log x ≤ √ x for any x > that (cid:114) log (cid:16) V / ,t λ − / (cid:17) ≤ (cid:113) V / ,t λ − / . (34)Therefore, | ˆ θ ,t − θ | ≤ σ (cid:15) (cid:118)(cid:117)(cid:117)(cid:116) log (cid:32) ¯ V / ,t λ − / δ (cid:33) + λ / S (cid:113) ¯ V ,t (by Eq. (33)) ≤ σ (cid:15) (cid:113) V / ,t λ − / + λ / S (cid:113) ¯ V ,t (by (34))and thus ∀ t ≥ N (0)1 , | ˆ θ ,t − θ | ≤ | θ | (cid:113) ¯ V ,N (0)1 ≥ θ max (cid:18) σ (cid:15) (cid:114) V / ,N (0)1 λ − / , λ / S (cid:19) whose suﬃcient condition for the initial sample size N (0)1 is ¯ V ,N (0)1 ≥ max (cid:20) θ ( σ (cid:15) /λ ) , θ λS (cid:21) . Note that

Pr[ ¯ V ,N (0)1 ≥ µ x N (0)1 ] ≥ / . Letting the observation noise σ (cid:15) and regularizer λ beconstants, constant size of warm-start is enough to assure this bound with probability C = 1 / × / / .The following lemma states that, when ˆ θ is very small, the estimated quality x i ˆ θ of theminority group is likely to be small. Lemma 26.

There exists a constant C , C that is independent of N such that Pr[ P (cid:48) ( n ) |P ] ≥ − C exp ( − C /b ) (35)holds. Proof of Lemma 26.

Pr[ P (cid:48) ( n ) |P ] ≥ − Pr (cid:20) x i ( n ) ≥ b (cid:21) ≥ − Φ c (cid:18) σ x (cid:18) b − µ x (cid:19)(cid:19) ≥ − (cid:112) πσ x exp (cid:18) − σ x (cid:18) b − µ x (cid:19)(cid:19) , (by Lemma 15)where we have assumed (cid:0) b − µ (cid:1) /σ x ≥ in the last transformation (which holds for suﬃcientlysmall b ). Eq. (35) holds for C = √ πσ x e µ/σ x and C = 2 /σ x . Lemma 27.

Pr[ Q (cid:48) ( n ) | Q ] ≥ − (1 / K . Q (cid:48) ( n ) states that all the candidates’ estimated quality x i ˆ θ is not below mean. Lemma 27states that the probability of Q (cid:48) ( n ) is exponentially small to the number of candidates. The proofof Lemma 27 directly follows from the symmetry of normal distribution and independence of eachcharacteristic x i . Proof of Theorem 3.

By using Lemmas 24–27, we have

Pr[ P ] ≥ C b (36) Pr [ Q ] ≥ C (37) Pr[ P (cid:48) ( n ) |P ] ≥ − C exp ( − C b ) (38) Pr[ Q (cid:48) ( n ) |Q ] ≥ − (1 / K From these equations, the probability of perpetual underestimation is bounded as: Pr (cid:34)(cid:91) n { ι ( n ) = 1 } (cid:35) ≥ Pr (cid:34)(cid:91) n {P (cid:48) ( n ) , Q (cid:48) ( n ) } , P , Q (cid:35) ≥ Pr [ P ] Pr [ Q ] Pr (cid:34)(cid:91) n {P (cid:48) ( n ) , Q (cid:48) ( n ) } | P , Q (cid:35) (by the independence of P and Q ) ≥ C b × C × (1 − N C exp ( − C b )) × (cid:32) − N (cid:18) (cid:19) K (cid:33) (by the union bound) (39)which, by letting b = O (1 / log( N )) and K > log ( N ) , is ˜ O (1) . B.5 Proof of Theorem 4

Proof.

Let reg( n ) = Reg( n ) − Reg( n − . Notice that under the UCB decision rule, ι ( n ) = max i ∈ I ( n ) ( x (cid:48) i ˜ θ i ( n )) . (40)By Lemma 19, with probability at least − δ , the true parameter of group g lies in C g , and thus x (cid:48) i ˜ θ i ( n ) ≥ x (cid:48) i θ g (41)67or each i ∈ I ( n ) .Let i ∗ = i ∗ ( n ) := arg max i ∈ I ( n ) x (cid:48) i θ g ( i ) be the ﬁrst-best worker, and g ∗ = g ( i ∗ ) be the group i ∗ belongs to. The regret in round n is bounded as reg( n ) = x (cid:48) i ∗ θ g ∗ − x (cid:48) ι θ g ( ι ) ≤ x (cid:48) i ∗ ˜ θ i ∗ − x (cid:48) ι θ g ( ι ) (by Eq. (41)) ≤ x (cid:48) ι ˜ θ ι − x (cid:48) ι θ g ( ι ) (by Eq. (40)) ≤ || x (cid:48) ι || ¯ V − g ( ι ) (cid:13)(cid:13)(cid:13) θ g ( ι ) − ˜ θ ι (cid:13)(cid:13)(cid:13) ¯ V g ( ι ) (by the Cauchy–Schwarz inequality) ≤ || x (cid:48) ι || ¯ V − g ( ι ) β N . (by Eq. (14)) (42)The total regret is bounded as: Reg( N ) = (cid:88) n reg( n ) ≤ (cid:115) N (cid:88) n reg( n ) (by the Cauchy–Schwarz inequality) ≤ β N (cid:115) N (cid:88) n || x (cid:48) ι || V − g ( ι ) ( n ) ≤ β N (cid:115) N L (cid:88) g ∈ G log(det( ¯ V g ( N ))) (by Lemma 20) (43) ≤ ˜ O ( (cid:112) N | G | ) where we have used the fact that log(det( ¯ V g )) = O (log( N )) = ˜ O (1) . B.6 Proof of Theorem 5

Proof of the ﬁrst statement

Since s i ( n ) = ˜ q i ( n ) − ˆ q i ( n ) , we have ˆ q i ( n ) + s i ( n ) = ˜ q i ( n ) . Hence,ﬁrm i ’s incentive is aligned with the UCB index. Accordingly, ﬁrm i follows the UCB decision rule,which maximizes the UCB index. Proof of the second statement

For notational simplicity, we drop n , X g , Y g from this proof.Deﬁne a correspondence U by U (˜ q i ; s ) := { u i ∈ R | ∃ i, ∃ x i s.t. ˆ q i ( x i ) + s i ( x i ) = u i , ˜ q i = ˜ q i ( x i ) } . U (˜ q i ) represents the set of ﬁrm n ’s all possible payoﬀs from a worker whose UCB index is ˜ q i . Clearly, the subsidy rule s implements the UCB decision rule ι if and only if for all i , ˜ q (cid:48) i > ˜ q i implies min U (˜ q (cid:48) i ; s ) > max U (˜ q i ; s ) . (44)Since min U ( · ; s i ) is an increasing function, it is continuous at all but countably many points.Equivalently, U (˜ q i ; s i ) is a singleton for almost all values of ˜ q i .Now, suppose that U (˜ q ∗ i ) is not a singleton for some ˜ q ∗ i . Deﬁne ∆ by ∆ := max U (˜ q ∗ i ; s ) − min U (˜ q ∗ i ; s ) . Deﬁne another subsidy rule s (cid:48) by setting s (cid:48) i ( x i ) =  s i ( x i ) if ˜ q i ( x i ) < ˜ q ∗ i min U (˜ q ∗ i ) − ˆ q i ( x i ) if ˜ q i ( x i ) = ˜ q ∗ i s i ( x i ) − ∆ otherwisefor all i . Then, we have U (˜ q i ; s (cid:48) ) =  U (˜ q i ; s ) if ˜ q i < ˜ q ∗ i { min U (˜ q ∗ i ; s ) } if ˜ q i = ˜ q ∗ i U (˜ q i ; s ) − ∆ otherwise , which implies U ( · ; s (cid:48) ) also satisﬁes (44), or equivalently, s (cid:48) also implements the UCB rule ι . Fur-thermore, s (cid:48) i ( x i ) ≤ s (cid:48) i ( x i ) for all x i , with a strict inequality for some x i . Accordingly, s (cid:48) needs asmaller budget than s .By the argument above, whenever U ( · ; s i ) does not returns singleton for some ˜ q i , the subsidyamount can be improved by ﬁlling a gap. From now, we discuss the case in which U ( · ; s i ) returns asingleton for all ˜ q i ; i.e., U reduces to a function. From now, we use u (˜ q i ; s ) to represent the ﬁrm’sutility when it hires a worker whose UCB index is ˜ q i (which was previously written as U because69t could take multiple values). Then, we have s i ( x i ) = u (˜ q i ( x i ); s ) − ˆ q i ( x i ) for all x i . Since we require that s i ( x i ) ≥ for all x i , u (˜ q i ( x i ); s ) − ˆ q i ( x i ) ≥ . After some history, ˆ q i may become arbitrarily close to ˜ q i . Therefore, the inequality is satisﬁed forall ˜ q i and ˆ q i . Accordingly, u must satisfy u ( q ; s ) ≥ q (45)for all q . The UCB index subsidy rule satisﬁes (45) with equalities for all q : The UCB index subsidyrule satisﬁes s i = ˜ q i − ˆ q i , and therefore, u (˜ q i ; s ) = ˜ q i for all ˜ q i . Accordingly, it needs the minimumpossible budget. Proof of the third statement

We bound the amount of total subsidy

Sub( N ) . Sub( N ) := (cid:88) n x (cid:48) ι ( n ) (˜ θ ι − ˆ θ g ( ι ) ) ≤ || x (cid:48) ι || ¯ V − g ( ι ) β N , (by Eq. (15))which is the same as Eq. (42) and thus the same bound as regret applies. B.7 Proofs of Theorems 8 and 9

Proof of Theorem 8.

We adopt “slot” notation for each group. Group g is allocated K g slots andat each round n , one candidate arrives for each slot. We use index i ∈ [ K ] to denote each slot:Although x i at two diﬀerent rounds n , n (cid:48) (= x i ( n ) , x i ( n (cid:48) ) ) represent diﬀerent candidates, they arefrom the identical group g = g ( i ) . In summary, we use index i to represent the i -th slot and with aslight abuse of argument. We also call candidate i to represent the candidate of slot i . Note thatthis does not change any part of the model, and the slot notation here is for the sake of analysis.70nder the hybrid decision rule, a ﬁrm at each round hires the candidate of the largest index.Namely, ι ( n ) = arg max i ∈ I ( n ) ˜ q H i ( n ) where ˜ q H i is deﬁned at Eq. (6). We also denote ˜ ι ( n ) = arg max i ∈ I ( n ) x (cid:48) i ˜ θ i . That is, ˜ ι indicates thecandidate who would have been hired if we have used the standard UCB decision rule (Eq. (5))The following bounds the regret into estimation errors of ˜ ι and ι . reg( n ) = x (cid:48) i ∗ θ g ∗ − x (cid:48) ι θ g ( ι ) ≤ x (cid:48) i ∗ ˜ θ i ∗ − x (cid:48) ι θ g ( ι ) (by Eq. (16)) ≤ x (cid:48) ˜ ι ˜ θ ˜ ι − x (cid:48) ι θ g ( ι ) (by deﬁnition of ˜ ι ) = x (cid:48) ˜ ι ˜ θ ˜ ι − x (cid:48) ι ˜ θ ι + x (cid:48) ι (˜ θ ι − θ g ( ι ) ) ≤ x (cid:48) ˜ ι (˜ θ ˜ ι − ˆ θ g (˜ ι ) ) + x (cid:48) ι (˜ θ ι − θ g ( ι ) ) (by deﬁnition of ι ) (46)Here, x (cid:48) ˜ ι (˜ θ ˜ ι − ˆ θ g (˜ ι ) ) ≤ || x (cid:48) ˜ ι || ¯ V − g (˜ ι ) (cid:13)(cid:13)(cid:13) ˜ θ ˜ ι − ˆ θ g (˜ ι ) (cid:13)(cid:13)(cid:13) ¯ V g (˜ ι ) (by the Cauchy–Schwarz inequality) ≤ || x (cid:48) ˜ ι || ¯ V − g (˜ ι ) β N . (by Eq. (15)) ≤ || x (cid:48) ˜ ι || (cid:113) λ min ( ¯ V g (˜ ι ) ) β N (by deﬁnition of eigenvalues) ≤ L (cid:113) λ min ( ¯ V g (˜ ι ) ) β N . (by Eq. (13)) (47)Moreover, the estimation error of candidate ι is bounded as x (cid:48) ι (˜ θ ι − θ g ( ι ) ) ≤ || x (cid:48) ι || ¯ V − g ( ι ) (cid:13)(cid:13)(cid:13) ˜ θ ι − θ g ( ι ) (cid:13)(cid:13)(cid:13) ¯ V g ( ι ) (by the Cauchy–Schwarz inequality) ≤ || x (cid:48) ι || ¯ V − g ( ι ) β N . (by Eq. (17)) ≤ || x (cid:48) ι || (cid:113) λ min ( ¯ V g ( ι ) ) β N (by deﬁnition of eigenvalues) ≤ L (cid:113) λ min ( ¯ V g ( ι ) ) β N . (by Eq. (13)) (48)71ased on the above bounds, the regret is bounded as follows. Reg( N ) = N (cid:88) n =1 reg( n ) ≤ N (cid:88) n =1  (cid:113) λ min ( ¯ V g ( ι ) ) + 1 (cid:113) λ min ( ¯ V g (˜ ι ) )  Lβ N (by Eq.(46), (47), (48) ) ≤ Lβ N (cid:88) i ∈ [ K ] N (cid:88) n =1 [ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) )+ Lβ N (cid:88) i ∈ [ K ] N (cid:88) n =1 [˜ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) (49)Eq. (49) consisted of two components. The ﬁrst component is the estimation error of the hiredcandidate ι . The second component is the estimation error of ˜ ι , who would have hired if we hadposed the UCB decision rule. The Hybrid decision rule ˜ ι can be diﬀerent from the UCB decisionrule ι , which is the main challenge of deriving regret bound in the hybrid decision rule.We ﬁrst deﬁne the following events V i ( n ) := (cid:110) x i ( n ) (cid:48) (˜ θ i ( n ) − ˆ θ g ( i ) ( n )) ≤ aσ x (cid:13)(cid:13)(cid:13) ˆ θ ( n ) (cid:13)(cid:13)(cid:13)(cid:111) W i ( n ) := { ˜ ι ( n ) = i }X i ( n ) := { ι ( n ) = i }X (cid:48) i ( n ) := (cid:40) x i ( n ) (cid:48) ˆ θ ( n ) ≥ arg max j (cid:54) = i ˜ q H j (cid:41) ⊆ X i . Event V i states that the candidate i is not subsidized. Event W i states that i would have beenhired if it was subsidized in the UCB decision rule. Event X i states that i is hired and X (cid:48) i statesthat i is hired regardless of the subsidy.The following lemma is the crux of bounding the components in Eq. (49). Lemma 28 (Proportionality) . The following two inequalities hold.

Pr[ X (cid:48) i ] ≥ exp( − a /

2) Pr[ W i ] (50)72 r[ X (cid:48) i ] ≥ exp( − a /

2) Pr[ X i ] (51) Proof of Lemma 28.

We ﬁrst prove, for any c ∈ R , d > , Pr (cid:104) x (cid:48) i ˆ θ g ( i ) ≥ c (cid:105) ≥ exp( − d /

2) Pr (cid:20) x (cid:48) i ˆ θ g ( i ) ≥ c − d (cid:16) σ x (cid:13)(cid:13)(cid:13) ˆ θ g ( i ) (cid:13)(cid:13)(cid:13)(cid:17) (cid:21) . (52)Let x (cid:107) := ( x (cid:48) i ˆ θ g ( i ) ) / || ˆ θ g ( i ) || be the projection of x i into the direction of ˆ θ g ( i ) . Then, x (cid:48) i ˆ θ g ( i ) = x (cid:107) || ˆ θ g ( i ) || . From the symmetry of a normal distribution, x (cid:107) || ˆ θ g ( i ) || is drawn from a normal distri-bution with its variance ( σ x || ˆ θ g ( i ) || ) , from which Eq. (52) follows.Eq. (50) follows by letting c = max j (cid:54) = i ˜ q H j , d = a because W i ⊆ (cid:26) x (cid:48) i ˆ θ g ( i ) ≥ c − d (cid:16) σ x (cid:13)(cid:13)(cid:13) ˆ θ g ( i ) (cid:13)(cid:13)(cid:13)(cid:17) (cid:27) X (cid:48) i ⊇ (cid:110) x (cid:48) i ˆ θ g ( i ) ≥ c (cid:111) Eq. (51) also follows by letting c = max j (cid:54) = i ˜ q H j and d = a X i ⊆ (cid:110) x (cid:48) i ˜ θ i ≥ c (cid:111) X (cid:48) i ⊇ (cid:26) x (cid:48) i ˜ θ i ≥ c + d (cid:16) σ x (cid:13)(cid:13)(cid:13) ˆ θ g ( i ) (cid:13)(cid:13)(cid:13)(cid:17) (cid:27) and exactly the same discussion as Eq. (52) applies for Pr (cid:20) x (cid:48) i ˜ θ i ≥ c + d (cid:16) σ x (cid:13)(cid:13)(cid:13) ˆ θ g ( i ) (cid:13)(cid:13)(cid:13)(cid:17) (cid:21) ≥ exp( − d /

2) Pr (cid:104) x (cid:48) i ˜ θ i ≥ c (cid:105) . (53)Lemma 28 is intuitively understood as follows. Assume that candidate i would have beenhired under the UCB rule. The candidate may not be hired under the Hybrid rule because it cancut subsidy for that candidate. However, there is constant probability such that a slightly better(“ aσ -good”) candidate appears on slot i and such a candidate is hired under the Hybrid rule.The following two lemmas, which utilizes Lemma 28, bounds the two terms of Eq. (49). Note that x (cid:48) i ˆ θ g ( i ) in Eq. (52) is replaced by x (cid:48) i ˜ θ i in Eq. (53), which does not change the subsequent derivationsat all. emma 29. E  N (cid:88) n =1 [ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) )  ≤ e a / λ √ N + O (1) . Lemma 30. E  N (cid:88) n =1 [˜ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) )  ≤ e a / λ √ N + O (1) . With Lemmas 29 and 30, the regret is bounded as

Reg( N ) ≤ Lβ n ( L, /N ) (cid:88) i ∈ [ K ] N (cid:88) n =1 [ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) )+ Lβ n ( L, /N ) (cid:88) i ∈ [ K ] N (cid:88) n =1 [˜ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) (by Eq. (49)) ≤ Lβ n ( L, /N ) K e a / √ Nλ + ˜ O (1) (by Lemma 29 and 30) (54)which completes the proof of Theorem 8. Proof of Lemma 29.

Let N i ( n ) be the number of the rounds before n such that the worker of slot i is selected. Let τ t be the ﬁrst round such that N i ( n ) reaches t and N i,t = (cid:80) n ≤ τ t [ X (cid:48) i ( n )] . Lemma28 implies E [ N i,t ] ≥ e − a / t and applying the Hoeﬀding inequality on binary random variables ( [ X (cid:48) i ( τ )] , [ X (cid:48) i ( τ )] , . . . , ..., [ X (cid:48) i ( τ t )]) yields Pr (cid:104) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:105) ≤ N . (55)By using this, we have Pr (cid:34) N (cid:92) t =1 (cid:110) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:111)(cid:35) ≤ (cid:88) t Pr (cid:104) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:105) (by union bound) ≤ (cid:88) t N (by Eq. (55)) ≤ N .

74n the following, we focus on the case N i,t ≥ e − a / t − (cid:112) (log N ) t, (56)which occurs with probability at least − /N .Let ¯ V i ( n ) := (cid:80) n (cid:48) ≤ n [ ι = i ] x i x (cid:48) i (cid:22) ¯ V g ( i ) ( n ) . The context x i conditioned on event X (cid:48) i satisﬁesassumptions in Lemma 18 with ˆ θ = ˜ θ i and ˆ b = max j (cid:54) = i ˜ q H j . We have, N (cid:88) n =1 [ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) ≤ N (cid:88) n =1 [ ι = i ] 1 (cid:112) λ min ( ¯ V i ) (by ¯ V g ( i ) (cid:23) ¯ V i ) ≤ N (cid:88) n =1 N (cid:88) t =1 [ ι = i, N i ( n ) = t ] 1 (cid:112) λ min ( ¯ V i ) (by N i ( N ) ≤ N ) ≤ N (cid:88) t =1 (cid:112) λ min ( ¯ V i ( τ t )) . (by [ ι = i, N i ( n ) = t ] occurs at most once)In other words, lower-bounding λ min ( ¯ V i ( τ t )) suﬃces the regret bound, which we demonstrate in thefollowing.We have E (cid:2) λ min ( ¯ V i ( τ t )) (cid:3) ≥ λ min ( (cid:88) n E [ [ X (cid:48) i ( n )] x i x (cid:48) i ]) ≥ λ N i,t . (by Lemma 18)By using the matrix Azuma inequality (Lemma 21), with probability of at least − /Nλ min ( ¯ V i ( τ t )) ≥ (cid:18) λ N i,t − (cid:113) N i,t σ A log( dN ) (cid:19) (57)where σ A = 2 L . By using Eq. (56), (57), we have λ min ( ¯ V i ( τ t )) ≥ λ e − a / t − O ( √ t ) N (cid:88) t =1 (cid:112) λ min ( ¯ V i ( τ t )) ≤ N (cid:88) t =1 (cid:113) λ e − a / t − O ( √ t ) ≤ e a / λ √ N + O (1) . Proof of Lemma 30.

Let N W i i ( n ) = (cid:80) n (cid:48) ≤ n [ W i ] and let τ t be the ﬁrst round such that N W i i ( n ) reaches t and N i,t = (cid:80) n ≤ τ t [ X (cid:48) i ( n )] . The following discussions are very similar to the one of Lemma29, which we write for the completeness. Then, we have Pr (cid:34) N (cid:92) t =1 (cid:110) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:111)(cid:35) ≤ (cid:88) t Pr (cid:104) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:105) (by union bound) ≤ (cid:88) t N (by Lemma 28 and the Hoeﬀding inequality) ≤ N .

In the following, we focus on the case N i,t ≥ e − a / t − (cid:112) (log N ) t (58)that occurs with probability at least − /N .We have, N (cid:88) n =1 [˜ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) ≤ N (cid:88) n =1 [˜ ι = i ] 1 (cid:112) λ min ( ¯ V i ) (by ¯ V g ( i ) (cid:23) ¯ V i ) ≤ N (cid:88) n =1 N (cid:88) t =1 [˜ ι = i, N W i i ( n ) = t ] 1 (cid:112) λ min ( ¯ V i ) ≤ N (cid:88) t =1 (cid:112) λ min ( ¯ V i ( τ t )) . (by { ˜ ι = i } increments N W i i )76he following lower-bounds λ min ( ¯ V i ( τ t )) .We have E (cid:2) λ min ( ¯ V i ( τ t )) (cid:3) ≥ λ min (cid:32)(cid:88) n E [ (cid:2) X (cid:48) i ( n )] x i x (cid:48) i (cid:3)(cid:33) ≥ λ N i,t . (by Lemma 18)By using the matrix Azuma inequality (Lemma 21), at least − /Nλ min ( ¯ V i ( τ t )) ≥ (cid:18) λ N i,t − (cid:113) N i,t σ A log( dN ) (cid:19) (59)where σ A = 2( L (1 /N )) . By using Eq. (58), (59), we have λ min ( ¯ V i ( τ t )) ≥ λ e − a / t − O ( √ t ) and thus N (cid:88) t =1 (cid:112) λ min ( ¯ V i ( τ t )) ≤ N (cid:88) t =1 (cid:113) λ e − a / t − O ( √ t ) ≤ e a / λ √ N + O (1) . Proof of Theorem 9.

We here bound the amount of the subsidy. Eq. (47), (48) imply x (cid:48) i (cid:16) ˜ θ i − ˆ θ g ( i ) (cid:17) ≤ (cid:113) λ min ( ¯ V g ( i ) ) Lβ N (cid:12)(cid:12)(cid:12) x (cid:48) i ˆ θ g ( i ) − θ g ( i ) (cid:12)(cid:12)(cid:12) ≤ (cid:113) λ min ( ¯ V g ( i ) ) Lβ N and thus the subsidy s H-I i ( n ) = 0 for λ min ( ¯ V g ( i ) ) ≥ (cid:18) Lβ N (cid:107) θ (cid:107) (cid:19) max (cid:18) , a σ x (cid:19) =: C s = ˜ O (1) . (60)77ence, it follows that Sub( N ) = (cid:88) n s H-I ι ( n ) ≤ (cid:88) n (cid:88) i [ X i ] s H-I ι ( n ) ≤ Lβ N (cid:88) i (cid:88) n [ λ min ( ¯ V g ( i ) ) ≤ C s ] 1 (cid:113) λ min ( ¯ V g ( i ) ) (by Eq. (60)) ≤ Lβ N (cid:88) i (cid:88) t [ λ e − a / t − O ( √ t ) ≤ C s ] 1 (cid:113) λ e − a / t − O ( √ t ) (by the same discussion as Lemma 29) ≤ Lβ N K (cid:88) t [ λ e − a / t ≤ C s ] 1 (cid:112) λ e − a / t + ˜ O (1) ≤ Lβ N K e a / λ (cid:115) C s e a / λ + ˜ O (1) = ˜ O (1) . (61)Note that C s diverges as a → +0 . The bound of Theorem 9 is meaningful for a > . If a = 0 ,the hybrid mechanism is reduced to the UCB mechanism, and thus Theorem 5 for UCB applies. B.8 Proof of Theorem 10

We modify the proof of Theorem 3. Accordingly, we use the same notation as the proof of Theorem 3unless we explicitly mention.We deﬁne Q (cid:48)(cid:48) ( n ) = (cid:26) ∃ i A , i B s.t. g ( i A ) = g ( i B ) = 1 , i A (cid:54) = i B , and x i ˆ θ ,N ( n ) ≥ µ x θ for i = i A , i B (cid:27) . When the event Q (cid:48)(cid:48) ( n ) occur, there are two majority workers whose predicted skill ˆ q i ( n ) is largerthan its mean. Lemma 31.

Pr[ Q (cid:48)(cid:48) ( n ) |Q ] ≥ − ( K + 1) (cid:18) (cid:19) K . (62)Event Q (cid:48)(cid:48) ( n ) states that the second order statistics of { ˆ q i } i : g ( i )=1 is below mean. Lemma 3178tates that this event is exponentially unlikely to K . By the symmetry of normal distribution andindependence of each characteristic x i , each candidate is likely to be below mean with probability / , and the proof of Lemma 31 directly follows by counting the combinations such that at mostone of the worker(s) are above mean.When we have P (cid:48) ( n ) and Q (cid:48)(cid:48) ( n ) for all n , then for every round n , the top- workers in termsof quality ˆ q i ( n ) are from the majority. In this case, the minority worker is not hired regardless ofadditional signal η i . Accordingly, this is a suﬃcient condition for a perpetual underestimation. Proof of Theorem 10.

By using Lemmas 24, 25, 26, and 31, we have (36), (37), (38), and (62).From these equations, the probability of perpetual underestimation is bounded as: Pr (cid:34)(cid:91) n { ι ( n ) = 1 } (cid:35) ≥ Pr (cid:34)(cid:91) n {P (cid:48) ( n ) , Q (cid:48)(cid:48) ( n ) } , P , Q (cid:35) ≥ Pr [ P ] Pr [ Q ] Pr (cid:34)(cid:91) n {P (cid:48) ( n ) , Q (cid:48)(cid:48) ( n ) } | P , Q (cid:35) (by the independence of P and Q ) ≥ C b × C × (1 − N C exp ( − C b )) × (cid:32) − N ( K + 1) (cid:18) (cid:19) K (cid:33) (by the union bound) , which, by letting b = O (1 / log( N )) and K + log ( K + 1) ≥ log N , is ˜ O (1) . B.9 Proof of Theorem 11

Proof of Theorem 11.

We have | x (cid:48) i (ˆ θ g − θ g ) | ≤ || x i || ¯ V − g (cid:13)(cid:13)(cid:13) ˆ θ g − θ g (cid:13)(cid:13)(cid:13) ¯ V g ≤ Lλ min ( ¯ V g ) β n (by Eq. (13) and (14) ) ≤ Lλ β N (by ¯ V g (cid:23) λ I d ) =: C = ˜ O (1) . (63)79et i and i be the ﬁnalists chosen from group and , respectively. Deﬁne the following event: J ( n ) = { η i ( n ) − η i ( n ) > C } . Under J , the ﬁnalist of group is chosen because Eq. (63) implies that | x (cid:48) i ˆ θ g − x (cid:48) i ˆ θ g | ≤ C and thus x (cid:48) i ˆ θ g + η i − x (cid:48) i ˆ θ g + η i > . Note that η i − η i is drawn from N (0 , σ η ) . Let C = Φ c ( √ C /σ η ) . Then, Pr[ J ( n )] = C . (64)Let N J = (cid:80) n − n (cid:48) =1 [ g ( ι ) = 1 , J ] ≤ N ( n ) be the number of hiring of group under event J . Byusing the Hoeﬀding inequality, with probability − /N we have N J ≥ nC − (cid:112) n log( N ) . (65)By taking union bound, Eq. (65) holds for all n with probability − (cid:80) n /N ≥ − /N . Fromnow, we evaluate λ min (cid:0) ¯ V ( n ) (cid:1) . It is easy to see that ¯ V := n (cid:88) n (cid:48) =1: ι ( n (cid:48) )= g x i x (cid:48) i + λI ≥ n (cid:88) n (cid:48) =1: ι ( n (cid:48) )= g x i x (cid:48) i ≥ n (cid:88) n (cid:48) =1: J x i x (cid:48) i . In the following, we lower-bound the quantity λ min ( E [ x i x (cid:48) i |J ]) ≥ min v : || v || =1 λ min (Var[ v (cid:48) x i |J ]) . Note that i = arg max i : g ( i )=1 x (cid:48) i ˆ θ is biased towards the direction of ˆ θ , and we cannot use thediversity condition (Lemma 18). Let v (cid:107) and v ⊥ be the component of v that is parallel to and perpen-dicular to ˆ θ . It is easy to conﬁrm that

Var[ v (cid:48)⊥ x i ] = || v ⊥ || σ x because selection of arg max i x (cid:48) i ˆ θ g does not yield any bias in perpendicular direction. Regarding v (cid:107) , Lemma 22 characterize the || v (cid:107) || + || v ⊥ || = 1 . smaller than the original variance due to biased selection. Namely, min v : || v || =1 λ min (Var[ v (cid:48) x i |J ]) ≥ σ x (cid:18) C varmax log( K ) || v (cid:107) || + || v ⊥ || (cid:19) ≥ σ x C varmax log( K ) . By using the matrix Azuma inequality (Lemma 21) with σ A = 2 L , for t = (cid:113) N J σ A log( dN ) ,with probability − /N λ min ( ¯ V ) ≥ σ x C varmax log( K ) N J g − t. (66)Combining Eq. (65) and (66), with probability at least − /N , we have λ min ( ¯ V ( n )) ≥ σ x C p log( K ) n − ˜ O ( √ n ) (67)where C p = C C varmax = ˜ O (1) . By symmetry, exactly the same bound as Eq. (67) holds for group . Finally, by using similar transformations as Eq. (24), the regret is bounded as E [Reg( N )] ≤ N (cid:88) n =1 max i ∈ [ K ] (cid:12)(cid:12)(cid:12) x (cid:48) i ( n )(ˆ θ g − θ g ) (cid:12)(cid:12)(cid:12) ≤ N (cid:88) n =1 L (cid:113) λ min ( ¯ V g ) β N (by Eq. (13), (14)) ≤ Lβ N N (cid:88) n =1 (cid:115) log( K ) σ x C p n − ˜ O ( √ n ) (by Eq. (67)) ≤ Lβ N (cid:115) N log( K ) σ x C p + ˜ O (1) = ˜ O ( √ N ) (68)which concludes the proof. References

Abbasi-Yadkori, Y., D. P´al, and C. Szepesv´ari (2011): “Improved Algorithms for LinearStochastic Bandits,” in

Advances in Neural Information Processing Systems , 2312–2320.

Abe, N. and P. M. Long (1999): “Associative Reinforcement Learning Using Linear ProbabilisticConcepts,” in

Proceedings of the Sixteenth International Conference on Machine Learning , 3–11. O (1 / log K ) igner, D. J. and G. G. Cain (1977): “Statistical Theories of Discrimination in Labor Markets,” Industrial and Labor Relations Review , 30, 175–187.

Al-Ali, M. N. (2004): “How to Get Yourself on the Door of a Job: A Cross-Cultural ContrastiveStudy of Arabic and English Job Application Letters,”

Journal of Multilingual and MulticulturalDevelopment , 25, 1–23.

Altonji, J. G. and C. R. Pierret (2001): “Employer Learning and Statistical Discrimination,”

Quarterly Journal of Economics , 116, 313–350.

Arrow, K. (1973): “The Theory of Discrimination,” in

Discrimination in Labor Markets , ed. byO. Ashenfelter and A. Rees, Princeton University Press, 3–33.

Auer, P., N. Cesa-Bianchi, and P. Fischer (2002): “Finite-Time Analysis of the MultiarmedBandit Problem,”

Machine Learning , 47, 235–256.

Banerjee, A. V. (1992): “A Simple Model of Herd Behavior,”

Quarterly Journal of Economics ,107, 797–817.

Bardhi, A., Y. Guo, and B. Strulovici (2020): “Early-Career Discrimination: Spiraling orSelf-Correcting?” Working Paper.

Bastani, H., M. Bayati, and K. Khosravi (2020): “Mostly Exploration-Free Algorithms forContextual Bandits,”

Management Science . Battaglini, M., J. M. Harris, and E. Patacchini (2020): “Professional Interactions andHiring Decisions: Evidence from the Federal Judiciary,” Working Paper 26726, National Bureauof Economic Research.

Becker, G. S. (1957):

The Economics of Discrimination , University of Chicago press.

Bergemann, D. and J. V¨alim¨aki (2006): “Bandit Problems,” Tech. rep., Cowles Foundationfor Research in Economics, Yale University.

Bikhchandani, S., D. Hirshleifer, and I. Welch (1992): “A Theory of Fads, Fashion, Cus-tom, and Cultural Change as Informational Cascades,”

Journal of Political Economy , 100, 992–1026. 82 ohren, J. A., K. Haggag, A. Imas, and D. G. Pope (2019a): “Inaccurate Statistical Dis-crimination,” Working Paper.

Bohren, J. A., A. Imas, and M. Rosenberg (2019b): “The Dynamics of Discrimination:Theory and Evidence,”

American Economic Review , 109, 3395–3436.

Bordalo, P., K. Coffman, N. Gennaioli, and A. Shleifer (2019): “Beliefs about Gender,”

American Economic Review , 109, 739–73.

Che, Y.-K. and J. H¨orner (2018): “Recommender Systems as Mechanisms for Social Learning,”

Quarterly Journal of Economics , 133, 871–925.

Che, Y.-K., K. Kim, and W. Zhong (2019): “Statistical Discrimination in Ratings-GuidedMarkets,” Working Paper.

Chu, W., L. Li, L. Reyzin, and R. Schapire (2011): “Contextual Bandits with Linear PayoﬀFunctions,” in

Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligenceand Statistics , 208–214.

Coate, S. and G. C. Loury (1993): “Will Aﬃrmative-Action Policies Eliminate Negative Stereo-types?”

American Economic Review , 1220–1240.

Cornell, B. and I. Welch (1996): “Culture, Information, and Screening Discrimination,”

Jour-nal of Political Economy , 104, 542–571.

De Paola, M., V. Scoppa, and R. Lombardo (2010): “Can Gender Quotas Break DownNegative Stereotypes? Evidence from Changes in Electoral Rules,”

Journal of Public Economics ,94, 344 – 353.

Ding, J., R. Eldan, and A. Zhai (2015): “On Multiple Peaks and Moderate Deviations for theSupremum of a Gaussian Field,”

Annals of Probability , 43, 3468–3493.

Eddo-Lodge, R. (2017): “Why I’m No Longer Talking to White People About Race,” TheGurdian, . Accessed on 08/20/2020.83 ang, H. and A. Moro (2011): “Theories of Statistical Discrimination and Aﬃrmative Action:A Survey,” in

Handbook of Social Economics , Elsevier, vol. 1, 133–200.

Farber, H. S. and R. Gibbons (1996): “Learning and Wage Dynamics,”

Quarterly Journal ofEconomics , 111, 1007–1047.

Feller, W. (1968):

An Introduction to Probability Theory and Its Applications. , vol. 1 of

Thirdedition , New York: John Wiley & Sons Inc.

Foster, D. and R. Vohra (1992): “An Economic Argument for Aﬃrmative Action,”

Rationalityand Society , 4, 176 – 188.

Frazier, P., D. Kempe, J. Kleinberg, and R. Kleinberg (2014): “Incentivizing Explo-ration,” in

Proceedings of the ﬁfteenth ACM conference on Economics and computation , 5–22.

Fryer, R. and M. O. Jackson (2008): “A Categorical Model of Cognition and Biased DecisionMaking,”

The BE Journal of Theoretical Economics , 8.

Gittins, J. C. (1979): “Bandit Processes and Dynamic Allocation Indices,”

Journal of the RoyalStatistical Society. Series B (Methodological) , 41, 148–177.

Gu, J. and P. Norman (2020): “A Search Model of Statistical Discrimination,” Working Paper.

Hanna, R. N. and L. L. Linden (2012): “Discrimination in Grading,”

American EconomicJournal: Economic Policy , 4, 146–68.

Hann´ak, A., C. Wagner, D. Garcia, A. Mislove, M. Strohmaier, and C. Wilson (2017):“Bias in Online Freelance Marketplaces: Evidence from Taskrabbit and Fiverr,” in

Proceedingsof the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing ,1914–1933.

Hilton, J. L. and W. Von Hippel (1996): “Stereotypes,”

Annual Review of Psychology , 47,237–271.

Immorlica, N., J. Mao, A. Slivkins, and Z. S. Wu (2020): “Incentivizing Exploration withSelective Data Disclosure,” in

Proceedings of the 21st ACM Conference on Economics and Com-putation , EC ’20, 647–648. 84 oseph, M., M. Kearns, J. H. Morgenstern, and A. Roth (2016): “Fairness in Learning:Classic and Contextual Bandits,” in

Advances in Neural Information Processing Systems , 325–333.

Judd, C. M. and B. Park (1993): “Deﬁnition and Assessment of Accuracy in Social Stereotypes,”

Psychological Review , 100, 109.

Kannan, S., M. Kearns, J. Morgenstern, M. Pai, A. Roth, R. Vohra, and Z. S. Wu (2017): “Fairness Incentives for Myopic Agents,” in

Proceedings of the 2017 ACM Conference onEconomics and Computation , 369–386.

Kannan, S., J. H. Morgenstern, A. Roth, B. Waggoner, and Z. S. Wu (2018): “ASmoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem,” in

Advances in Neural Information Processing Systems , 2227–2236.

Kleinberg, J. M. and M. Raghavan (2018): “Selection Problems in the Presence of ImplicitBias,” in , 33:1–33:17.

Kremer, I., Y. Mansour, and M. Perry (2014): “Implementing the ‘Wisdom of the Crowd’,”

Journal of Political Economy , 122, 988–1012.

Lai, T. and H. Robbins (1985): “Asymptotically Eﬃcient Adaptive Allocation Rules,”

Advancesin Applied Mathematics , 6, 4 – 22.

Langford, J. and T. Zhang (2008): “The Epoch-Greedy Algorithm for Contextual Multi-ArmedBandits,” in

Advances in Neural Information Processing Systems , 817–824.

Lundberg, S. J. and R. Startz (1983): “Private Discrimination and Social Intervention inCompetitive Labor Market,”

American Economic Review , 73, 340–347.

MacNell, L., A. Driscoll, and A. N. Hunt (2015): “What’s in a Name: Exposing GenderBias in Student Ratings of Teaching,”

Innovative Higher Education , 40, 291–303.

Mailath, G. J., L. Samuelson, and A. Shaked (2000): “Endogenous Inequality in IntegratedLabor Markets with Two-Sided Search,”

American Economic Review , 90, 46–72.85 ansour, Y., A. Slivkins, and V. Syrgkanis (2020): “Bayesian Incentive-Compatible BanditExploration,”

Operations Research , 68, 1132–1161.

Mitchell, K. M. and J. Martin (2018): “Gender Bias in Student Evaluations,”

PS: PoliticalScience and Politics , 51, 648–652.

Monachou, F. G. and I. Ashlagi (2019): “Discrimination in Online Markets: Eﬀects of SocialBias on Learning from Reviews and Policy Design,” in

Advances in Neural Information ProcessingSystems , 2145–2155.

Moro, A. (2009):

Statistical Discrimination , London: Palgrave Macmillan UK, 1–5.

Moro, A. and P. Norman (2003): “Aﬃrmative action in a Competitive Economy,”

Journal ofPublic Economics , 87, 567–594.——— (2004): “A General Equilibrium Model of Statistical Discrimination,”

Journal of EconomicTheory , 114, 1–30.

O’Brien, S. A. (2018): “Facebook Commits to Seeking More Minority Directors,”

CNN , https://money.cnn.com/2018/05/31/technology/facebook-board-diversity/index.html . Accessedon 09/09/2020. Papanastasiou, Y., K. Bimpikis, and N. Savva (2018): “Crowdsourcing Exploration,”

Man-agement Science , 64, 1727–1746.

Pe˜na, V. H., T. L. Lai, and Q.-M. Shao (2008):

Self-normalized processes: Limit theory andStatistical Applications , Springer Science & Business Media.

Phelps, E. S. (1972): “The Statistical Theory of Racism and Sexism,”

American Economic Re-view , 62, 659–661.

Precht, K. (1998): “A Cross-Cultural Comparison of Letters of Recommendation,”

English forSpeciﬁc Purposes , 17, 241–265.

Rigollet, P. (2015): “High Dimensional Statistics,” MIT OpenCourseWare, https://ocw.mit.edu/courses/mathematics/18-s997-high-dimensional-statistics-spring-2015/lecture-notes/ . Accessed on 08/29/2020.86 obbins, H. (1952): “Some Aspects of the Sequential Design of Experiments,”

Bulletin of theAmerican Mathematical Society , 58, 527–535.

Rusmevichientong, P. and J. N. Tsitsiklis (2010): “Linearly Parameterized Bandits,”

Math-ematics of Operations Research , 35, 395–411.

Schwartzstein, J. (2014): “Selective Attention and Learning,”

Journal of the European EconomicAssociation , 12, 1423–1452.

Smith, L. and P. Sørensen (2000): “Pathological Outcomes of Observational Learning,”

Econo-metrica , 68, 371–398.

Sundaram, R. K. (2005): “Generalized Bandit Problems,” in

Social Choice and Strategic Deci-sions , Springer, 131–162.

Thompson, W. R. (1933): “On the Likelihood that One Unknown Probability Exceeds Anotherin View of the Evidence of Two Samples,”

Biometrika , 25, 285–294.

Trix, F. and C. Psenka (2003): “Exploring the Color of Glass: Letters of Recommendation forFemale and Male Medical Faculty,”

Discourse and Society , 14, 191–220.

Tropp, J. A. (2012): “User-Friendly Tail Bounds for Sums of Random Matrices,”

Foundations ofComputational Mathematics , 12, 389–434.

Wang, J., Y. Zhang, C. Posse, and A. Bhasin (2013): “Is It Time for a Career Switch?” in

Proceedings of the 22nd International Conference on World Wide Web , New York, NY, USA:Association for Computing Machinery, WWW ’13, 1377–1388.

Williams, W. M. and S. J. Ceci (2015): “National Hiring Experiments Reveal 2: 1 FacultyPreference for Women on STEM Tenure Track,”

Proceedings of the National Academy of Sciences ,112, 5360–5365.

Xu, L., J. Honda, and M. Sugiyama (2018): “A Fully Adaptive Algorithm for Pure Explorationin Linear Bandits,” in