On Statistical Discrimination as a Failure of Social Learning: A Multi-Armed Bandit Approach
OOn Statistical Discrimination as a Failure of Social Learning:A Multi-Armed Bandit Approach ∗Junpei Komiyama † Shunya Noda ‡ First Draft: October 2, 2020 Current Version: October 29, 2020
Abstract
We analyze statistical discrimination using a multi-armed bandit model where myopic firmsface candidate workers arriving with heterogeneous observable characteristics. The associationbetween the worker’s skill and characteristics is unknown ex ante; thus, firms need to learn it.In such an environment, laissez-faire may result in a highly unfair and inefficient outcome—myopic firms are reluctant to hire minority workers because the lack of data about minorityworkers prevents accurate estimation of their performance. Consequently, minority groups couldbe perpetually underestimated —they are never hired, and therefore, data about them is neveraccumulated. We proved that this problem becomes more serious when the population ratio isimbalanced, as is the case in many extant discrimination problems. We consider two affirmative-action policies for solving this dilemma: One is a subsidy rule that is based on the popular upperconfidence bound algorithm, and another is the Rooney Rule, which requires firms to interview atleast one minority worker for each hiring opportunity. Our results indicate temporary affirmativeactions are effective for statistical discrimination caused by data insufficiency.
JEL Codes:
C44, D82, D83, J71
Keywords:
Statistical Discrimination, Affirmative Action, Multi-Armed Bandit, SocialLearning, Strategic Experimentation ∗ We are grateful to Itai Ashlagi, Tomohiro Hara, Yoko Okuyama, Masayuki Yagasaki, and all the participants ofHappy Hour Seminar! for helpful comments. All remaining errors are our own. † Leonard N. Stern School of Business, New York University, 44 West 4th Street, New York, NY 10012, UnitedStates. E-mail: [email protected]. ‡ Vancouver School of Economics, University of British Columbia, 6000 Iona Dr, Vancouver, BC V6T 1L4 Canada.E-mail: [email protected]. Noda has been supported by the Social Sciences and Humanities Research Councilof Canada. a r X i v : . [ ec on . T H ] O c t Introduction
Statistical discrimination refers to discrimination against minority people, taken by fully rationaland non-prejudiced agents. In contrast to taste-based discrimination (Becker, 1957), which regardsagents’ preferences (e.g., racism, sexism) as the primary source of discrimination, the model of sta-tistical discrimination does not assume preferences of hating discriminated groups. Previous studieshave shown that even in the absence of prejudice, discrimination could occur persistently becauseof various reasons, such as the discouragement of human capital investment (Arrow, 1973; Fosterand Vohra, 1992; Coate and Loury, 1993; Moro and Norman, 2004), information friction (Phelps,1972; Cornell and Welch, 1996), and search friction (Mailath, Samuelson, and Shaked, 2000). Theliterature has proposed a variety of affirmative-action policies to solve statistical discrimination,and many of them are being implemented in practice.The contribution of this paper is to articulate a new channel of statistical discrimination—underestimation of minority workers that appears as a consequence of social learning. Most ofthe extant literature focuses on behaviors of rational agents under an equilibrium where agentshave a correct belief about the relationship between observable characteristics and unobservableskills. However, several empirical studies have shown that real-world people often have a biasedbelief towards minority groups. The aim of this study is to endogenize the evolution of the biasedbelief and analyze its consequence. In our model, (i) all firms (decision makers) are fully rationaland non-prejudiced (i.e., attempt to hire the most productive worker), and (ii) all workers are exante symmetric. We show that, even in such an environment, a biased belief could be generatedendogenously and persist in the long run.For instruction, we use the terminology of hiring markets (while our model is applicable to According to Moro’s (2009) definition,Statistical discrimination is a theory of inequality between demographic groups based on stereotypesthat do not arise from prejudice or racial and gender bias.Although some previous studies in statistical discrimination consider the consequence of exogenously endowed bi-ased beliefs (e.g., Bohren, Haggag, Imas, and Pope, 2019a; Bohren, Imas, and Rosenberg, 2019b; Monachou andAshlagi, 2019), we additionally require that agents are fully rational. De Paola, Scoppa, and Lombardo (2010) analyzed an Italian local administration record where a gender quotais introduced in a short period of 1993-1995, which resulted in increased representation of women politicians evenafter the quota is terminated. Battaglini, Harris, and Patacchini (2020) showed that increased professional exposureof women judges promotes hiring of other women judges and hypothesized that it is due to the reinforced belief ofwomen’s professional capabilities. Bohren et al. (2019a) showed that there is a widespread misconception on themathematical competence of American people. However, the true statistical relationship between characteristics andactual skills is not observed directly; thus, firms need to learn it based on the data about past hiringcases. Firms tend to have insufficient data about minority groups because (i) minority groups areliterally a “minority” (in terms of the population), and (ii) they have been discriminated againstand not hired in history. The lack of data makes it difficult to assess the skills of minority workers.Hence, it tends to be safer and more profitable to hire a majority candidate, whose skill is accuratelyestimable. This situation persists because no firm is willing to “experiment” with hiring minorities.Consequently, minority workers may never be hired, and firms may miss many skillful workers fromthe minority group. We also show that some temporary affirmative-action policies can effectivelyprevent this form of discrimination.We develop a multi-armed bandit model of social learning, in which many myopic and short-livedfirms sequentially make hiring decisions. In each round, a firm faces multiple candidate workers.Each firm wants to hire only one person. Each firm’s utility is determined by the hired worker’sskill, which cannot be observed directly until employment. However, as in the standard statisticaldiscrimination model, each worker also has an observable characteristic that is associated with theworker’s hidden skill. In the beginning, no one knows the precise way to interpret the the worker’sobservable characteristic for predicting that skill. Hence, firms first need to learn the relationshipbetween the characteristic and skill, and then apply the statistical model to evaluate the predictedskill of workers. We assume that, firms submit all the information about their hiring cases to apublic database, and therefore, each firm can observe all the past hiring cases (the characteristicsand skills of all the workers actually hired in the past).Each worker belongs to a group that represents the worker’s gender, race, and ethnicity. Weassume that the characteristics of workers who belong to different groups should be interpreteddifferently. This assumption is realistic. First, previous studies have revealed that the underrepre-sented groups receive unfairly low evaluations in many places. When the observable characteristic Observable characteristics can be very informative in one’s career. Wang, Zhang, Posse, and Bhasin (2013)showed that the CV is very informative for predicting whether a software engineer switches another senior positionin three years. For example, Trix and Psenka (2003) study letters of recommendation for medical faculty and find that letterswritten for female applicants differ systematically from those written for male applicants. Hanna and Linden (2012)suggest that students who belong to lower caste (in India) tend receive unfairly lower exam scores. Conversely, as forteaching evaluation, MacNell, Driscoll, and Hunt (2015) and Mitchell and Martin (2018) demonstrate that students
3s an evaluation provided by an outside rater, then the characteristic information itself could bebiased because of the prejudice of the rater. Second, evaluations may also reflect differences in theculture, living environment, and social system. For example, firms must be familiar with the customof writing recommendation letters to interpret letters correctly. Hence, the observable character-istics (curriculum vitae, exam score, grading report, recommendation letter, teaching evaluation,etc.) may provide very different implications even when their appearances are similar. If firms areimpartial and aware of these biases, they should adjust the way they interpret the characteristics,by applying different statistical models for different groups. Firms are typically less knowledgeable about minority workers. In many cases, discriminatedgroups are literally “minorities,” and therefore, the number of candidate workers itself tends to besmaller. Furthermore, even when discriminated groups are demographically a majority, they mightnot have been hired in the past due to a historical reason. Hence, compared with majority workers,the data about minority workers are often insufficient.The lack of data results in inaccurate prediction of minority workers’ skills, and the inaccuracydiscourages firms from hiring the minorities. Many workers apply for each job opening. To gethired, a worker must have the highest predicted skill. Once the minority group is underestimated,it is difficult for a minority worker to appear to be the best candidate worker—even if the trueskill is the highest, the firm will not be convinced. Underestimation rarely happens once societyacquires a sufficiently rich data set. However, in an early stage of the game, the minority group isunderestimated due to bad realizations of the unpredictable component.The structure described above causes perpetual underestimation . Firms tend to hire majorityworkers because of the imbalance of data richness. However, as long as firms only hire majorityworkers, society cannot learn about the minority group; thus, the imbalance remains even in thelong run. Here, the minority group is perpetually underestimated: the lack of data prevents hiring,and therefore, minority workers are never hired. We prove that, perpetual underestimation may rated the male identity significantly higher than the female identity. Hann´ak, Wagner, Garcia, Mislove, Strohmaier,and Wilson (2017) study online freelance marketplaces and find that gender and race are significantly correlated withworker evaluations. Precht (1998) and Al-Ali (2004) report cross-cultural differences in letters of recommendations (that do notoriginate from discrimination). Through a randomized experiment, Williams and Ceci (2015) demonstrate that as for the STEM tenure trackhiring, female applicants are favored over male applicants. This result is consistent with our assumption here: If theobserved characteristics are systematically biased, an impartial employer would debias the data before interpretingit. This may lead to reversal discrimination. laissez-faire results in the underprovision of apublic good—the information about minority groups. By enforcing or incentivizing early movers(firms) to review the minority groups, late movers can refer to a more useful data set of hiring cases,leading to improvement of social welfare. Note that the policy intervention need not be persistent:once sufficiently rich data are collected, the government can terminate the affirmative action andreturn to laissez-faire.We analyze the equilibrium consequence of laissez-faire and study desirable policy interventions.Multi-armed bandit models are useful for quantifying the value of information as the width ofconfidence bounds. We use a linear contextual bandit model to study whether a policy can leadsociety to achieve “no regret” in the long run. The regret is one of the most popular criteria forevaluating the performance of algorithms in multi-armed bandit problems. The regret measuresthe welfare loss compared with the first-best decision rule (which firms would take when they hadperfect information about the statistical model). When the regret grows sublinearly in N , it meansthat firms make fair and efficient decisions after a certain time.In our theoretical analyses of laissez-faire, we first prove that it achieves no regret in the longrun: When the groups are ex ante symmetric and the population ratio is equal, the expected regretof laissez-faire is shown to be ˜ O ( √ N ) , where ˜ O is a Landau notation that ignores a logarithmicfactor. In contrast, when the population ratio is imbalanced (i.e., the number of majority workersis larger than the number of minority workers), this result no longer holds. In such a case, theexpected regret is proven to be ˜Ω( N ) , which implies that efficiency is not attained even in the longrun.This paper studies two policy interventions towards fair and efficient social learning. The firstpolicy is a subsidy rule, based on the idea of the upper confidence interval algorithm (UCB). UCB5s an effective solution for balancing exploration and exploitation in the (standard single-agent)multi-armed bandit problem (Lai and Robbins, 1985; Auer, Cesa-Bianchi, and Fischer, 2002). Byincentivizing firms to take actions that are consistent with the recommendation of UCB, we canlead the social learning to no regret in the long run. We achieve it by providing firms subsidies whenthey hire a worker who belongs to an underexplored group. The subsidy is adjusted to the degreeof information externality; thus, its total amount shrinks as time goes. Formally, we show that theUCB mechanism has expected regret of ˜ O ( √ N ) . The subsidy amount required for implementingthe UCB mechanism is also ˜ O ( √ N ) .This paper further proposes a hybrid mechanism , which terminates affirmative actions once asufficiently rich data set is collected and returns to laissez-faire. In our setting, once firms obtain acertain amount of data, the diversity of workers’ characteristics naturally promotes learning aboutthe minority group. Hence, even if we terminate the policy intervention earlier than a standardUCB algorithm would do, society rarely falls into perpetual underestimation. We prove that ourhybrid mechanism achieves ˜ O ( √ N ) regret with ˜ O (1) subsidy in N rounds. Furthermore, in oursimulation, the hybrid mechanism achieved smaller regret than the UCB mechanism.The second policy is the Rooney Rule. The Rooney Rule is a “soft” affirmative action (in thatno hiring quota is required) and it does not require monetary compensation. Instead, the RooneyRule requires each firm to select at least one minority candidate as a finalist for each job opening. Inthe final selection, firms can obtain additional signals, besides the observable characteristics shownas an application document. The Rooney Rule leaves minority workers an opportunity to be hired.Even when a firm underestimates a minority worker’s skill in the beginning (due to the predictioninaccuracy), the worker may turn out to be the most attractive candidate once the interview isdone. As long as minority workers have a chance to be hired, perpetual underestimation will notoccur. However, our analysis also shows that the Rooney Rule may hinder hiring of skilled majoritycandidates, and therefore, should not be adopted as a permanent policy.The remainder of this paper is organized as follows. Section 2 reviews the literature. Section 3introduces the model. Section 4 studies the equilibrium consequence of laissez-faire. Section 5develops the upper-confidence-bound subsidy rules, and Section 6 improves it further. Section 7 The Rooney Rule is originally introduced to the National Football League, and the original version of the rulerequired league teams to interview ethnic-minority candidates for head coaching and senior football operation jobs.The rule is named after Dan Rooney, the former chairman of the league’s diversity committee (Eddo-Lodge, 2017).
A survey by Fang and Moro (2011) classifies the literature of statistical discrimination broadly intotwo strands. The first strand, originates from Arrow (1973), assumes that groups are ex ante iden-tical and analyzes how statistical discrimination occurs as an asymmetric equilibrium (e.g., Fosterand Vohra; Coate and Loury, 1992; 1993; Mailath et al., 2000; Moro and Norman, 2003, 2004; Guand Norman, 2020). This strand interprets statistical discrimination as a random selection of mul-tiple equilibria and does not explain why demographic minorities tend to be discriminated against.The second strand of the literature, originates from Phelps (1972), studies discrimination triggeredby unexplained exogenous differences between groups, coupled with incomplete information aboutworkers’ skills (e.g., Aigner and Cain, 1977; Lundberg and Startz, 1983; Cornell and Welch, 1996).The difference in the signal distribution of workers’ skill is one of the most popular assumptions inthis strand. This paper unifies these two strands in that we endogenize the difference in the signaldistribution. We consider otherwise ex ante identical individuals from different groups. Using asocial learning model, we demonstrate how the difference in the prediction of skills is generatedand persists. We find that when the population ratio of a group is small, the group tends to bestatistically discriminated against. Hence, in contrast to most papers in the first strand, our resultindicates that a minority group tends to suffer as an inevitable consequence under laissez-faire.More recently, several works (e.g., Bohren et al., 2019a, 2019b; Monachou and Ashlagi, 2019)demonstrate how misspecified beliefs about groups will result in discrimination. Thus far, this lit-erature has attributed the belief misspecification to psychological bias (e.g., Judd and Park, 1993;Hilton and Von Hippel, 1996) and bounded rationality (e.g., Fryer and Jackson, 2008; Schwartzstein, 2014;Bordalo, Coffman, Gennaioli, and Shleifer, 2019). In contrast, we develop a model of fully rationalagents and show that a misspecified belief persists (i.e., a minority group is perpetually underes-timated) even in the long run. Our result supports a fundamental assumption of the belief-based The population imbalance is not an “unexplained” difference. Bardhi et al. (2020) show that a small difference inthe prior belief about each worker’s type (associated with the group the worker belongs to) couldgenerate a significant difference in payoffs of workers . In contrast, the focus of this paper is onhow society acquires a persistent misspecified belief about the minority group endogenously.Some previous studies consider a linear contextual bandit problem and study the performance ofa “greedy” algorithm, which myopically makes decision in accordance with the current information(Bastani, Bayati, and Khosravi, 2020; Kannan, Morgenstern, Roth, Waggoner, and Wu, 2018). Asfirms take greedy actions under laissez-faire, their results are also relevant to our model. They showthat the greedy algorithm could lead to no regret in the long run, if (i) the contexts (correspondingto workers’ characteristics in our model) are diverse enough, and (ii) the decision maker acquiressufficiently many uniform samples in the beginning. While their results suggest that laissez-faireperforms well, uniform sampling is not adaptive, and thus does not adequately quantify the valueof information. Subsection 8.6 provides a detailed analysis on this point—we show that, our hybridmechanism performs better than uniform sampling followed by laissez-faire.Theoretical analyses for the Rooney Rule are relatively scarce. Kleinberg and Raghavan(2018) show that, when the recruiter has an unconscious bias against the discriminated group, the As a decision maker is a long-lived employer, Bardhi et al. (2020) belongs to the literature on dynamic employerlearning (Farber and Gibbons, 1996; Altonji and Pierret, 2001), not to the social learning literature. Bardhi et al. (2020) also described the effect of population imbalance in an independent section. De Paola et al. (2010) empirically show that a soft gender quota imposed on an Italian local administration brokedown negative stereotypes towards women even after it was terminated. This quota can be regarded as a version ofthe Rooney Rule.
Basic Setting
We develop a linear contextual bandit problem with myopic agents (firms). Weconsider a situation where N firms (indexed by n = 1 , . . . , N ) sequentially hire one worker for each.In each round n , a set of workers I ( n ) arrives. Each worker i ∈ I ( n ) takes no action, and firm n selects one worker ι ( n ) ∈ I ( n ) . We denote the set of all workers by I := (cid:83) Nn =1 I ( n ) . Both firms andworkers are short-lived. Once round n is finished, firm n ’s payoff is finalized, and all the workersnot hired leave the market.Each worker i belongs to a group g ∈ G . We assume that the population ratio is fixed: forevery round n , the number of arrived workers who belong to group g is K g ∈ N and K = (cid:80) g ∈ G K g .Slightly abusing the notation, we denote the group worker i belongs to by g ( i ) . Each worker i ∈ I also has an observable characteristic x i ∈ R d , where d ∈ N is its dimension. Finally, each worker i also has a skill y i ∈ R , which is not observable until worker i is hired. The characteristics andskills are random variables.Because each firm’s payoff is equal to the hired worker’s skill y i (plus the subsidy assigned toworker i as an affirmative action, if any), firms want to predict the skill y i based on the character-istics x i . We assume that the characteristics and skills are associated in the following way: y i = x (cid:48) i θ g ( i ) + (cid:15) i , where θ g ∈ R d is a coefficient parameter , and (cid:15) i ∼ N (0 , σ (cid:15) ) i.i.d. is an unpredictable error term. Weassume || θ g || ≤ S for some S ∈ R + , where || · || is the standard L2-norm. Since (cid:15) i is unpredictable, q i := x (cid:48) i θ g ( i ) (1)10s the best predictor of worker i ’s skill y i .The coefficient parameters ( θ g ) g ∈ G are unknown in the beginning. Hence, unless firms sharethe information about past hiring cases, firms are unable to predict each worker’s skill y i . Weassume that all firms share information about hiring cases. Accordingly, when firm n makes adecision, besides current workers’ characteristics and groups ( x i , g ( i )) i ∈ I ( n ) , firm n can observe allthe past candidate workers’ characteristics and groups, ( x i , g ( i )) for all i ∈ (cid:83) n − n (cid:48) =1 I ( n (cid:48) ) , the pastfirms’ decisions ( ι ( n (cid:48) )) n − n (cid:48) =1 , and the past hired workers’ skills ( y ι ( n (cid:48) ) ) n − n (cid:48) =1 . We refer to all of therealizations of these variables as the history in round n , and denote it by h ( n ) . Formally, h ( n ) isgiven by h ( n ) = (cid:16) ( x i , g ( i )) i ∈ I ( n ) , ( x i , g ( i )) i ∈ (cid:83) n − n (cid:48) =1 I ( n (cid:48) ) , ( ι ( n (cid:48) )) n − n (cid:48) =1 , ( y ι ( n (cid:48) ) ) n − n (cid:48) =1 (cid:17) . Note that, h ( n ) does not include the information about (i) the worker hired in firm n , and (ii)that worker’s actual skill. This is because h ( n ) represents the information set firm n faces when itmakes a hiring decision. We define the set of all histories in round n as H ( n ) . We define the set ofall histories as H := (cid:83) Nn =1 H ( n ) . The firm’s decision rule for hiring and the government’s subsidyrule will be defined as a function that maps a history to a hiring decision and the subsidy amount(described later). For notational convenience, we often drop h ( n ) . Prediction
We assume that firms are not Bayesian but frequentist . Hence, firms have no priordistribution but they estimate the true parameter θ using the available data set.We assume that each firm predicts the skill by using ridge regression (also known as regularizedleast square) to stabilize the small-sample inference. Let N g ( n ) be the number of rounds at whichgroup- g workers are hired by round n . Let X g ( n ) ∈ R N g ( n ) × d be a matrix that lists the characteris-tics of group- g workers hired by round n : each row of X g ( n ) corresponds to { x ι ( n (cid:48) ) : ι ( n (cid:48) ) = g } n − n (cid:48) =1 .Likewise, let Y g ( n ) ∈ R N g ( n ) be a vector that lists the skills of group- g workers hired by round n :each element of Y g ( n ) corresponds to { y ι ( n (cid:48) ) : ι ( n (cid:48) ) = g } n − n (cid:48) =1 . We define V g ( n ) := ( X g ( n )) (cid:48) X g ( n ) .For a parameter λ > , we define ¯ V g ( n ) = V g ( n ) + λ I d , where I d denotes the d × d identity matrix.Firm n estimates the parameter as follows: ˆ θ g ( n ) := ( ¯ V g ( n )) − ( X g ( n )) (cid:48) Y g ( n ) . (2)11nlike ordinary least square (OLS), for λ > the inverse ( ¯ V g ( n )) − is always well-defined. Firm n predicts worker i ’s skill by (1), while substituting θ g with ˆ θ g ( n ) : ˆ q i ( n ) := x (cid:48) i ˆ θ g ( i ) ( n ) . Note that, both ˆ q i ( n ) and ˆ θ g ( n ) depend on the history h ( n ) . We often drop h ( n ) for notationalsimplicity. Mechanism
Besides the (predicted) skill of workers, firms also take the subsidies provided asaffirmative actions into consideration. We assume that firms’ preferences are risk-neutral andquasi-linear. Hence, if firm n hires worker i , firm n ’s payoff (von-Neumann–Morgenstern utility) isgiven by y i + s i , where where s i ∈ R + denotes the amount of the subsidy assigned to worker i .In the beginning of the game, the government commits to a subsidy rule s i ( n, · ) : H → R + ,which maps a history to a subsidy amount. Hence, once a history h ( n ) is specified, firm n canidentify the subsidy assigned to each worker i ∈ I ( n ) . Firm n attempts to maximize E [ y i + s i ( n ; h ( n )) | h ( n )] = ˆ q i ( n ; h ( n )) + s i ( n ; h ( n )) . Firm n ’s decision rule ι ( n, · ) : H ( n ) → I ( n ) specifies the worker firm n hires given a history h ( n ) . We say that, a decision rule ι is implemented by a subsidy rule s i if for all n , for all h ( n ) , wehave ι ( n ; h ( n )) = arg max i ∈ I ( n ) { ˆ q i ( n ; h ( n )) + s i ( n ; h ( n )) } . (3)We call a pair of a decision rule and subsidy rule a mechanism .Throughout this paper, any ties are broken in an arbitrary way. Again, we often drop h ( n ) from the input of decision rule ι when it does not cause confusion. Remark 1 (Observability of the Past Hiring Data) . While we assume that firms share the entirehistory of past hiring data for simplicity, practically, each firm may have limited access to thedatabase. Even if we make such an assumption, our analysis and results will not require qualitativechanges. Rational firms estimate θ g based on the available data and use it to predict workers’ skills.The smaller the sample size of the available data is, the severer the data insufficiency of minority12orkers is. Social Welfare
We measure social welfare by the smallness of regret , which is the standardmeasure to evaluate the performance of algorithms in multi-armed bandit models. The regret isdefined as follows:
Reg( N ) := N (cid:88) n =1 (cid:26) max i ∈ I ( n ) q i − q ι ( n ) (cid:27) . Since (cid:15) i is unpredictable, it is natural to evaluate the performance of the algorithm (or the equi-librium consequence of the policy intervention) by checking the value of predictors q i . If theparameter ( θ g ) g ∈ G were known, each firm could easily calculate q i for each worker i and choose ι ( n ) = arg max i ∈ I ( n ) q i . In this case, the regret would become zero. However, since ( θ g ) g ∈ G isunknown, it is too demanding to aim at zero regret. The goal of the policy design is to set upa mechanism that minimizes the expected regret E [Reg( N )] , where the expectation is taken on arandom draw of the workers. This aim is equivalent to maximizing the sum of the skills of thehired workers.Following the literature, we mainly evaluate the performance by the limiting behavior (order) ofexpected regrets. One useful benchmark is whether the expected regret is linear (i.e., E [Reg( N )] =Ω( N ) ) or sublinear (i.e., E [Reg( N )] = o ( N ) ). As we described above, once ( θ g ) g ∈ G is known,firms can use the best predictor q i to evaluate workers. After that point, regret does not increase.Although ( θ g ) g ∈ G is unknown ex ante, firms can learn it from the data. A linear regret means thatsociety fails to learn the underlying parameter ( θ g ) g ∈ G , and therefore, firms are hiring less-skilledworkers even in the long run. In our model, perpetual underestimation is often a consequenceof statistical discrimination—typically, minority workers are more likely to be underexplored, andtherefore, they are unfairly rejected. Budget
Some of the policies we study incentivize exploration by subsidization. The total budgetrequired by a subsidy rule is also an important policy concern. The total amount of the subsidy is All the mechanisms proposed in this paper are deterministic. Hence, there is no algorithmic randomness. In the literature of the multi-armed bandit problem, sublinear regret is also referred as no regret since the regretper round approaches zero as N → ∞ . lgorithm 1 Initial Sampling Phase { g n } N (0) n =1 is allocated such that (cid:80) N (0) n =1 [ g n = g ] = N (0) g . for n = 1 , · · · , N (0) do Hire ι ( n ; h ( n )) = min i ∈ I ( n ): g ( i )= g n i . (cid:46) Firm n blindly hires a group- g n candidate. end for given by Sub( N ) := N (cid:88) n =1 s ι ( n ) ( n ) . Initial Sampling Phase
For analytical tractability, we assume that for the first N (0) rounds,each firm n is forced to hire from a pre-specified group, g n . We refer to the first N (0) rounds asthe initial sampling phase (Algorithm 1). Namely, for all n = 1 , . . . , N (0) , firm n hires a group- g n candidate who has the smallest agent number: ι ( n ; h ( n )) = min i ∈ I ( n ) i subject to g ( i ) = g n . (4)Choosing an agent who has the smallest number is just a random choice. Whenever agents belongto the same group, their characteristics and skill distributions are the same. Accordingly, (4) isequivalent to choosing a group- g n worker blindly (i.e., uniformly at random without looking atworkers’ predicted skills). We define N (0) g := (cid:80) N (0) n =1 [ g n = g ] as the data size of initial samplingfor group g . The initial sampling phase is exogenous and not regarded as a part of the mechanism.Hence, we ignore the incentives and payoffs of firms hiring in the initial sampling phase. This section analyzes the equilibrium under laissez-faire , that is, the consequence of social learningwhen policy intervention is absent. Subsection 4.1 introduces a basic fact: laissez-faire has linearregret in a general domain. However, a general domain is not suitable for the analysis of statisticaldiscrimination. Hence, in Subsection 4.2, we define a symmetric and diverse environment, withwhich we can discuss how statistical discrimination grows. In Subsection 4.3, we formally defineperpetual underestimation and discuss its implications. Subsection 4.4 describes the case where14 lgorithm 2
Laissez-FaireComplete the initial sampling phase by running Algorithm 1. for n = N (0) + 1 , · · · , N do (cid:46) Laissez-Faire starts.Offer s i ( n ) = 0 for all i ∈ I ( n ) . (cid:46) No Subsidy is provided.Firm n hires ι ( n ) = arg max i x (cid:48) i ˆ θ g ( i ) ( n ) as an equilibrium consequence. end for (i) both of the groups have sufficient variation, and (ii) the population ratio is balanced. In thiscase, the underestimation of minority groups is spontaneously resolved, and therefore, laissez-faireperforms well. However, as shown in Subsection 4.5, when the population ratio is imbalanced,laissez-faire tends to result in perpetual underestimation, and therefore, performs poorly. We first define laissez-faire.
Definition 1 (Laissez-Faire) . The laissez-faire decision rule always selects the worker who has thehighest predicted skill, i.e., ι ( n ) = arg max i ∈ I ( n ) ˆ q i ( n ) . Clearly, the laissez-faire decision rule is implemented by the laissez-faire subsidy rule , which provideno subsidy s i = 0 after any history (Algorithm 2).Laissez-faire makes no intervention, and therefore, each firm hires the worker whose expectedskill, predicted by the current data set, is the highest. In the multi-armed bandit literature, thelaissez-faire decision rule is referred to as the greedy algorithm . The greedy algorithm often resultsin a catastrophic outcome due to insufficient exploration. Since information is a public good, itssupply is inefficiently low if the government makes no policy intervention. This well-known resultapplies to our environment if no structure is assumed. We state this basic result as a benchmark. Theorem 1 (Failure of Laissez-Faire in General Domain) . Let
Reg LF be the regret under thelaissez-faire decision rule. There exists an instance with which E [Reg LF ( N )] = Ω( N ) . roof. See Appendix B.2.The analysis in Appendix B.2 is essentially the same as the analysis of greedy algorithm inthe standard K -armed bandit problem, which is well-known to be Ω( N ) . We show Theorem 1by constructing an instance explicitly. By assuming that the distribution of the characteristics( x i ) to be degenerate, our linear contextual bandit problem reduces to a basic K -armed banditproblem, where the expected skill (reward) of each group (arm) is fixed. We assume that onegroup is more productive than another, and therefore, the first-best decision rule would alwayshire from the better group. With a constant probability, firms happen to underestimate the moreproductive group in the beginning. When a less productive group constantly performs better thanthe underestimated predicted skill of the better group, firms never want to investigate the bettergroup further. Consequently, with a significant probability, a worker from the better group isnever hired again, implying linear expected regret. Once an underestimation of the minority groupoccurs, it tends to persist: When the “context” of the majority group is fixed, there is a constantprobability that the minority group is never chosen throughout all the rounds. It is too naive to conclude from Theorem 1 that the laissez-faire decision rule may cause statisticaldiscrimination. First, the instance constructed in the proof of Theorem 1 assumes an unexplainedexogenous difference (in expected skills) between groups, while our aim is to endogenize the dif-ference. Second, we assumed that one group has higher expected skill than the other. With thisassumption, it is efficient to always hire a worker from one group. Under such an assumption, whensocial learning is successful, workers from the inferior group are never hired. Third, we reduced acontextual bandit model to a K -armed bandit model by assuming that the distribution of charac-teristics is degenerate. However, in the real word, candidate workers have diverse characteristics,even when they belong to the same group.To provide better analysis for the laissez-faire decision rule, we make the following three as-sumptions. First, we focus on the case of two groups. Assumption 1 (Two Groups) . The population consists of two groups G = { , } . The example we consider in the proof fixes the context of each worker, and the problem boils down to the standard K -armed bandit problem without context. as a majority (domi-nant) group and group as a minority (discriminated) group. The two-group assumption helps usto elucidate how the minority group is discriminated against by the majority group.Second, we assume that groups are symmetric. Assumption 2 (Symmetric Groups) . The characteristics of all groups are identically distributed,and the coefficient parameters are the same across all groups. Namely, a probability distribution F such that for all i ∈ I , x i ∼ F, and there exists θ ∈ R d such that for all g ∈ G , θ g = θ . Note that although we assume that groups are symmetric, firms do not see them as symmetric,and therefore, apply different statistical models for different groups. In other words, even thoughthe true coefficients are identical ( θ g = θ (cid:48) g for all g, g (cid:48) ∈ G ), firms estimate them separately; thus,the values of the estimated coefficients are typically different ( ˆ θ g ( n ) (cid:54) = ˆ θ g (cid:48) ( n ) for g (cid:54) = g (cid:48) ).Although Assumption 2 is unrealistic (as it is evident that the characteristics should be inter-preted differently), it is useful for elucidating how laissez-faire nourishes statistical discrimination.Under Assumption 2, there is no ex ante difference between groups (as assumed in Arrow, 1973;Foster and Vohra, 1992; Coate and Loury, 1993; Moro and Norman, 2004, etc.). Hence, all the dif-ferences we observe in the equilibrium consequence are purely due to the property of the equilibriumlearning process.Under Assumption 2, statistical discrimination implies inefficiency: although a best candidatebelongs to a minority group with substantial probability ( K /K ), that candidate is not hireddue to underexploration. Hence, when the groups are symmetric, the resolution of statisticaldiscrimination make the hiring process not only fair but also efficient. By contrast, when there isexogenous asymmetry between groups, fairness and efficiency are often conflicting. For example, demographic parity is one of the most popular fairness notions studied in machine learning (orsupervised learning) literature. In our model, the demographic parity requires that the probability17f hiring from the minority group is equal to the population ratio; i.e., K /K . Clearly, when thegroups are asymmetric, the “first-best decision rule” does not satisfy this condition—it hires morefrom a “more productive group,” while it is arguable that such a decision rule is socially desirable.As long as we assume the group symmetry, our argument avoids this controversy: The first-bestdecision rule is fair and efficient. Thus, we should attempt to approximate it.Third, we assume that characteristics are normally distributed, and therefore, the distributionis non-degenerate. This assumption captures the diversity of workers, which is the nature of thereal-world labor market. Assumption 3 (Normally Distributed Characteristics) . For every candidate i , x i ∼ N ( µ xg ( i ) , σ xg ( i ) I d ) , where µ xg ∈ R d and σ xg ∈ R ++ for every g ∈ G . We also denote x i = µ xg ( i ) + e xi to highlight thenoise term e xi .Note that when we have both Assumptions 2 and 3, then there exist µ x , σ x such that µ xg = µ x ,σ xg = σ x , for all g ∈ G . Hence, x i ∼ N ( µ x , σ x I d ) for all i ∈ I . To determine whether social learning incurs linear expected regret or not, it is useful to checkwhether it results in perpetual underestimation with a significant probability.
Definition 2 (Perpetual Underestimation) . A group g is perpetually underestimated if, for all n > N (0) , we have g ( ι ( n )) (cid:54) = g .Namely, when group g is perpetually underestimated, no worker from group g is hired afterthe initial sampling phase. 18f social learning results in perpetual underestimation with a significant probability, then itoften incurs linear expected regret. In particular, under Assumptions 2, perpetual underestimationagainst any group g ∈ G implies that firms fail to hire at least K g K (cid:16) N − N (0) (cid:17) best candidate workers, which is linear in N . Hence, if the probability of perpetual underestimationis constant (independent of N ), then we have linear expected regret.In our model, perpetual underestimation is also closely related to social discrimination. Whenperpetual underestimation occurs, a candidate who belongs to an underestimated group is not hired,while groups are symmetric. This outcome happens because society cannot accurately predict theskills of minority workers due to the lack of data. Hence, in our model, perpetual discriminationcan be regarded as a form of statistical discrimination. This section analyzes the case where is only one candidate arrives at each round for both groups.In this case, the variation of context implicitly urges the firms to explore all the groups with somefrequency. Consequently, laissez-faire has sublinear regret, implying that statistical discriminationis eventually resolved, spontaneously.
Theorem 2 (Sublinear Regret with Balanced Population) . Suppose Assumptions 1, 2, and 3.Suppose also that K g = 1 for g = 1 , . Then, the expected regret is bounded as E [Reg LF ( N )] ≤ C bal √ N where C bal is a ˜ O (1) factor that depends on model parameters. Here, ˜ O (1) is a Landau notationthat ignores polylogarithmic factors. Letting µ x = || µ x || , the factor C bal is inverse proportionalto Φ c ( µ x /σ x ) , which is approximately scales as exp( − ( µ x /σ x ) / . Proof.
See Appendix B.3. The explicit form of C bal is found at the end of the Appendix B.3. Namely, there exists N ∈ N and a function f ( N ) that is finite-order polynomial of log N such that E [Reg LF ( N )] ≤ f ( N ) for all N ≥ N . In this and subsequent theorems, we often ignore polylogarithmic factors (factors that arefinite-order polynomial of the logarithm) of N because they grow very slowly as N grows large. We remark on theimportant dependence on model parameters and refer to the equation of explicit formulae of each factor. is underestimated, which happens with some constant probability. The ratio µ x /σ x represents the stability of characteristics. The larger this value is, the more stable the skill ofcandidates. If µ x /σ x is small, there is some probability such that the skill of the group- candidateis predicted to be bad. In such a case, the candidate from group might be chosen, which updatesthe belief about group to resolve underestimation. As is expected by the theory of least squares,the standard deviation of ˆ θ g ( n ) is proportional to ( ¯ V g ( n )) − / , and we show that its diameter ( λ min ( ¯ V g ( n ))) − / shrinks as ˜ O (1 / √ n ) . The regret per error is defined by this quantity, and thetotal regret is ˜ O ( (cid:80) n ≤ N (1 / √ n )) = ˜ O ( √ N ) .Theorem 3 shows that statistical discrimination is resolved spontaneously when the candidatevariation is large. At a glance, this appears to be in contradiction with widely known resultsthat states laissez-faire may lead to suboptimal results in bandit problems due to underexplo-ration. Since selfish firms do not want to experiment with underrepresented groups at their ownrisk, laissez-faire perpetually underestimates the skill of the minority group (as demonstrated inTheorem 1). However, the variation in characteristics naturally incentivizes selfish agents to ex-plore the underestimated group, and therefore, with some additional conditions, we can bound theprobability of perpetual underestimation.Theorem 2 shares some intuitions with the previous results (Kannan et al., 2018; Bastani et al.,2020), which have shown that the variation in contexts (characteristics) improves the performanceof the greedy algorithm (laissez-faire) in contextual multi-armed bandit problems. Kannan et al.(2018) assume that there is a sufficiently long initial sampling phase, in which society can collectthe uniform-sample data until the model parameters are stabilized. Theorem 1 in Bastani et al.(2020) corresponds to Theorem 2 in our paper, and we further attributes it to the stability µ x /σ x rather than the diameter of the characteristics.More importantly, in the next subsection, we prove that these positive results are “specialcases”—we will show that, even when there is a variation in characteristics, when the popula-tion ratio imbalanced, the laissez-faire decision rule may cause perpetual underestimation with asubstantial probability. 20 .5 Large Regret with Imbalanced Population While Theorem 2 implies that statistical discrimination might be spontaneously resolved in thelong run (if we admit that workers’ characteristics are diverse enough), it crucially relies on one un-realistic assumption—the balanced population ratio. In many real-world problems, the populationratio is imbalanced. The dominant group is often the majority of the population, and the discrim-inated is minority. Even when the population demographically balanced, if we look at a specificlabor market, the population ratio could be imbalanced due to an imbalanced wealth distributionor discouragement of human capital investments.We indeed find that the population ratio between groups has a crucial role for the welfare underlaissez-faire. In the following theorem, we assume that, in each round, while only one minorityworker arrives (i.e., K = 1 ), while many majority workers ( K > ) arrive. When the populationis imbalanced, perpetual underestimation becomes more likely, and therefore, society suffers froma large expected regret. Theorem 3 (Large Regret with Imbalanced Population) . Suppose Assumptions 1, 2, and 3. Sup-pose also that K = 1 and d = 1 . Let K > log N . Then, under the laissez-faire decision rule,group is perpetually underestimated with the probability at least C imb = ˜Θ(1) . Accordingly, theexpected regret of the laissez-faire decision rule is E (cid:104) Reg LF ( N ) (cid:105) ≥ C imb ( N − N (0) ) K = ˜Ω( N ) . Proof.
See Appendix B.4. The explicit form of C imb is found at Eq. (39).In the proof of Theorem 3, we evaluate the probability of the following two events occuring. (i)The coefficient parameter for the minority candidates, θ is underestimated. (ii) The characteristicsand skills of the hired majority workers are consistently good throughout the rounds. The proba-bility of (i) is constant (independent of N ) and the probability of (ii) is constant if K > log N .When both (i) and (ii) occur, minority workers are perpetually underestimated, and therefore, wehave a large regret.Theorem 3 indicates that we should not be too optimistic about the consequence of laissez-faire.The imbalance in the population ratio naturally favors the majority group by helping society to21ollect a richer data set about them, leading to statistical discrimination. This insight applies tomany real-world problems because an imbalanced population is a commonplace. Remark 2.
In the proof of Theorem 3, we explicitly bound the probability that each event happens.Hence, the effect of initial sample size is revealed. The probability of underestimating the minoritygroup is exponentially small to the number initial samples for minorities, N (0)2 , which implies thatsmall number of initiators in the minority group can prevent the underestimation to be perpet-uated. Note also that this is consistent with the prior results by Kannan et al. (2018), whichstate a sufficiently large initial samples prevent perpetual underestimation because it alleviates theunderestimation of ˆ θ . In Subsection 8.6, we demonstrate that this solution is not desirable becauseuniform sampling is costly and difficult to implement. According to our simulation, the UCB-basedsubsidy rule (the hybrid mechanism, proposed in Section 6) outperforms uniform sampling followedby laissez-faire. Remark 3.
In our framework, the statistical distribution purely attributes to a failure of sociallearning and the resultant misinformation. Hence, even when perpetual underestimation occurs, thetrue skills of minority workers (the distribution of y i ) are not lowered. However, if we additionallyincorporate the choice of the education level and human capital investments (as in Foster andVohra, 1992; Coate and Loury, 1993), the misinformation naturally discourages minority workersfrom improving their skills. Therefore, if education is endogenous, the welfare loss and inequalityraised by social learning would be more serious. Section 4 has discussed the equilibrium consequence under laissez-faire. We observed that, when thepopulation ratio is imbalanced (as in the real-world job market), there is a substantial probabilitythat the underestimation is perpetuated. This result indicates that a policy intervention (affirmativeaction) is effective for improving social welfare and fairness of the hiring market.This section proposes a subsidy rule to resolve such a perpetual underestimation. We use theidea of the upper confidence bound (UCB) algorithm, which is widely used in the literature of See Lemma 24 (in the Appendix) for the full detail. The UCB algorithm balances exploration and exploitation by allocatinghandicaps to less explored arms (groups), whose rewards (skills) cannot be predicted accurately.The UCB algorithm develops a confidence interval for the true reward and evaluate each arm’sperformance by its upper confidence bound to achieve this balance. Although firms are not willingto follow the UCB’s recommendation under laissez-faire, the government can provide a subsidy topromote a candidate worker who has the highest UCB. In this section, we establish a UCB-basedsubsidy rule and evaluate its performance.
To establish the UCB-based subsidy rule, we first define the hiring decision suggested by the UCBalgorithm. After that, we construct a subsidy rule that incentivizes firms to hire workers based onUCB. A challenge is that the adaptive selection of the candidates based on history can induce somebias, and the standard confidence bound no longer applies to our case. To overcome this issue,we use martingale inequalities (Pe˜na, Lai, and Shao, 2008; Rusmevichientong and Tsitsiklis, 2010;Abbasi-Yadkori, P´al, and Szepesv´ari, 2011). We here introduce the confidence interval for the truecoefficient parameter, ( θ g ) g ∈ G . Definition 3 (Confidence Interval, Abbasi-Yadkori et al., 2011) . Given the group g ’s collecteddata matrix ¯ V g ( n ) , the confidence interval of group g ’s coefficient parameter θ g is given by C g ( n ) = ¯ θ g ∈ R d : (cid:13)(cid:13)(cid:13) ¯ θ g − ˆ θ g ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ σ (cid:15) (cid:115) d log (cid:18) det( ¯ V g ( n )) / det( λ I d ) − / δ (cid:19) + λ / S where || v || A = √ v (cid:48) Av for a d -dimensional vector v and d × d matrix A .The standard confidence interval, C g ( n ) , shrinks as firm n has a richer set of data about group g . Abbasi-Yadkori et al. (2011) study the property of this confidence interval, and they prove thatthe true parameter θ g lies in C g ( n ) with a probability − δ (Lemma 19). If we choose sufficiently The idea of UCB goes back to at least in 1980s. The seminal paper by Lai and Robbins (1985) analyzed aversion of UCB. More recently, Auer et al. (2002) introduced UCB1, which is widely known in the machine learningliterature. δ , it is “safe” to assess that worker i ’s skill is at most ˜ q i ( n ) := max ¯ θ g ( i ) ∈C g ( i ) ( n ) x (cid:48) i ¯ θ g ( i ) . We call ˜ q i ( n ) the upper confidence bound index (UCB index) of worker i ’s skill. Intuitively, ˜ q i ( n ) isworker i ’s skill in the most optimistic scenario. The UCB decision rule makes a decision based onthis UCB index. Definition 4 (UCB Decision Rule) . The UCB decision rule selects the worker who has the highestUCB index; i.e., ι ( n ) = arg max i ∈ I ( n ) ˜ q i ( n ) . (5)The UCB index ˜ q i ( n ) is close to the pointwise estimate ˆ q i ( n ) when society has a rich data aboutgroup g ( i ) , because C g ( i ) ( n ) is small in such a case. However, when the information about group g ( i ) is insufficient, ˜ q i ( n ) is much larger than ˆ q i ( n ) , because the firm is not sure about the true skillof worker i and C g ( i ) ( n ) is large. In this sense, the UCB decision rule offers affirmative actionsto underexplored groups. In contrast to the greedy algorithm (laissez-faire), the UCB algorithmappropriately balances exploration and exploitation, and therefore, it has a sublinear expectedregret in general environments.The UCB decision rule recommends the exploration of majority candidates as well as minor-ity candidates. The amount of the subsidy is proportional to the uncertainty of the candidate’scharacteristic, which is represented by the confidence interval C g ( n ) . The confidence interval C g ( n ) is inverse proportional to ¯ V g ( n ) = ( X g ( n )) (cid:48) X g ( n ) + λ I d . Hence, if the data V g ( n ) do not havea large variation in a particular dimension of x i , then the prediction from that dimension can beinaccurate. In such a case, the UCB decision rule recommends hiring a candidate who contributesto increasing the data variation for that dimension. For example, when a candidate has some skillsthat previous candidates do not have, then the candidate’s UCB index tends to become large.As the UCB decision rule nicely balances exploration and experimentation, it has a sublinear We typically choose δ = 1 /N so that the confidence interval is asymptotically correct in the limit of N → ∞ . The standard ordinary least square has a confidence bound of the form θ g − ˆ θ g ( n ) ∼ N (0 , σ (cid:15) V − g ( n )) and thus | θ g − ˆ θ g ( n ) | ∼ σ (cid:15) V − / g ( n ) . The martingale confidence bound C g ( n ) is larger than OLS confidence bound in twofactors because of the price of adaptivity. Namely, (1) √ d factor and (2) (cid:112) log(det( ¯ V g ( n ))) factor. As discussedin Xu, Honda, and Sugiyama (2018), the first √ d factor unnecessarily overestimates the confidence bound for mostcases. Theorem 4 (Sublinear Regret of UCB) . Suppose Assumptions 3. Let
Reg
UCB be the regret fromthe UCB decision rule. Let λ ≥ max(1 , L ) . Then, by choosing sufficiently small δ , the regretunder the UCB decision rule is bounded as E [Reg UCB ( N )] ≤ C ucb √ N , where C ucb is a ˜ O (1) factor to N that depends on model parameters. Proof.
See Appendix B.5. The explicit form of C ucb is found in Eq. (43) therein.Note that, ˜ O ( √ N ) regret is the optimal rate for these sequential optimization problems underpartial feedback (Chu, Li, Reyzin, and Schapire, 2011). Hence, Theorem 4 states that the UCBdecision rule effectively prevents perpetual underestimation and is asymptotically efficient. Theanalysis here does not depend on the size of candidate pool K , and thus effective regardless of thepopulation ratio. Remark 4.
Although we have made several strong assumptions for the analysis of laissez-faire (e.g.,two groups, symmetry), Theorem 4 does not rely on them, and therefore, it is applicable to a verygeneral environment. The groups need not be symmetric. The normal characteristic assumption(Assumption 3) can be relaxed to a weaker condition that guarantees that the distributions arelight-tailed, or the characteristics can even be arbitrary as long as they are bounded with highprobability.
To implement the UCB decision rule, we need to satisfy the firms’ obedience condition (3) alongwith the UCB’s decision rule (5). In this paper, we focus on two types of subsidy rules. One is the
UCB index subsidy rule , and another is the
UCB cost-saving subsidy rule .First, we formally define the UCB index subsidy rule. The UCB index subsidy rule inducesfirms to hire a candidate with the largest UCB index by aligning each firm’s profit with the UCBindex. 25 lgorithm 3
The UCB Index Subsidy RuleComplete the initial sampling phase by running Algorithm 1. for n = N (0) + 1 , · · · , N do (cid:46) The UCB index subsidy rule starts. for i do Compute ˜ q i ( n ) = max ¯ θ g ( i ) ∈C g ( i ) ( n ) x (cid:48) i ¯ θ g ( i ) . (cid:46) Obtain UCB indices.Offer s i = ˜ q i ( n ) − ˆ q i ( n ) for all i ∈ I ( n ) . (cid:46) Align firm n ’s payoff to the UCB index.Firm n hires ι ( n ) = arg max i ∈ I ( n ) ˜ q i ( n ) as an equilibrium consequence. end forend forDefinition 5 (UCB Index Subsidy Rule) . The
UCB index subsidy rule s subsidizes worker i whoarrives in round n by s i ( n ; h ( n )) = ˜ q i ( n ; h ( n )) − ˆ q i ( n ; h ( n )) . The formal algorithm is shown as Algorithm 3.The UCB index subsidy rule is named “index” because it belongs to an index policy (Gittins,1979) in the terminology of the multi-armed bandit literature.
Definition 6 (Index Policy) . A subsidy rule s is an index policy if for all n and i ∈ I ( n ) , s i ( n ; · ) only depends on X g ( i ) ( n ) , Y g ( i ) ( n ) , x i .To be more precise, our definition of the index rule is slightly weaker than the standard defini-tion. A standard definition requires that the index of an arm only depends on the data generatedby the arm. However, since we regard a set of arms as a group, it does not make sense to focuson the data generated by “an arm.” Hence, we utilize all the data about group g ( i ) . Having saidthat, our definition requires that the subsidy for worker i is independent of (i) the other agents’characteristics x j for any j ∈ I ( n ) \ { i } and (ii) the data about other groups, X g (cid:48) ( n ) for any g (cid:48) (cid:54) = g ( i ) .If a subsidy rule is an index policy, the government need not observe the characteristics of I ( n ) \ { i } to determine the subsidy assigned to the employment of worker i . This is a practicallydesirable property: In many real-world problems, it is difficult for the government to observe thecharacteristics of candidate workers who are not hired.The following theorem states the property of the UCB index subsidy rule. Among all indexsubsidy rules that implement the UCB decision rule, the UCB index subsidy rule requires the26inimum amount of the subsidy. Its expected amount is proven to be ˜ O ( √ N ) . Theorem 5 (Sublinear Subsidy of the UCB Index Rule) .
1. The UCB index subsidy rule implements the UCB decision rule.2. The UCB index subsidy rule needs a minimum amount of the subsidies among all subsidyrules that (i) implement the UCB decision rule, and (ii) str an index policy. Formally, let s U-I be the UCB cost-saving subsidy rule and s be an arbitrary subsidy rule that satisfies (i) and(ii). Then, for all i , n and h ( n ) , we have s U-I i ( n ; h ( n )) ≤ s i ( n ; h ( n )) .
3. Under the same assumptions as Theorem 4, the amount of the subsidy required by the UCBindex subsidy rule is bounded as E [Sub UCB-I ( N )] ≤ C ucb √ N . where C ucb is an ˜ O (1) factor that is the same as Theorem 4. Proof.
See Appendix B.6.The square-root subsidy implies that the government can eventually end the subsidy because
Sub
UCB-I ( N ) /N → as N → ∞ . Alternatively, Theorem 5 implies that society can terminateaffirmative actions once a sufficiently rich data set about the minority groups is obtained. If the mechanism does not have to be an index policy (i.e., the subsidy for worker i ∈ I ( n ) maydepend on ( x j ) j ∈ I ( n ) of the other candidates), then we can save the budget without modifyingthe decision rule. To achieve it, we can subsidize the minimum amount such that candidate ι ismore profitable than the other candidates. Formally, the UCB cost-saving subsidy rule is definedas follows. 27 lgorithm 4 The UCB Cost-Saving Subsidy RuleComplete the initial sampling phase by running Algorithm 1. for n = N (0) + 1 , · · · , N do (cid:46) The UCB cost-saving subsidy rule starts. for i do Compute ˆ q i ( n ) = x (cid:48) i ˆ θ g ( i ) ( n ) .Compute ˜ q i ( n ) = max ¯ θ g ( i ) ∈C g ( i ) ( n ) x (cid:48) i ¯ θ g ( i ) .Compute ι ( n ) = arg max i ∈ I ( n ) ˜ q i ( n ) . (cid:46) ι ( n ) is the UCB winner.Offer s ι ( n ) ( n ) = max j ∈ I ( n ) ˆ q j ( n ) − ˆ q ι ( n ) ( n ) . (cid:46) Make ι ( n ) becomes most profitable.Offer s j ( n ) = 0 for all j ∈ I ( n ) \ { ι ( n ) } .Firm n hires ι ( n ) as an equilibrium consequence. end forend forDefinition 7 (UCB Cost-Saving Subsidy Rule) . For every round n , the UCB cost-saving subsidyrule chooses s i ( n ) = 0 for every i ∈ I ( n ) \ { ι ( n ) } , where ι ( n ) is the candidate worker selected bythe UCB algorithm, (5). For i = ι ( n ) , the subsidy s i is given by s i ( n ; h ( n )) = max j ∈ I ( n ) ˆ q j ( n ; h ( n )) − ˆ q i ( n ; h ( n )) . The formal algorithm is shown as Algorithm 3.The UCB cost-saving subsidy rule subsidizes only the targeted worker, ι ( n ) . Hence, for otherworkers j (cid:54) = ι ( n ) , the payoff from the employment is ˆ q j ( n ) . The UCB cost-saving subsidy rulesets the subsidy amount s ι ( n ) in such a way that the payoff from hiring worker ι ( n ) , which is ˜ q ι ( n ) ( n ) + s ι ( n ) , is equal to (or slightly larger than) the payoff from hiring the worker who has thehighest predicted skill, max j ∈ I ( n ) ˜ q j ( n ) .Clearly, the UCB cost-saving subsidy rule is the subsidy rule that requires the minimum budgetto implement the UCB decision rule. As fines (negative subsidies) are not allowed in our model, thegovernment cannot discourage the employment of the other candidate workers, j ∈ I ( n ) \ { ι ( n ) } ,further. Hence, the UCB cost-saving subsidy rule requires the smallest budget among all subsidyrules that implements the decision rule (5).Combining this observation with Theorem 5, we obtain the following theorem. Theorem 6 (Sublinear Subsidy of the UCB Cost-Saving Rule) .
1. The UCB cost-saving subsidy rule implements the UCB decision rule.28. The UCB cost-saving subsidy requires the minimum budget for implementing the UCB de-cision rule. Formally, let s U-CS be the UCB cost-saving subsidy rule and s be an arbitrarysubsidy rule that implements the UCB decision rule. Then, for all i , n and h ( n ) , we have s U-CS i ( n ; h ( n )) ≤ s i ( n ; h ( n )) .
3. The amount of the subsidy required by the UCB cost-saving subsidy rule is bounded as E [Sub UCB-CS ( N )] ≤ E [Sub UCB-I ( N )] ≤ C ucb √ N .
Proof.
The first two statements straightforwardly follow from the argument above. The last state-ment follows from the first two and Theorem 5.The cost-saving subsidy rule has some drawbacks. It depends on the characteristics of all thepotential candidates. Hence, the government must have precise knowledge about candidates whoappeared in each round but were not hired by the firm. Still, as a theoretical benchmark, it is usefulto study the minimum subsidy amount incurred. In Subsection 8.4, we compare the index rule andcost-saving rule numerically. Our simulation results indicate that the cost-saving rule outperformthe index rule by much in terms of the total amount of subsidy.
In the previous section, we showed that the UCB mechanism effectively prevents perpetual under-estimation and achieves sublinear regret for general environments. However, the UCB mechanismhas one draw back: it assigns subsidies forever. Although the confidence interval C g ( n ) shrinks as n grows large, it does not degenerate to a singleton for any finite n . Accordingly, even for a large n , there remains a gap between expected skill ˆ q i ( n ) and the UCB index ˜ q i ( n ) (though small insize). This feature is not desirable for the following reasons. First, introducing a permanent policyis often more politically difficult than introducing a temporary policy. If the government declaresthat hiring of minority workers is permanently subsidized, the policy may look quite unfair to themajority group. The appearance of unfairness would cause significant opposition. Second, if we29eep distributing subsidies over the long run, the required budget tends to grow. Third, besidesthe subsidy itself, the permanent allocation of the subsidy comes with (unmodeled) administrationcosts.To overcome these limitations of the UCB mechanism, we propose the hybrid mechanism , whichstarts with the UCB mechanism and turns to laissez-faire by terminating the subsidy at some point.We terminate the UCB-phase once the amount of data of the minority group is enough to inducespontaneous exploration. We prove that, our hybrid mechanism has ˜ O ( √ N ) regret (as the UCBmechanism does), and its expected total subsidy amount is ˜ O (1) (as opposed to ˜ O ( √ N ) subsidy ofUCB).The construction of the hybrid mechanism is as follows. Let s U-I i ( n ) = ˜ q i ( n ) − ˆ q i ( n ) be the sizeof confidence bound. Note that, s U-I i ( n ) corresponds to the amount of the subsidy allocated by theUCB index subsidy rule (Definition 5). The hybrid index ˜ q H i is defined as ˜ q H i ( n ; h ( n )) := ˜ q i ( n ; h ( n )) if s U-I i ( n ; h ( n )) > aσ x || ˆ θ g ( i ) ( n ; h ( n )) || , ˆ q i ( n ; h ( n )) otherwise , (6)where a ≥ is the mechanism’s parameter.The hybrid index is literally a “hybrid” of the predicted skill ˆ q i ( n ) and the UCB index ˜ q i ( n ) .If the difference between the UCB index and the predicted skill is larger than the threshold (i.e., s U-I i ( n ) > aσ x || ˆ θ g ( i ) ( n ) || ), the hybrid index is equal to the UCB index ˜ q i ( n ) . The confidence bound | ˜ q i ( n ) − ˆ q i ( n ) | is large while we have insufficient knowledge about group g ( i ) ; this is typicallythe case in an early stage of the game. Once this gap becomes smaller than the threshold (i.e., s U-I i ( n ) ≤ aσ x || ˆ θ g ( i ) ( n ) || ), then the hybrid index becomes equal to the predicted skill ˆ q i ( n ) .Naturally, the hybrid decision rule is defined as the rule that hires the highest hybrid index. Definition 8 (The Hybrid Decision Rule) . The hybrid decision rule selects the worker who hasthe highest hybrid index; i.e., ι H ( n ; h ( n )) = arg max i ∈ I ( n ) ˜ q H i ( n ; h ( n )) . As the hybrid decision rule is a hybrid of the UCB decision rule and the laissez-faire decision30ule, it can be implemented by mixing the laissez-faire subsidy rule and either the UCB indexsubsidy rule or the UCB cost-saving subsidy rule.
Definition 9 (The Hybrid Index Subsidy Rule) . Let s U-I i be the UCB index subsidy rule. The hybrid index subsidy rule s H-I is defined by s H-I i ( n ; h ( n )) := s U-I i ( n ; h ( n )) if s U-I i ( n ; h ( n )) > aσ x || ˆ θ g ( i ) ( n ; h ( n )) || , otherwise . Or, equivalently, the hybrid index subsidy rule can be defined by s H-I i ( n ; h ( n )) = ˜ q H i ( n ; h ( n )) − ˆ q i ( n ; h ( n )) . Definition 10 (The Hybrid Cost-Saving Subsidy Rule) . Let s U-CS i be the UCB cost-saving subsidyrule. The hybrid cost-saving subsidy rule s H-CS is defined by s H-CS i ( n ; h ( n )) := s U-CS i ( n ; h ( n )) if s U-I i ( n ; h ( n )) > aσ x || ˆ θ g ( i ) ( n ; h ( n )) || , otherwise . Theorem 7 (The Properties of the Hybrid Subsidy Rules) .
1. The hybrid index subsidy rule s H-I and the hybrid cost-saving subsidy rule s H-CS implementthe hybrid decision rule ι H .2. The hybrid index subsidy rule s H-I requires the minimum subsidy among all index subsidyrules that implement ι H .3. The hybrid cost-saving subsidy rule s H-CS requires the minimum subsidy among all subsidyrules that implement ι H .The proof of Theorem 7 is analogous to that of Theorems 5 and 6, and thus is omitted.The algorithm of the hybrid index subsidy rule and its equilibrium consequence is stated asAlgorithm 5. As it is straightforward to modify Algorithm 5 to construct a hybrid cost-savingsubsidy rule, we omit the algorithm for the hybrid cost-saving subsidy rule here.31 lgorithm 5 The Hybrid Index Subsidy RuleComplete the initial sampling phase by running Algorithm 1. for n = N + 1 , · · · , N do (cid:46) The hybrid index subsidy rule starts. for i do Compute ˆ q i ( n ) = x (cid:48) i ˆ θ g ( i ) ( n ) .Compute ˜ q i ( n ) = max ¯ θ ∈C g ( n ) x (cid:48) i ¯ θ .Compute s U-I i ( n ) = ˜ q i ( n ) − ˆ q i ( n ) .Offer s i ( n ) = (cid:40) if s U-I i ≤ aσ x || ˆ θ g ( i ) || ,s U-I i ( n ) otherwise. (cid:46) The hybrid index subsidy.Firm n hires ι ( n ) = arg max i ∈ I ( n ) (cid:8) ˆ q i ( n ) + s U-I i ( n ) (cid:9) as an equilibrium consequence. end forend for The following two theorems characterize the regret and amount of subsidy of the hybrid decisionrule.
Theorem 8 (Regret Bound for the Hybrid Decision Rule) . Suppose Assumptions 1, 2, and 3.Then, by choosing sufficiently small δ , the regret under the hybrid decision rule ι H is bounded as E [Reg H ( N )] ≤ C hyb √ N where C hyb is a factor that is ˜ O (1) to N . Theorem 9 (Subsidy Bound for the Hybrid Subsidy Rules) . Suppose Assumptions 1, 2, and 3.By choosing sufficiently small δ , for any a > , the total amount of the subsidy under the hybridindex subsidy rule ( Sub
H-I ) and the hybrid cost-saving subsidy rule (
Sub
H-CS ) is bounded as
Sub
H-CS ( N ) ≤ Sub
H-I ( N ) = C hyb-sub . where C hyb-sub is a factor that is ˜ O (1) to N . Proof.
See Appendix B.7. The explicit form of C hyb is found at Eq. (54). The explicit form of C hyb-sub is found at Eq. (61) therein.Theorem 8 states that the order of the regret under the hybrid decision rule is ˜ O ( √ N ) , whichis the same as the original UCB decision rule. Theorem 9 states that the amount of the subsidy is32olylogarithmic to N , which is a substantial improvement from the standard UCB where ˜ O ( √ N ) subsidy is required.The threshold of switching from the UCB mechanism to laissez-faire is crucial for guaranteeingthe performance of the hybrid mechanism. Our threshold, aσ x || ˆ θ ( n ) || , is determined in such a waythat the hybrid decision rule ι H satisfies proportionality , which is a new concept established in thispaper. The formal statement appears in Lemma 28 in Appendix B.7 but it requires additionalnotations that do not appear in the main body of this paper. In what follows, we provide ahigh-level intuition regarding the concept of proportionality.We evaluate the expected regret of the hybrid decision rule by comparing it with the expectedregret of the UCB decision rule. However, since different decision rules generate different historiesand data, neither decision rule dominates the other. This is why the comparison is challenging.We overcome this problem by proving that the hybrid decision rule ι H is proportional to ι U inthe sense that there exists a constant c > such that when the UCB rule ι U hires worker i withprobability p i , then the hybrid rule ι H hires worker i with probability at least cp i given the samehistory. This property guarantees that the hybrid rule escapes from underexploring the minoritygroup and secures expected regret of ˜ O ( √ N ) .The timing of switching to laissez-faire is crucial for the proportionality. When the data aboutthe minority group are insufficient, firms rarely hire minority workers under laissez-faire. Weprove that, when the threshold is set to aσ x || ˆ θ g ( n ) || , then firms keep hiring minority workers withsufficiently high frequency, and therefore, statistical discrimination is resolved eventually. Remark 5 (Dependence on Parameter a ) . There is a tradeoff between the regret and subsidy.The constant on the top of regret (Theorem 8) is exp( a / , which is increasing in a . By contrast,the constant on the top of subsidy (Theorem 9) is exp(3 a / /a , which goes to infinity as a → .Theorem 9 guarantees that the subsidy is ˜ O (1) whenever a > . However, when a is small, thebound provided by Theorem 9 becomes large and may not be insightful. To balance the tradeoff,the government should select a “right-size” value for a . In our simulations (Section 8) we adopt a = 1 .For small a , because the hybrid mechanism is close to the UCB mechanism, we can divert ouranalysis for the UCB mechanism (Theorem 5). When a is small and N is finite, the square-root33ubsidy bound established in Theorem 5 may provide a tighter characterization of the total subsidy. Although the UCB-based subsidy rule is a powerful policy intervention to resolve statistical discrim-ination, the subsidy rule is sometimes difficult to implement in practice. This section articulatesthe advantages and disadvantage of the Rooney Rule, which requires each firm to invite at least onecandidate of each group to an on-site interview. The Rooney Rule is relatively easy to implementbecause it requires neither the subsidy nor hard hiring quota.To incorporate the additional information the firms acquire through the interview, we makethe following modification to the model. In the modified model, each round n consists of twostages. In the first stage, firm n observes the characteristics of each arriving agent i ∈ I ( n ) , x i .Based on x i , firm n selects a shortlist of finalists I F ( n ) ⊆ I ( n ) , where | I F ( n ) | = K F for some K F ∈ N . In the second stage, by interviewing finalists, firm n observes an additional signal η i for each finalist i (as assumed in Kleinberg and Raghavan, 2018). Firm n predicts each finalist i ’sskill from the characteristics x i and the additional signal η i , and hires one worker from the set offinalists, ι ( n ) ∈ I F ( n ) . Firms are not allowed to hire a worker who was not selected as a finalist.After the firm makes a decision, the skill of the hired worker y ι ( n ) is publicly disclosed.We assume the following linear relationship between the skill y i and the observable variables x i and ι i : y i = x (cid:48) i θ g ( i ) + η i + (cid:15) i The “noise” term comprises two variables: η i and (cid:15) i . η i is revealed as an additional signal when thefirm chooses i as a finalist. However, (cid:15) i remains to be unpredictable even after the hiring—firmsonly observe y i after worker i is hired.For analytical tractability, besides Assumptions 1, 2, and 3, we make the following two assump-tions. Assumption 4 (Two Finalists) . Each firm can invite only two finalists; i.e., K F = 2 .Assumption 4 generates a minimal environment to consider the performance of the Rooney34ule. Assumption 5 (Normal Additional Signals) . The signal that the finalist reveals is the independentand identically distributed normal random variable: η i ∼ N (0 , σ η ) . Remark 6. If σ η = 0 , then the two-stage model is the same as the one-stage model that we haveconsidered in the previous sections. This subsection analyzes the performance of laissez-faire in this two-stage setting. The result isanalogous to the one-stage case (Theorem 3) : laissez-faire often falls in perpetual underestimation,and therefore, has linear regret.First, we formally define the regret. As in the one-stage model, the benchmark is the first-bestdecision rule, which is the rule firms would take if the coefficient parameter θ were known. Clearly,the first-best decision rule would greedily invite top- K F workers in terms of q i to the final interview.We denote this set of finalists chosen by the first-best decision rule in round n by ¯ I F ( n ) . Formally, ¯ I F ( n ) is obtained by solving the following problem: ¯ I F ( n ) = arg max I (cid:48) ⊆ I ( n ) (cid:88) i ∈ I (cid:48) q i s.t. | I (cid:48) | = K F . After that, the first-best decision rule would observe the realization of η i for i ∈ ¯ I F ( n ) , and thenhires the worker i who has the highest skill predictor: q i + η i . The unconstrained two-stage regret isdefined as the loss compared with this first-best decision rule. (This regret is named “unconstrained”because we introduce an alternative definition of regret later.) Definition 11 (Unconstrained Two-Stage Regret) . In the two-stage hiring model, the uncon-strained two-stage regret
U2S-Reg of decision rule ι is defined as follows:U2S-Reg ( N ) = N (cid:88) n =1 (cid:26) max i ∈ ¯ I F ( n ) ( q i + η i ) − (cid:0) q ι ( n ) + η ι ( n ) (cid:1)(cid:27) . n ’s optimal strategy is to greedily choose their candidates based on thebelief, i.e., I F ( n ) = arg max I (cid:48) ⊆ I ( n ) (cid:88) i ∈ I (cid:48) ˆ q i ( n ) s.t. | I (cid:48) | = K F . After observing the realization of the additional signals η i , firm n again selects the candidate withthe highest predicted skill: ι ( n ) = arg max i ∈ I F ( n ) { ˆ q i ( n ) + η i } . Even in the two-stage model, laissez-faire has linear regret when the population ratio is imbal-anced.
Theorem 10 (Failure of Laissez-Faire in the Two-Stage model) . Suppose Assumptions 1, 3, 2,4, and 5. Suppose also that K = 1 and d = 1 . Let K − log ( K + 1) > log N . Then, underthe laissez-faire decision rule, group is perpetually underestimated with the probability at least C imb = ˜Θ(1) . Accordingly, the expected regret of the laissez-faire decision rule is E (cid:2) U2S-Reg LF (cid:3) = ˜Ω( N ) . Proof.
See Appendix B.8.The proof idea of Theorem 10 is as follows. Under laissez-faire, each firm n interviews twocandidates who have the highest expected skills, ˆ q i ( n ) . If both of these two workers are majorities,then minority workers are never hired no matter what the η i for each finalist is. By evaluating theprobability that both finalists are majorities, we derive the probability that perpetual underesti-mation occurs. Note that, to meet K − log ( K + 1) ≥ log N , K should be Ω(log N ) = ˜Ω(1) .Hence, Theorems 3 and 10 require the same rate of the imbalanced population ratio.To summarize, even in a two-stage setting, the laissez-faire decision has linear regret (when thepopulation ratio is imbalanced). This is because the laissez-faire decision rule results in perpetualunderestimation with a significant probability. 36 .3 The Rooney Rule and Exploration As laissez fair does not perform well, we need to seek for desirable policy intervention. The RooneyRule, which requires each firm to invite at least one minority finalist to the final interview, is oneof the natural affirmative actions in this setting, and is widely implemented in real-world problems.
Definition 12 (The Rooney Rule) . In the two stage hiring model, the
Rooney Rule requires eachfirm n to select at least one finalist from every group g ∈ G ; i.e., for every n and every g ∈ G , I F ( n ) must satisfy (cid:12)(cid:12)(cid:8) i ∈ I F ( n ) | g ( i ) = g (cid:9)(cid:12)(cid:12) ≥ . (7)The Rooney Rule is relatively easy to implement because it imposes no hiring quota or hiringpreference given to minorities. The Rooney Rule of originally introduced as the National FootballLeague policy to promote hiring of ethnic-minority candidates for head coaching positions, butvariations of the Rooney Rule are now implemented in many industries. Although the RooneyRule has been used in many places, its theoretical performance has not been studied intensively.To understand the fact that the Rooney Rule resolves statistical discrimination, we introducean alternative (weaker) notion of regret, constrained two-stage regret . Definition 13 (Constrained Two-Stage Regret) . In the two-stage hiring model, the constrainedtwo-stage regret (C2S-Reg) of decision rule ι is defined as follows:C2S-Reg ( N ) = N (cid:88) n =1 (cid:40) max i ∈ ˘ I F ( n ) ( q i + η i ) − (cid:0) q ι ( n ) + η ι ( n ) (cid:1)(cid:41) . where ˘ I F ( n ) is given by ˘ I F ( n ) = arg max I (cid:48) ⊆ I ( n ) (cid:88) i ∈ I q i (8)s.t. | I (cid:48) | = K F , For example, in a securities and exchange commission filing posted on 2018, Amazon declares that “
The AmazonBoard of Directors has adopted a policy that the Nominating and Corporate Governance Committee include a slate ofdiverse candidates, including women and minorities, for all director openings. This policy formalizes a practice alreadyin place ” ( ). Inaddition, according to O’Brien (2018), Facebook COO Sheryl Sandberg said that “
The company’s ‘diverse slateapproach’ is a sort of ‘Rooney Rule,’ the National Football League policy that requires teams to consider minoritycandidates .” g ∈ G, (cid:12)(cid:12)(cid:12)(cid:110) i ∈ ˘ I F ( n ) | g ( i ) = g (cid:111)(cid:12)(cid:12)(cid:12) ≥ . In plain words, ˘ I F ( n ) is the best list of finalists who satisfy the constraint (7). If (7) is imposedas an “exogenous constraint” (rather than a policy), the first-best decision rule would interview ˘ I F ( n ) to maximize social welfare. Clearly, the unconstrained regret is larger than the constrainedregret.The constrained regret is useful in that it enables us to identify whether the Rooney Ruleprevents perpetual underestimation—if perpetual underestimation occurs under the Rooney Rule,then the constrained regret is linear in N . To the contrary, if the social learning is successful (i.e., ˆ q i is very close to q i for all the workers), the constrained regret would be zero.Under Rooney Rule, myopic firm n would greedily choose candidates based on estimator ˆ q i ( n ) subject to the constraints: I F ( n ) = arg max I (cid:48) ⊆ I ( n ) (cid:88) i ∈ I ˆ q i ( n ) (9)s.t. | I (cid:48) | = K F , ∀ g ∈ G, (cid:12)(cid:12)(cid:8) i ∈ I F ( n ) | g ( i ) = g (cid:9)(cid:12)(cid:12) ≥ . and ι = arg max i ∈ I F ( n ) { ˆ q i ( n ) + η i } . Note that the only difference between Eq. (8) and (9) is that q i is replaced by ˆ q i ( n ) .The following theorem states that the Rooney Rule is able to resolve perpetual underestimationwith sufficiently revealing signal η i . Theorem 11 (Sublinear Constrained Regret under the Rooney Rule) . Suppose Assumptions 1, 2,3, 4, and 5. Then, the regret under the Rooney Rule is bounded as E (cid:2) C2S-Reg
Rooney ( N ) (cid:3) ≤ C √ N where C is ˜ O (1) to N . Proof.
See Appendix B.9. The explicit form of C is found at Eq. (68) therein. Note that C isexponentially dependent on signal variance σ η (see the definition of C in Eq. (64)), which implies38hat a sufficiently large value of σ η is required to obtain a reasonable bound.The proof idea is as follows. When a group is underrepresented, no candidates from the group isregarded as the most promising finalist with a significant probability. Hence, laissez-faire may resultin perpetual underestimation. The Rooney Rule mitigates this problem by securing a finalist seat foreach group. If the additional signal is informative enough (i.e., σ η is large), there is some probabilitythat the minority finalist beats the majority finalist and is hired. In other words, additional signalnaturally induces exploration for the minority group and prevents perpetual underestimation. Remark 7.
The Rooney Rule is analogous to the (cid:15) -greedy algorithm that is widely studied in themulti-armed bandit and reinforcement learning literature. The (cid:15) -greedy algorithm usually makesa decision based on the greedy algorithm (equivalent to laissez-faire in our model), but there isa small probability ( (cid:15) ) that the algorithm chooses a worker uniformly at random. In the banditliterature, the (cid:15) - greedy algorithm is known to be robust to the choice of the exploration probability (cid:15) : In fact, one can prove that the regret of the (cid:15) -greedy algorithm is sub-linear for any value (cid:15) > .In our model, the Rooney Rule successfully resolve underexploration because of the randomness inthe additional signal η i induces (cid:15) -experiments. This subsection shows that, although the Rooney Rule successfully prevents statistical discrimina-tion, it may worsen social welfare evaluated by the original unconstrained regret.When the population ratio is imbalanced (i.e., K /K is large), there is a significant probabilitythat more than one majority worker has high skills. In that case, the true predicted skill ofthe second-best majority worker ( q i ) is likely to be higher than that of the minority champion.This feature raises constant regret per round: when η i is normally distributed, any finalist has apositive probability of being hired. Hence, the skills of all candidates matter, and therefore, firmswant to interview top- K F candidates who have the highest skills. The Rooney Rule prevents thisoutcome. This effect would present even when firms had perfect information about coefficients θ .Furthermore, the loss from the constraint (7) is constant per round, and therefore, results in theunconstrained regret of Ω( N ) in total. 39 heorem 12 (Linear Unconstrained Regret under the Rooney Rule) . Suppose Assumptions 1, 2,3, 4, 5. Then, the regret under the Rooney Rule is bounded as E (cid:2) U2S-Reg
Rooney ( N ) (cid:3) = Ω( N ) . The proof is straightforward from the argument above, and therefore, is omitted.In summary, both laissez-faire and the Rooney Rule have linear unconstrained regret. However,the structure behind these results are different. Laissez-faire has linear regret due to underexplo-ration. In contrast, the Rooney Rule has linear regret due to underexploitation.One way to resolve this trade-off is to mix the Rooney Rule and laissez-faire (as the hybridmechanism does). By starting with the Rooney Rule and abolishing it after sufficiently rich datais obtained, we could mitigate the disadvantage of the Rooney Rule. In Section 8, we also testifythe performance of such a mechanism.
This section reports the results of the simulations that we run to support our theoretical findings .Unless specified, the model parameters are set as follows: d = 1 , µ x = 3 , σ x = 2 , σ (cid:15) = 2 . Theregularizer of regression is set to be λ = 1 . The group sizes are set to be ( K , K ) = (10 , . Theinitial sample size is N (0) = K + K , and the sample size for each group is equal to its populationratio: N (0)1 = K , N (0)2 = K . All the results are averaged over runs.The value of δ in the confidence bound is set to . . We first testify the population ratio effects to the frequency of perpetual underestimation (i.e.,group is never hired after the initial sampling phase). The decision rule is fixed to laissez-faire(LF). We fix the number of minority candidates in each round to two (i.e., K = 2 ) and vary thenumber of majority candidates ( K = 2 , , , ). The source code of the simulations is available at https://github.com/jkomiyama/FairSocialLearning/
10 30 100 o f P U Figure 1: The number of perpetual underestimation among runs under laissez-faire. The errorbars are the two-sigma binomial confidence intervals. o f m i n o r i t i e s h i r e d LFUCBOptimal (a) Number of minority candidates hired R e g r e t LFUCB (b) Regret
Figure 2: The comparison between the LF and UCB decision rules. The lines are the averageover sample paths, and the areas cover between 5% and 95% percentile of runs. The error bars at N = 1000 are the two-sigma confidence intervals.Figure 1 exhibits the simulation result. Consistent with our theoretical analyses, we observethat (i) as indicated by Theorem 2 laissez-faire rarely results in perpetual underestimation if thepopulation is balanced (i.e., K is close to K = 2 ), and (ii) as indicated by Theorem 3, perpetualunderestimation becomes more frequent as the population of majority workers increases (i.e., K increases). 41 .3 Laissez-Faire vs The UCB Decision Rule Figure 2a compares the number of minority workers hired by the laissez-faire (LF) and UCBdecision rules. Figure 2b compares the regret under these two rules. The horizontal axis representsthe round (where the number of total rounds is fixed to N = 1000 ), and the vertical axis representsthe number of minority workers hired and the regret, respectively. The subsidy required by theUCB mechanism will be shown later (in Figure 4).As indicated by Theorem 3, our simulation shows that laissez-faire has a significant probabilityof underestimating the minority group. Consequently, we observe the following two facts. First,the number of minority workers hired on average is lower than the first-best decision rule wouldhire (hire a minority worker with probability K / ( K + K ) = 2 / (10 + 2) ≈ for each round).Second, laissez-faire sometimes causes perpetual underestimation, and therefore, the number ofminority workers hired could be zero, and the regret grows linearly in n even after rounds.Due to the possibility of perpetual underestimation, the confidence intervals of the sample paths(denoted by the read area) is very large, indicating that the performance of laissez-faire is highlyuncertain.In contrast, consistent with Theorem 4, the performance of the UCB decision rule is shownto be much more stable. As the UCB rule avoids underexploration, it does not cause perpetualunderestimation. Consequently, (i) the UCB’s regret is lower than laissez-faire on average, and(ii) the variance of the regret and the number of minority workers hired is also small. Note thatthe UCB decision rule tends to hire more minority workers than the first-best decision rule. Thisoutcome happens because society is typically less knowledgeable about the minority group (due toan uneven population ratio), and therefore, the confidence interval for minority workers is typicallylarger than that for the minority. Next, we compare the performance of the UCB and hybrid mechanisms. The parameter of thehybrid mechanism is set to be a = 1 . Figure 3 compares the performance of these decision rules:Figure 3a shows the number of minority workers hired, and Figure 3b shows the regret.We observe that the number of the minority hired on average becomes closer to the first-best42
200 400 600 800 1000Round (n)050100150200250 o f m i n o r i t i e s h i r e d UCBHybridOptimal (a) Number of minority candidates hired R e g r e t UCBHybrid (b) Regret
Figure 3: The comparison between the UCB and hybrid decision rules. The lines are average oversample paths, and the areas cover between 5% and 95% percentile of runs. The error bars at N = 1000 show the two-sigma confidence intervals of the expected regret.decision rule (Figure 3a). Furthermore, as expected by Theorems 4 and 8, as for efficiency (regret),the performance of these two decision rules grows in the same order. However, we find that thehybrid decision rule outperforms UCB in our simulation setting (Figure 3b). We consider thatthese results happen because the hybrid decision rule stops overexploration of the minority groupin an early stage.Figure 4 compares the total budgets required by (i) the UCB index subsidy rule (UCB), (ii)the hybrid index subsidy rule (Hybrid), (iii) the UCB cost-saving subsidy rule (CS-UCB), and (iv)the hybrid cost-saving subsidy rule (CS-Hybrid).Figure 4a compares the index subsidy rules. As predicted by Theorems 5 and 9, the hybridindex subsidy rule requires a much smaller budget than the UCB index subsidy rule. Furthermore,the subsidy distributed by the UCB rule seems still growing, even after rounds are finished.This is also consistent with our theory because the UCB rule requires ˜ O ( √ N ) subsidy (while thehybrid rule only requires ˜ O (1) subsidy).Figure 4b compares the subsidy amount of the UCB cost-saving subsidy rule and the hybridsubsidy rules. The UCB index subsidy rule is excluded because it requires a much larger subsidyamount. We observe that (i) two cost-saving subsidy rules require a similar amount of the subsidy(while the hybrid cost-saving subsidy rule performs slightly better), and (ii) the cost-saving methodis very effective, even when it is compared with the hybrid index rule.Note that, although the subsidy amounts required by these two cost-saving rules are similar,43
200 400 600 800 1000Round (n)0100200300400500600 S u b s i d y ( n ) UCBHybrid (a) The UCB index subsidy rule vs the hybrid indexsubsidy rule. S u b s i d y ( n ) HybridCS-UCBCS-Hybrid (b) The UCB cost-saving subsidy rule vs the hybridindex subsidy rule and the hybrid cost-saving subsidyrule.
Figure 4: The comparison of the budget required by subsidy rules. The lines are average oversample paths, and the areas cover between 5% and 95% percentile of runs. The error bars at N = 1000 show the two-sigma confidence intervals of the expected regret.when we have more rounds, the hybrid cost-saving subsidy rule outperforms. Figure 5 articulatesthis result. While the subsidy required by the hybrid cost-saving rule remains constant after a few(about 100) rounds, the subsidy by the UCB cost-saving rule gradually grows. This result is alsoconsistent with our theory: While the subsidy required by the hybrid rule is ˜ O (1) (Theorem 9),the subsidy required by the UCB cost-saving rule is ˜ O ( √ N ) (Theorem 6). This subsection describes the performance of the Rooney Rule compared with laissez-faire. Fig-ure 6a depicts the relationship between the frequency of perpetual underestimation and the infor-mativeness of the signal obtained at the second stage (measured by σ η , which is the variance of η i )under laissez-faire and the Rooney Rule.For the Rooney Rule, we observe that when the second-stage signal η i is more informative, per-petual underestimation occurs less often. This outcome happens because, even when the minorityfinalist is underestimated (the predicted skill ˆ q i is small while the true skill q i is large), when σ i islarge, the minority finalist has a significant probability of overturning the situation. If this happensoften enough, society can learn about the minority group, and statistical discrimination can bespontaneously resolved. 44 S u b s i d y ( n ) HybridCS-UCBCS-Hybrid
Figure 5: The UCB cost-saving subsidy rule vs the hybrid index subsidy rule and the hybrid cost-saving subsidy rule where N = 10000 . Each line is an average over sample paths, and the areascover between 5% and 95% percentile of runs. Due to computational limitation, we only did runs of this simulation. The error bars at N = 10000 show the two-sigma confidence intervals ofthe expected regret.As for laissez-faire, we observe that laissez-faire falls in perpetual underestimation with a signif-icant probability for any σ η adopted in the simulation. This outcome is consistent with our analysis(Theorem 10). Since minority workers are rarely chosen as finalists, they have no opportunity to behired even when σ η is large. These results imply that that, even in a two-stage model, laissez-fairefrequently results in statistical discrimination.Figure 6b shows the constrained regret of the Rooney Rule. We can observe that the constrainedregret grows sublinearly in n , implying that the Rooney Rule resolves perpetual underestimation.Hence, under the Rooney Rule, society does not suffer from underexploration of the minority group.However, this does not imply that the Rooney Rule arrives come without cost. As we discussedin Subsection 7.4, once the coefficient parameter θ is learned, the Rooney Rule may prevent societyfrom making a fair and efficient decision. To testify this, we also examine the growth of uncon-strained regret. Figure 7 exhibits the results of this simulation. We find that the performanceof the Rooney Rule is worse than laissez-faire because the cost of underexploitation (of Rooney)exceeds the cost of underexploitation (of laissez-faire).As we indicated in Subsection 7.4, the performance of the Rooney Rule could be improved ifwe terminate it after “learning is completed.” In this simulation, we also test this rule, the Rooney-LF rule —impose the Rooney Rule to first 100 firms and then turn to laissez-faire. We find thatthe Rooney-LF rule avoids perpetual underestimation, and therefore, has a similar performance to45 .0 0.6 1.2 1.8 2.4sigma_eta0100200300 o f P U RooneyLF (a) The number of perpetual underestimation among runs. The error bars show binomial confidenceintervals. R e g r e t Rooney (b) The constrained two-stage regret. σ η is fixed to . . Figure 6: The Rooney Rule’s performance for exploration. R e g r e t ( N o R e s t r i c t i o n ) RooneyRooney-LFLF
Figure 7: The unconstrained two-stage regret under laissez-faire (LF), the Rooney Rule, and theirhybrid (Rooney-LF).laissez-faire. This result indicates that, if we select the transition timing appropriately, then wecan resolve statistical discrimination without compromising the quality of the finalists.
Kannan et al. (2018) show that when we have sufficiently large initial samples (i.e., N (0) is large),the greedy algorithm (corresponding to laissez-faire in this paper) has sublinear regret. As statedin Remark 2, our analysis also indicates that the probability of perpetual underestimation is small We also note that the number of initial samples required by the relevant theorem ( n min of Lemma 4.3therein) is very large and cannot be satisfied in our simulation setting: Letting R = σ x (cid:112) N ) , we have n min ≥ R log( R dK/δ ) /λ ≥ . ybrid N0=10 N0=20 N0=50 N0=100decision rules0100200300400 o f P U (a) The number of perpetual underestimations among runs. Hybrid N0=10 N0=20 N0=50 N0=100Decision rules0100200300400 S u b s i d ( N ) (b) The total amount of subsidies. “Hybrid” denotesthe hybrid cost-saving subsidy rule. Figure 8: The comparison between the hybrid mechanism and uniform sampling. N0 (= N (0) ) denotes the number of initial samples taken prior to laissez-faire. The error bars are the two-sigmabinomial confidence intervals.when N (0) is large (see Lemma 24 for the full detail).One may think that this “warm-start” version of laissez-faire is efficient. However, the warm-start approach has several disadvantages. First, while we have ignored the cost of acquiring initialsamples thus far for analytical tractability, we need to take into account of the cost of acquiringuniform samples if we want to take a sufficiently long warm-start period. As uniform samplingignores firms’ incentives for hiring workers, we need a large budget to implement it in practice.Second, uniform sampling does not maximize any index. Accordingly, it cannot be implemented byany index policy. Third, uniform sampling is inefficient in terms of information acquisition becauseit is not adaptive to current estimated parameters.We argue that our hybrid mechanism (Section 6) is a more sophisticated version of laissez-faire with a warm start—it initially samples the data adaptively and then switch to laissez-faireat an efficient timing. Hence, we can naturally expect that the hybrid mechanism outperformslaissez-faire with initial uniform sampling.Figure 8 exhibits the simulation results that compare the hybrid mechanism with laissez-fairewith various initial samples. In this simulation, the number of initial samples for each group isproportional to the population ratio; i.e., N (0) g = ( K g /K ) · N (0) .Figure 8a measures the number of perpetual underestimations. As indicated by our theory,the larger the initial sample, the less frequently perpetual underestimation occurs. In addition,47e observed no perpetual underestimation under the hybrid mechanism, as it solidly incentivizeshiring from an underexplored group.Figure 8b depicts the subsidy amount required by the cost-saving subsidy rules (recall thatuniform sampling cannot be acquired by any index subsidy rule). Here, we can observe thatthe hybrid cost-saving subsidy rule outperforms laissez-faire with uniform sampling. Laissez-fairerequires at least N (0) > samples to mitigate perpetual underestimation. However, when N (0) ≥ , the hybrid cost-saving subsidy rule requires a smaller budget than uniform sampling. Thisresult indicates that the hybrid mechanism is more efficient in compensating firms. Thus far, we have stated all the results in the terminology of the economics and statistical dis-crimination literature. However, this paper also makes several technological contributions to theliterature of the contextual bandit problems, which are of independent interest. In particular, weconsider non-discounted reward formalization (Robbins, 1952; Lai and Robbins, 1985). Unlike otherformalization such as Gittins’s (1979) one (e.g., Sundaram, 2005; Bergemann and V¨alim¨aki, 2006),this formalization weights future rewards and the current reward equally. The greedy and theUCB algorithms have been intensively studied in this literature, and we made several contributionsto it. For convenience of the readers, we state our technological contributions using the banditterminology.
Perpetual Underestimation
The greedy algorithm (which takes optimal decision at each roundbased on plug-in parameters) fails due to the randomness in finite samples. This concept originatedin a “context-less” bandit, a traditional model that corresponds to the limit of σ x → . We provethat, when the context is fixed (or has very small variance), exploration is required to mitigateperpetual underestimation (Theorem 1). Analysis of the Greedy Algorithm in a Disproportionate Model
Some previous studies(Bastani et al., 2020; Kannan et al., 2018) show that the greedy algorithm performs well if thecontext variation is sufficient. Our results (Theorem 3) indicate that, when multiple arms forma group (cluster) and share the coefficient parameter, the ratio of the group size is crucial for48he performance of the greedy algorithm (laissez-faire). This is a novel finding in the contextualmulti-armed bandit literature. When the contexts have limited variance, the greedy algorithm fails.
Development of the Hybrid Algorithm
Thus far, the contextual bandit literature (e.g., Chuet al., 2011) has studied the regret with an “adversarial” setup where the contexts (characteristics)are chosen to maximize the regret, and the UCB algorithm was designed to solve such an adversarialbandit problem.By contrast, this paper assumes that the contexts are drawn from a fixed distribution. Ourhybrid algorithm, which switches from an UCB algorithm to a greedy algorithm, takes advantage ofthe knowledge about the context distribution (more specifically, the information about σ x ), and se-lects an appropriate time for switching. Consequently, we obtained the proportionality (Lemma 28),which is a crucial lemma to evaluate the performance of the hybrid algorithm. As shown theoret-ically (Theorem 9) and numerically (Subsection 8.6), the hybrid algorithm outperforms the UCBalgorithm in terms of the total budget required to a large extent. Analysis of the Rooney Rule
To our knowledge, this is the first multi-armed bandit studyon the Rooney Rule. We show that, the greedy algorithm underexplores some arms even whenagents are unbiased and fully rational, and the Rooney Rule can mitigate that underexploration(Theorem 11). The uncertainty in the first stage (the realization of η i ) helps to mitigate perpetualunderestimation by implicitly encouraging exploration.
10 Conclusion
We studied statistical discrimination using a contextual multi-armed bandit model. Our dynamicmodel articulates that statistical discrimination can be caused by the failure of social learning. Inour model, the insufficiency of the data about the minority group is endogenously generated. Thelack of data prevents firms from estimating the candidate workers’ skill accurately. Consequently,firms tend to prefer hiring a majority worker, which makes the data sufficiency persistent (perpetualunderestimation). In our setting, this form of statistical discrimination is not only unfair but also Prior to our work, Kleinberg and Raghavan (2018) study the Rooney Rule in the context of evaluation bias. temporary affirmativeaction would be the best solution to resolve statistical discrimination as a failure of social learning.
Appendix
A Lemmas
This section describes the technical lemmas that are used for deriving the theorems.The Hoeffding inequality, which is one of the most well-known versions of concentration in-equality, provides a high-probability bound of the sum of bounded independent random variables.
Lemma 13 (Hoeffding inequality) . Let x , x , . . . , x n be i.i.d. random variables in [0 , . Let ¯ x = (1 /n ) (cid:80) nt =1 x t . Then, Pr [¯ x − E [¯ x ] ≥ k ] ≤ e − nk r [¯ x − E [¯ x ] ≤ − k ] ≤ e − nk and taking union bound yields Pr [ | ¯ x − E [¯ x ] | ≥ k ] ≤ e − nk . The following is a version of concentration inequality for a sum of squared normal variables.
Lemma 14 (Concentration Inequality for Chi-squared distribution) . Let Z , Z , . . . , Z n be inde-pendent standard normal variables. Then, Pr (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) k =1 Z k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:35) ≤ e − nt / Lemma 15 (Normal Tail Bound, Feller, 1968) . Let φ ( x ) := e − x / √ π be the probability densityfunction (pdf) of a standard normal random variable. Let Φ c ( x ) = (cid:82) ∞ x φ ( x (cid:48) ) dx (cid:48) . Then, (cid:18) x − x (cid:19) e − x / √ π ≤ Φ c ( x ) ≤ x e − x / √ π Lemma 16 (Largest Context, Theorem 1.14 in Rigollet, 2015) . Let x i ∼ N ( µ x , σ x I d ) for each i ∈ I ( n ) . Let µ x = || µ x || and L = L ( δ ) := µ x + σ x (cid:112) d (2 log( KN ) + log(1 /δ )) Then, with probability at least − δ , we have ∀ i ∈ I ( n ) , n ∈ [ N ] , || x i || ≤ L ( δ ) . The following bounds the variance of a conditioned normal variable.
Lemma 17 (Conditioned Tail Deviation) . Let x ∼ N ( a, be a scalar normal random variable51ith its mean a ∈ R and unit variance. Then, for any b ∈ R , the following two inequalities hold. Var( x | x ≥ b ) ≥ Proof of Lemma 17.
Without loss of generality, we assume b = 0 because otherwise we can reparametrize x (cid:48) = x − b ∼ N ( a − b, . If a ≤ , the pdf of conditioned variable x | x ≥ is ψ ( x ) for x ≥ .Manual evaluation of this distribution reveals that Var( x ) ≥ / . Otherwise ( a > ), the pdfof x | x ≥ b is p ( x ) ≥ ψ ( x − a ) for x ≥ a , which implies Var( x | x ≥ b ) ≥ Var( z ) , where z be a“half-normal” random variable with its cumulative distribution function P ( z ) = Φ( z ) if z > / if z = 00 otherwise . Manual evaluation of
Var( z ) also shows that Var( z ) ≥ / .The following diversity condition that simplifies the original definition of (Kannan et al., 2018)is used to lower-bound the expected minimum eigenvalue of ¯ V g . Lemma 18 (Diversity of Multivariate Normal Distribution) . The context x is λ -diverse for λ ∈ R if for any ˆ b ∈ R , ˆ θ ∈ R d λ min (cid:16) E (cid:104) xx (cid:48) | x (cid:48) ˆ θ ≥ ˆ b (cid:105)(cid:17) ≥ λ . Let x ∼ N ( µ x , σ x I d ) . Then, the context x is λ -diverse with λ = σ x / . Proof of Lemma 18. λ min (cid:16) E (cid:104) xx (cid:48) | x (cid:48) ˆ θ ≥ ˆ b (cid:105)(cid:17) = min v : || v || =1 E (cid:104) ( v (cid:48) x ) | x (cid:48) ˆ θ ≥ ˆ b (cid:105) ≥ min v : || v || =1 Var (cid:104) v (cid:48) x | x (cid:48) ˆ θ ≥ ˆ b (cid:105) Let e , e , . . . , e d be the orthogonal bases. Without loss of generality, we assume ˆ θ = θ e for some This distribution is called a folded normal distribution. Half of the mass lies in z > , the other half of mass is at z = 0 . ≥ and µ x = u e + u e for some u , u ∈ N . Let x = x e + x e + · · · + x d e d . Due to the property of the normal distribution, each coordinate x l for l ∈ [ d ] are independent eachother. We will show the variance of Var (cid:104) x l | x (cid:48) ˆ θ ≥ ˆ b (cid:105) ≥ σ x / , (10)which suffices to prove Lemma 18.• For the first dimension, we have x ∼ N ( u , σ x ) and Var (cid:104) x | x (cid:48) ˆ θ ≥ ˆ b (cid:105) = Var (cid:104) x | x ≥ ˆ b/θ (cid:105) . Applying Lemma 17 with x = sgn(ˆ b/θ ) /σ x , a = µ x /σ x , b = | ˆ b/θ | yield Var (cid:104) x | x ≥ ˆ b/θ (cid:105) ≥ σ x / .• For the second dimension, we have x ∼ N ( u , σ x ) and Var (cid:104) x | x (cid:48) ˆ θ ≥ ˆ b (cid:105) = Var [ x ] = σ x > σ x / . • ( x , x , . . . , x d ) ∼ N (0 , σ x I d − ) . In other words, these characteristics are normally distributedand thus Var( x l ) = σ x > σ x / .In summary, we have Eq. (10), which concludes the proof. Lemma 19 (Martingale Inequality on Ridge Regression, Abbasi-Yadkori et al., 2011) . Assumethat || θ g || ≤ S . Take δ > arbitrarily. With probability at least − δ , the true parameter θ g isbounded as ∀ n, (cid:13)(cid:13)(cid:13) ˆ θ g ( n ) − θ g (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ σ (cid:15) (cid:115) d log (cid:18) det( ¯ V g ( n )) / det( λ I ) − / δ (cid:19) + λ / S. (11)53oreover, let L = max i,n (cid:107) x i ( n ) (cid:107) and β n ( L, δ ) = σ (cid:15) (cid:115) d log (cid:18) nL /λδ (cid:19) + λ / S. Then, with probability at least − δ , ∀ n, (cid:13)(cid:13)(cid:13) ˆ θ g ( n ) − θ g (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ β n ( L, δ ) . (12)The following lemma is used in deriving a regret bound. Lemma 20 (Sum of Diminishing Contexts, Lemma 11 in Abbasi-Yadkori et al., 2011) . Let λ ≥ and L = max n,i (cid:107) x i ( n ) (cid:107) . Then, the following inequality holds: (cid:88) n : ι ( n )= g (cid:13)(cid:13) x ι ( n ) (cid:13)(cid:13)
2( ¯ V g ( n )) − ≤ L log (cid:18) det( ¯ V g ( N ))det( λ I d ) (cid:19) for any group g .The following inequality is used to bound the variation of the minimum eigenvalue of the sumof characteristics (contexts). Lemma 21 (Matrix Azuma Inequality, Tropp, 2012) . Let X , X , . . . , X n be adaptive sequenceof d × d symmetric matrices such that E k − X k = and X k (cid:22) A k almost surely, where A (cid:23) B between two matrices denotes A − B is positive semidefinite. Let σ A := (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) k A k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) where the matrix norm is defined by the largest eigenvalue. Then, for all t ≥ , Pr (cid:34) λ min (cid:32)(cid:88) k X k (cid:33) ≤ t (cid:35) ≤ d exp( − t / (8 nσ A )) . Proof.
The proof directly follows from Theorem 7.1 and Remark 3.10 in Tropp (2012).The following lemma states that the selection bias makes its variance slightly ( O (1 / log K ) times) smaller than the original variance. 54 emma 22 (Variance of Maximum, Theorem 1.8 in Ding, Eldan, and Zhai, 2015) . Let x , . . . , x K ∈ R be i.i.d. samples from N (0 , . Let I max = arg max i ∈ [ K ] x i . Then, there exists a distribution-independent constant C varmax > such that Var[ I max ] ≥ C varmax log( K ) . B Proofs
This section is structured as follows. Section B.1 describes the common inequalities that we assumethroughout the section. Proofs of individual theorems are shown in what follows.
B.1 Common Inequalities
In the proofs, we often ignore the events that happen with probability O (1 /N ) . The expectedregret per round is at most max i x (cid:48) i θ g ( i ) − min i x (cid:48) i θ g ( i ) , which is O (1) in expectation. Hence, theevents that happen with probability O (1 /N ) contributes to the regret by O (1 /N × N ) = O (1) ,which are insignificant in our analysis. In particular, Pr (cid:20) ∀ n ∈ [ N ] , i ∈ I ( n ) , || x i ( n ) || ≤ L (cid:18) N (cid:19)(cid:21) ≥ − N . (by Lemma 16) (13)Moreover, Pr (cid:20) ∀ n ∈ [ N ] , g ∈ G, (cid:13)(cid:13)(cid:13) ˆ θ g ( n ) − θ g (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ β n (cid:18) L, N (cid:19)(cid:21) ≥ − | G | N . (by Eq. (12) in Lemma 19)(14)and throughout the proof we ignore the case these events do not hold: All the contexts arebounded by L (1 /N ) = O ( √ log N ) , and all the confidence bounds hold with β n (cid:0) L (1 /N ) , N (cid:1) ≤ β N (cid:0) L (1 /N ) , N (cid:1) = O ( √ log N ) = ˜ O (1) , which grows very slowly as N grows large. We also denote L = L (1 /N ) and β N = β N (cid:0) L, N (cid:1) .We next discuss the upper confidence bounds. Remark 8 (Bound for ˜ θ i ) . Let ˜ θ i = arg max ¯ θ g ( i ) ∈C g ( i ) ( n ) x (cid:48) i ¯ θ g ( i ) . By definition of ˜ θ i , the following Except for Theorem 1 that does not pose distributional assumptions. ∀ n, (cid:13)(cid:13)(cid:13) ˜ θ i − ˆ θ g ( i ) ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ β N (15)and Eq. (14) implies ∀ n, x i (˜ θ i − θ g ( i ) ( n )) ≥ . (16)Moreover, by using triangular inequality, we have (cid:13)(cid:13)(cid:13) ˜ θ i − θ g ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ (cid:13)(cid:13)(cid:13) ˜ θ i − ˆ θ g ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) + (cid:13)(cid:13)(cid:13) ˆ θ g ( n ) − θ (cid:13)(cid:13)(cid:13) ¯ V g ( n ) and thus Eq. (14) implies ∀ n, (cid:13)(cid:13)(cid:13) ˜ θ i − θ g ( n ) (cid:13)(cid:13)(cid:13) ¯ V g ( n ) ≤ β N . (17)We use the calligraphic font to denote events. For two events A , B , let A c be a complementaryevent and {A , B} := {A ∩ B} . We also use prime to denote event that is close to the originalevent. For example, event A (cid:48) is different from event A but these two events are deeply linked.Finally, we discuss the minimum eigenvalue. We denote A (cid:23) B for two d × d matrices if A − B is positive semidefinite: That is, λ min ( A − B ) ≥ . Note that λ min ( A + B ) ≥ λ min ( A ) + λ min ( B ) and λ min ( A + B ) ≥ λ min ( A ) if B (cid:23) . xx (cid:48) (cid:23) for any vector x R d . B.2 Proof of Theorem 1
Consider an environment where there are two groups, G = { , } , and two workers arrive in eachround, K = K = 1 . We assume that error terms follow a standard normal distribution, i.e., σ (cid:15) = 1 . We set the ridge regression parameter λ to be . We assume N (0)1 = 1 and N (0)2 = 0 .Hence, g = 1 . We consider “no context” setting: Namely, d = 1 , and x i = 1 for all i ∈ I .We assume that θ = 0 and θ = − b with b > . Under this assumption, hiring a worker fromgroup incurs regret of b . Let R g ( n ) be the sum of the skills of workers who have been hired untilround n (i.e., have arrived from round to n − ) and belong to group g . Then, (2) implies that ˆ θ g ( n ) = R g ( n ) / ( N g ( n ) + λ ) = R g ( n ) / ( N g ( n ) + 1) . Since x i is fixed to , firm n chooses a groupwhose predicted expected skill ˆ θ g ( n ) is larger. Note that these the following derivation does not strongly depend on the specific value of these parameters northe number of groups. g = 1 , firm hires the group- worker: ι (1) = 1 . Let b (cid:48) > b be a constant that we specifylater. Let A := { ˆ θ (2) < − b (cid:48) } = { (cid:15) ι (1) < − b (cid:48) } be the event that the skill of the worker hired in round is smaller than b (cid:48) / (1 + λ ) = b (cid:48) . Theprobability that A occurs is Φ( − b (cid:48) ) , where Φ( x ) is the cumulative distribution of a standardnormal distribution. Let B := N (cid:92) n =1 (cid:16) ˆ θ ( n ) ≥ − b (cid:48) (cid:17) be the event that ˆ θ ( n ) never becomes smaller than b (cid:48) .We evaluate the probability that A ∩ B occurs. When such an event happens, a group- workeris hired in round , and group- workers are hired all the subsequent rounds (i.e., ι (2) = 2 for anyround n > ). Accordingly, N ( n ) = n − is the case for all n ≥ . (cid:110) ˆ θ ( n ) ≥ − b (cid:48) (cid:111) = (cid:26) R ( n ) N ( n ) + 1 ≥ − b (cid:48) (cid:27) ⊇ (cid:26) R ( n ) N ( n ) ≥ − b (cid:48) (cid:27) Applying Hoeffding’s inequality to the empirical average R g ( n ) /N g ( n ) , we have P (cid:18) R ( n ) N ( n ) < − b (cid:48) (cid:19) ≤ exp (cid:0) − b (cid:48) − b ) ( n − (cid:1) . Accordingly, P ( B ) ≥ P (cid:32) N (cid:91) n =1 (cid:110) ˆ θ ( n ) ≥ − b (cid:48) (cid:111)(cid:33) ≥ − N (cid:88) n =3 exp (cid:0) − b (cid:48) − b ) ( n − (cid:1) . Here, N (cid:88) n =3 exp (cid:0) − b (cid:48) − b ) ( n − (cid:1) ≤ ∞ (cid:88) n =3 exp (cid:0) − b (cid:48) − b ) ( n − (cid:1) = exp (cid:0) − b (cid:48) − b ) (cid:1) − exp ( − b (cid:48) − b ) ) ≤ b (cid:48) − b ) , and thus P ( B ) occurs with constant probability − b (cid:48) − b ) > for any b (cid:48) > b + 1 / √ . Remember57hat A ∩ B implies that arm is never drawn after n > , and thus Reg( N ) ≥ bN . In conclusion,we have E [Reg( N )] ≥ Φ( − b (cid:48) ) · (cid:18) − b (cid:48) − b ) (cid:19) · b · N = Ω( N ) , as desired. B.3 Proof of Theorem 2
We first bound regret per round reg( n ) := Reg( n ) − Reg( n − in Lemma 23. Then, we proveTheorem 2. Lemma 23 (Regret per Round) . Under the laissez-faire decision rule, the regret per round isbounded as: reg( n ) ≤ i ∈ I ( n ) (cid:107) x i (cid:107) ¯ V − g (cid:13)(cid:13)(cid:13) θ g ( i ) − ˆ θ g ( i ) (cid:13)(cid:13)(cid:13) ¯ V g . Proof of Lemma 23.
We denote the first-best decision rule by i ∗ ( n ) := arg max i ∈ I ( n ) x (cid:48) i θ g ( i ) . Then, reg( n ) = x (cid:48) i ∗ θ g ( i ∗ ) − x (cid:48) ι θ g ( ι ) ≤ x (cid:48) i ∗ (cid:16) ˆ θ g ( i ∗ ) + θ g ( i ∗ ) − ˆ θ g ( i ∗ ) (cid:17) − x (cid:48) ι (cid:16) ˆ θ g ( ι ) + θ g ( ι ) − ˆ θ g ( ι ) (cid:17) ≤ x (cid:48) i ∗ (cid:16) θ g ( i ∗ ) − ˆ θ g ( i ∗ ) (cid:17) − x (cid:48) ι (cid:16) θ g ( ι ) − ˆ θ g ( ι ) (cid:17) (by the greedy choice of firm) ≤ (cid:107) x i ∗ (cid:107) ¯ V − g ( i ∗ ) (cid:13)(cid:13)(cid:13) θ g ( i ∗ ) − ˆ θ g ( i ∗ ) (cid:13)(cid:13)(cid:13) ¯ V g ( i ∗ ) + (cid:107) x ι (cid:107) ¯ V − g ( ι ) (cid:13)(cid:13)(cid:13) θ g ( ι ) − ˆ θ g ( ι ) (cid:13)(cid:13)(cid:13) ¯ V g ( ι ) (by the Cauchy–Schwarz inequality) ≤ i ∈ I ( n ) (cid:107) x i (cid:107) ¯ V − g ( i ) (cid:13)(cid:13)(cid:13) θ g ( i ) − ˆ θ g ( i ) (cid:13)(cid:13)(cid:13) ¯ V g ( i ) . The following proves Theorem 2. For the ease of discussion, we assume N (0) = 0 . That is,there is no initial sampling phase. Taking it into consideration is trivial. We first show thatregardless of estimated values ˆ θ , ˆ θ , the candidate of group is drawn with constant probability.Let µ x = || µ x || . Let M ( n ) = (cid:110) x (cid:48) ( n )ˆ θ ≤ (cid:111) ( n ) = (cid:110) x (cid:48) ( n )ˆ θ > (cid:111) The sign of x (cid:48) ˆ θ ( n ) is solely determined by the component of x ( n ) that is parallel to ˆ θ ( n ) . Thiscomponent is drawn from N ( µ x, (cid:107) , σ x ) where µ x, (cid:107) is the component of µ x that is parallel to ˆ θ ( n ) .Therefore, for any ˆ θ , we have Pr[ M ( n )] ≥ Φ c ( µ x /σ x ) . (18)Likewise, for ˆ θ (cid:54) = 0 , we have Pr[ M ( n )] ≥ Φ c ( µ x /σ x ) (19)Let X ( n ) = { g ( ι ( n )) = g } for g ∈ { , } . By using Eq. (18) and (19), Pr[ X ( n )] = Pr[ x (cid:48) ( n )ˆ θ < x (cid:48) ( n )ˆ θ ] ≥ Pr[ x (cid:48) ( n )ˆ θ ≤ < x (cid:48) ( n )ˆ θ ]= Pr[ M ( n ) , M ( n )] ≥ (Φ c ( µ x /σ x )) . (20)(by Eq. (18), (19))Let N ( M )2 ( n ) = (cid:80) nn (cid:48) =1 [ M ( n (cid:48) ) , X ( n (cid:48) )] ≤ N ( n ) . Eq. (20) implies E [ N ( M )2 ( n )] ≥ (Φ c ( µ x /σ x )) n .By using the Hoeffding inequality, with probability at least − /N , we have N ( M )2 ≥ n (cid:0) (Φ c ( µ x /σ x )) − k (cid:1) (21)for k = (cid:114) log( N ) n . Therefore, union bound over n = 1 , , . . . , N implies Eq. (21) holds with probability at least − (cid:80) n /N = 1 − /N .In the following we bound the λ min ( ¯ V g ) . Note that a hiring of a worker i under events Pr[ M ( n )] = Φ c ( µ x /σ x ) when µ x, (cid:107) = µ x . Namely, the direction of µ x is exactly the same as ˆ θ . In the subsequent discussion, we do not care point mass ˆ θ = 0 of measure zero for N ( n ) > . ( n ) , X ( n ) satisfies a diversity condition (Lemma 18) with ˆ b = 0 , and we have λ min ( E [ x ι x (cid:48) ι |M ( n ) , X ( n )]) ≥ λ with λ = σ x / . Using the matrix Azuma inequality (Lemma 21) for subsequence { x ι x (cid:48) ι : M ( n ) , X ( n ) } with X = x ι x (cid:48) ι − E [ x ι x (cid:48) ι ] and σ A = 2 L , for t = (cid:113) N σ A log( dN ) , with probability − /N λ min (cid:88) n : ι ( n )=2 x ι x (cid:48) ι ≥ N ( M )2 λ − t. (22)In summary, with probability − /N , Eq. (21) and (22) hold, and then, we have λ min ( ¯ V ) ≥ N ( M )2 λ − (cid:113) N σ A log( dN ) ≥ ( n (Φ c ( µ x /σ x )) − k ) λ − (cid:113) N σ A log( dN )= n (Φ c ( µ x /σ x )) λ − ˜ O ( √ n ) . (23)By using the symmetry of the two groups, exactly the same results as Eq. (23) holds for group .In the following, we bound the regret as a function of min g λ min ( ¯ V g ) . Eq. (23) holds withprobability − O (1 /N ) , and we ignore events of probability O (1 /N ) that does not affect theanalysis. The regret is bounded as Reg( N ) ≤ (cid:88) n max i (cid:107) x i (cid:107) ¯ V − g ( i ) (cid:13)(cid:13)(cid:13) θ g ( i ) − ˆ θ g ( i ) (cid:13)(cid:13)(cid:13) ¯ V g ( i ) (by Lemma 23) (24) ≤ (cid:88) n max i (cid:107) x i (cid:107) ¯ V − g ( i ) β N (by Eq. 14) ≤ (cid:88) n max i || x i || λ min ( ¯ V g ( i ) ) β N (by definition of eigenvalues) ≤ (cid:88) n max i Lλ min ( ¯ V g ( i ) ) β N (by Eq. (13)) ≤ L (cid:88) n max i min (cid:32) λ min ( ¯ V g ( i ) ) , λ (cid:33) β N (by λ min ( ¯ V g ( i ) ) ≥ λ ) ≤ L (cid:88) n min (cid:32)(cid:115) n (Φ c ( µ x /σ x )) λ − ˜ O ( √ n ) , λ (cid:33) β N (by Eq. (23))60 L (cid:115) N (Φ c ( µ x /σ x )) λ β N + ˜ O (1) (cid:18) by N (cid:88) n = C +1 (cid:40) (cid:112) n − C √ n (cid:41) = 2 √ N + ˜ O (1) for C = ˜ O (1)) (cid:19) which completes Proof of Theorem 2. B.4 Proof of Theorem 3
Since we consider d = 1 case in this theorem, we remove bold styles in scalar variables. In thisproof, we assume µ x θ > and θ > . The proof for the case of µ x θ < or θ < is similar. Let ˆ θ g,t be the value of ˆ θ g when group g candidate was chosen t times. With a slight abuse of notation,we use i = i ( n ) to denote the unique candidate of group in each round n . We first define theseveral events that characterize the perpetual underestimation. Namely, P = (cid:26)(cid:12)(cid:12)(cid:12) ˆ θ ,N (0)2 (cid:12)(cid:12)(cid:12) < b θ (cid:27) P (cid:48) ( n ) = (cid:26) x i ( n ) ˆ θ ,N (0)2 < µ x θ (cid:27) Q = (cid:26) ∀ t ≥ N (0)1 , ˆ θ ,t ≥ θ (cid:27) Q (cid:48) ( n ) = (cid:26) ∃ i s.t. g ( i ) = 1 , x i ˆ θ ,N ( n ) ≥ µ x θ (cid:27) where b is a small constant that we specify later. P and P (cid:48) are about the minority whereas Q and Q (cid:48) are about the majority: Intuitively, Event P states that ˆ θ is largely underestimated, and P (cid:48) states that the minority candidate is undervalued. Q states that the majority parameter ˆ θ isconsistently lower-bounded, and Q (cid:48) states the stability of the best candidate of the majority after n rounds. Under laissez-faire, N (cid:92) n =1 ( P (cid:48) ( n ) ∩ Q (cid:48) ( n )) We will specify b = O (1 / (log N )) . g ( ι ) = 1 for all n ), which is exactly the perpetualunderestimation of Definition 2. Therefore, proving Pr (cid:34) N (cid:92) n =1 ( P (cid:48) ( n ) ∩ Q (cid:48) ( n )) (cid:35) ≥ ˜ O (1) (25)concludes the proof. We bound these events by the following lemmas and finally derives Eq. (25). Lemma 24.
Pr[ P ] ≥ C b for some constant C . Proof of Lemma 24.
We denote x i ,t for representing t -th sample of group during the initialsampling phase, which is an i.i.d. sample from N ( µ x , σ x ) . Likewise, we also denote y i ,t = x i ,t θ + (cid:15) t . Pr[ P ] = Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) N (0)2 t =1 x i ,t ( x i ,t θ + (cid:15) t ) (cid:80) N (0)2 t =1 x i ,t + λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ b θ = Pr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (0)2 (cid:88) t =1 x i ,t ( x i ,t θ + (cid:15) t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ b θ N (0)2 (cid:88) t =1 x i ,t + λ = Pr − g ( b ) ≤ N (0)2 (cid:88) t =1 x i ,t ( x i ,t θ + (cid:15) t ) ≤ g ( b ) where g ( b ) = b θ N (0)2 (cid:88) t =1 x i ,t + λ . Let x i ,t = µ x + e t . Define an event R as follows. R = N (0)2 (cid:88) t =1 e t ≤ σ x N (0)2 ⊆ N (0)2 (cid:88) t =1 x i ,t ≤ N (0)2 ( µ x + 5 σ x ) where we used x i ,t = ( µ x + e t ) ≤ µ x + e t ) in the last transformation. By using Lemma 14, wehave Pr[ R c ] ≤ − e − N (0)2 ≤ / . S = N (0)2 (cid:88) t =1 x i ,t = N (0)2 (cid:88) t =1 ( µ x + e t ) ≥ N (0)2 µ x . It is easy to confirm that
Pr[ (cid:80) n ( µ x + e t ) ≥ N (0)2 µ x ] ≥ / , and thus Pr[
R ∩ S ] ≥ − / − / / . (26)Note that S implies g ( b ) ≥ b θN (0)2 µ x + λ. (27)Conditioned on x i ,t , we have x i ,t (cid:15) t ∼ N (0 , x i ,t σ (cid:15) ) . Moreover, by using the property on thesum of independent normal random variables, (cid:88) t x i ,t (cid:15) t ∼ N (0 , (cid:88) t x i ,t σ (cid:15) ) (28)Letting L R = − g ( b ) − (cid:80) t x i ,t θσ (cid:15) (cid:113)(cid:80) t x i ,t U R = g ( b ) − (cid:80) t x i ,t θσ (cid:15) (cid:113)(cid:80) t x i ,t M R = L R + U R − (cid:16)(cid:113)(cid:80) t x i ,t (cid:17) θσ (cid:15) We have Pr (cid:34) − g ( b ) ≤ (cid:88) t =1 ( x i ,t θ + x i ,t (cid:15) n ) ≤ g ( b ) (cid:35) ≥ Pr (cid:34) − g ( b ) ≤ (cid:88) t =1 ( x i ,t θ + x i ,t (cid:15) t ) ≤ g ( b ) , R , S (cid:35) ≥ Pr (cid:34) − g ( b ) − (cid:88) t =1 x i ,t θ ≤ (cid:88) t =1 x i ,t (cid:15) t ≤ g ( b ) − (cid:88) t =1 x i ,t θ, R , S (cid:35) ≥ Pr[ R , S ] min { e n : R , S} (cid:20)(cid:90) U R L R φ ( y ) dy (cid:21) (by Eq. (28))63
14 min { e n : R , S} (cid:20)(cid:90) U R L R φ ( y ) dy (cid:21) (by Eq. (26)) (29)The following bounds Eq. (29). The integral’s bandwidth is U R − L R = 2 g ( b ) σ (cid:15) (cid:113)(cid:80) t x i ,t ≥ g ( b ) σ (cid:15) (cid:113) N (0)2 ( µ x + 5 σ x ) . (by event R )The value of φ ( y ) within [ M R − , M R + 1] is at least φ ( M R ) /e / ≥ (1 / φ ( M R ) . Therefore, (cid:90) U R L R φ ( y ) dy ≥ min , g ( b ) σ (cid:15) (cid:113) N (0)2 ( µ x + 5 σ x ) × φ ( M R )2 . (30)Moreover, φ ( M R ) = 1 √ π exp (cid:18) − ( M R ) (cid:19) = 1 √ π exp − θ (cid:80) N (0)2 t =1 x i ,t σ (cid:15) ≤ √ π exp (cid:32) − θ N (0)2 ( µ x + 5 σ x )2 σ (cid:15) (cid:33) (by event R ) (31)By using these, we have (cid:90) U R L R φ ( y ) dy ≥ min (cid:32) , g ( b ) σ (cid:15) (cid:112) µ x + 5 σ x ) (cid:33) φ ( M R )2 (by Eq. (30)) = min , g ( b ) σ (cid:15) (cid:113) N (0)2 ( µ x + 5 σ x ) φ ( M R )= O (cid:32) b (cid:113) N (0)2 exp (cid:32) − θ N (0)2 ( µ x + 5 σ x )2 σ (cid:15) (cid:33)(cid:33) (by Eq. (27), (31))The exponent does not depend on b : Given all model parameters as constant, the probability of P is O ( b ) , which concludes the proof.The following Lemma 25 on Q is about the stability of the mean estimator, which is widelyused to prove lower bounds in multi-armed bandit problems. Namely, for any ∆ > , a wide class64f mean estimators ˆ θ of θ satisfies Pr (cid:34) ∞ (cid:91) n =1 (cid:16) ˆ θ ( n ) ≥ θ − ∆ (cid:17)(cid:35) ≥ C (32)for some constant C = C ( θ, ∆) > . Lemma 25 is a version Eq. (32) for our ridge estimator. Lemma 25.
There exists a constant N (0)1 that is independent on N such that, with a warm-startof size N (0)1 , Pr[ Q ] ≥ C holds. Proof of Lemma 25.
In this proof, we use t ≥ to denote the estimator where the t -th sample isdrawn. For example, ¯ V g,t := ¯ V g ( n ) of n : N ( n −
1) = t . Note that we consider d = 1 case and ¯ V ,t = (cid:80) tt (cid:48) =1 x ,t + λ . By martingale bound (Eq. (11)), with probability − δ , ∀ t ≥ , | ˆ θ ,t − θ | (cid:113) ¯ V ,t ≤ σ (cid:15) (cid:118)(cid:117)(cid:117)(cid:116) log (cid:32) ¯ V / ,t λ − / δ (cid:33) + λ / S. (33)Let δ = 1 / . It follows from √ log x ≤ √ x for any x > that (cid:114) log (cid:16) V / ,t λ − / (cid:17) ≤ (cid:113) V / ,t λ − / . (34)Therefore, | ˆ θ ,t − θ | ≤ σ (cid:15) (cid:118)(cid:117)(cid:117)(cid:116) log (cid:32) ¯ V / ,t λ − / δ (cid:33) + λ / S (cid:113) ¯ V ,t (by Eq. (33)) ≤ σ (cid:15) (cid:113) V / ,t λ − / + λ / S (cid:113) ¯ V ,t (by (34))and thus ∀ t ≥ N (0)1 , | ˆ θ ,t − θ | ≤ | θ | (cid:113) ¯ V ,N (0)1 ≥ θ max (cid:18) σ (cid:15) (cid:114) V / ,N (0)1 λ − / , λ / S (cid:19) whose sufficient condition for the initial sample size N (0)1 is ¯ V ,N (0)1 ≥ max (cid:20) θ ( σ (cid:15) /λ ) , θ λS (cid:21) . Note that
Pr[ ¯ V ,N (0)1 ≥ µ x N (0)1 ] ≥ / . Letting the observation noise σ (cid:15) and regularizer λ beconstants, constant size of warm-start is enough to assure this bound with probability C = 1 / × / / .The following lemma states that, when ˆ θ is very small, the estimated quality x i ˆ θ of theminority group is likely to be small. Lemma 26.
There exists a constant C , C that is independent of N such that Pr[ P (cid:48) ( n ) |P ] ≥ − C exp ( − C /b ) (35)holds. Proof of Lemma 26.
Pr[ P (cid:48) ( n ) |P ] ≥ − Pr (cid:20) x i ( n ) ≥ b (cid:21) ≥ − Φ c (cid:18) σ x (cid:18) b − µ x (cid:19)(cid:19) ≥ − (cid:112) πσ x exp (cid:18) − σ x (cid:18) b − µ x (cid:19)(cid:19) , (by Lemma 15)where we have assumed (cid:0) b − µ (cid:1) /σ x ≥ in the last transformation (which holds for sufficientlysmall b ). Eq. (35) holds for C = √ πσ x e µ/σ x and C = 2 /σ x . Lemma 27.
Pr[ Q (cid:48) ( n ) | Q ] ≥ − (1 / K . Q (cid:48) ( n ) states that all the candidates’ estimated quality x i ˆ θ is not below mean. Lemma 27states that the probability of Q (cid:48) ( n ) is exponentially small to the number of candidates. The proofof Lemma 27 directly follows from the symmetry of normal distribution and independence of eachcharacteristic x i . Proof of Theorem 3.
By using Lemmas 24–27, we have
Pr[ P ] ≥ C b (36) Pr [ Q ] ≥ C (37) Pr[ P (cid:48) ( n ) |P ] ≥ − C exp ( − C b ) (38) Pr[ Q (cid:48) ( n ) |Q ] ≥ − (1 / K From these equations, the probability of perpetual underestimation is bounded as: Pr (cid:34)(cid:91) n { ι ( n ) = 1 } (cid:35) ≥ Pr (cid:34)(cid:91) n {P (cid:48) ( n ) , Q (cid:48) ( n ) } , P , Q (cid:35) ≥ Pr [ P ] Pr [ Q ] Pr (cid:34)(cid:91) n {P (cid:48) ( n ) , Q (cid:48) ( n ) } | P , Q (cid:35) (by the independence of P and Q ) ≥ C b × C × (1 − N C exp ( − C b )) × (cid:32) − N (cid:18) (cid:19) K (cid:33) (by the union bound) (39)which, by letting b = O (1 / log( N )) and K > log ( N ) , is ˜ O (1) . B.5 Proof of Theorem 4
Proof.
Let reg( n ) = Reg( n ) − Reg( n − . Notice that under the UCB decision rule, ι ( n ) = max i ∈ I ( n ) ( x (cid:48) i ˜ θ i ( n )) . (40)By Lemma 19, with probability at least − δ , the true parameter of group g lies in C g , and thus x (cid:48) i ˜ θ i ( n ) ≥ x (cid:48) i θ g (41)67or each i ∈ I ( n ) .Let i ∗ = i ∗ ( n ) := arg max i ∈ I ( n ) x (cid:48) i θ g ( i ) be the first-best worker, and g ∗ = g ( i ∗ ) be the group i ∗ belongs to. The regret in round n is bounded as reg( n ) = x (cid:48) i ∗ θ g ∗ − x (cid:48) ι θ g ( ι ) ≤ x (cid:48) i ∗ ˜ θ i ∗ − x (cid:48) ι θ g ( ι ) (by Eq. (41)) ≤ x (cid:48) ι ˜ θ ι − x (cid:48) ι θ g ( ι ) (by Eq. (40)) ≤ || x (cid:48) ι || ¯ V − g ( ι ) (cid:13)(cid:13)(cid:13) θ g ( ι ) − ˜ θ ι (cid:13)(cid:13)(cid:13) ¯ V g ( ι ) (by the Cauchy–Schwarz inequality) ≤ || x (cid:48) ι || ¯ V − g ( ι ) β N . (by Eq. (14)) (42)The total regret is bounded as: Reg( N ) = (cid:88) n reg( n ) ≤ (cid:115) N (cid:88) n reg( n ) (by the Cauchy–Schwarz inequality) ≤ β N (cid:115) N (cid:88) n || x (cid:48) ι || V − g ( ι ) ( n ) ≤ β N (cid:115) N L (cid:88) g ∈ G log(det( ¯ V g ( N ))) (by Lemma 20) (43) ≤ ˜ O ( (cid:112) N | G | ) where we have used the fact that log(det( ¯ V g )) = O (log( N )) = ˜ O (1) . B.6 Proof of Theorem 5
Proof of the first statement
Since s i ( n ) = ˜ q i ( n ) − ˆ q i ( n ) , we have ˆ q i ( n ) + s i ( n ) = ˜ q i ( n ) . Hence,firm i ’s incentive is aligned with the UCB index. Accordingly, firm i follows the UCB decision rule,which maximizes the UCB index. Proof of the second statement
For notational simplicity, we drop n , X g , Y g from this proof.Define a correspondence U by U (˜ q i ; s ) := { u i ∈ R | ∃ i, ∃ x i s.t. ˆ q i ( x i ) + s i ( x i ) = u i , ˜ q i = ˜ q i ( x i ) } . U (˜ q i ) represents the set of firm n ’s all possible payoffs from a worker whose UCB index is ˜ q i . Clearly, the subsidy rule s implements the UCB decision rule ι if and only if for all i , ˜ q (cid:48) i > ˜ q i implies min U (˜ q (cid:48) i ; s ) > max U (˜ q i ; s ) . (44)Since min U ( · ; s i ) is an increasing function, it is continuous at all but countably many points.Equivalently, U (˜ q i ; s i ) is a singleton for almost all values of ˜ q i .Now, suppose that U (˜ q ∗ i ) is not a singleton for some ˜ q ∗ i . Define ∆ by ∆ := max U (˜ q ∗ i ; s ) − min U (˜ q ∗ i ; s ) . Define another subsidy rule s (cid:48) by setting s (cid:48) i ( x i ) = s i ( x i ) if ˜ q i ( x i ) < ˜ q ∗ i min U (˜ q ∗ i ) − ˆ q i ( x i ) if ˜ q i ( x i ) = ˜ q ∗ i s i ( x i ) − ∆ otherwisefor all i . Then, we have U (˜ q i ; s (cid:48) ) = U (˜ q i ; s ) if ˜ q i < ˜ q ∗ i { min U (˜ q ∗ i ; s ) } if ˜ q i = ˜ q ∗ i U (˜ q i ; s ) − ∆ otherwise , which implies U ( · ; s (cid:48) ) also satisfies (44), or equivalently, s (cid:48) also implements the UCB rule ι . Fur-thermore, s (cid:48) i ( x i ) ≤ s (cid:48) i ( x i ) for all x i , with a strict inequality for some x i . Accordingly, s (cid:48) needs asmaller budget than s .By the argument above, whenever U ( · ; s i ) does not returns singleton for some ˜ q i , the subsidyamount can be improved by filling a gap. From now, we discuss the case in which U ( · ; s i ) returns asingleton for all ˜ q i ; i.e., U reduces to a function. From now, we use u (˜ q i ; s ) to represent the firm’sutility when it hires a worker whose UCB index is ˜ q i (which was previously written as U because69t could take multiple values). Then, we have s i ( x i ) = u (˜ q i ( x i ); s ) − ˆ q i ( x i ) for all x i . Since we require that s i ( x i ) ≥ for all x i , u (˜ q i ( x i ); s ) − ˆ q i ( x i ) ≥ . After some history, ˆ q i may become arbitrarily close to ˜ q i . Therefore, the inequality is satisfied forall ˜ q i and ˆ q i . Accordingly, u must satisfy u ( q ; s ) ≥ q (45)for all q . The UCB index subsidy rule satisfies (45) with equalities for all q : The UCB index subsidyrule satisfies s i = ˜ q i − ˆ q i , and therefore, u (˜ q i ; s ) = ˜ q i for all ˜ q i . Accordingly, it needs the minimumpossible budget. Proof of the third statement
We bound the amount of total subsidy
Sub( N ) . Sub( N ) := (cid:88) n x (cid:48) ι ( n ) (˜ θ ι − ˆ θ g ( ι ) ) ≤ || x (cid:48) ι || ¯ V − g ( ι ) β N , (by Eq. (15))which is the same as Eq. (42) and thus the same bound as regret applies. B.7 Proofs of Theorems 8 and 9
Proof of Theorem 8.
We adopt “slot” notation for each group. Group g is allocated K g slots andat each round n , one candidate arrives for each slot. We use index i ∈ [ K ] to denote each slot:Although x i at two different rounds n , n (cid:48) (= x i ( n ) , x i ( n (cid:48) ) ) represent different candidates, they arefrom the identical group g = g ( i ) . In summary, we use index i to represent the i -th slot and with aslight abuse of argument. We also call candidate i to represent the candidate of slot i . Note thatthis does not change any part of the model, and the slot notation here is for the sake of analysis.70nder the hybrid decision rule, a firm at each round hires the candidate of the largest index.Namely, ι ( n ) = arg max i ∈ I ( n ) ˜ q H i ( n ) where ˜ q H i is defined at Eq. (6). We also denote ˜ ι ( n ) = arg max i ∈ I ( n ) x (cid:48) i ˜ θ i . That is, ˜ ι indicates thecandidate who would have been hired if we have used the standard UCB decision rule (Eq. (5))The following bounds the regret into estimation errors of ˜ ι and ι . reg( n ) = x (cid:48) i ∗ θ g ∗ − x (cid:48) ι θ g ( ι ) ≤ x (cid:48) i ∗ ˜ θ i ∗ − x (cid:48) ι θ g ( ι ) (by Eq. (16)) ≤ x (cid:48) ˜ ι ˜ θ ˜ ι − x (cid:48) ι θ g ( ι ) (by definition of ˜ ι ) = x (cid:48) ˜ ι ˜ θ ˜ ι − x (cid:48) ι ˜ θ ι + x (cid:48) ι (˜ θ ι − θ g ( ι ) ) ≤ x (cid:48) ˜ ι (˜ θ ˜ ι − ˆ θ g (˜ ι ) ) + x (cid:48) ι (˜ θ ι − θ g ( ι ) ) (by definition of ι ) (46)Here, x (cid:48) ˜ ι (˜ θ ˜ ι − ˆ θ g (˜ ι ) ) ≤ || x (cid:48) ˜ ι || ¯ V − g (˜ ι ) (cid:13)(cid:13)(cid:13) ˜ θ ˜ ι − ˆ θ g (˜ ι ) (cid:13)(cid:13)(cid:13) ¯ V g (˜ ι ) (by the Cauchy–Schwarz inequality) ≤ || x (cid:48) ˜ ι || ¯ V − g (˜ ι ) β N . (by Eq. (15)) ≤ || x (cid:48) ˜ ι || (cid:113) λ min ( ¯ V g (˜ ι ) ) β N (by definition of eigenvalues) ≤ L (cid:113) λ min ( ¯ V g (˜ ι ) ) β N . (by Eq. (13)) (47)Moreover, the estimation error of candidate ι is bounded as x (cid:48) ι (˜ θ ι − θ g ( ι ) ) ≤ || x (cid:48) ι || ¯ V − g ( ι ) (cid:13)(cid:13)(cid:13) ˜ θ ι − θ g ( ι ) (cid:13)(cid:13)(cid:13) ¯ V g ( ι ) (by the Cauchy–Schwarz inequality) ≤ || x (cid:48) ι || ¯ V − g ( ι ) β N . (by Eq. (17)) ≤ || x (cid:48) ι || (cid:113) λ min ( ¯ V g ( ι ) ) β N (by definition of eigenvalues) ≤ L (cid:113) λ min ( ¯ V g ( ι ) ) β N . (by Eq. (13)) (48)71ased on the above bounds, the regret is bounded as follows. Reg( N ) = N (cid:88) n =1 reg( n ) ≤ N (cid:88) n =1 (cid:113) λ min ( ¯ V g ( ι ) ) + 1 (cid:113) λ min ( ¯ V g (˜ ι ) ) Lβ N (by Eq.(46), (47), (48) ) ≤ Lβ N (cid:88) i ∈ [ K ] N (cid:88) n =1 [ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) )+ Lβ N (cid:88) i ∈ [ K ] N (cid:88) n =1 [˜ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) (49)Eq. (49) consisted of two components. The first component is the estimation error of the hiredcandidate ι . The second component is the estimation error of ˜ ι , who would have hired if we hadposed the UCB decision rule. The Hybrid decision rule ˜ ι can be different from the UCB decisionrule ι , which is the main challenge of deriving regret bound in the hybrid decision rule.We first define the following events V i ( n ) := (cid:110) x i ( n ) (cid:48) (˜ θ i ( n ) − ˆ θ g ( i ) ( n )) ≤ aσ x (cid:13)(cid:13)(cid:13) ˆ θ ( n ) (cid:13)(cid:13)(cid:13)(cid:111) W i ( n ) := { ˜ ι ( n ) = i }X i ( n ) := { ι ( n ) = i }X (cid:48) i ( n ) := (cid:40) x i ( n ) (cid:48) ˆ θ ( n ) ≥ arg max j (cid:54) = i ˜ q H j (cid:41) ⊆ X i . Event V i states that the candidate i is not subsidized. Event W i states that i would have beenhired if it was subsidized in the UCB decision rule. Event X i states that i is hired and X (cid:48) i statesthat i is hired regardless of the subsidy.The following lemma is the crux of bounding the components in Eq. (49). Lemma 28 (Proportionality) . The following two inequalities hold.
Pr[ X (cid:48) i ] ≥ exp( − a /
2) Pr[ W i ] (50)72 r[ X (cid:48) i ] ≥ exp( − a /
2) Pr[ X i ] (51) Proof of Lemma 28.
We first prove, for any c ∈ R , d > , Pr (cid:104) x (cid:48) i ˆ θ g ( i ) ≥ c (cid:105) ≥ exp( − d /
2) Pr (cid:20) x (cid:48) i ˆ θ g ( i ) ≥ c − d (cid:16) σ x (cid:13)(cid:13)(cid:13) ˆ θ g ( i ) (cid:13)(cid:13)(cid:13)(cid:17) (cid:21) . (52)Let x (cid:107) := ( x (cid:48) i ˆ θ g ( i ) ) / || ˆ θ g ( i ) || be the projection of x i into the direction of ˆ θ g ( i ) . Then, x (cid:48) i ˆ θ g ( i ) = x (cid:107) || ˆ θ g ( i ) || . From the symmetry of a normal distribution, x (cid:107) || ˆ θ g ( i ) || is drawn from a normal distri-bution with its variance ( σ x || ˆ θ g ( i ) || ) , from which Eq. (52) follows.Eq. (50) follows by letting c = max j (cid:54) = i ˜ q H j , d = a because W i ⊆ (cid:26) x (cid:48) i ˆ θ g ( i ) ≥ c − d (cid:16) σ x (cid:13)(cid:13)(cid:13) ˆ θ g ( i ) (cid:13)(cid:13)(cid:13)(cid:17) (cid:27) X (cid:48) i ⊇ (cid:110) x (cid:48) i ˆ θ g ( i ) ≥ c (cid:111) Eq. (51) also follows by letting c = max j (cid:54) = i ˜ q H j and d = a X i ⊆ (cid:110) x (cid:48) i ˜ θ i ≥ c (cid:111) X (cid:48) i ⊇ (cid:26) x (cid:48) i ˜ θ i ≥ c + d (cid:16) σ x (cid:13)(cid:13)(cid:13) ˆ θ g ( i ) (cid:13)(cid:13)(cid:13)(cid:17) (cid:27) and exactly the same discussion as Eq. (52) applies for Pr (cid:20) x (cid:48) i ˜ θ i ≥ c + d (cid:16) σ x (cid:13)(cid:13)(cid:13) ˆ θ g ( i ) (cid:13)(cid:13)(cid:13)(cid:17) (cid:21) ≥ exp( − d /
2) Pr (cid:104) x (cid:48) i ˜ θ i ≥ c (cid:105) . (53)Lemma 28 is intuitively understood as follows. Assume that candidate i would have beenhired under the UCB rule. The candidate may not be hired under the Hybrid rule because it cancut subsidy for that candidate. However, there is constant probability such that a slightly better(“ aσ -good”) candidate appears on slot i and such a candidate is hired under the Hybrid rule.The following two lemmas, which utilizes Lemma 28, bounds the two terms of Eq. (49). Note that x (cid:48) i ˆ θ g ( i ) in Eq. (52) is replaced by x (cid:48) i ˜ θ i in Eq. (53), which does not change the subsequent derivationsat all. emma 29. E N (cid:88) n =1 [ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) ≤ e a / λ √ N + O (1) . Lemma 30. E N (cid:88) n =1 [˜ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) ≤ e a / λ √ N + O (1) . With Lemmas 29 and 30, the regret is bounded as
Reg( N ) ≤ Lβ n ( L, /N ) (cid:88) i ∈ [ K ] N (cid:88) n =1 [ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) )+ Lβ n ( L, /N ) (cid:88) i ∈ [ K ] N (cid:88) n =1 [˜ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) (by Eq. (49)) ≤ Lβ n ( L, /N ) K e a / √ Nλ + ˜ O (1) (by Lemma 29 and 30) (54)which completes the proof of Theorem 8. Proof of Lemma 29.
Let N i ( n ) be the number of the rounds before n such that the worker of slot i is selected. Let τ t be the first round such that N i ( n ) reaches t and N i,t = (cid:80) n ≤ τ t [ X (cid:48) i ( n )] . Lemma28 implies E [ N i,t ] ≥ e − a / t and applying the Hoeffding inequality on binary random variables ( [ X (cid:48) i ( τ )] , [ X (cid:48) i ( τ )] , . . . , ..., [ X (cid:48) i ( τ t )]) yields Pr (cid:104) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:105) ≤ N . (55)By using this, we have Pr (cid:34) N (cid:92) t =1 (cid:110) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:111)(cid:35) ≤ (cid:88) t Pr (cid:104) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:105) (by union bound) ≤ (cid:88) t N (by Eq. (55)) ≤ N .
74n the following, we focus on the case N i,t ≥ e − a / t − (cid:112) (log N ) t, (56)which occurs with probability at least − /N .Let ¯ V i ( n ) := (cid:80) n (cid:48) ≤ n [ ι = i ] x i x (cid:48) i (cid:22) ¯ V g ( i ) ( n ) . The context x i conditioned on event X (cid:48) i satisfiesassumptions in Lemma 18 with ˆ θ = ˜ θ i and ˆ b = max j (cid:54) = i ˜ q H j . We have, N (cid:88) n =1 [ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) ≤ N (cid:88) n =1 [ ι = i ] 1 (cid:112) λ min ( ¯ V i ) (by ¯ V g ( i ) (cid:23) ¯ V i ) ≤ N (cid:88) n =1 N (cid:88) t =1 [ ι = i, N i ( n ) = t ] 1 (cid:112) λ min ( ¯ V i ) (by N i ( N ) ≤ N ) ≤ N (cid:88) t =1 (cid:112) λ min ( ¯ V i ( τ t )) . (by [ ι = i, N i ( n ) = t ] occurs at most once)In other words, lower-bounding λ min ( ¯ V i ( τ t )) suffices the regret bound, which we demonstrate in thefollowing.We have E (cid:2) λ min ( ¯ V i ( τ t )) (cid:3) ≥ λ min ( (cid:88) n E [ [ X (cid:48) i ( n )] x i x (cid:48) i ]) ≥ λ N i,t . (by Lemma 18)By using the matrix Azuma inequality (Lemma 21), with probability of at least − /Nλ min ( ¯ V i ( τ t )) ≥ (cid:18) λ N i,t − (cid:113) N i,t σ A log( dN ) (cid:19) (57)where σ A = 2 L . By using Eq. (56), (57), we have λ min ( ¯ V i ( τ t )) ≥ λ e − a / t − O ( √ t ) N (cid:88) t =1 (cid:112) λ min ( ¯ V i ( τ t )) ≤ N (cid:88) t =1 (cid:113) λ e − a / t − O ( √ t ) ≤ e a / λ √ N + O (1) . Proof of Lemma 30.
Let N W i i ( n ) = (cid:80) n (cid:48) ≤ n [ W i ] and let τ t be the first round such that N W i i ( n ) reaches t and N i,t = (cid:80) n ≤ τ t [ X (cid:48) i ( n )] . The following discussions are very similar to the one of Lemma29, which we write for the completeness. Then, we have Pr (cid:34) N (cid:92) t =1 (cid:110) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:111)(cid:35) ≤ (cid:88) t Pr (cid:104) N i,t < (cid:16) e − a / t − (cid:112) (log N ) t (cid:17)(cid:105) (by union bound) ≤ (cid:88) t N (by Lemma 28 and the Hoeffding inequality) ≤ N .
In the following, we focus on the case N i,t ≥ e − a / t − (cid:112) (log N ) t (58)that occurs with probability at least − /N .We have, N (cid:88) n =1 [˜ ι = i ] 1 (cid:113) λ min ( ¯ V g ( i ) ) ≤ N (cid:88) n =1 [˜ ι = i ] 1 (cid:112) λ min ( ¯ V i ) (by ¯ V g ( i ) (cid:23) ¯ V i ) ≤ N (cid:88) n =1 N (cid:88) t =1 [˜ ι = i, N W i i ( n ) = t ] 1 (cid:112) λ min ( ¯ V i ) ≤ N (cid:88) t =1 (cid:112) λ min ( ¯ V i ( τ t )) . (by { ˜ ι = i } increments N W i i )76he following lower-bounds λ min ( ¯ V i ( τ t )) .We have E (cid:2) λ min ( ¯ V i ( τ t )) (cid:3) ≥ λ min (cid:32)(cid:88) n E [ (cid:2) X (cid:48) i ( n )] x i x (cid:48) i (cid:3)(cid:33) ≥ λ N i,t . (by Lemma 18)By using the matrix Azuma inequality (Lemma 21), at least − /Nλ min ( ¯ V i ( τ t )) ≥ (cid:18) λ N i,t − (cid:113) N i,t σ A log( dN ) (cid:19) (59)where σ A = 2( L (1 /N )) . By using Eq. (58), (59), we have λ min ( ¯ V i ( τ t )) ≥ λ e − a / t − O ( √ t ) and thus N (cid:88) t =1 (cid:112) λ min ( ¯ V i ( τ t )) ≤ N (cid:88) t =1 (cid:113) λ e − a / t − O ( √ t ) ≤ e a / λ √ N + O (1) . Proof of Theorem 9.
We here bound the amount of the subsidy. Eq. (47), (48) imply x (cid:48) i (cid:16) ˜ θ i − ˆ θ g ( i ) (cid:17) ≤ (cid:113) λ min ( ¯ V g ( i ) ) Lβ N (cid:12)(cid:12)(cid:12) x (cid:48) i ˆ θ g ( i ) − θ g ( i ) (cid:12)(cid:12)(cid:12) ≤ (cid:113) λ min ( ¯ V g ( i ) ) Lβ N and thus the subsidy s H-I i ( n ) = 0 for λ min ( ¯ V g ( i ) ) ≥ (cid:18) Lβ N (cid:107) θ (cid:107) (cid:19) max (cid:18) , a σ x (cid:19) =: C s = ˜ O (1) . (60)77ence, it follows that Sub( N ) = (cid:88) n s H-I ι ( n ) ≤ (cid:88) n (cid:88) i [ X i ] s H-I ι ( n ) ≤ Lβ N (cid:88) i (cid:88) n [ λ min ( ¯ V g ( i ) ) ≤ C s ] 1 (cid:113) λ min ( ¯ V g ( i ) ) (by Eq. (60)) ≤ Lβ N (cid:88) i (cid:88) t [ λ e − a / t − O ( √ t ) ≤ C s ] 1 (cid:113) λ e − a / t − O ( √ t ) (by the same discussion as Lemma 29) ≤ Lβ N K (cid:88) t [ λ e − a / t ≤ C s ] 1 (cid:112) λ e − a / t + ˜ O (1) ≤ Lβ N K e a / λ (cid:115) C s e a / λ + ˜ O (1) = ˜ O (1) . (61)Note that C s diverges as a → +0 . The bound of Theorem 9 is meaningful for a > . If a = 0 ,the hybrid mechanism is reduced to the UCB mechanism, and thus Theorem 5 for UCB applies. B.8 Proof of Theorem 10
We modify the proof of Theorem 3. Accordingly, we use the same notation as the proof of Theorem 3unless we explicitly mention.We define Q (cid:48)(cid:48) ( n ) = (cid:26) ∃ i A , i B s.t. g ( i A ) = g ( i B ) = 1 , i A (cid:54) = i B , and x i ˆ θ ,N ( n ) ≥ µ x θ for i = i A , i B (cid:27) . When the event Q (cid:48)(cid:48) ( n ) occur, there are two majority workers whose predicted skill ˆ q i ( n ) is largerthan its mean. Lemma 31.
Pr[ Q (cid:48)(cid:48) ( n ) |Q ] ≥ − ( K + 1) (cid:18) (cid:19) K . (62)Event Q (cid:48)(cid:48) ( n ) states that the second order statistics of { ˆ q i } i : g ( i )=1 is below mean. Lemma 3178tates that this event is exponentially unlikely to K . By the symmetry of normal distribution andindependence of each characteristic x i , each candidate is likely to be below mean with probability / , and the proof of Lemma 31 directly follows by counting the combinations such that at mostone of the worker(s) are above mean.When we have P (cid:48) ( n ) and Q (cid:48)(cid:48) ( n ) for all n , then for every round n , the top- workers in termsof quality ˆ q i ( n ) are from the majority. In this case, the minority worker is not hired regardless ofadditional signal η i . Accordingly, this is a sufficient condition for a perpetual underestimation. Proof of Theorem 10.
By using Lemmas 24, 25, 26, and 31, we have (36), (37), (38), and (62).From these equations, the probability of perpetual underestimation is bounded as: Pr (cid:34)(cid:91) n { ι ( n ) = 1 } (cid:35) ≥ Pr (cid:34)(cid:91) n {P (cid:48) ( n ) , Q (cid:48)(cid:48) ( n ) } , P , Q (cid:35) ≥ Pr [ P ] Pr [ Q ] Pr (cid:34)(cid:91) n {P (cid:48) ( n ) , Q (cid:48)(cid:48) ( n ) } | P , Q (cid:35) (by the independence of P and Q ) ≥ C b × C × (1 − N C exp ( − C b )) × (cid:32) − N ( K + 1) (cid:18) (cid:19) K (cid:33) (by the union bound) , which, by letting b = O (1 / log( N )) and K + log ( K + 1) ≥ log N , is ˜ O (1) . B.9 Proof of Theorem 11
Proof of Theorem 11.
We have | x (cid:48) i (ˆ θ g − θ g ) | ≤ || x i || ¯ V − g (cid:13)(cid:13)(cid:13) ˆ θ g − θ g (cid:13)(cid:13)(cid:13) ¯ V g ≤ Lλ min ( ¯ V g ) β n (by Eq. (13) and (14) ) ≤ Lλ β N (by ¯ V g (cid:23) λ I d ) =: C = ˜ O (1) . (63)79et i and i be the finalists chosen from group and , respectively. Define the following event: J ( n ) = { η i ( n ) − η i ( n ) > C } . Under J , the finalist of group is chosen because Eq. (63) implies that | x (cid:48) i ˆ θ g − x (cid:48) i ˆ θ g | ≤ C and thus x (cid:48) i ˆ θ g + η i − x (cid:48) i ˆ θ g + η i > . Note that η i − η i is drawn from N (0 , σ η ) . Let C = Φ c ( √ C /σ η ) . Then, Pr[ J ( n )] = C . (64)Let N J = (cid:80) n − n (cid:48) =1 [ g ( ι ) = 1 , J ] ≤ N ( n ) be the number of hiring of group under event J . Byusing the Hoeffding inequality, with probability − /N we have N J ≥ nC − (cid:112) n log( N ) . (65)By taking union bound, Eq. (65) holds for all n with probability − (cid:80) n /N ≥ − /N . Fromnow, we evaluate λ min (cid:0) ¯ V ( n ) (cid:1) . It is easy to see that ¯ V := n (cid:88) n (cid:48) =1: ι ( n (cid:48) )= g x i x (cid:48) i + λI ≥ n (cid:88) n (cid:48) =1: ι ( n (cid:48) )= g x i x (cid:48) i ≥ n (cid:88) n (cid:48) =1: J x i x (cid:48) i . In the following, we lower-bound the quantity λ min ( E [ x i x (cid:48) i |J ]) ≥ min v : || v || =1 λ min (Var[ v (cid:48) x i |J ]) . Note that i = arg max i : g ( i )=1 x (cid:48) i ˆ θ is biased towards the direction of ˆ θ , and we cannot use thediversity condition (Lemma 18). Let v (cid:107) and v ⊥ be the component of v that is parallel to and perpen-dicular to ˆ θ . It is easy to confirm that
Var[ v (cid:48)⊥ x i ] = || v ⊥ || σ x because selection of arg max i x (cid:48) i ˆ θ g does not yield any bias in perpendicular direction. Regarding v (cid:107) , Lemma 22 characterize the || v (cid:107) || + || v ⊥ || = 1 . smaller than the original variance due to biased selection. Namely, min v : || v || =1 λ min (Var[ v (cid:48) x i |J ]) ≥ σ x (cid:18) C varmax log( K ) || v (cid:107) || + || v ⊥ || (cid:19) ≥ σ x C varmax log( K ) . By using the matrix Azuma inequality (Lemma 21) with σ A = 2 L , for t = (cid:113) N J σ A log( dN ) ,with probability − /N λ min ( ¯ V ) ≥ σ x C varmax log( K ) N J g − t. (66)Combining Eq. (65) and (66), with probability at least − /N , we have λ min ( ¯ V ( n )) ≥ σ x C p log( K ) n − ˜ O ( √ n ) (67)where C p = C C varmax = ˜ O (1) . By symmetry, exactly the same bound as Eq. (67) holds for group . Finally, by using similar transformations as Eq. (24), the regret is bounded as E [Reg( N )] ≤ N (cid:88) n =1 max i ∈ [ K ] (cid:12)(cid:12)(cid:12) x (cid:48) i ( n )(ˆ θ g − θ g ) (cid:12)(cid:12)(cid:12) ≤ N (cid:88) n =1 L (cid:113) λ min ( ¯ V g ) β N (by Eq. (13), (14)) ≤ Lβ N N (cid:88) n =1 (cid:115) log( K ) σ x C p n − ˜ O ( √ n ) (by Eq. (67)) ≤ Lβ N (cid:115) N log( K ) σ x C p + ˜ O (1) = ˜ O ( √ N ) (68)which concludes the proof. References
Abbasi-Yadkori, Y., D. P´al, and C. Szepesv´ari (2011): “Improved Algorithms for LinearStochastic Bandits,” in
Advances in Neural Information Processing Systems , 2312–2320.
Abe, N. and P. M. Long (1999): “Associative Reinforcement Learning Using Linear ProbabilisticConcepts,” in
Proceedings of the Sixteenth International Conference on Machine Learning , 3–11. O (1 / log K ) igner, D. J. and G. G. Cain (1977): “Statistical Theories of Discrimination in Labor Markets,” Industrial and Labor Relations Review , 30, 175–187.
Al-Ali, M. N. (2004): “How to Get Yourself on the Door of a Job: A Cross-Cultural ContrastiveStudy of Arabic and English Job Application Letters,”
Journal of Multilingual and MulticulturalDevelopment , 25, 1–23.
Altonji, J. G. and C. R. Pierret (2001): “Employer Learning and Statistical Discrimination,”
Quarterly Journal of Economics , 116, 313–350.
Arrow, K. (1973): “The Theory of Discrimination,” in
Discrimination in Labor Markets , ed. byO. Ashenfelter and A. Rees, Princeton University Press, 3–33.
Auer, P., N. Cesa-Bianchi, and P. Fischer (2002): “Finite-Time Analysis of the MultiarmedBandit Problem,”
Machine Learning , 47, 235–256.
Banerjee, A. V. (1992): “A Simple Model of Herd Behavior,”
Quarterly Journal of Economics ,107, 797–817.
Bardhi, A., Y. Guo, and B. Strulovici (2020): “Early-Career Discrimination: Spiraling orSelf-Correcting?” Working Paper.
Bastani, H., M. Bayati, and K. Khosravi (2020): “Mostly Exploration-Free Algorithms forContextual Bandits,”
Management Science . Battaglini, M., J. M. Harris, and E. Patacchini (2020): “Professional Interactions andHiring Decisions: Evidence from the Federal Judiciary,” Working Paper 26726, National Bureauof Economic Research.
Becker, G. S. (1957):
The Economics of Discrimination , University of Chicago press.
Bergemann, D. and J. V¨alim¨aki (2006): “Bandit Problems,” Tech. rep., Cowles Foundationfor Research in Economics, Yale University.
Bikhchandani, S., D. Hirshleifer, and I. Welch (1992): “A Theory of Fads, Fashion, Cus-tom, and Cultural Change as Informational Cascades,”
Journal of Political Economy , 100, 992–1026. 82 ohren, J. A., K. Haggag, A. Imas, and D. G. Pope (2019a): “Inaccurate Statistical Dis-crimination,” Working Paper.
Bohren, J. A., A. Imas, and M. Rosenberg (2019b): “The Dynamics of Discrimination:Theory and Evidence,”
American Economic Review , 109, 3395–3436.
Bordalo, P., K. Coffman, N. Gennaioli, and A. Shleifer (2019): “Beliefs about Gender,”
American Economic Review , 109, 739–73.
Che, Y.-K. and J. H¨orner (2018): “Recommender Systems as Mechanisms for Social Learning,”
Quarterly Journal of Economics , 133, 871–925.
Che, Y.-K., K. Kim, and W. Zhong (2019): “Statistical Discrimination in Ratings-GuidedMarkets,” Working Paper.
Chu, W., L. Li, L. Reyzin, and R. Schapire (2011): “Contextual Bandits with Linear PayoffFunctions,” in
Proceedings of the Fourteenth International Conference on Artificial Intelligenceand Statistics , 208–214.
Coate, S. and G. C. Loury (1993): “Will Affirmative-Action Policies Eliminate Negative Stereo-types?”
American Economic Review , 1220–1240.
Cornell, B. and I. Welch (1996): “Culture, Information, and Screening Discrimination,”
Jour-nal of Political Economy , 104, 542–571.
De Paola, M., V. Scoppa, and R. Lombardo (2010): “Can Gender Quotas Break DownNegative Stereotypes? Evidence from Changes in Electoral Rules,”
Journal of Public Economics ,94, 344 – 353.
Ding, J., R. Eldan, and A. Zhai (2015): “On Multiple Peaks and Moderate Deviations for theSupremum of a Gaussian Field,”
Annals of Probability , 43, 3468–3493.
Eddo-Lodge, R. (2017): “Why I’m No Longer Talking to White People About Race,” TheGurdian, . Accessed on 08/20/2020.83 ang, H. and A. Moro (2011): “Theories of Statistical Discrimination and Affirmative Action:A Survey,” in
Handbook of Social Economics , Elsevier, vol. 1, 133–200.
Farber, H. S. and R. Gibbons (1996): “Learning and Wage Dynamics,”
Quarterly Journal ofEconomics , 111, 1007–1047.
Feller, W. (1968):
An Introduction to Probability Theory and Its Applications. , vol. 1 of
Thirdedition , New York: John Wiley & Sons Inc.
Foster, D. and R. Vohra (1992): “An Economic Argument for Affirmative Action,”
Rationalityand Society , 4, 176 – 188.
Frazier, P., D. Kempe, J. Kleinberg, and R. Kleinberg (2014): “Incentivizing Explo-ration,” in
Proceedings of the fifteenth ACM conference on Economics and computation , 5–22.
Fryer, R. and M. O. Jackson (2008): “A Categorical Model of Cognition and Biased DecisionMaking,”
The BE Journal of Theoretical Economics , 8.
Gittins, J. C. (1979): “Bandit Processes and Dynamic Allocation Indices,”
Journal of the RoyalStatistical Society. Series B (Methodological) , 41, 148–177.
Gu, J. and P. Norman (2020): “A Search Model of Statistical Discrimination,” Working Paper.
Hanna, R. N. and L. L. Linden (2012): “Discrimination in Grading,”
American EconomicJournal: Economic Policy , 4, 146–68.
Hann´ak, A., C. Wagner, D. Garcia, A. Mislove, M. Strohmaier, and C. Wilson (2017):“Bias in Online Freelance Marketplaces: Evidence from Taskrabbit and Fiverr,” in
Proceedingsof the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing ,1914–1933.
Hilton, J. L. and W. Von Hippel (1996): “Stereotypes,”
Annual Review of Psychology , 47,237–271.
Immorlica, N., J. Mao, A. Slivkins, and Z. S. Wu (2020): “Incentivizing Exploration withSelective Data Disclosure,” in
Proceedings of the 21st ACM Conference on Economics and Com-putation , EC ’20, 647–648. 84 oseph, M., M. Kearns, J. H. Morgenstern, and A. Roth (2016): “Fairness in Learning:Classic and Contextual Bandits,” in
Advances in Neural Information Processing Systems , 325–333.
Judd, C. M. and B. Park (1993): “Definition and Assessment of Accuracy in Social Stereotypes,”
Psychological Review , 100, 109.
Kannan, S., M. Kearns, J. Morgenstern, M. Pai, A. Roth, R. Vohra, and Z. S. Wu (2017): “Fairness Incentives for Myopic Agents,” in
Proceedings of the 2017 ACM Conference onEconomics and Computation , 369–386.
Kannan, S., J. H. Morgenstern, A. Roth, B. Waggoner, and Z. S. Wu (2018): “ASmoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem,” in
Advances in Neural Information Processing Systems , 2227–2236.
Kleinberg, J. M. and M. Raghavan (2018): “Selection Problems in the Presence of ImplicitBias,” in , 33:1–33:17.
Kremer, I., Y. Mansour, and M. Perry (2014): “Implementing the ‘Wisdom of the Crowd’,”
Journal of Political Economy , 122, 988–1012.
Lai, T. and H. Robbins (1985): “Asymptotically Efficient Adaptive Allocation Rules,”
Advancesin Applied Mathematics , 6, 4 – 22.
Langford, J. and T. Zhang (2008): “The Epoch-Greedy Algorithm for Contextual Multi-ArmedBandits,” in
Advances in Neural Information Processing Systems , 817–824.
Lundberg, S. J. and R. Startz (1983): “Private Discrimination and Social Intervention inCompetitive Labor Market,”
American Economic Review , 73, 340–347.
MacNell, L., A. Driscoll, and A. N. Hunt (2015): “What’s in a Name: Exposing GenderBias in Student Ratings of Teaching,”
Innovative Higher Education , 40, 291–303.
Mailath, G. J., L. Samuelson, and A. Shaked (2000): “Endogenous Inequality in IntegratedLabor Markets with Two-Sided Search,”
American Economic Review , 90, 46–72.85 ansour, Y., A. Slivkins, and V. Syrgkanis (2020): “Bayesian Incentive-Compatible BanditExploration,”
Operations Research , 68, 1132–1161.
Mitchell, K. M. and J. Martin (2018): “Gender Bias in Student Evaluations,”
PS: PoliticalScience and Politics , 51, 648–652.
Monachou, F. G. and I. Ashlagi (2019): “Discrimination in Online Markets: Effects of SocialBias on Learning from Reviews and Policy Design,” in
Advances in Neural Information ProcessingSystems , 2145–2155.
Moro, A. (2009):
Statistical Discrimination , London: Palgrave Macmillan UK, 1–5.
Moro, A. and P. Norman (2003): “Affirmative action in a Competitive Economy,”
Journal ofPublic Economics , 87, 567–594.——— (2004): “A General Equilibrium Model of Statistical Discrimination,”
Journal of EconomicTheory , 114, 1–30.
O’Brien, S. A. (2018): “Facebook Commits to Seeking More Minority Directors,”
CNN , https://money.cnn.com/2018/05/31/technology/facebook-board-diversity/index.html . Accessedon 09/09/2020. Papanastasiou, Y., K. Bimpikis, and N. Savva (2018): “Crowdsourcing Exploration,”
Man-agement Science , 64, 1727–1746.
Pe˜na, V. H., T. L. Lai, and Q.-M. Shao (2008):
Self-normalized processes: Limit theory andStatistical Applications , Springer Science & Business Media.
Phelps, E. S. (1972): “The Statistical Theory of Racism and Sexism,”
American Economic Re-view , 62, 659–661.
Precht, K. (1998): “A Cross-Cultural Comparison of Letters of Recommendation,”
English forSpecific Purposes , 17, 241–265.
Rigollet, P. (2015): “High Dimensional Statistics,” MIT OpenCourseWare, https://ocw.mit.edu/courses/mathematics/18-s997-high-dimensional-statistics-spring-2015/lecture-notes/ . Accessed on 08/29/2020.86 obbins, H. (1952): “Some Aspects of the Sequential Design of Experiments,”
Bulletin of theAmerican Mathematical Society , 58, 527–535.
Rusmevichientong, P. and J. N. Tsitsiklis (2010): “Linearly Parameterized Bandits,”
Math-ematics of Operations Research , 35, 395–411.
Schwartzstein, J. (2014): “Selective Attention and Learning,”
Journal of the European EconomicAssociation , 12, 1423–1452.
Smith, L. and P. Sørensen (2000): “Pathological Outcomes of Observational Learning,”
Econo-metrica , 68, 371–398.
Sundaram, R. K. (2005): “Generalized Bandit Problems,” in
Social Choice and Strategic Deci-sions , Springer, 131–162.
Thompson, W. R. (1933): “On the Likelihood that One Unknown Probability Exceeds Anotherin View of the Evidence of Two Samples,”
Biometrika , 25, 285–294.
Trix, F. and C. Psenka (2003): “Exploring the Color of Glass: Letters of Recommendation forFemale and Male Medical Faculty,”
Discourse and Society , 14, 191–220.
Tropp, J. A. (2012): “User-Friendly Tail Bounds for Sums of Random Matrices,”
Foundations ofComputational Mathematics , 12, 389–434.
Wang, J., Y. Zhang, C. Posse, and A. Bhasin (2013): “Is It Time for a Career Switch?” in
Proceedings of the 22nd International Conference on World Wide Web , New York, NY, USA:Association for Computing Machinery, WWW ’13, 1377–1388.
Williams, W. M. and S. J. Ceci (2015): “National Hiring Experiments Reveal 2: 1 FacultyPreference for Women on STEM Tenure Track,”
Proceedings of the National Academy of Sciences ,112, 5360–5365.
Xu, L., J. Honda, and M. Sugiyama (2018): “A Fully Adaptive Algorithm for Pure Explorationin Linear Bandits,” in