[PDF] On Meritocracy in Optimal Set Selection

Abstract

We consider the problem of selecting a set of individuals from a candidate population in order to maximise utility. When the utility function is defined over sets, this raises the question of how to define meritocracy. We define and analyse an appropriate notion of meritocracy derived from the utility function. We introduce the notion of expected marginal contributions of individuals and analyse its links to the underlying optimisation problem, our notion of meritocracy, and other notions of fairness such as the Shapley value. We also experimentally analyse the effect of different policy structures on the utility and meritocracy in a simulated college admission setting including constraints on statistical parity.

Full PDF

FFair Set Selection: Meritocracy and Social Welfare ∗ Thomas Kleine B¨uning † Meirav Segal † Debabrota Basu ‡ Christos Dimitrakakis † Abstract

In this paper, we formulate the problem of selecting a set of individuals from a candidatepopulation as a utility maximisation problem. From the decision maker’s perspective, it is equivalentto ﬁnding a selection policy that maximises expected utility. Our framework leads to the notionof expected marginal contribution (EMC) of an individual with respect to a selection policy as ameasure of deviation from meritocracy. In order to solve the maximisation problem, we propose touse a policy gradient algorithm. For certain policy structures, the policy gradients are proportionalto EMCs of individuals. Consequently, the policy gradient algorithm leads to a locally optimalsolution that has zero EMC, and satisﬁes meritocracy. For uniform policies, EMC reduces to theShapley value. EMC also generalises the fair selection properties of Shapley value for generalselection policies. We experimentally analyse the eﬀect of diﬀerent policy structures in a simulatedcollege admission setting and compare with ranking and greedy algorithms. Our results verify thatseparable linear policies achieve high utility while minimising EMCs. We also show that we candesign utility functions that successfully promote notions of group fairness, such as diversity.

Machine learning is now pervasively used to assist high-stake decision making, with signiﬁcant con-sequences for individuals and organisations. We are especially interested in the case where the decisionmaker (DM) must select a set of individuals from a population so as to maximise expected utility.This case is especially interesting as maximising set utility can induce diversity without additionalconstraints for certain domains. For instance, according to D´ıaz-Garc´ıa et al. [2013], if we wish tomaximise innovative ideas, we would prefer teams with gender diversity. We can thus assume thatexpected utility represents an appropriate notion of social welfare. In addition to maximising utility,the DM may also like their selection policy to be meritocratic. The ﬁrst question that arises is how todeﬁne meritocracy for general utility functions. Here we suggest using a notion of expected individualcontribution to social welfare to achieve meritocracy.In particular, we consider the case where the DM maximises a utility function representing socialwelfare. This means that a policy maximising utility would be fair in terms of its aggregate beneﬁt tosociety. However, this may conﬂict with notions of fairness relating to individuals or groups, and inparticular egalitarianism and meritocracy.

Egalitarianism is a notion of fairness where individuals’ chances of obtaining a reward are inde-pendent of personal characteristics. In the simplest case, we wish our decisions to be independent of any personal characteristics, which is achievable through a uniformly random lottery. More typically,we wish our decisions or the outcomes of our decisions to be independent of sensitive characteristicssuch as gender or ethnicity.

Meritocracy suggests that individuals with the highest “worth” obtain higher rewards. As longas there is an inherent, static measure of worth this is a well-deﬁned notion. However, in manysettings, including ours, worth is relative and dependent on circumstances. In particular, the relativecontribution of an individual to utility depends on who else has been selected (and more generally theDM’s policy) so it is a constantly shifting quantity. ∗ Submission to the archival-option track of FORC 2021. † University of Oslo ‡ Scool, Inria Lille-Nord Europe a r X i v : . [ c s . A I] F e b ontribution. We deﬁne a notion of meritocracy for set selection problems and arbitrary utilityfunctions relating to individual expected marginal contributions (EMC). This is analogous to marginalutilities for a basket of goods in economics, which must necessarily depend on both the population under consideration and the DM’s policy for selection. This is in contrast to works on set selectionbased on ranking, which implicitly deﬁne meritocracy through a static worth for every individual. Inparticular, when the DM’s policy is egalitarian, the EMC is identical to the Shapley value [Shapley,1951]. A natural way to shift the policy towards meritocracy would then be to increase the probabilityof selecting individuals proportionally to their EMC. If we apply this algorithmic idea to any initialDM policy, we show that we arrive at a policy gradient algorithm for policies that are separablyparameterised over the population. For these, maximising utility also achieves meritocracy, as thegradient is a linear transform of the EMC. For other policies, e.g. those relying on a threshold, thisdoes not hold and meritocratic outcomes cannot be guaranteed when maximising utility. Then, we canmeasure the deviation from meritocracy in terms of the residual EMC. Experimentally, we analysediﬀerent policy structures in a college admission setting and illustrate the relation of meritocracy withrespect to EMCs to policy structures and utility maximisation. We experimentally instantiate thepolicy gradient framework on linear and log-linear utility, and separable linear and threshold policies.We measure the expected utility and diversity of the solution to quantify the solution quality. We showthat separable policies achieve expected utility, residual EMC, and diversity index that are close tooptimal, whereas policies based on thresholding and ranking are signiﬁcantly sub-optimal.

Ranking with pre-determined measures of “worth”.

Most work on fairness in set selection hasfocused on policies that rank individuals according to some pre-determined criterion [Kearns et al.,2017, Zehlike et al., 2017, Celis et al., 2017, Biega et al., 2018]. This approach satisﬁes meritocracy,since “better” individuals should be ranked higher, and higher ranked individuals are preferred overlower ranked ones. In particular, Kearns et al. [2017] consider a probabilistic ordering, generalisingDwork et al.’s notion of similar treatment to selection over multiple groups. More precisely, a person i in group j is preferred to a person i (cid:48) in group j (cid:48) only if their relative percentile ranking is higher.They extend this basic deﬁnition to diﬀerent amounts of information available to the decision makerranging from ex ante to ex poste fairness. In a similar vein, Singh and Joachims [2019] propose afair ranking approach for Luce-Plackett ranking models. We instead approach this problem from autility maximisation perspective, where meritocracy is deﬁned as rewarding individuals according totheir contribution to social welfare. This is in contrast to ranking methods such as the above, whichimplicitly assume a fundamental worth for individuals. In our setting, rather the contribution of eachindividual to the utility depends on who else is selected.Several other papers assume the existence of latent individual worth with respect to which mer-itocracy can be measured. Kleinberg and Raghavan [2018] analyse a stylised parametric model ofindividual potential and Celis et al. [2020] consider interventions for ranking, where each individualhas a latent utility they would generate if hired. Emelianov et al. [2020] also examine latent worthwith variance depending on group (e.g. gender) membership. Fair selection as utility maximisation.

Our work is not the only one in the fairness literatureseeking to select sets of individuals in a way that maximises utility. Recently, Kusner et al. [2018]considered the setting where the utility is linear and the performance of individuals in the selected setdepends on who else is selected. They investigate how these interactions aﬀect group fairness. Whilethis is also possible in our setting if we use an appropriate model, we instead focus on the problem ofdeﬁning meritocracy when there is no unique ranking of individuals. This is generally the case whenthe DM’s utility is non-linear. While the problem of fair package assignment does consider non-linearutilities and targets eﬃciency and envy-freeness [Lahaie and Parkes, 2009], assignment problems donot readily transfer to the framework discussed in the paper.Stoyanovich et al. [2018] consider diversity in terms of recommendations as a constraint in decisionmaking. These methods rely on hard-coded thresholds on diversity, which are hardly available in realityand are too application-speciﬁc. Recently, Kilbertus et al. [2020] proves an impossibility result showing2hat decision making with deterministic thresholds (ref. Proposition 2) may not improve either fairnessor utility but learning a stochastic fair policy can circumvent this. This result motivates us to considerstochastic policies that optimise a welfare-based utility.

Fair set selection.

Fair set selection problems can also naturally be found in social choice. Forexample, in participatory budgeting, we aim to select a set of projects all of which have diﬀerent costs,while respecting a general budget constraint [Aziz and Shah, 2020]. In elections or tournaments, wewant to identify a set of winners [Moulin, 2016]. More speciﬁcally, in committee voting, we elect acommittee of a given size, i.e. a subset of all candidates [Elkind et al., 2017, Lackner and Skowron, 2020].All of these set selection problems are typically considering a set of candidates that have no features per se . It is assumed that there is a set of voters expressing preferences over the candidates. We caninterpret those voters’ preferences as features in analogy of our framework. Fair set selection is obtainedthrough mechanisms that guarantee fairness axioms such as anonymity and justiﬁed representation.These axioms are often binary, i.e. are either satisﬁed or not. They also often deal with fairness withrespect to the voters, while we concentrate on fairness towards the individual candidates.

We formulate the problem of selecting a set of individuals from a population from a general decisiontheoretic perspective, where the Decision Maker (DM) aims to select a subset of individuals maximisingexpected utility. Crucially, the utility function may be non-linear and depend on the set of selectedindividuals as a whole. Thus, the utility may not be decomposable into terms relating to each individual.Speciﬁcally, we consider a population of N candidates N . The DM observes the features of thepopulation x ∈ X , where x (cid:44) ( x , . . . , x N ), and then makes a decision a ∈ A about the populationusing a (stochastic) policy π ( a | x ). We focus on the case where A = { , } N and interpret a decision a = ( a , . . . , a N ) such that a i = 1 stands for selecting the i -th individual and a i = 0 stands for rejectingindividual i . After the DM makes a choice, she observes an outcome y ∈ Y . Typically, Y is a productspace encoding outcomes for every individual in the population.The utility of the DM is a function over subsets and outcomes u : A × Y → R . The utility function u ( a , y ) represents social welfare for every possible combination of selected individuals and outcomes.Since we only have access to the observed features x , the expected utility of an action a can becalculated by marginalising over outcomes y : U ( a , x ) (cid:44) E [ u | a , x ] = (cid:88) y ∈Y P ( y | a , x ) u ( a , y ) . (1)Here, P ( y | a , x ) is a predictive model used by the DM to estimate the outcomes. For simplicity, inthis paper we use a point estimate of the true conditional distribution. The expected utility of a policy π given population x is then U ( π, x ) (cid:44) E π [ u | x ] = (cid:88) a ∈A π ( a | x ) U ( a , x ) . (2)While U ( a , x ) naturally induces a ranking over diﬀerent subsets for a given population x , it does notnecessarily provide a ranking over individuals as each individual’s utility may depend on the groupselected alongside it.The goal of the DM is to ﬁnd a parameterised policy in the policy space Π = { π θ | θ ∈ Θ } maximising expected utility. We distinguish two cases: in the ﬁrst, the DM chooses a policy afterobserving the population, and in the second the policy is ﬁxed before the population is seen. Population-dependent policies.

For a given population x , the goal is to maximise the expectedutility for a given population: θ ∗ ( x ) = arg max θ ∈ Θ U ( π θ , x ) . (3)This is the typical setting where the selected set is chosen in a way so as to take into account all theinformation we have about the current population, and it will be the focus of this paper.3 opulation-independent policies. In some cases, e.g. when there have to be ﬁxed rules forselecting individuals from a population, the DM may need to choose a policy before seeing a particularpopulation. In this scenario, we can assume that the DM has access to some distribution β overpossible populations, and the problem becomes maximising expected utility under this distribution: θ ∗ ( β ) = arg max θ ∈ Θ (cid:90) X U ( π θ , x ) dβ ( x ) . (4)Another advantage of such policies in terms of fairness is that, under some speciﬁc policy structures,each individual can be judged without taking into account who else is in the current population.Instead, individuals will be judged relative to their expected contribution over all possible populations. While maximising utility is appropriate from the point of view of social welfare, it is not necessarilymeritocratic. In the following, we will link fairness to the expected marginal contribution of individuals.For certain policy structures, this is a linear transformation of the expected utility with respect to thepolicy parameters. In consequence, at any local maximum, the gradient and so the expected marginalcontribution is zero. This can be interpreted as saying that social welfare would be reduced by changingthe current allocation.

We are now going to formally introduce the expected marginal contribution of an individual under agiven policy π . The marginal contribution of an individual j to a set a is given by U ( a + j ) − U ( a ),where we mean by a + j that individual j is being selected, i.e. a + j = ( a , . . . , , . . . , a N ) for any a ∈ A .In our case, we do not have a ﬁxed set a by use of which we could compare marginal contributionsof two individuals. We therefore deﬁne the expected marginal contribution (EMC) of an individual j under a policy π as ∆ j U ( π, x ) = (cid:88) a ∈A π ( a | x ) (cid:2) U ( a + j, x ) − U ( a , x ) (cid:3) . (5)Informally, the EMC of individual j corresponds to the gain (or loss) in expected utility when a policyis modiﬁed so as to always pick individual j . The concept of marginal contributions has been studied incooperative game theory, where the celebrated Shapley value [Shapley, 1951] constitutes a fair resourceallocation based on marginal contributions. In particular, the Shapley value is characterised by fouraxioms of fair division, namely eﬃciency, symmetry, linearity, and the treatment of null players. As weshow below, the EMC satisﬁes analogous axioms. Properties.

The EMC satisﬁes the following axioms of fair division, where we omit x for brevity.1. Symmetry: If U ( a + i ) = U ( a + j ) for all a ∈ A , then ∆ i U ( π ) = ∆ j U ( π ) for any policy π .2. Linearity: If U and U are two utility functions, then ∆ j U ( αU + βU )( π ) = α∆ j U ( π )+ β∆ j U ( π )for any policy π .3. Null player: If an individual does not contribute to any of the subsets, i.e. U ( a + j ) = U ( a ) forall a ∈ A , then ∆ j U ( π ) = 0 for any policy π .If we restrict ourselves to policies π with support equal to A , then property 3 indeed becomes anequivalence. That is, if ∆ j U ( π ) = 0 for every policy π with support A , then U ( a + j ) = U ( a ) for all a ∈ A .These properties are - apart from eﬃciency which does not apply to the EMC - analogs to theaxioms of fair division characterising the Shapley value. In fact, we obtain the Shapley value whenassuming an egalitarian policy. Remark 1.

When the selection policy π is egalitarian, i.e. π ( a ) = k +1) ( Nk +1 ) when k = (cid:80) Ni =1 a i , theEMC ( ∆ i U ( π egal )) ≤ i ≤ N is equal to the Shapley value. .2 Policy structures We now demonstrate that the EMC deﬁned in equation (5) is related to the policy gradient ∇ θ U ( π θ )for speciﬁc types of policy parameterisations. Speciﬁcally, we consider policies that are separablyparameterised over the population and show that the policy gradient is a linear transform of the EMCin speciﬁc cases. In the following, we will omit x and write π ( a ) for π ( a | x ) for brevity. We say that a parameterised policy is separable over the population N if θ = ( θ , . . . , θ N ) and thereexists Z ( θ ) such that for every i ∈ { , . . . , N } the probability of selecting individual i takes form π θ ( a i ) = g ( a i , θ i ) Z ( θ ) (6)for some function g , and π θ ( a ) = (cid:81) Ni =1 π θ ( a i ). In slightly imprecise terms this means that theprobability of selecting an individual i only depends on the i -th parameter θ i apart from a commonnormalisation. Softmax policies.

A natural choice for policies that take form as in (6) are softmax policies of thefollowing kind. For β ≥ θ = ( θ , . . . , θ N ), we deﬁne π θ ( a ) = e β θ T a (cid:80) a (cid:48) ∈A e β θ T a (cid:48) . Here, β ≥ Lemma 1.

The gradient is a linear transform of the EMC. More precisely, for every j ∈ { , . . . , N } : ∇ θ j U ( π θ ) = βπ θ ( a j = 1) ∆ j U ( π θ ) , where π θ ( a j = 1) (cid:44) (cid:80) a : a j =1 π θ ( a ) is the probability of selecting a set containing individual j underpolicy π θ . Note that the EMC constitutes the expected contribution of always adding individual j under thecurrent policy, so if an individual is always added, then ∆ j U ( π θ ) is zero. More generally, the EMC ofan individual j is decreasing as the probability of selecting individual j is increasing. Consequently, wecan understand the factor π θ ( a j = 1) in Lemma 1 as a normalisation of the EMC opposing this eﬀect.The relation of the EMC and the gradient is again illustrated at an example in Figure 1. Linear policies.

From a computational point of view, it is appealing to consider separable policies sothat the probability of selecting individual i truly depends on the parameter θ i only, i.e. Z ( θ ) = constant.Then, the probability of a decision about individual i is simply given by π θ ( a i ) = π θ i ( a i ). Note thatusing such policies in Algorithm 1 does not render decisions about individual i independent from thoseabout individual j , as the parameters θ ∗ chosen by (3) depend on the whole population represented bythe feature vector x .We consider separable linear policies as these have a particularly intuitive structure. These selectindividual i with probability θ i , i.e. π θ i ( a i ) = θ i I { a i = 1 } + (1 − θ i ) I { a i = 0 } , where θ = ( θ , . . . , θ N ). Consequently, a separable linear policy assigns a decision a ∈ A the probability π θ ( a ) = N (cid:89) i =1 ( θ i I { a i = 1 } + (1 − θ i ) I { a i = 0 } ) . We again observe that there is a natural link between the policy gradient and the EMC.5 nd i v i dua l Individual 1 (a) Policy Gradient I nd i v i dua l Individual 1 (b) EMC (c) Shapley value

Figure 1: Gradient, EMC, and Shapley value for two individuals, a log-linear utility, a linear separablepolicy, and a selection cost c = 0 .

3. Light colours denote higher utility, dark colours lower utility. Themaximum is indicated by a red cross.

Lemma 2.

For separable linear policies, if π θ j ( a j = 0) > , then ∇ θ j U ( π θ ) = ∆ j U ( π θ ) π θj ( a j =0) . We observe that the gradient takes a similar form as in Lemma 2, where in this case the normalisingfactor is the reciprocal of the probability of rejecting individual j under policy π θ . In particular, notethat under any uniform policy, the gradient is equal to the EMC after uniform scaling. A diﬀerent class of policies emerges when instead of separably parameterising over the population,we parameterise over the feature space X . One natural example is a policy of logistic type when θ = ( θ , . . . , θ |X | ) and the probability of selecting individual i is π θ ( a i = 1 | x i ) = e θ T x i e θ T x i . These policies can be viewed as threshold policies since an individual i with feature vector x i such that θ T x i > i isbeing selected only if θ T x i > In order to ﬁnd the optimal policy for a ﬁxed population, i.e. to solve (3), we propose to use a policygradient algorithm. The general pseudo code is given in Algorithm 1.

Algorithm 1

Policy gradient algorithm Input: A population N with features x and a utility function u . Initialise θ , threshold δ >

0, learning rate η > while (cid:107) θ i +1 − θ i (cid:107) > δ do Evaluate ∇ U ( π θ , x ) using data x θ i +1 ← θ i + η [ ∇ U ( π θ , x )] θ = θ i i ++ end while return π θ i +1 From Figure 1, we observe that the policy gradients and the EMCs have almost similar directionexcept for the edges where the probability of selecting an individual is almost 1. This illustrates theeﬀect of the ’normalising’ factors emerging in Lemmas 1 and 2. We also observe that the Shapley valueand the gradients are similar near the point (0 . , . .4 Deviation from meritocracy as the residual EMCs We do not state meritocracy in our setting as a optimisation constraint, but we are interested in thenatural link between utility maximisation and meritocratic fair decisions. For this reason, we wishto quantify the deviation from meritocracy of policies in terms of the EMC. We suggest to use thecumulative positive EMC under a policy π :Res( π ) = N (cid:88) i =1 max(0 , ∆ i U ( π )) . Here, large values for Res( π ) indicate strong deviation from fully meritocratic decisions. Note that wedecide to only account for positive EMCs as a negative EMC indicate negative potential contributionsof individuals. From an individual’s point of view, the EMC of an individual under a given policyrepresents her potential that remains unrecognized by the DM.From Figure 1b, we observe that as the policy reaches the optimal solution, i.e. the probability ofselecting the individual with higher contribution (individual 1) becomes higher, the residual EMCsare getting close to zero. This is demonstrated by the arrow pointing downwards in the bottom rightcorner, showing that individual 2 has negative expected contribution under a policy selecting individual1 with high probability. We also observe that residual EMCs are always close to zero (decreasing arrowsize) when both individuals are being selected with probability close to 1. In this section, we perform an empirical case-study on fair set selection of individuals in a collegeadmission system to test diﬀerent algorithms with regards to achieved utility and our notion ofmeritocracy. The goal of the experiments is primarily to analyse the eﬀect of diﬀerent policy structuresand secondarily to demonstrate that utility maximisation is suﬃcient for promoting group fairnessnotions (e.g. diversity or statistical parity).

Data generation.

Data is generated through simulation. From this, we ﬁrst generate records for4 ,

000 students, which represent the admitted students to the faculty from previous years. Each studentis represented by a feature vector including a high school GPA for three speciﬁc ﬁelds (e.g. humanitiesor science), gender and high school. We assume that students come from 10 diﬀerent schools and thatthe school to some extent depends on the individual’s ethnicity (being majority, large minority, or smallminority). In addition, each student is provided with a graduation result (numerical between 0 and 1),which we use to train a prediction model. The graduation grade is based on two latent variables: 1)talent: an innate quality, normally distributed across the population, and 2) skill: based on talent,but can be developed or hindered. Skill development is aﬀected by school quality and societal genderbiases. For evaluation, we generate another 500 students from the simulator from who we wish toselect a subset. A detailed description of the simulator is given in the Appendix A.

Modelling graduation using regression.

As we wish to estimate the success for each student, wetrain a linear regression model on the data D from the 4 ,

000 students to predict the graduation resultsof the students observed in the decision phase. We refer to these predictions as predicted outcomes .The trained predictor f takes the data x of students from this year as input and predicts the outcome y i ∈ [0 ,

1] of the i -th individual: f ( i ) (cid:44) P ( y i | x , D ). Algorithms for set selection.

We examine the behaviour of separable linear and threshold policies.For threshold policies, we use the high school GPA as a numerical three dimensional feature and addthe gender and school as one-hot encoded categorical features. For both policy structures, we allow thepolicy gradient algorithms 250 updates to converge. In addition, we consider uniform set selection as atrivial lower baseline and for a robust benchmark, implement stochastic greedy set selection (see e.g.7ian et al. [2018]) which essentially performs dynamic ranking of individuals according to marginalcontributions.Even though we believe that ranking is not particularly suitable for this problem, we can use predicted outcomes of an individual to rank them. We implement the cross-population noisy-rankalgorithm of Kearns et al. [2017], which produces a probabilistic ordering over the population basedon individuals’ relative percentiles. In its original form, the noisy-rank algorithm requires us to onlyselect an individual i if all higher ranked individuals are being selected as well. To combine this witha utility function, we select individuals in the order induced by the ranking until an individual hasnegative contribution. We will refer to this in our experiments as noisy-rank . From a utilitarian social-welfare point of view, the DM is interested in the outcomes of her decisions.This requires having a utility function that depends on the actual outcomes - in the university scenario,for instance the graduation result encoded in some numerical value. In addition, the DM might have acost associated with the selection of a student, so that it might only be favourable for the DM to selecta student, if the gain in utility from adding the student to the set exceeds the cost c of selecting her.In case of linear utility , the DM’s utility function u : A × Y → R then takes form u ( a , y ) = (cid:88) i ∈ a y i − c · | a | , where we slightly abuse notion when writing i ∈ a to mean that i is being selected under decision a ,i.e. i ∈ { k : a k = 1 } .The DM might also be interested in promoting diversity in the study body, e.g. by admittingcandidates from diﬀerent backgrounds or demographics. For instance, the DM may wish to admitapproximately the same number of female and male students. In our case, we can deﬁne group types, T = { gender , schools } , and groups of such type, e.g. G gender = { male , female } . This allows us to deﬁneutilities accounting for the demographic background of applicants. A natural choice for a diversitypromoting utility is the log-linear utility over groups : u ( a , y ) = (cid:88) T ∈T (cid:88) G ∈G T log (cid:16) (cid:88) i ∈ a ∩ G y i (cid:17) − c · | a | as it naturally promotes egalitarian selection among groups.In practice, we do not have access to the actual outcomes y of individuals and therefore use thetrained predictor f to estimate the utility of a set by U ( a , x ) = E y ∼ f [ u ( a , y )] (cf. Section 3). In ecology, economics, and political science, quantifying diversity in form an index is studied historic-ally [Simpson, 1949, Hill, 1973, Laakso and Taagepera, 1979, Jost, 2006, Chao et al., 2016]. In general,if individuals are chosen from C communities (or groups) according to some demographic feature, the expected diversity of order l ∈ R + is deﬁned asDiv l (cid:44) (cid:32) C (cid:88) i =1 p li (cid:33) / (1 − l ) , where p i is the probability of selecting an individual from the i -th community. These are calledHill numbers in ecology [Hill, 1973]. Here, we use diversity of order l = 2, which is also called theinverse Simpson’s index [Simpson, 1949], where higher values of Div l represent higher diversity. Thisspeciﬁc quantiﬁcation of diversity is used to represent the eﬀective number of parties in an electionprocess [Laakso and Taagepera, 1979] or to represent the eﬀective number of diﬀerent species in anecosystem [Chao et al., 2016]. We adopt this well studied diversity measure to quantify the expectednumber of groups having representatives in the selected set of individuals. As we have diﬀerent typesof groups in our situation, we take the sum of Div over all group types (i.e. gender, school).8 .4 Performance Analysis We report the results of the experiments for linear and log-linear utility in Tables 1 and 2 ,respectively.The experiments are conducted with a population of 500 students, and a selection cost per individualof 0 . . true outcomes of individuals generated by the simulator which are unavailable to the algorithms.Table 1: Expected utility and residual EMCs w.r.t.actual outcomes for linear utility ( c = 0 . U ( π ) Res( π ) Div Linear 24.53 2.00 10.01Threshold 13.03 8.83 10.00Greedy 25.07 1.62 10.14Noisy-rank 1.25 26.80 6.47Uniform -4.10 14.10 10.65 Table 2: Expected utility and residual EMCs w.r.t.actual outcomes for log-linear utility ( c = 0 . U ( π ) Res( π ) Div Linear 16.64 0.59 11.42Threshold 12.34 1.69 10.24Greedy 17.03 0.06 11.73Noisy-rank 11.83 37.04 9.75Uniform 7.63 8.91 10.71From Tables 1 and 2, we observe that closeness to optimal expected utility corresponds to lowresidual EMCs. The residuals for the separable linear policy and the stochastic greedy algorithm arenon-zero due to prediction errors of the trained model. With respect to the predicted outcomes, bothalgorithms achieve almost zero residuals for both utility types (cf. further results in Tables 4 and 5 inAppendix D). Thus, the residuals of the separable linear policy and the greedy algorithm can be seenas a deviation from meritocracy caused by the DM’s imperfect predictions.The threshold policy remains highly unstable as we do not project the policy to the next closestvertex in the simplex, which results in relatively low utility and high residuals. The aim of a thresholdpolicy is to provide simple criteria for admission as these are attractive from the point of view oftransparency and explainability. In order to guarantee such clarity and explainability, we want ourpolicy to make decisions about individuals in the original feature space without mapping to higherdimensions. As a result, the threshold policy struggles whenever individuals are in the optimal set (i.e.have positive contribution), but are surrounded by sub-optimal individuals with similar features. Thisis visualised in Figure 2, where we conducted the experiments with a simpliﬁed simulator having onlytwo grade features (apart from gender and school) for illustrative purposes.Tables 1 and 2 show that meritocratic fairness notion of Kearns et al. [2017] is too restrictive inour setting as selection is stopped too early. Consequently, the residual EMCs are even higher than forthe uniform policy as the majority of the population could have positively contributed but were notconsidered. The comparison between noisy-rank and the uniform selection also yields the observationthat while close to optimal utility generally implies low residual EMCs, low utility does not necessarilyimply high residuals. For instance, the policy that is selecting the whole population always achieveszero residuals while achieving low utility in general.When comparing the inverse Simpson indices Div for both utilities, we see that the log-linearutility successfully increases the diversity in the selected set compared to the linear utility. Here, thestochastic greedy algorithm and the separable linear policy achieve almost maximal diversity of 12,i.e. when every group has the same probability of selection. This is representative of the possibilitieswhen using utility maximisation as a framework for group fair set selection as we could clearly designutilities promoting other group fairness notions such as statistical parity. In particular, when applyingthe log-linear utility and choosing students according to their marginal contribution we encode theirdemographic background implicitly into their “merit”. This approach of utility design to promoteaﬃrmative actions is may be more appealing to the DM than designing diﬀerent admission criteria fordiﬀerent groups. 9 eature 1 F e a t u r e Selection of separable policy (a) Linear utility

Feature 1 F e a t u r e Selection of separable policy (b) Log-linear utility

Figure 2: The selection of the separable linear policy (red circles) and the soft decision boundary θ T x determined by the threshold policy (blue line) illustrated in the feature space of 100 studentsgenerated with two numerical grade features (x-axis and y-axis) as well as gender and school. Eachcircle represents a student by means of her feature vector. Every student above the line has > . We have provided a ﬁrst look into how one can deﬁne meritocracy in a general set selection scenario,where individuals contribute diﬀerent amounts depending on the DM’s policy. We have arrived at anatural deﬁnition relying on driving the expected marginal contribution of individuals to zero. Anypolicy satisfying this property cannot be improved locally by adding or removing individuals from theset. For separable policy structures, such policies can be obtained easily through greedy maximisationor gradient ascent.However, in some cases we are interested in policies which make decisions only based upon individualcharacteristics, and without making an explicit comparison between individuals. It is possible to usepolicy gradient methods to ﬁnd policies that maximise utility, but we have shown experimentally thatthese (at least for the case of a linear parametrisation) lead to inferior outcomes both in terms ofmeritocracy and utility.For completeness, we have also compared our methods experimentally with an algorithm from thefair ranking literature, noisy-rank , in a college admission simulation. This method seeks not to admit”worse” individuals before ”better” ones, and we have adapted it to our setting by admitting everyonein the order ranked by the algorithm until the DM’s utility could not be further improved. Perhapsunsurprisingly, this method performed poorly, even though it used the same predictive model as theother approaches.In future work, it will be interesting to examine policies that select ﬁxed admission rules beforeactually seeing. In some cases, e.g. when there have to be ﬁxed rules for selecting individuals froma population at the beginning of the process, the DM may need to choose a policy before seeing aparticular population. Then we can assume that the DM has access to some distribution β over possiblepopulations, and the problem becomes maximising expected utility under this distribution.Another advantage of such policies in terms of fairness is that, under some speciﬁc policy structures,each individual can be judged without taking into account who else is in the current population.Instead, individuals will be judged relative to their expected contribution over all possible populations,and this will provide a plausible method for extracting an individual worth, which could imaginably beuseful in some scenarios.In many real-world applications, however, including the university admission scenario, the problemis more complicated, as there are multiple universities and students. The setting then becomes amatching problem, for which other concepts of fairness, such envy-free allocation may well be applicable.10e leave this question for future work. Acknowledgements

Many thank you to Rachel Cao, Yang Liu, David Parkes and Goran Radanovic for long discussionsabout fairness in the set selection setting that partly served to inspire this line of work. We wouldalso like to thank Anne-Marie George for discussions on the relation of this work to social choiceand envy free allocation more generally, as well as Mauricio Byrd for contributing to early validationexperiments.

References

Haris Aziz and Nisarg Shah. Participatory budgeting: Models and approaches, 2020.Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. Equity of attention: Amortizing individualfairness in rankings. In

The 41st international acm sigir conference on research & development ininformation retrieval , pages 405–414, 2018.L Elisa Celis, Damian Straszak, and Nisheeth K Vishnoi. Ranking with fairness constraints. arXivpreprint arXiv:1704.06840 , 2017.L. Elisa Celis, Anay Mehrotra, and Nisheeth K. Vishnoi. Interventions for ranking in the presence ofimplicit bias, 2020.Anne Chao, Chun-Huo Chiu, and Lou Jost. Phylogenetic diversity measures and their decomposition:a framework based on hill numbers.

Biodiversity Conservation and Phylogenetic Systematics , page141, 2016.Cristina D´ıaz-Garc´ıa, Angela Gonz´alez-Moreno, and Francisco Jose Saez-Martinez. Gender diversitywithin r&d teams: Its impact on radicalness of innovation.

Innovation , 15(2):149–160, 2013.Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness throughawareness. In

Proceedings of the 3rd innovations in theoretical computer science conference , pages214–226, 2012.Edith Elkind, Piotr Faliszewski, Piotr Skowron, and Arkadii Slinko. Properties of multiwinner votingrules.

Social Choice and Welfare , 48(3):599–632, 2017.Vitalii Emelianov, Nicolas Gast, Krishna P Gummadi, and Patrick Loiseau. On fair selection inthe presence of implicit variance. In

Proceedings of the 21st ACM Conference on Economics andComputation , pages 649–675, 2020.Mark O Hill. Diversity and evenness: a unifying notation and its consequences.

Ecology , 54(2):427–432,1973.Lou Jost. Entropy and diversity.

Oikos , 113(2):363–375, 2006.Michael Kearns, Aaron Roth, and Zhiwei Steven Wu. Meritocratic fairness for cross-populationselection. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pages 1828–1836. JMLR. org, 2017.Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Sch¨olkopf, Krikamol Muandet, and Isabel Valera.Fair decisions despite imperfect predictions. In

International Conference on Artiﬁcial Intelligenceand Statistics , pages 277–287, 2020.Jon M. Kleinberg and Manish Raghavan. Selection problems in the presence of implicit bias.

CoRR ,abs/1801.03533, 2018. URL http://arxiv.org/abs/1801.03533 .11att J. Kusner, Chris Russell, Joshua R. Loftus, and Ricardo Silva. Causal interventions for fairness,2018.Markku Laakso and Rein Taagepera. “eﬀective” number of parties: a measure with application towest europe.

Comparative political studies , 12(1):3–27, 1979.Martin Lackner and Piotr Skowron. Approval-based committee voting: Axioms, algorithms, andapplications, 2020.S´ebastien Lahaie and David C Parkes. Fair package assignment.

Auctions, Market Mechanisms andTheir Applications , 2009.Herv´e Moulin.

Handbook of Computational Social Choice . Cambridge University Press, 2016. doi:10.1017/CBO9781107446984.Chao Qian, Yang Yu, and Ke Tang. Approximation guarantees of stochastic greedy algorithms forsubset selection. In

IJCAI , pages 1478–1484, 2018.Lloyd S Shapley. Notes on the n-person game—ii: The value of an n-person game. 1951.Edward H Simpson. Measurement of diversity. nature , 163(4148):688–688, 1949.Ashudeep Singh and Thorsten Joachims. Policy learning for fairness in ranking. In

Advances in NeuralInformation Processing Systems , pages 5427–5437, 2019.Julia Stoyanovich, Ke Yang, and HV Jagadish. Online set selection with fairness and diversityconstraints. In

Proceedings of the EDBT Conference , 2018.Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, and RicardoBaeza-Yates. Fa* ir: A fair top-k ranking algorithm. In

Proceedings of the 2017 ACM on Conferenceon Information and Knowledge Management , pages 1569–1578, 2017.12

Description of the Simulator z eth x sch x sk x gpa z gen z tal a π y Figure 3: A simulator for the college admission and graduation with diﬀerent schools, skills, genderand a DM policy. In dashed lines are relations that are only included in the prediction model and arenot part of the historical decision process.Students are admitted depending on the faculty’s admission policy, which can use their gradeinformation as well as which school they came from if necessary. We simulate a faculty with a speciﬁcadmission proﬁle. They prefer candidates whose grades match their proﬁle, in practice, with a higherinner product between grade and faculty proﬁle. This score is added with uniform noise to account forother admission factors such as a personal statement. The default admission process is that the top kmatches are selected. Their graduation grade ∈ [0 ,

1] depends on the faculty admission proﬁle, theirskills at the time of application and their innate talent in each topic . A parameter γ determines theratio between the talent and skill contributions to the graduation grade.The ethnicity aﬀects which high school a student will attend, with diﬀerent high schools havingdiﬀerent qualities. For that reason, high schools develop diﬀerent skills to diﬀerent amounts accordingto their proﬁle.The gender has two eﬀects. First, diﬀerent schools develop skills to a diﬀerent extent dependingon gender. Second, a general gender bias from society causes a positive, negative or neutral eﬀect onskill development for diﬀerent topics. This eﬀect is the same for all schools. At the end of high-school,students obtain a grade based on their skill.The notations in Figure 3 are as follows: • z eth —Ethnicity, z gen —Gender, z tal —Talent: independent variables • x sch —School: determined by Ethnicity (to simulate schools in geographical areas with diﬀerentethnicity distribution) • x sk —Skill: determined by Talent, School and Gender • x gpa —Grade: determined by School, Skill • a —Admission: determined by Grade, School, and policy • y —Graduation: determined by Skill and Talent (for admitted students) • π —Admissions Policy: chosen by the DMThe utility of the DM u ( a , y ) should be related to the action a and the outcome y . B Modelling of the simulation

Though we generate the data using the simulator of Figure 3, the policies do not have access to thesimulator. Instead, they can use a model built on data from the simulator that has been collectedusing a simple threshold policy as described in Section A. Note that we assume there are no biases aﬀecting the performance of individuals once they are admitted.

Linear Regression 0.0050686

Bayesian Regression 0.0050690Ridge Regression 0.0050737Multi-layer Perceptron Regressor 0.0050967Random Forest Regressor 0.0053579Lasso Regression 0.0304602Moreover, we only have labels (graduation grades) for individuals who have been admitted.As shown in Fig. 4, the prediction error of the model increases with γ (talent skill ratio). This iscaused due to the indirect relation between the latent variable talent and the observed features (schoolgrades).Figure 4: Mean squared error of the linear regression model, both on the train set and the test set,according to varying γ (talent skill ration). For γ = 0, the ﬁnal graduation grade is based on skillalone, while for γ = 1 it is entirely based on the talent. These scores are averaged over 5 repeatedexperiments with training set of 10000 samples and test set of 500 samples. B.1 Details of Dataset Generation

For each student the school comes from a discrete distribution returning one of ten schools whensampled. There is a diﬀerent school distribution depending on the student’s ethnicity. Ethnicity isdirectly sampled from a discrete distribution with probabilities [0 . , . , . p = 0 .

5. Talent has three dimensions (for three topics), eachone is sampled from a normal distribution with predeﬁned mean and standard deviation. The schooland gender deﬁne biases for each topic. These biases are added to the talent which serve together asthe mean of a normal distribution from which the skill is sampled. The school grades are determinedby the skill level combined with uniform noise in a speciﬁc range.14

Proofs

C.1 Proof of Lemma 2

Denote the set of decisions rejecting individual j by A − j (cid:44) { a : a j = 0 , a i ∈ { , } for i (cid:54) = j } . We omitwriting x here and calculate the gradient to be ∂∂θ j U ( π θ ) = ∂∂θ j (cid:88) a ∈A π θ ( a ) U ( a )= (cid:88) a ∈A ∂∂θ j π θ j ( a j ) (cid:89) i (cid:54) = j π θ i ( a i ) U ( a )= (cid:88) a ∈A ( I { a j = 1 } − I { a j = 0 } ) (cid:89) i (cid:54) = j π θ i ( a i ) U ( a )= (cid:88) a ∈A − j (cid:89) i (cid:54) = j π θ i ( a i ) U ( a + j ) − (cid:88) a ∈A − j (cid:89) i (cid:54) = j π θ i ( a i ) U ( a )= (cid:88) a ∈A − j (cid:89) i (cid:54) = j π θ i ( a i )[ U ( a + j ) − U ( a )]= 1 π θ j ( a j = 0) (cid:88) a ∈A − j π ( a )[ U ( a + j ) − U ( a )]= 1 π θ j ( a j = 0) (cid:88) a ∈A π ( a )[ U ( a + j ) − U ( a )] (for a with a j = 1 the diﬀerence is zero)= ∆ j U ( π θ ) π θ j ( a j = 0) . We used the equality (cid:88) a ∈A I { a j = 1 } (cid:89) i (cid:54) = j π θ i ( a i ) U ( a ) = (cid:88) a ∈A − j (cid:89) i (cid:54) = j π θ i ( a i ) U ( a + j )in the 4th equality of the proof and for completeness, rigorously prove it here. The LHS is equal to (cid:88) a ∈A : a j =1 (cid:89) i (cid:54) = j π θ i ( a i ) U ( a ) = (cid:88) a ∈A : a j =1 (cid:89) i (cid:54) = j π θ i ( a i ) U ( a + j ) . Now, let a and a (cid:48) only diﬀer in the j -th element, namely a j = 1 and a (cid:48) j = 0. Clearly, (cid:89) i (cid:54) = j π θ i ( a i ) = (cid:89) i (cid:54) = j π θ i ( a (cid:48) i ) and U ( a + j ) = U ( a (cid:48) + j ) , which yields (cid:88) a ∈A : a j =1 (cid:89) i (cid:54) = j π θ i ( a i ) U ( a + j ) = (cid:88) a ∈A − j (cid:89) i (cid:54) = j π θ i ( a i ) U ( a + j ) . C.2 Proof of Lemma 1

Deﬁne Z ( θ ) = (cid:80) a ∈A e β θ T a . Note that ∂∂θ j log Z ( θ ) = 1 Z ( θ ) (cid:88) a : a j =1 e β θ T a = β (cid:88) a : a j =1 π θ ( a ) , π θ ( a with a j = i ) (cid:44) (cid:80) a : a j = i π θ ( a ) for i ∈ { , } . In the following, we omit the factor β as itwill be nothing but a constant factor in the calculations. Straight forward calculation now yields ∂∂θ j U ( π θ ) = (cid:88) a ∈A ∂∂θ j π θ ( a ) U ( a )= (cid:88) a ∈A π θ ( a ) ∂∂θ j log( π θ ( a )) U ( a )= (cid:88) a ∈A π θ ( a ) ∂∂θ j ( θ T a − log Z ( θ )) U ( a )= (cid:88) a ∈A π θ ( a ) (cid:0) a j − (cid:88) a (cid:48) : a (cid:48) j =1 π ( a (cid:48) ) (cid:1) U ( a )= (cid:88) a : a j =1 π θ ( a ) U ( a ) − π θ ( a with a j = 1) U ( π θ )= (1 − π θ ( a with a j = 1)) (cid:88) a : a j =1 π θ ( a ) U ( a + j ) − π θ ( a with a j = 1) (cid:88) a : a j =0 π θ ( a ) U ( a )= π θ ( a with a j = 0) e θ j (cid:88) a : a j =0 π θ ( a ) U ( a + j ) − π θ ( a with a j = 1) (cid:88) a : a j =0 π θ ( a ) U ( a )= π θ ( a with a j = 1) (cid:88) a : a j =0 π θ ( a ) U ( a + j ) − π θ ( a with a j = 1) (cid:88) a : a j =0 π θ ( a ) U ( a )= π θ ( a with a j = 1) (cid:88) a : a j =0 π θ ( a )[ U ( a + j ) − U ( a )]= π θ ( a with a j = 1) (cid:88) a ∈A π θ ( a )[ U ( a + j ) − U ( a )] = π θ ( a with a j = 1) ∆ j U ( π θ ) . Additional results from experiments

In addition to Tables 1 and 2, we include comprehensive tables from our experiments comprising theperformance of the algorithms with respect to actual outcomes as well as predicted outcomes:Table 4: Expected utility U ( π ) and residual EMCs Res( π ) w.r.t. actual outcomes, expected utility U f ( π )and residual EMCs Res f ( π ) w.r.t. predicted outcomes for linear utility , as well as inverse Simpson’sindex. Algorithm U ( π ) Res( π ) U f ( π ) Res f ( π ) Div Linear 24.53 2.00 23.70 0.31 10.01Threshold 13.03 8.83 10.41 7.21 10.00Greedy 25.07 1.62 23.98 0 10.14Noisy-rank 1.25 26.80 1.01 23.00 6.47Uniform -4.10 14.10 -11.47 12.00 10.65Table 5: Expected utility U ( π ) and residual EMCs Res( π ) w.r.t. actual outcomes , expected utility U f ( π ) and residual EMCs Res f ( π ) w.r.t. predicted outcomes for log-linear utility over groups , as well asinverse Simpson’s index.Algorithm U ( π ) Res( π ) U f ( π ) Res f ( π ) Div2