FFair Policy Targeting ∗ Davide Viviano † Jelena Bradic ‡ May 27, 2020
Abstract
One of the major concerns of targeting interventions on individuals in social welfareprograms is discrimination: individualized treatments may induce disparities on sensi-tive attributes such as age, gender, or race. This paper addresses the question of thedesign of fair and efficient treatment allocation rules. We adopt the non-maleficenceperspective of “first do no harm”: we propose to select the fairest allocation within thePareto frontier. We provide envy-freeness justifications to novel counterfactual notionsof fairness. We discuss easy-to-implement estimators of the policy function, by castingthe optimization into a mixed-integer linear program formulation. We derive regretbounds on the unfairness of the estimated policy function, and small sample guaran-tees on the Pareto frontier. Finally, we illustrate our method using an application fromeducation economics.
Keywords:
Causal Inference, Welfare Programs, Pareto Optimal, Treatment Rules. ∗ We thank James Fowler, Yixiao Sun and Kaspar W¨uthrich for comments and discussion. All mistakesare our own. † Department of Economics University of California at San Diego, La Jolla, CA, 92093. Email: [email protected]. ‡ Department of Mathematics and Halicio˘glu Data Science Institute, University of California at SanDiego, La Jolla, CA, 92093. Email: [email protected]. a r X i v : . [ ec on . E M ] M a y Introduction
Heterogeneity in treatment effects, widely documented in social sciences, motivates treat-ment allocation rules that assign treatments to individuals differently, based on observablecharacteristics (Manski, 2004; Murphy, 2003). However, targeting individuals may induce“disparities” across sensitive attributes, such as age, gender, or race. Motivated by ev-idence for preferences of policymakers towards non-discriminatory actions (Cowgill andTucker, 2019), this paper designs fair and efficient targeting rules for applications in socialwelfare programs. We construct treatment allocation rules using data from experiments orquasi-experiments, and we develop new notions of optimality and counterfactual fairnessfor the construction of fair policies.“Fair targeting” is a controversial task, due to the lack of consensus on the formulationof the decision problem and the definition of fairness. Conventional approaches consist indesigning algorithmic decisions that maximize the expected utility across all individuals,by imposing some form of axiomatic “fairness constraints” on the decision space of thepolicymaker; this is usually imposed on the prediction of the algorithm. However, mostof these approaches ignore the welfare effects of such constraints (Kleinberg et al., 2018).In particular, fairness constraints imposed on the decision space of the policymaker mayultimately lead to sub-optimal welfare for both sensitive groups. This is a crucial limitationwhen policymakers are concerned with the effects of their decisions on the utilities of eachindividual: we may not want to impose unnecessary constraints on the policy if suchconstraints are harmful for some or all individuals.Motivated by these considerations, this paper advocates for Pareto optimal treatmentrules: decision-makers must prefer allocations for which we cannot find any other policythat strictly improves welfare for one of the two sensitive groups, without further decreas-ing welfare on the opposite group. We propose to select the fairest allocation within thePareto frontier. Our approach is motivated by the intuitive notion of “first do no harm”:instead of imposing possibly harmful fairness constraints on the decision space, we insteadrestrict the set of admissible solutions to the Pareto optimal set, and among such, wechoose the fairest one. We allow for general notions of fairness, whereas we also propose anotion of fairness for treatment allocation rules based on envy-freeness (Foley, 1967; Var-ian, 1976), making a clear connection between classical microeconomic literature and thecounterfactual interpretations of fairness. Our framework encompasses several applicationsin economics, including development programs (De Ree et al., 2017), educational programs(Muralidharan et al., 2019), health programs (Dupas, 2014), cash-transfer programs (Eggeret al., 2019), among others. We name our method Fair Targeting. For a review, the reader may refer to Corbett-Davies and Goel (2018). Further discussion on the relatedliterature is contained in Section 1.2. Pareto optimality has been often associated with the notion of fairness in economic theory (Varian,1976), and it finds intuitive justifications under the rational preferences of the decision-maker (Noghin,2006). nested classesof linear decision rules in Figure 1. Details on the definition of welfare and the datacan be found in Section 5. In the figure, the red line refers to the unconstrained lineardecision rule, the blue line constrains the welfare on female individuals to be larger than thewelfare on male students, and it does not allow to use the gender attribute in the decisionprocess. Points denote the Pareto optimal allocations. The figure showcases that imposingconstraints directly on the decision rules leads to allocations that are always dominatedby some other treatment allocation in an unconstrained environment. The blue allocationfinds an harmful allocation for both types of individuals when compared to three red dotsin the figure. The proposed method instead estimates the Pareto optimal allocation in theleast constrained environment (e.g., the red line), and among such, it chooses the “fairest”one.In summary, this paper contributes to the statistical treatment choice literature in twodirections: (i) we introduce the notion, estimation procedures, and we study properties ofPareto optimal treatment allocation rules under general notions of fairness; we justify suchan approach within a decision-theoretical framework; (ii) we adapt notions of envy-freenessto treatment rules and study statistical properties of the estimated policy. We now discuss3ach of these contributions in detail.The decision problem consists of lexicographic preferences of the policymaker of thefollowing form: (i) each Pareto optimal allocation strictly dominates allocations that are notPareto optimal; (ii) a Pareto optimal allocation is strictly preferred to any other allocationif it satisfies additional fairness requirements. We identify the Pareto frontier as the setof maximizers over any weighted average of the welfares of each group. Therefore, suchan approach embeds as a special case maximizing a weighted combination of welfares ofeach sensitive group (Kitagawa and Tetenov, 2018; Rambachan et al., 2020). However,the weights are not chosen a-priori, but instead, they are part of the decision problem,and directly selected to maximize fairness. This has important practical implications: theprocedure is solely based on the notion of fairness adopted by the social planner, and itdoes not require to specify importance weights assigned to each sensitive group. Theseweights would be extremely hard to justify to the general public. Estimation of the set of Pareto optimal allocations represents a key challenge sinceinfinitely many policies may populate it. In fact, (i) the set consists of maximizers overa continuum of weights between zero and one; (ii) each maximizer of the welfare (or aweighted combination of welfares) is often not unique (Elliott and Lieli, 2013). To over-come these issues, we showcase that the Pareto frontier can be approximated using simplelinear constraints on the policy space. Such a result drastically simplifies the optimiza-tion algorithm: instead of estimating the entire set of Pareto allocations, we propose tomaximize fairness of the policy function under easy-to-implement linear constraints. Ourapproach consists of a novel a-priori multi-objective formulation (Deb, 2014). Differentlyfrom previous literature, we use a discretization argument, and we evaluate weighted com-binations of the objective functions separately to construct a polyhedron that containsPareto allocations. We provide theoretical guarantees on our approach, and we show-case that the distance between the Pareto frontier obtained via linear constraints and itspopulation counterpart converges uniformly to zero at rate 1 / √ n .The choice of the allocation within the Pareto set depends on the notion of fairnessadopted by the social planner. Inspired by the economic concept of envy-freeness, wepropose a novel definition of fairness, which can be seen of independent interest. Weinterpret the policy assignment under the lens of the classical notion of “bundles” definedin microeconomics. For each policy function, we compare expected utilities that dependon its distribution under the same and opposite sensitive attribute. The proposed notionof fairness makes a connection to previous notions of counterfactual fairness (Chiappa,2019; Coston et al., 2020; Kilbertus et al., 2017; Kusner et al., 2019; Nabi and Shpitser,2018), whereas, differently from previous references, (i) we provide formal justification tofairness using an envy-freeness argument; (ii) we construct the definition of fairness basedon distributional impact of the treatment allocation rule on the welfare.In our theoretical analysis, we also study “regret unfairness”. This denotes the differ- See for example also the discussion on relative importance weights contained in Kasy and Abebe (2020).
This paper relates to a growing literature on statistical treatment rules (Armstrong andShen, 2015; Athey and Wager, 2017; Bhattacharya and Dupas, 2012; Dehejia, 2005; Hiranoand Porter, 2009; Kitagawa and Tetenov, 2018, 2019; Mbakop and Tabord-Meehan, 2016;Stoye, 2012; Tetenov, 2012; Viviano, 2019; Zhou et al., 2018). Further research on op-timal treatment allocations also includes estimation of individualized optimal treatmentsvia residuals weighting (Zhou et al., 2017), penalized methods (Qian and Murphy, 2011),inference on the welfare for optimal treatment strategies (Andrews et al., 2019; Labe et al.,2014; Luedtke and Van Der Laan, 2016; Rai, 2018), doubly robust methods for treatmentallocations (Dudik et al., 2011; Zhang et al., 2012), reinforcement learning (Kallus, 2017;Lu et al., 2018), and dynamic treatment regimes (Murphy, 2003; Nie et al., 2019). Furtherconnections are also related to the literature on classification, which includes Boucheronet al. (2005), and Elliott and Lieli (2013) among others. However, none of these discussthe design of fair decisions.Fairness is a rising concern in economic applications, see, e.g., Cowgill and Tucker(2019), Kleinberg et al. (2018), Rambachan et al. (2020), Rambachan and Roth (2019). Theauthors provide fundamental economic insights on the characteristics of optimal decisionrules in the presence of discrimination bias. Here we answer the different question of thedesign and estimation of the optimal targeting rule, and we propose a statistical frameworkand derive properties of the method. In addition, also the decision problem differs sincehere, we advocate for a multi-objective , instead of a single-objective utility function, as inprevious references. Finally, Kasy and Abebe (2020) provide comparative statics on theimpact of fairness on the welfare of individuals, and they discuss the analysis of algorithmsfor equity considerations.In contrast, in computer science, Pareto optimality has been considered in the con-text of binary predictions by Balashankar et al. (2019) and Martinez et al. (2019), where,5he authors propose semi-heuristic and computationally intensive procedures for estimat-ing Pareto efficient classifiers. Finally, Xiao et al. (2017) discuss the different problem ofestimation of a Pareto allocation that trade-offs fairness and individual utilities for rec-ommender systems, where the relative importance weights of the different objectives areselected a-priori. The above references do not address the problem of policy targetingdiscussed in the current paper.Additional references in computer science on fairness include Chouldechova (2017),Corbett-Davies et al. (2017), Dwork et al. (2012), Hardt et al. (2016), Feldman et al. (2015),Kleinberg et al. (2016), among others. For a review and comparisons, the reader may referto Corbett-Davies and Goel (2018), and Mitchell et al. (2018). The notion of fairness inprevious references differs in the lack of a valid causal framework. For instance, Zemelet al. (2013) defines fairness based on statistical independence of a classifier with respectto the sensitive attribute, whereas Donini et al. (2018) defines a classifier fair if the error ofsuch a classifier is similar across the sensitive groups. Hossain et al. (2020) and referencestherein propose envy-freeness notions of fairness for applications in classification. However,such notions do not account for covariates across different sensitive groups being drawnfrom potentially different distributions, which instead differently justifies counterfactualnotions of fairness. The fair classification has also been studied from different angles inLiu et al. (2017) who discuss stochastic fair bandits, and Ustun et al. (2019) who proposedecoupled estimation of tree classifiers. A related strand of literature discusses estimationof decision rules under fairness constraints within a causal framework (Chiappa, 2019;Coston et al., 2020; Kilbertus et al., 2017; Kusner et al., 2019; Nabi et al., 2019; Nabi andShpitser, 2018). All such references discuss the problem of maximizing some given objectivefunction under fairness constraints. This may result in Pareto dominated allocations, and,therefore, possibly harmful policies for both disadvantaged as well as advantaged groups(see Figure 1).
In this section we discuss the main framework and notation, the decision problem underconsideration and counterfactual notions of fairness.
For each unit, we denote S i ∈ S a sensitive or protected attribute that a social plannerwishes to be cautious about. For expositional convenience only, we let S = { , } where S i = 1 is considered to be a “disadvantaged” or minority group. Extensions to multiplegroups follow similarly to what discussed in the current section. Each individual is alsocharacterized by a vector of its characteristics, X i ∈ X ⊆ R p . Such characteristics are6ssumed to be drawn conditional on the realization of the sensitive attribute, i.e., X i = S i X i (1) + (1 − S i ) X i (0) , (1)where X i (0) , X i (1) denote the potential covariates , i.e., the covariates in the state where theattribute of individual i is equal to s . Note that for each individual we only observe either X i (1) or X i (0) and never both. The participants of the study have either been enrolledin the social welfare program or not. We denote their treatment status with D i ∈ { , } ,where D i = 1 denotes that individual i has been enrolled in the social program. We definetheir post-treatment outcomes with Y i ∈ Y ⊆ R realized only once the sensitive attribute,covariates and the treatment assignment are realized.Following the Neyman-Rubin potential outcome framework (Imbens and Rubin, 2015),we define Y i ( d, s ), d ∈ { , } , i.e. Y i (0 , , Y i (0 , , Y i (1 , , Y i (1 , , the potential outcomes that would have been observed respectively under treatment zeroand one, for the protected attribute or the opposite one. Note that we only get to observeone of the four potential outcomes. Throughout the paper, the observed Y i satisfies theSingle Unit Treatment Value Assumption (SUTVA) (Rubin, 1990), with Y i = D i S i Y i (1 ,
1) + D i (1 − S i ) Y i (1 ,
0) + (1 − D i ) S i Y i (0 ,
1) + (1 − D i )(1 − S i ) Y i (0 , . (2)We assume that the vector of potential outcomes, covariates, sensitive attributes and treat-ment assignments, ( Y i , D i , S i , X i ) are identically and independently distributed ( i.i.d. ). Wedenote with E X ( s ) , the expectation with respect to the distribution of the potential covari-ates X ( s ) only.Next, we discuss modeling assumptions. The first condition is the unconfoundenessassumption. Assumption 2.1. (Unconfoundeness) Let the following two conditions hold: ∀ s ∈ { , } , d ∈{ , } , (A) Y i ( d, s ) ⊥ ( D i , S i ) | X i ( s ); (B) X i ( s ) ⊥ S i .Condition (A) states that the treatments are randomized based on covariates, indepen-dently on potential outcomes (Imbens, 2004). Similarly, Condition (B) assumes that thesensible attribute is randomized independently of X i ( s ). We remark that such a conditionallows for the dependence of the observed covariates and outcomes on the sensitive at-tribute, this condition holds when, for instance, S i is either randomized before everythingelse is observed, or S i denotes some immutable characteristic that cannot change givenintervention on the other variables. The randomization assumption on S i is common in Throughout our discussion we implicitely assume that the support of X i (1) and X i (0) is the same,namely X . i.e., do-operation in the notation of Pearl (2009). i X i D i Y i Figure 2: Example of a directed acyclical graph under which Assumption 2.1 holds.mediation analysis under the name of conditional ignorability (Imai et al., 2010). Figure2 displays a directed acyclical graph under which Assumption 2.1 holds. In the figure, thevariable X i acts as a mediator of the effect of changing the sensitive attribute. Assumption2.1 can be equivalently stated after also conditioning on additional baseline characteristicsthat do not depend on the sensitive attribute s , such as, for instance, age. Namely, we mayconsider an additional set of covariates Z i , and assume that X i ( s ) ⊥ S i | Z i . We omit suchan extension for the sake of clarity only. Finally, we define e ( x, s ) = P ( D i = 1 | X i = x, S i = s ) , p s = P ( S i = s )as the propensity score and the probability of individual i being assigned to the disad-vantaged group, respectively. The following condition is the overlap assumption, which iscommon in the causal inference literature. Assumption 2.2 (Overlap) . Let e ( X i , s ) , p s ∈ ( δ, − δ ), almost surely, for δ ∈ (0 , s ∈ { , } . We design a policy function to maximize social welfare. This paper takes a utilitarianperspective of social welfare (Manski, 2004), whereas, with fairness in mind, our focus isshifted from an aggregate perspective to a “group perspective”, with special focus on the“protected” group.We now formalize the problem. Given observables, ( Y i , X i , D i , S i ) we seek to design apolicy function or equivalently a treatment assignment π : X × S (cid:55)→ { , } , π ∈ Πthat depends on the individual characteristics and protected attribute alone. That is, apolicy is a deterministic treatment assignment, which depends on covariates X i , and on8tatus S i only. Here, Π denotes the function class of interest for the treatment assignment π . Such a class may be unconstrained, or it may reflect economic or legal constraints thatrestrict the decision space of the policy-maker.For a given policy π and a sensitive attribute s , the social welfare for such a group ofindividuals is defined as follows W s ( π ) = E (cid:104) Y i (1 , s ) π ( X i ( s ) , s ) + Y i (0 , s )(1 − π ( X i ( s ) , s )) (cid:105) . (3)The above expression denotes the welfare within the group of individuals with sensi-tive attribute s . Under the standard properties of the propensity score (Imbens, 2000),identification of the social welfare within each group can be easily expressed using inverse-probability type estimators (Horvitz and Thompson, 1952). The reader may refer to theAppendix for further discussion.Under the standard utilitarian perspective (Manski, 2004), where welfare is aggregatedacross all the individuals, the estimand of interest solvesarg max π ∈ Π (cid:110) p W ( π ) + (1 − p ) W ( π ) (cid:111) where p is the fraction of the population for which S i = 1. However, we observe thatwhenever the sensitive group is a minority group, such an approach assigns a small weightto the welfare of such individuals, disproportionally favoring the majority group.An alternative approach is to maximize the welfare for each possible sensitive group(Ustun et al., 2019). However, in such a case, the resulting policy function may violatethe constraint in Π, e.g., violating anti-discriminatory laws. A simple example is when thepolicy π ( x, s ) is constrained to be constant in the sensitive attribute s .Motivated by this consideration, we move away from a single-objective function to a multi - objective decision problem: the social planner aims to maximize the welfare in each sensitive attribute under the legal or economic constraints. We propose a notion of Paretooptimality as a suitable notion of efficiency as it formalizes the trade-off between different,possibly contradicting, objectives. The planner then chooses within the set of Paretoallocations the least unfair policy.
Pareto optimality is driven by the idea that allocations are efficient when no-one can bemade better off without making somebody else worse off. Therefore, the notion of Paretooptimal policies is particularly suitable for fairness considerations (Varian, 1976).
Definition 2.1 (Pareto optimality) . A policy function π ∈ Π is called
Pareto optimal with respect to the protected group s if there is no other policy π ∈ Π, such that W s ( π ) ≤ W s ( π ) , for all s ∈ { , } A simple example where contradictory objectives take place is when treatment effects are heterogenousin covariates X i , with positive effect over such covariates for one group, and negative for the other. W s ( π ) < W s ( π ) for some s ∈ { , } . We denote with Π o the set of all π ∈ Π that are Pareto optimal.The set of Pareto optimal choices contains all such allocations for which the welfare forone of the two groups cannot be improved without reducing the welfare for the oppositegroup. The following lemma characterizes the set of Pareto optimal allocations.
Lemma 2.1 (Pareto Frontier) . The set Π o ⊆ Π is such that Π o = (cid:110) π α : π α ∈ arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) , α ∈ (0 , (cid:111) . (4)The lemma is a trivial application of results of Negishi (1960), applied to the func-tional of interest. For the sake of completeness, the proof is included in the Appendix.Lemma 2.1 provides an intuitive interpretation: Pareto optimal allocations consist of poli-cies that maximize a weighted combination of the protected group welfares. Throughoutour discussion we define ¯ W α = sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) , being the largest objective after weighting the welfares of each group by α and 1 − α .The set of Pareto allocations generalizes notions of optimal treatment rules discussedin previous literature. Simple examples may illustrate this point. Example 2.1 (Utilitarian Welfare Maximization) . Notice that the population equivalentof EWM belongs to the Pareto frontier. Namely,arg max π ∈ Π (cid:110) p W ( π ) + (1 − p ) W ( π ) (cid:111) ⊆ Π o Since the above expression maximizes a weighted average of welfares, with weight α = p . An alternative approach consists in choosing appropriately the weights in the aboveexpression to reflect the relative importance of each group to the social planner. Forinstance we may consider maximizing (Rambachan et al., 2020)arg max π ∈ Π (cid:110) ωW ( π ) + (1 − ω ) W ( π ) (cid:111) ⊆ Π o for some specific weight ω . Clearly such an allocation belongs to the Pareto frontier. Example 2.2 (Unrestricted Policy Function Class) . Whenever the set Π is unrestrictedthe Pareto optimal allocation equals the set of policies that lead to the largest possiblewelfare within each class. Namely, such a set denotes all the policies whose welfare leadsto the welfare under the first best policy π fb ( x, s ) = 1 (cid:110) m ,s ( x ) − m ,s ( x ) ≥ (cid:111) . Formally, the set reads as follows: (cid:110) π : W s ( π ) = E (cid:104)(cid:16) Y i (1 , s ) − Y i (0 , s ) (cid:17) π fb ( X i ( s ) , s ) (cid:105) , s ∈ { , } (cid:111) . emark 1 (Anscillary Fairness Restrictions on Π) . Complementary restrictions may alsobe considered within the above framework. Following the discussion in Section 1, one suchexample of constraint is predictive parity (Kasy and Abebe, 2020), which in such a settingis interpreted as follows π : E (cid:104) Y i (1 , − Y i (0 , (cid:12)(cid:12)(cid:12) π ( X i , S i ) = 1 , S i = 1 (cid:105) = E (cid:104) Y i (1 , − Y i (0 , (cid:12)(cid:12)(cid:12) π ( X i , S i ) = 1 , S i = 0 (cid:105) . These additional restrictions can be directly incorporated in the function class Π. However,whereas we see legal or economic restrictions often necessary in the construction of Π,additional restrictions imposed directly by the researcher and not by the policy-maker mayinduce sub-optimal allocations for both of the sensitive groups. In fact, observe that fortwo function classes Π , Π (cid:48) , where Π (cid:48) ⊆ Π, the following is truesup π ∈ Π (cid:48) ⊆ Π (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) ≤ sup π ∈ Π (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) . (5)Therefore, the Pareto frontier of a more restricted function class Π is always weakly dom-inated by the Pareto frontier under weaker constraints.Pareto optimal allocations are often non-unique, allowing for flexibility in the choice ofefficient policies. If there is a range of alternative policies that assign different weights todifferent groups, the social planner must appeal to some principle of preferential ranking. We discuss here the decision problem of the social planner. In full generality we defineUnFairness : Π (cid:55)→ R a functional that maps from a given policy function to the real line, and that quantifiesthe level of the unfairness of a given policy. We leave unspecified such a functional andprovide examples in the following paragraphs. Given a set of available policy functions Π,we define C (Π) the choice set of the decision-maker (Fudenberg and Tirole, 1991). Thefunction C denotes a choice function, where C ( { π , π } ) = π is interpreted as π beingstrictly preferred to π . Definition 2.2 (Planner’s Preferences) . Planner’s preferences satisfy the following threeconditions:(i) C ( { π , π } ) = π if W ( π ) ≥ W ( π ) and W ( π ) ≥ W ( π ) and either (or both) ofthe two inequalities hold strictly; Note that this is not the only definition that can be attributed to predictive parity, and also alternativenotions may be considered. π , π ∈ Π o , C ( { π , π } ) = π if UnFairness( π ) < UnFairness( π );(iii) C ( { π , π } ) = { π , π } if π , π ∈ Π o and UnFairness( π ) = UnFairness( π ).Definition 2.2 postulates that the set of optimal actions of the decision-maker is asubset of the Pareto optimal set. The above definition also states that within the set ofPareto optimal allocations, the planner strictly prefers a given policy based on fairnessconsiderations. Intuitively, the above definition models preferences lexicographically: (i)each allocation is strictly preferred to the ones for which welfare on both groups is strictlylower; (ii) allocations are then ranked based on unfairness considerations. Under Definition2.2, the decision problem takes the following form. Lemma 2.2 (Social Planner’s Decision Problem) . Under rational preferences, with pref-erences as discussed in Definition 2.2, the following holds: each π (cid:63) ∈ C (Π) is such that π (cid:63) solves π (cid:63) ∈ arg inf π ∈ Π UnFairness( π ) subject to αW ( π )+(1 − α ) W ( π ) ≥ ¯ W α , for some α ∈ (0 , . (6)Lemma 2.2 provides a formal characterization of the social planner decision problem,and the proof is contained in the Appendix. The problem consists in minimizing theunfairness criterion of the policy function, under the condition that the policy is Paretooptimal. The social planner does not maximize a weighted combination of welfares, withsome pre-specified and hard-to-justify weighting scheme. Instead, the relative importanceof each group (i.e., α ) is implicitly chosen within the optimization problem to minimize theunfairness criterion. Such an approach allows for a transparent choice of the policy basedon the definition of fairness adopted by the social planner. Remark 2 (Allowing for Exploration) . One key question is whether we may lose theconstraint on Pareto optimality, to substantially decrease the unfairness of the estimatedpolicy. For instance, the policy-maker may be willing to find a policy that is only “approx-imately” Pareto optimal, i.e., whose resulting welfares are within the orange regions, aslong as it minimizes the unfairness criterion. Such an approach can be easily implementedby imposing slacker constraints on social welfare than the ones in Equation (6). Simpleexamples are constraints of the form αW ( π ) + (1 − α ) W ( π ) ≥ ¯ W α − ε for some ε > Next, we propose a notion of fairness. However, we remark that alternative notions offairness that policy-maker may prefer to adopt may be considered for the proposed method.To define fairness, we make a connection between counterfactual notions of fairnessand the economic notions of envy (Foley, 1967). Envy refers to the concept that “anallocation is equitable if and only if no agent prefers another agent’s bundle to his own”(Varian, 1976). We build on such a notion, to construct the measure of unfairness, where,12n the context under consideration, “bundles” assigned to individuals refer to the treatmentassignment π ( X i , S i ), and utilities depend on the distribution of such assignments.We start our discussion with the following definition: V π ( x,s ) ( x, s ) = E (cid:104) π ( x, s ) Y i (1 , s ) + (1 − π ( x, s )) Y i (0 , s ) (cid:12)(cid:12)(cid:12) X i ( s ) = x (cid:105) . (7)The above expression denotes the conditional welfare, for the policy function being assignedto the opposite attribute. Namely, it defines the generated effect of π ( x, s ), for type s ,conditional on the value of the covariates.We say that the agent with attribute s envies the agent with attribute s , if E X ( s ) (cid:104) V π ( X ( s ) ,s ) (cid:16) X ( s ) , s (cid:17)(cid:105) > W s ( π ) . (8)Intuitively, the agent, with attribute s , envies the one with opposite attribute if, afterchanging her distribution of covariates and treatment assignment to be of the oppositeattribute, the resulting welfare is strictly larger than her current welfare.Motivated by the above definition, we measure the unfairness towards an individualwith attribute s as A ( s , s ; π ) = E (cid:104) V π ( X ( s ) ,s ) ( X ( s ) , s ) (cid:105) − W s ( π ) . (9)Here W s ( π ) is as defined in Equation (3). Observe that the above measure denotes thedifference between the welfare of group s under respectively the opposite and same dis-tribution of covariates and treatment assignments. The above notion of fairness showcasesthe advantage of embedding the problem within a causal inference framework: we compareutilities, by letting covariates and treatment assignments being drawn from the distributionunder the opposite sensitive attribute.Since we aim not to discriminate in either direction (women with respect to menand vice-versa), we define unfairness by taking the sum of the effects A ( s , s ; π ) and A ( s , s ; π ). Such an approach builds on the notion of “social envy” discussed in Feldmanand Kirman (1974), where envy is defined as a weighted combination of envy of each group.Namely, we propose the following notion of unfairnessUnFairness( π ) = A (1 , π ) + A (0 , π ) . (10)Note that the above definition induces partial ordering in the set of Pareto optimal alloca-tions.Motivated by the above definition, we propose an estimand which finds the fairestallocation within the set of Pareto optimal policies.13 Fair Targeting
In this section, we discuss the estimation and optimization problem, namely, we aim toconstruct an estimator of π (cid:63) in Equation (6). We design a plug-in estimator, where thepopulation quantities are replaced with particularly designed sample equivalents. By letting A n and ˆΠ o denote the sample approximations of the population quantities A and Π o , whichare discussed in the following paragraphs, our Fair Targeting method, solvesˆ π ∈ arg inf π ∈ ˆΠ o {A n (1 , π ) + A n (0 , π ) } . (11)One of the key challenges induced by the estimation procedure is represented by the pres-ence of a possibly infinitely dimensional set of Pareto optimal allocations. To overcomethis issue, we represent the Pareto frontier using a set of linear constraints on the policyfunction, without directly constructing the entire set of Pareto optimal policies. Such anapproach permits us to find efficient solutions without the need to exhaustively search overthe entire set of Pareto allocations. We start by introducing some necessary notation. We let m d,s ( x ) = E (cid:104) Y i ( d, s ) (cid:12)(cid:12)(cid:12) X i ( s ) = x (cid:105) = E (cid:104) Y i ( d, s ) (cid:12)(cid:12)(cid:12) X i = x, S i = s (cid:105) be the conditional mean of the group s under treatment status d . We denote the doublyrobust score (Robins et al., 1994) asΓ d,s,i = 1 { S i = s } { D i = d } p s P ( D i = d | X i , S i ) (cid:16) Y i − m d,s ( X i ) (cid:17) + m d,s ( X i ) (12)and ˆΓ d,s,i its estimated counterpart. We estimate the welfare asˆ W s ( π ) = 1 n n (cid:88) i =1 ˆΓ ,s,i π ( X i , s ) + 1 n n (cid:88) i =1 ˆΓ ,s,i (1 − π ( X i , s )) . (13)Finally, we estimate A ( · ) is Equation (9) as follows: A n ( s, s (cid:48) ; π ) = 1 n ˆ p s (cid:88) i : S i = s ˆ m ,s (cid:48) ( X i ) π ( X i , s ) + ˆ m ,s (cid:48) ( X i )(1 − π ( X i , s )) − ˆ W s (cid:48) ( π ) . (14)The estimators of nuisances ˆ m d,s ( . ) , ˆ e ( . ) , ˆ p s , are built upon causal inference and semi-parametric literature (Bang and Robins, 2005; Hahn, 1998; Newey, 1990, 1994; Robins andRotnitzky, 1995). Each of these components is estimated via cross-fitting (Chernozhukovet al., 2018; Chiang et al., 2019; Schick, 1986). One needs to carefully consider protected14igure 3: Graphical representation of cross-fitting under two alternative model formulation.The light gray area is the training set, used to construct an estimator of ˆ m d,s =1 , whereas thedarker gray area is an evaluation set, area in which a prediction of ˆ m d,s =1 is computed. Thefigure on the left refers to the case where no parametric assumptions are imposed on thedepends on m d,s with s . The panel on the right uses both groups’ information to estimatethe conditional mean function by further imposing additional parametric assumptions onthe depends on m d,s with s . S = 1 S = 0 I k S = 1 S = 0 I k attribute in this procedure. Two alternative cross-fitting procedures are available to theresearcher. The first one, consists in dividing the sample into K folds and estimating theconditional mean ˆ m ( − k ) d,s ( X i ) using observations for which S = s only, after excluding thefold k corresponding to unit i . Such an approach does not impose parametric restrictionson the dependence of m d,s on the attribute s , at the expense of shrinking the effective sam-ple size used for estimation. The second approach consists in further imposing additionalparametric restrictions on the depends of m d,s on s and using all observations in all foldsexcept k for estimating ˆ m ( − k ) d,s ( X i ). A graphical representation is depicted in Figure 3. Next, we characterize the Pareto frontier using linear inequalities. We define the empiricalcounterpart of Equation (4) as followsΠ o,n = (cid:110) π α ∈ Π : π α ∈ arg sup π ∈ Π (cid:110) α ˆ W ( π ) + (1 − α ) ˆ W ( π ) (cid:111) , s.t. α ∈ (0 , (cid:111) . (15)Clearly, such a set cannot be directly estimated since it contains solutions to uncountablymany optimization problems. Therefore, as a first step, we discretize such a set, and weconstruct a grid of equally spaced values α j ∈ (0 , j ∈ { , ..., N } , { α , ..., α N } Formally, let i ∈ I k ∩ S where I k is the k -th fold of the data and S = { i : S i = s } . Let ˆ m ( − k ) d,s be anestimator obtained using samples not in the fold k , I ck ∩ S c for which S c = { i : S i = s } ; for example by arandom forest or linear regression of Y j onto X j for S j = s , and j / ∈ I k . N = √ n . We approximate Π o,n using the discretized setˆΠ o = (cid:110) π α ∈ Π : π α ∈ arg sup π ∈ Π (cid:110) α ˆ W ( π ) + (1 − α ) ˆ W ( π ) (cid:111) , s.t. α ∈ { α , ..., α N } (cid:111) . (16)The choice of the grid is arbitrary, as long as each value is far from the other by no morethan a small ε -value. A natural choice for the construction of the grid consists in startingwith α j = ˆ p and then choosing the remaining N − ε . Such an approach guarantees, for instance, that the solution of the EWMmethod is a feasible solution under Fair Targeting.Unfortunately, the set ˆΠ o may be hard, if not impossible, to directly estimate. In fact,for each α we may have uncountably many solutions (Elliott and Lieli, 2013; Manski andThompson, 1989). Therefore, instead of directly estimating such a set, we characterize ˆΠ o using simple linear constraints. The key idea consists in imposing that for some arbitraryweight α j , the weighted combination of empirical welfare is “large enough” to guaranteePareto optimality. We formalize this approach in the following lines.The first step consists in finding the maximum empirical welfare achieved on the “dis-cretized” Pareto Frontier by finding¯ W j,n = sup π ∈ Π (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111) , (17)for each j ∈ { , ..., N } . This can be easily solved using using exact methods such asmixed-integer programs (Kitagawa and Tetenov, 2018) or recursive search algorithms (Zhouet al., 2018). To guarantee Pareto optimality it is sufficient to impose that a weightedcombination of the welfare is at least as large as ¯ W j,n , for some j ∈ { , ..., N } .Observe now that the set ˆΠ o , (16), can be characterized using the following represen-tation: ∀ π ∈ ˆΠ o ∃ j ∈ { , ..., N } such that α j ˆ W ,n ( π ) + (1 − α j ) ˆ W ,n ( π ) ≥ ¯ W j,n . (18)That is, ˆΠ o contains all policies π that achieve the maximum value of weighted welfares,¯ W j,n for some j . Therefore, whereas we may not be able to find all the policies in the setˆΠ o , we can still represent such a set using simple linear constraint on the estimated policyfunction. Given the set of linear inequalities, we aim to formulate the program as a mixed-integerlinear program, for a general class of policy functions admitting linear program formula-tions (Bertsimas and Dunn, 2017; Florios and Skouras, 2008). To do so, we introduce an Notice that previous references do not discuss estimation of the Pareto frontier, and we use their resultsas a tool for computing ¯ W j,n only. z s = ( z s, , · · · , z s,n ) , z s,i = π ( X i , s ) , π ∈ Π . (19)These variables have simple representation for general classes of policy functions, such aslinear rules (see for example Chen and Lee (2018); Kitagawa and Tetenov (2018)). Inaddition, we define a vector u = ( u , ..., u N ) ∈ { , } N (20)that encodes the locations on the grid of α for which the supremum in (18) is reached at; u j = 1 whenever the constraint in Equation (18) holds for α j . The chosen policy must bePareto optimal, i.e., u j must be equal to one for at least one j . To ensure this, we introducea simple constraint (cid:80) Nj =1 u j ≥ u j α j (cid:68) ˆ Γ , − ˆ Γ , , z (cid:69) + u j (1 − α j ) (cid:68) ˆ Γ , − ˆ Γ , , z (cid:69) + (cid:68) ˆ Γ , + ˆ Γ , , (cid:69) − u j n ¯ W j,n ≥ , (21)where ˆ Γ d,s = (ˆΓ d,s, , ..., ˆΓ d,s,n ), and (cid:104) ., . (cid:105) denotes the vector product. The above constraint,together with (20) and (cid:80) Nj =1 u j ≥
1, ensures that the assignment z ,i , z ,i is Pareto optimal.Combining our results, we obtain the following quadratic program formulation:min z , z , u ,π A n (1 , π ) + A n (0 , π ) (22)subject to z s,i = π ( X i , s ) (A) u j α j (cid:104) ˆ Γ , − ˆ Γ , , z (cid:105) + u j (1 − α j ) (cid:104) ˆ Γ , − ˆ Γ , , z (cid:105) + (cid:104) ˆ Γ , + ˆ Γ , , (cid:105) ≥ u j n ¯ W j,n (B) (cid:104) , u (cid:105) ≥ π ∈ Π (D) z ,i , z ,i , u j ∈ { , } , ≤ i ≤ n, ≤ j ≤ N. (E) Remark 3 (Existence of Feasible Solutions) . The optimization problem always admits afeasible solution. In fact, observe that for each α j , there exist by definition a policy π j such that α j ˆ W ,n + (1 − α j ) ˆ W ,n = ¯ W j,n . Such a policy satisfies the constraints in theoptimization with u j = 1.We formalize the validity of the above formulation in the following theorem.17 heorem 3.1. The solution set to Equation (22) and Equation (11) are equal.
Theorem 3.1 showcases that the optimization algorithm can be written as a mixed-integer program with constraints that depend on the function class of interest. Such con-straints read as mixed-integer linear constraints for several function classes such as optimaltrees and maximum score methods (Bertsimas and Dunn, 2017; Florios and Skouras, 2008).In addition, the objective function is clearly linear in the decision variables. Constraints(A), (C), (E) are mixed-integer linear constraints, whereas the Constraint (B) is quadratic.However, notice that we can further simplify (B) as a linear program at the expense ofintroducing additional
N n binary variables, and 2
N n additional constraints (e.g., see Vi-viano (2019); Wolsey and Nemhauser (1999)). We omit such an extension for the sake ofbrevity only.
Next, we formalize the finite sample theoretical properties of the solution in Equation (22).The following assumption is imposed throughout our discussion.
Assumption 4.1.
Suppose that the following conditions hold.(A) Y i ( d, s ) ∈ [ − M, M ], for some
M < ∞ , for all d ∈ { , } , s ∈ { , } ;(B) Π has finite VC-dimension, denoted as v ; (C) Π is pointwise measurable. Condition (A) states the outcome is uniformly bounded by a universal constant
M < ∞ . Importantly, the implementation of Fair Targeting does not require knowledge ofsuch a constant. Condition (B) imposes a restriction on the function class of interest ofthe policy function. Condition on the finite VC dimension is common in many areas ofresearch (see Devroye et al. (2013) for further reference). General examples where such anassumption holds for Empirical Welfare Maximization can be found in Mbakop and Tabord-Meehan (2016), among others. Condition (C) ensures the measurability of the supremumof the empirical process of interest, similarly, for instance, to Rai (2018). Simple exampleswhere the finite VC-dimension condition holds are linear decision rules (Manski, 1975), anddecision trees (Loh, 2011). In the former case, the VC-dimension is bounded by the numberof covariates, whereas in the latter case is bounded by the exponential of the number oflayers in the tree (Athey and Wager, 2017; Zhou et al., 2018). The VC dimension denotes the cardinality of the largest set of points that the function π can shatter.The VC dimension is commonly used to measure the complexity of a class, see for example Devroye et al.(2013). A pointwise measurable function satisfies the following condition: for each π ∈ Π, there exists a count-able subset G , such that there exists a sequence π m ∈ G such that π m → m →∞ π uniformly (Chernozhukovet al., 2014). .1 Guarantees on the Pareto Frontier In the following lines, we impose consistency conditions of the estimated nuisance param-eters.
Assumption 4.2.
Assume that for some ξ ≥ / , ξ ≥ /
4, the following holds: E (cid:104)(cid:16) ˆ m d,s ( X i ) − m d,s ( X i ) (cid:17) (cid:105) = O ( n − ξ ) , E (cid:104)(cid:16) (cid:46) ˆ p s ˆ e ( X i , s ) − (cid:46) p s e ( X i , s ) (cid:17) (cid:105) = O ( n − ξ ) . (23)for all s ∈ { , } , d ∈ { , } , where X i is out-of-sample. Assume in addition that for some H , and n ≥ H , with probability one,sup x ∈X ,s ∈S ,d ∈{ , } | ˆ m d,s ( x ) − m d,s ( x ) | ≤ M, sup x ∈X ,s ∈S (cid:12)(cid:12)(cid:12) p s e ( x, s ) − p s ˆ e ( x, s ) (cid:12)(cid:12)(cid:12) ≤ /δ . (24)Assumption 4.2 assumes that the product of the mean-squared error of the propensityscore and conditional mean function converges at the parametric rate n − . Such a conditionis standard in the doubly-robust literature (Chernozhukov et al., 2018; Farrell, 2015; Newey,1994; Robins and Rotnitzky, 1995). Since we only consider rates (and not constants whichdepend on the variance component) in the upper bound, we find that uniform consistencyis not necessary for the derivation of the regret bound. Instead, we state the stabilitystatement as a finite-sample condition, which allows the characterization of small sampleproperties on the regret. The condition can be alternatively stated asymptotically (Atheyand Wager, 2017), whereas in such a case, all the remaining results should be interpretedin the asymptotic sense only.As the first step for our results, we derive uniform convergence of the estimated welfarewith respect to its population counterpart, weighted by α -weights. Theorem 4.1.
Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then, with probability at least − / √ n − γ , for some H > , and n ≥ H , γ > α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − max α j ∈{ α ,...,α N } α j ˆ W ( π ) − (1 − α j ) ˆ W ( π ) (cid:12)(cid:12)(cid:12) ≤ ¯ C Mδ (cid:16)(cid:112) v/n + (cid:112) log(2 /γ ) /n + 1 /N (cid:17) for a universal constant ¯ C < ∞ . Theorem 4.1 guarantees uniform convergence rate of the empirical Pareto frontier con-structed via discretization to its population counterpart at the optimal 1 / √ n rate. Theproof is contained in the Appendix.Next, we characterize the distance between the estimated set ˆΠ o in Equation (18), ob-tained using a discretization argument and taking into account estimation of the empirical19elfares, and the set of Pareto optimal policies Π o . To quantify such a difference, weintroduce the following measure: d (Π o , ˆΠ o ; α ) = sup π ∈ Π o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) . (25)Equation (25) quantifies the distance between the population Pareto frontier and the empirical Pareto frontier, obtained via discretization . Under the above conditions, we canstate the following theorem.
Theorem 4.2.
Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then, with probability at least − / √ n − γ , for some H > , and n ≥ H , γ > , for some universal constants c, c (cid:48) < ∞ , sup α ∈ (0 , d (Π o , ˆΠ o ; α ) ≤ cMδ (cid:114) v log( c (cid:48) /γ ) n + c Mδ N . (26)The proof is in the Appendix.Theorem 4.2 ensures that the empirical
Pareto Frontier ˆΠ o , estimated using the dis-cretization step, is close in terms of welfare-effects from the population one, up to a smallerror factor which scales at the rate (cid:112) /n + 1 /N ; here N denotes the number of elementsin the discretized Pareto frontier. Such a result guarantees that the set ˆΠ o , obtained via thediscretization argument and estimated using the empirical welfares, “well-approximates”the Pareto optimal set Π o , in Equation (4), for the sample size being large enough. To thebest of our knowledge, this is the first result of this type of Pareto optimal policy. In thefollowing theorem, we characterize the rate for an explicit choice of N . Theorem 4.3.
Suppose that we consider a grid with N = √ n many elements. Then underAssumptions 2.1, 2.2, 4.1, 4.2, the following hold: for a universal constant ¯ C < ∞ P (cid:16) sup α ∈ (0 , d (Π o , ˆΠ o ; α ) ≤ ¯ CMδ (cid:114) v log( c (cid:48) n ) n (cid:17) ≥ − / √ n. (27) Theorem 4.2 provides guarantees on the Pareto set, whereas, in this section, we discussguarantees on the unfairness of the estimated policy function, with respect to the least un-fair policy within the set of truly Pareto optimal policy. The following additional conditionsare required for the derivation of the theorem.
Assumption 4.3.
For each α ∈ (0 , π α , π (cid:48) α ∈ arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) , then π α ( x, s ) = π (cid:48) α ( x, s ) for all x ∈ X , s ∈ { , } .20he above condition imposes that for each element in the Pareto Frontier, the supre-mum at the population level is reached by the same policies. We remark that such acondition is not imposed on the maximum of the empirical counterpart , which instead maynot satisfy the above condition. The second condition is the consistency assumption of theconditional mean function, given the opposite sensitive attribute.
Assumption 4.4.
Assume that for some η >
0, the following holds: E (cid:104)(cid:16) ˆ m d,s ( X i ( s )) − m d,s ( X i ( s )) (cid:17) (cid:105) = O ( n − η ) , ∀ s , s ∈ { , } , d ∈ { , } . Assumption 4.4 states that the estimator of the conditional mean function for each sen-sitive attribute s ∈ { , } and treatment status d ∈ { , } , must consistently converge to thetrue conditional mean function in MSE at some arbitrary rate 2 η >
0. One important as-pect to remark is that differently, for instance, from previous literature on semi-parametricestimation, we require convergence in l for a given sensitive attribute conditional on the opposite sensitive attribute. Such a condition is motivated by the counterfactual notionof fairness: to estimate fairness, we are required to compute the empirical average of theconditional mean function for a given group of individuals with respect to the distributionunder opposite group. Therefore, standard doubly-robust procedures cannot be employedfor estimation of the counterfactual envy-free fairness, but instead, its estimation must relyon the estimation of the conditional mean function. Remark 4 (Sufficient Conditions for Assumption 4.4) . A sufficient condition consists inimposing that almost surely,sup x ∈X (cid:12)(cid:12)(cid:12) ˆ m d,s ( x ) − m d,s ( x ) (cid:12)(cid:12)(cid:12) = O ( n − η ) (28)Examples include (i) linear regression models, of the form m d,s ( x ) = xβ s , with β s being po-tentially high-dimensional, and X i (1) , X i (0) ∈ [ − M, M ] p , for M < ∞ ; (ii) local polynomialestimators of the form ˆ m d,s ( x ) = 1 n n (cid:88) i =1 { S i = s } p s Y i K (cid:16) x − X i h (cid:17) (29)where K ( u ) is a kernel and h is the bandwidth, whose uniform convergence rates arediscussed, for instance, in Fan and Gijbels (1996); Hansen (2008).Under the above conditions we state our second theorem, where we implicitely assumethat N = √ n . Conditions on the uniqueness of the optimal welfare maximization rule can also be found in Andrewset al. (2019). heorem 4.4. (Optimal Fairness) Let Assumptions 2.1, 2.2, 4.1-4.4 hold. Then for some ˜ N M,δ,v > , which depends on M, δ, v , and n > max { ˜ N M,δ,v , H } , with probability at least − γ − / √ n , for universal constants ¯ C, c < ∞ UnFairness(ˆ π ) − inf π ∈ Π o UnFairness( π ) ≤ Mδ ¯ C (cid:114) v log(1 /γ ) n + cδ n − η . (30)The proof is contained in the Appendix.Theorem 4.4 showcases that the policy constructed using the empirical, and discretizedPareto frontier achieves the same level of fairness as the fairest policy within the populationPareto efficient policies. To the best of our knowledge, this is the first result of this typeof fair policies. Fairness with respect to gender is a major concern in education and in industry. Motivatedby this concern, in this section we design policy that assigns students to enterpreneurialprograms, imposing fairness on gender. We use data that originated from Lyons and Zhang(2017). The paper studies the effect of an entrepreneurship training and incubation pro-gram for undergraduate students in North America on subsequent entrepreneurial activity.We have in total 335 observations, of which 53% where treated and the remaining wereunder control, and 26% of applicants are women.The population of interest is the pool of final applicants . We construct a targeting rulewhich, based on observable characteristics of the applicant, assigns the award to the final-ist. We maximize subsequent entrepreneurial activity, which is captured using a dummyvariable, indicating whether the participant worked in the startup once the program ended.We impose fairness with respect to the gender attribute. Whereas potential outcomes and potential covariates may depend on such an attribute, it is likely to believe that the genderattribute is independent on such potentials , and randomized before covariates and outcomesare realized.We estimate the probability of treatment using a penalized logistic regression, where wecondition on the non-Caucasian attribute, gender, the average score, years to graduation,whether the individual had previously had entrepreneurship activities, the startup region(which a dummy since only two regions are considered), the degree (either engineer orbusiness) and the school rank. We estimate the outcome using a logistic regression, afterconditioning on the above covariates, and any interaction term between gender, treat-ment assignment, and a vector of covariates, which include years to graduation, priorentrepreneurship, startup region, and the school rank. We estimate treatment effects usinga doubly robust estimator. We use cross-fitting with five folds in our estimation.22e consider a function class of linear decision rules, given their large use in economicapplications (Manski, 1975). Namely, we consider the following policy function:Π = (cid:110) π ( x, fem) = 1 (cid:110) β + β fem + x (cid:62) φ ≥ (cid:111) , ( β , β , φ ) ∈ B (cid:111) . (31)We allow covariates x to be either (i) the years to graduation, years of enterpreneurship,the region of the start up, the major (either computer science or business), and the schoolrank, or (ii) the score assigned to the candidate by the interviewer and the school rank. Werefer to these two cases respectively as Case 1 and Case 2. We consider the problem whereapproximately half of the sample, namely 150 individuals, are selected for the treatment.We consider two notions of unfairness: (i) in-equity across groups, which is capturedby | W ( π ) − W ( π ) | , and (ii) counterfactual envy as discussed in Section 2.5. Observe thatthese are two different definitions of fairness. The first, inequity, denotes the differencein welfare across groups. The second, counterfactual envy, instead depends on the sum ofthe difference in welfares of two individuals in the same group, where one individual hascovariates and the treatment assignment drawn under the distribution corresponding tothe opposite sensitive attribute. Using the exact mixed-integer program formulation, we compute the optimal policyrule that minimizes either envy or in-equity over the function class Π, with a grid of size N = 36. Results are robust to different grid choices.We compare the proposed methodology to the Empirical Welfare Maximization method(Athey and Wager, 2017; Kitagawa and Tetenov, 2018), with nuisance functions beingestimated as above. We consider three nested function classes for the EWM method. Thefirst does not impose any restriction except for the functional form in Equation (31). Wethen consider classes under additional “fairness constraints”. In particular, the second,imposes that β = 0, i.e., anti-classification parity. The third class imposes that β = 0 and that the average marginal effect of the policy on females is at least as large as the oneon males. The three function classes are formally described below.Π =Π , Π = (cid:110) π ( x ) = 1 (cid:110) β + x (cid:62) φ ≥ (cid:111) , (cid:111) , Π = (cid:110) π ( x ) = 1 (cid:110) β + x (cid:62) φ ≥ (cid:111) , E n (cid:104) ( Y i (1 , − Y i (0 , π ( X i (1)) (cid:105) ≥ E n (cid:104) ( Y i (1 , − Y i (0 , π ( X i (0)) (cid:105)(cid:111) , (32)where E n [ · ] denote the empirical expectation, estimated using the doubly-robust method.We refer to each policy space respectively as Type 1, Type 2, Type 3. A simple example may illustrate the difference. Consider the policy π corresponding to “always treat”.Suppose that X (1) ∼ N ( µ , σ ) and X (0) ∼ N ( µ , σ ). Suppose that Y ( d, s ) | X ( s ) ∼ N ( β d + X ( s ) , η ).Then inequity corresponding to the policy “always treat” reads as follows: | β + µ − β − µ | . However,counterfactual envy reads as follows: β − β + β − β = 0. .1 Results In Figure 4 we plot the Pareto frontier over each function class. The figure showcases thatrestricting the function class of interest leads to Pareto dominated allocations. This outlinesthe limitations of maximizing welfare under “fairness constraints”: such constraints canresult as harmful for both types of individuals. Instead, the proposed method considers thered line (i.e., the least constrained environment), and select the policy based on fairnessconsiderations. In Figure 4, we also label allocations on the Pareto frontier correspondingto the lowest envy and to the lowest inequity. We observe that different definitions offairness lead to different treatment allocations and different importance weights assignedto each of the two groups. The proposed method finds such weights solely based on thenotion of fairness provided by the social planner, without requiring any prior specificationof relative weights assigned to each group. We see this as a crucial advantage, since relativeimportance weights may be hard to justify to the general public.In Figure 5, we report the value of counterfactual envy and inequity as a function ofthe relative weight assigned to female students. We use a dotted line to refer to the weightcorresponding to the EWM allocation. The figure showcases that the EWM allocation doesnot find allocations with the lowest unfairness in three of the four panels. Only for Case1, the EWM allocation, and the Fair Targeting rule with the lowest inequity coincide.Finally, in Table 1, we report the welfare of female students, male students, and theirweighted combination, for different targeting rules. Similarly to what is shown in Figure4, we observe that different notions of fairness correspond to different weights of groups tominimize the objective function. As we may observe from Table 1, the Fair Targeting rulesare not Pareto dominated by any other competitor. Instead, they correspond to the optimalallocations under definitions of fairness based respectively on envy-freeness and equity. Hereis why: the envy-freeness Fair Targeting leads to a better allocation than unconstrainedEWM since it leads to lower envy while being both Pareto optimal (e.g., see Figure 5). Itleads to a better allocation than constrained EWM (EWM2 and EWM3) since, as shownin Figure 4, for each constrained allocation, we can find an unconstrained allocation thatPareto dominates it. Such an allocation is strictly preferred to the constrained EWMmethod. However, such an allocation has higher envy than the optimal envy-freenessallocation rule reported in the table. Similar reasoning also applies when fairness is definedin terms of equity across groups. The value functions over the Pareto frontier can be exactly recovered as follows: we solve 2 optimizationproblems for each α j , j ∈ { , ..., N } . For each of these problems, we impose constraints of the welfare ofone the two groups being larger than the other and vice-versa; we then select the subset of solutions thatare not Pareto dominated by the others, and we plot the corresponding welfares in the figure. , Type 2 to the function class Π and Type 3 to the function class Π . Noticethat the blue and green lines denote the Pareto frontier under fairness constraints, whichare strictly dominated by the Pareto frontier in an unconstrained environment (red line).From the figure, we can observe that imposing fairness constraints may result in harmfulallocations for both groups of individuals. 25igure 5: Envy (left-panels) and in-equity (right-panels) under Case 1 (top panels) andCase 2 (bottom panels). The x-axis corresponds to different importance weight assignedto the welfare of female students, and the y-axis reports the level of the unfairness ofthe procedure that maximizes the weighted combination of welfares of female and malestudents. Dotted lines denote the importance weight assigned to female students by theEWM method. Notice that the Pareto allocation corresponding to the lowest envy-freenessand the one corresponding to the lowest inequity assign different importance weights tofemale students. From the figure, we can observe that importance weights may have an a-priori unknown dependence with different notions of fairness adopted by the social planner.26able 1: Panel at the top refers to the case where the covariates used in the decision ruleare years of graduation, years of entrepreneurship, region of the start-up, the major, andthe school rank. The panel at the bottom collects results for covariates used in the decisionrule are the score assigned to the applicant and the school rank. Fair Envy refers to the FairTargeting rule that minimizes envy-freeness unfairness; Fair Equity is the equity-based FairTargeting method; EWM1, EWM2, EWM3 refers to the EWM method with function classbeing respectively Π , Π , Π . The second column reports the welfare of female studentsand the third column the welfare of male students; the fourth column report a weightedcombination of the two welfares. the last column reports the importance weight assignedby the method to the welfare of female students.Case 1 Welf Fem Welf Mal Weighted Welfare WeightFair Envy 0 .
376 0 .
272 0 .
312 0 . .
297 0 .
300 0 .
299 0 . .
297 0 .
300 0 .
299 0 . .
288 0 .
285 0 .
286 0 . .
331 0 .
265 0 .
282 0 . .
372 0 .
195 0 .
281 0 . .
203 0 .
253 0 .
250 0 . .
351 0 .
235 0 .
266 0 . .
307 0 .
238 0 .
257 0 . .
307 0 .
238 0 .
257 0 . Conclusion
In this paper, we have introduced a novel method for estimating fair and optimal treatmentallocation rules. We proposed a multi-objective decision problem, where the policymakeraims to select the least unfair policy, in the set of Pareto optimal allocations. We providedcounterfactual definitions of fairness, which are justified from an envy-freeness perspec-tive. We derive strong theoretical guarantees on the properties of the Pareto frontier andunfairness regret bounds of the proposed policy function.We employ the Neyman-Rubin potential outcome framework, indexing potential out-comes, and covariates by the sensitive attribute. We assume that the sensitive attributeis independent of such potentials. Such an assumption may hold in many applicationsinvolving attributes such as age or gender, whereas it may fail in applications where thesensitive attribute is confounded by unobserved characteristics, as in the case of the raceattribute. We leave for future research extensions of this work, which allow for weakernotions of exogeneity of the sensitive attribute, and the use of possible instruments in suchsettings.From a theoretical perspective, the validity of the regret bound relies on the convergencerate of the proposed estimator in MSE, evaluated on covariates drawn from a differentdistribution for the training sample. Whereas examples are discussed, such notions opennew questions on the theoretical guarantees of classical estimators, when used to predicton samples drawn from different distributions.Finally, this paper sheds some light on the connection of counterfactual definitions offairness and the statistical treatment allocation problem. A comprehensive discussion ofsuch a connection in a decision-theoretical framework remains an open research question.28 eferences
Andrews, I., T. Kitagawa, and A. McCloskey (2019). Inference on winners. Technicalreport, National Bureau of Economic Research.Armstrong, T. and S. Shen (2015). Inference on optimal treatment assignments.
Availableat SSRN 2592479 .Athey, S. and S. Wager (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896 .Balashankar, A., A. Lees, C. Welty, and L. Subramanian (2019). What is fair? exploringpareto-efficiency for fairness constrained classifiers. arXiv preprint arXiv:1910.14120 .Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causalinference models.
Biometrics 61 (4), 962–973.Bertsimas, D. and J. Dunn (2017). Optimal classification trees.
Machine Learning 106 (7),1039–1082.Bhattacharya, D. and P. Dupas (2012). Inferring welfare maximizing treatment assignmentunder budget constraints.
Journal of Econometrics 167 (1), 168–196.Boucheron, S., O. Bousquet, and G. Lugosi (2005). Theory of classification: A survey ofsome recent advances.
ESAIM: probability and statistics 9 , 323–375.Boucheron, S., G. Lugosi, and P. Massart (2013).
Concentration inequalities: A nonasymp-totic theory of independence . Oxford university press.Chen, L.-Y. and S. Lee (2018). Best subset binary prediction.
Journal of Economet-rics 206 (1), 39–56.Chernozhukov, V., D. Chetverikov, K. Kato, et al. (2014). Gaussian approximation ofsuprema of empirical processes.
The Annals of Statistics 42 (4), 1564–1597.Chernozhukov, V., W. K. Newey, and J. Robins (2018). Double/de-biased machine learningusing regularized riesz representers. Technical report, cemmap working paper.Chiang, H. D., K. Kato, Y. Ma, and Y. Sasaki (2019). Multiway cluster robust dou-ble/debiased machine learning. arXiv preprint arXiv:1909.03489 .Chiappa, S. (2019). Path-specific counterfactual fairness. In
Proceedings of the AAAIConference on Artificial Intelligence , Volume 33, pp. 7801–7808.Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidi-vism prediction instruments.
Big data 5 (2), 153–163.29orbett-Davies, S. and S. Goel (2018). The measure and mismeasure of fairness: A criticalreview of fair machine learning. arXiv preprint arXiv:1808.00023 .Corbett-Davies, S., E. Pierson, A. Feller, S. Goel, and A. Huq (2017). Algorithmic decisionmaking and the cost of fairness. In
Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pp. 797–806.Coston, A., A. Mishler, E. H. Kennedy, and A. Chouldechova (2020). Counterfactual riskassessments, evaluation, and fairness. In
Proceedings of the 2020 Conference on Fairness,Accountability, and Transparency , pp. 582–593.Cowgill, B. and C. E. Tucker (2019). Economics, fairness and algorithmic bias. preparationfor: Journal of Economic Perspectives .De Ree, J., K. Muralidharan, M. Pradhan, and H. Rogers (2017).
Double for nothing?experimental evidence on an unconditional teacher salary increase in indonesia . TheWorld Bank.Deb, K. (2014). Multi-objective optimization. In
Search methodologies , pp. 403–449.Springer.Dehejia, R. H. (2005). Program evaluation as a decision problem.
Journal of Economet-rics 125 (1-2), 141–173.Devroye, L., L. Gy¨orfi, and G. Lugosi (2013).
A probabilistic theory of pattern recognition ,Volume 31. Springer Science & Business Media.Donini, M., L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil (2018). Empir-ical risk minimization under fairness constraints. In
Advances in Neural InformationProcessing Systems , pp. 2791–2801.Dudik, M., J. Langford, and L. Li (2011). Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601 .Dupas, P. (2014). Short-run subsidies and long-run adoption of new health products:Evidence from a field experiment.
Econometrica 82 (1), 197–228.Dwork, C., M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012). Fairness throughawareness. In
Proceedings of the 3rd innovations in theoretical computer science confer-ence , pp. 214–226.Egger, D., J. Haushofer, E. Miguel, P. Niehaus, and M. W. Walker (2019). General equi-librium effects of cash transfers: experimental evidence from kenya. Technical report,National Bureau of Economic Research. 30lliott, G. and R. P. Lieli (2013). Predicting binary outcomes.
Journal of Economet-rics 174 (1), 15–26.Fan, J. and I. Gijbels (1996).
Local polynomial modelling and its applications: monographson statistics and applied probability 66 , Volume 66. CRC Press.Farrell, M. H. (2015). Robust inference on average treatment effects with possibly morecovariates than observations.
Journal of Econometrics 189 (1), 1–23.Feldman, A. and A. Kirman (1974). Fairness and envy.
The American Economic Review ,995–1005.Feldman, M., S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015).Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDDinternational conference on knowledge discovery and data mining , pp. 259–268.Florios, K. and S. Skouras (2008). Exact computation of max weighted score estimators.
Journal of Econometrics 146 (1), 86–91.Foley, D. K. (1967). Resource allocation and the public sector.Fudenberg, D. and J. Tirole (1991). Game theory.Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimationof average treatment effects.
Econometrica , 315–331.Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependentdata.
Econometric Theory 24 (3), 726–748.Hardt, M., E. Price, and N. Srebro (2016). Equality of opportunity in supervised learning.In
Advances in neural information processing systems , pp. 3315–3323.Hirano, K. and J. R. Porter (2009). Asymptotics for statistical treatment rules.
Econo-metrica 77 (5), 1683–1701.Horvitz, D. G. and D. J. Thompson (1952). A generalization of sampling without replace-ment from a finite universe.
Journal of the American statistical Association 47 (260),663–685.Hossain, S., A. Mladenovic, and N. Shah (2020). Designing fairly fair classifiers via eco-nomic fairness notions. In
Proceedings of The Web Conference 2020 , pp. 1559–1569.Imai, K., L. Keele, and D. Tingley (2010). A general approach to causal mediation analysis.
Psychological methods 15 (4), 309.Imbens, G. W. (2000). The role of the propensity score in estimating dose-response func-tions.
Biometrika 87 (3), 706–710. 31mbens, G. W. (2004). Nonparametric estimation of average treatment effects under exo-geneity: A review.
Review of Economics and statistics 86 (1), 4–29.Imbens, G. W. and D. B. Rubin (2015).
Causal inference in statistics, social, and biomedicalsciences . Cambridge University Press.Kallus, N. (2017). Recursive partitioning for personalization using observational data. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pp.1789–1798. JMLR. org.Kasy, M. and R. Abebe (2020). Fairness, equality, and power in algorithmic decisionmaking.
Working paper .Kilbertus, N., M. R. Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Sch¨olkopf(2017). Avoiding discrimination through causal reasoning. In
Advances in Neural Infor-mation Processing Systems , pp. 656–666.Kitagawa, T. and A. Tetenov (2018). Who should be treated? Empirical welfare maxi-mization methods for treatment choice.
Econometrica 86 (2), 591–616.Kitagawa, T. and A. Tetenov (2019). Equality-minded treatment choice.
Journal of Busi-ness & Economic Statistics , 1–14.Kleinberg, J., J. Ludwig, S. Mullainathan, and A. Rambachan (2018). Algorithmic fairness.In
Aea papers and proceedings , Volume 108, pp. 22–27.Kleinberg, J., S. Mullainathan, and M. Raghavan (2016). Inherent trade-offs in the fairdetermination of risk scores. arXiv preprint arXiv:1609.05807 .Kusner, M., C. Russell, J. Loftus, and R. Silva (2019). Making decisions that reducediscriminatory impacts. In
International Conference on Machine Learning , pp. 3591–3600.Labe, E. B., D. J. Lizotte, M. Qian, W. E. Pelham, and S. A. Murphy (2014). Dynamictreatment regimes: Technical challenges and applications.
Electronic journal of statis-tics 8 (1), 1225.Liu, Y., G. Radanovic, C. Dimitrakakis, D. Mandal, and D. C. Parkes (2017). Calibratedfairness in bandits. arXiv preprint arXiv:1707.01875 .Loh, W.-Y. (2011). Classification and regression trees.
Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery 1 (1), 14–23.Lu, C., B. Sch¨olkopf, and J. M. Hern´andez-Lobato (2018). Deconfounding reinforcementlearning in observational settings. arXiv preprint arXiv:1812.10576 .32uedtke, A. R. and M. J. Van Der Laan (2016). Statistical inference for the mean outcomeunder a possibly non-unique optimal treatment strategy.
Annals of statistics 44 (2), 713.Lyons, E. and L. Zhang (2017). The impact of entrepreneurship programs on minorities.
American Economic Review 107 (5), 303–07.Manski (2004). Statistical treatment rules for heterogeneous populations.
Economet-rica 72 (4), 1221–1246.Manski, C. F. (1975). Maximum score estimation of the stochastic utility model of choice.
Journal of econometrics 3 (3), 205–228.Manski, C. F. and T. S. Thompson (1989). Estimation of best predictors of binary response.
Journal of Econometrics 40 (1), 97–123.Martinez, N., M. Bertran, and G. Sapiro (2019). Fairness with minimal harm: A pareto-optimal approach for healthcare. arXiv preprint arXiv:1911.06935 .Mas-Colell, A., M. D. Whinston, J. R. Green, et al. (1995).
Microeconomic theory , Vol-ume 1. Oxford university press New York.Mbakop, E. and M. Tabord-Meehan (2016). Model selection for treatment choice: Penalizedwelfare maximization. arXiv preprint arXiv:1609.03167 .Mitchell, S., E. Potash, S. Barocas, A. D’Amour, and K. Lum (2018). Prediction-based de-cisions and fairness: A catalogue of choices, assumptions, and definitions. arXiv preprintarXiv:1811.07867 .Muralidharan, K., A. Singh, and A. J. Ganimian (2019). Disrupting education? ex-perimental evidence on technology-aided instruction in india.
American Economic Re-view 109 (4), 1426–60.Murphy, S. A. (2003). Optimal dynamic treatment regimes.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 65 (2), 331–355.Nabi, R., D. Malinsky, and I. Shpitser (2019). Learning optimal fair policies.
Proceedingsof machine learning research 97 , 4674.Nabi, R. and I. Shpitser (2018). Fair inference on outcomes. In
Thirty-Second AAAIConference on Artificial Intelligence .Negishi, T. (1960). Welfare economics and existence of an equilibrium for a competitiveeconomy.
Metroeconomica 12 (2-3), 92–97.Newey, W. K. (1990). Semiparametric efficiency bounds.
Journal of applied economet-rics 5 (2), 99–135. 33ewey, W. K. (1994). The asymptotic variance of semiparametric estimators.
Economet-rica: Journal of the Econometric Society , 1349–1382.Nie, X., E. Brunskill, and S. Wager (2019). Learning when-to-treat policies. arXiv preprintarXiv:1905.09751 .Noghin, V. D. (2006). An axiomatization of the generalized edgeworth–pareto principle interms of choice functions.
Mathematical Social Sciences 52 (2), 210–216.Pearl, J. (2009).
Causality . Cambridge university press.Qian, M. and S. A. Murphy (2011). Performance guarantees for individualized treatmentrules.
Annals of statistics 39 (2), 1180.Rai, Y. (2018). Statistical inference for treatment assignment policies.
UnpublishedManuscript .Rambachan, A., J. Kleinberg, J. Ludwig, and S. Mullainathan (2020). An economic ap-proach to regulating algorithms.Rambachan, A. and J. Roth (2019). Bias in, bias out? evaluating the folk wisdom. arXivpreprint arXiv:1909.08518 .Robins, J. M. and A. Rotnitzky (1995). Semiparametric efficiency in multivariate regressionmodels with missing data.
Journal of the American Statistical Association 90 (429), 122–129.Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficientswhen some regressors are not always observed.
Journal of the American statistical As-sociation 89 (427), 846–866.Rubin, D. B. (1990). Formal mode of statistical inference for causal effects.
Journal ofstatistical planning and inference 25 (3), 279–292.Schick, A. (1986). On asymptotically efficient estimation in semiparametric models.
TheAnnals of Statistics , 1139–1151.Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validityof experiments.
Journal of Econometrics 166 (1), 138–156.Tetenov, A. (2012). Statistical treatment choice based on asymmetric minimax regretcriteria.
Journal of Econometrics 166 (1), 157–165.Ustun, B., Y. Liu, and D. Parkes (2019). Fairness without harm: Decoupled classifiers withpreference guarantees. In
International Conference on Machine Learning , pp. 6373–6382.34an Der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. In
Weak convergenceand empirical processes , pp. 16–28. Springer.Varian, H. R. (1976). Two problems in the theory of fairness.
Journal of Public Eco-nomics 5 (3-4), 249–260.Viviano, D. (2019). Policy targeting under network interference. arXiv preprintarXiv:1906.10258 .Wainwright, M. J. (2019).
High-dimensional statistics: A non-asymptotic viewpoint , Vol-ume 48. Cambridge University Press.Wolsey, L. A. and G. L. Nemhauser (1999).
Integer and combinatorial optimization , Vol-ume 55. John Wiley & Sons.Xiao, L., Z. Min, Z. Yongfeng, G. Zhaoquan, L. Yiqun, and M. Shaoping (2017). Fairness-aware group recommendation with pareto-efficiency. In
Proceedings of the Eleventh ACMConference on Recommender Systems , pp. 107–115.Zemel, R., Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013). Learning fair representa-tions. In
International Conference on Machine Learning , pp. 325–333.Zhang, B., A. A. Tsiatis, E. B. Laber, and M. Davidian (2012). A robust method forestimating optimal treatment regimes.
Biometrics 68 (4), 1010–1018.Zhou, X., N. Mayer-Hamblett, U. Khan, and M. R. Kosorok (2017). Residual weightedlearning for estimating individualized treatment rules.
Journal of the American Statis-tical Association 112 (517), 169–187.Zhou, Z., S. Athey, and S. Wager (2018). Offline multi-action policy learning: Generaliza-tion and optimization. arXiv preprint arXiv:1810.04778 .35
Proofs
A.1 Auxiliary Lemmas
Lemma A.1.
Under Assumption 2.1, 2.2 for any sensitive attribute s ∈ { , } W s ( π ) = E (cid:104) { S i = s } p s Y i D i e ( X i , s ) π ( X i , s ) + 1 { S i = s } p s Y i (1 − D i )1 − e ( X i , s ) (cid:16) − π ( X i , s ) (cid:17)(cid:105) . (33) Proof of Lemma A.1.
Assumption 2.2 guarantees existence of the expectation. By the lawof iterated expectations and Equation (2), we obtain: E (cid:104) { S i = s } p s (cid:16) Y i D i e ( X i , s ) − Y i (1 − D i )1 − e ( X i , s ) (cid:17) π ( X i , s ) (cid:105) = E (cid:104) { S i = s } p s E (cid:104) Y i (1 , s ) D i e ( X i , s ) − Y i (0 , s )(1 − D i )1 − e ( X i , s ) (cid:12)(cid:12)(cid:12) S i = s, X i (cid:105) π ( X i , s ) (cid:105) . (34)By the first condition in Assumption 2.1, we have E (cid:104) Y i (1 , s ) D i e ( X i , s ) − Y i (0 , s )(1 − D i )1 − e ( X i , s ) (cid:12)(cid:12)(cid:12) S i = s, X i (cid:105) = E (cid:104) Y i (1 , s ) − Y i (0 , s ) (cid:12)(cid:12)(cid:12) S i = s, X i (cid:105) . (35)Observe now that E (cid:104) Y i (1 , s ) − Y i (0 , s ) (cid:12)(cid:12)(cid:12) S i = s, X i (cid:105) = E (cid:104) Y i (1 , s ) − Y i (0 , s ) (cid:12)(cid:12)(cid:12) X i ( s ) (cid:105) . (36)Therefore, we can write by consistency of potential covariates E (cid:104) { S i = s } p s E (cid:104) Y i (1 , s ) − Y i (0 , s ) (cid:12)(cid:12)(cid:12) S i = s, X i (cid:105) π ( X i ( s ) , s ) (cid:105) = E (cid:104) { S i = s } p s E (cid:104) Y i (1 , s ) − Y i (0 , s ) (cid:12)(cid:12)(cid:12) X i ( s ) (cid:105) π ( X i ( s ) , s ) (cid:105) . (37)Under Condition (B): E (cid:104) { S i = s } p s E (cid:104) Y i (1 , s ) − Y i (0 , s ) (cid:12)(cid:12)(cid:12) X i ( s ) (cid:105) π ( X i ( s ) , s ) (cid:105) = E (cid:104) { S i = s } p s (cid:105) × E (cid:104) E (cid:104) Y i (1 , s ) − Y i (0 , s ) (cid:12)(cid:12)(cid:12) X i ( s ) (cid:105) π ( X i ( s ) , s ) (cid:105)(cid:105) = E (cid:104)(cid:16) Y i (1 , s ) − Y i (0 , s ) (cid:17) π ( X i ( s ) , s ) (cid:105) . (38)Finally, observe that using similar arguments E (cid:104) { S i = s } p s Y i (1 − D i )1 − e ( X i , s ) (cid:105) = E [ Y i (0 , s )]which concludes the proof. 36 emma A.2. Let W s,n = n (cid:80) ni =1 Γ ,s,i π ( X i , s ) + Γ ,s,i (1 − π ( X i , s )) , where Γ d,s,i is definedas in Equation (12) . Let Assumptions 2.1, 2.2, 4.1 hold. Then with probability at least − γ , sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π )+(1 − α ) W ( π ) − αW ,n ( π )+(1 − α ) W ,n ( π ) (cid:12)(cid:12)(cid:12) ≤ ¯ C Mδ (cid:112) v/n + ¯ CMδ (cid:112) log(2 /γ ) /n (39) for a universal constant ¯ C < ∞ . Similarly, E (cid:104) sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − αW ,n ( π ) + (1 − α ) W ,n ( π ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ ¯ C Mδ (cid:112) v/n. (40) Proof of Lemma A.2.
Throughout the proof we refer to ¯
C < ∞ as a universal constant.Observe first that under Assumption 2.2 and Assumption 4.1, we havesup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − αW ,n ( π ) + (1 − α ) W ,n ( π ) (cid:12)(cid:12)(cid:12) , (41)satisfies the bounded difference assumption (Boucheron et al., 2013) with constant Mδ n . Seefor instance Boucheron et al. (2005). By the bounded difference inequality, with probabilityat least 1 − γ ,sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − αW ,n ( π ) + (1 − α ) W ,n ( π ) (cid:12)(cid:12)(cid:12) ≤ E (cid:104) sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − αW ,n ( π ) + (1 − α ) W ,n ( π ) (cid:12)(cid:12)(cid:12)(cid:105) + ¯ C Mδ (cid:112) log(2 /γ ) /n. (42)We now move to bound the expectation in the right-hand side of Equation (42). UnderAssumption 2.1, we obtain by Lemma A.1 and trivial rearrangments, that E (cid:104) αW ( π ) + (1 − α ) W ( π ) − αW ,n ( π ) + (1 − α ) W ,n ( π ) (cid:105) = 0 . (43)Using the symmetrization argument (Van Der Vaart and Wellner, 1996), we can nowbound the above supremum with the Radamacher complexity of the function class ofinterest (e.g., Athey and Wager (2017), Viviano (2019)), which combined with the triangle37nequality reads as follows: E (cid:104) sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − αW ,n ( π ) + (1 − α ) W ,n ( π ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ E (cid:104) sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) − αW ,n ( π ) (cid:105) + E (cid:104) sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) (1 − α ) W ( π ) − (1 − α ) W ,n ( π ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) W ( π ) − W ,n ( π ) (cid:105) + E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) W ( π ) − W ,n ( π ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i π ( X i (1) , , ,i (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i (1 − π ( X i (1) , , ,i (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i π ( X i (0) , , ,i (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i (1 − π ( X i (0) , , ,i (cid:12)(cid:12)(cid:12)(cid:105) , (44)where here σ i are independent Radamacher random variables. We can study each com-ponent of the above expression separately. By the Dudley’s entropy integral bound, sincethe VC-dimension of the function class Π is bounded by Assumption 4.1, and since eachΓ d,s, is bounded, we obtain (see for instance Wainwright (2019)), under Assumption 4.1(A) and (B), with trivial rearrangement E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i π ( X i ( s ) , s )Γ d,s,i (cid:12)(cid:12)(cid:12)(cid:105) ≤ M ¯ Cδ (cid:112) v/n. (45)for each d, s . Observe now that the VC-dimension of π equals the VC dimension of 1 − π (Devroye et al., 2013). This completes the proof. Lemma A.3.
Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then with probability at least − γ ,and n > H , sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π )+(1 − α ) W ( π ) − α ˆ W ( π )+(1 − α ) ˆ W ( π ) (cid:12)(cid:12)(cid:12) ≤ ¯ C Mδ (cid:112) v/n + ¯ CMδ (cid:112) log(2 /γ ) /n (46) for a universal constant ¯ C < ∞ . roof of Lemma A.3. First observe that we can bound the above expression assup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − α ˆ W ( π ) + (1 − α ) ˆ W ( π ) (cid:12)(cid:12)(cid:12) ≤ sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − αW ,n ( π ) + (1 − α ) W ,n ( π ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( I ) + sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ,n ( π ) + (1 − α ) W ,n ( π ) − α ˆ W ( π ) + (1 − α ) ˆ W ( π ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( II ) . (47)Here W s,n is as defined in Lemma A.2. The term (I) is bounded as in Lemma A.2. There-fore, we are only left to discuss (II).Using the triangular inequality, we only need to boundsup π ∈ Π (cid:12)(cid:12)(cid:12) W ,n ( π ) − ˆ W ,n ( π ) (cid:12)(cid:12)(cid:12) + sup π ∈ Π (cid:12)(cid:12)(cid:12) W ,n ( π ) − ˆ W ,n ( π ) (cid:12)(cid:12)(cid:12) . (48)We bound the first term while the second term follows similarly. We writesup π ∈ Π (cid:12)(cid:12)(cid:12) W s,n ( π ) − ˆ W s ( π ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } p s D i ( Y i − m ,s ( X i )) e ( X i , s ) π ( X i , s ) + m ,s ( X i ) π ( X i , s ) − n n (cid:88) i =1 { S i = s } ˆ p s D i ( Y i − ˆ m ,s ( X i ))ˆ e ( X i , s ) π ( X i , s ) − ˆ m ,s ( X i ) π ( X i , s ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } p s (1 − D i )( Y i − m ,s ( X i ))1 − e ( X i , s ) (1 − π ( X i , s )) + m ,s ( X i )(1 − π ( X i , s )) − n n (cid:88) i =1 { S i = s } ˆ p s (1 − D i )( Y i − ˆ m s, ( X i ))1 − ˆ e ( X i , s ) (1 − π ( X i , s )) − ˆ m s, ( X i )(1 − π ( X i , s )) (cid:12)(cid:12)(cid:12) . (49)We discuss the first component while the second follows similarly.With trivial re-arrengment, using the triangular inequality, we obtain that the followingholds 39 (cid:12)(cid:12) n n (cid:88) i =1 { S i = s } p s D i ( Y i − m ,s ( X i )) e ( X i , s ) π ( X i , s ) + m ,s ( X i ) π ( X i , s ) − n n (cid:88) i =1 { S i = s } ˆ p s D i ( Y i − ˆ m ,s ( X i ))ˆ e ( X i , s ) π ( X i , s ) − ˆ m ,s ( X i ) π ( X i , s ) (cid:12)(cid:12)(cid:12) ≤ sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } D i ( Y i − m ,s ( X i )) (cid:16) p s e ( X i , s ) − p s ˆ e ( X i , s ) (cid:17) π ( X i , s ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( i ) + sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) { S i = s } D i ˆ p s ˆ e ( X i , s ) − (cid:17) ( m ,s ( X i ) − ˆ m ,s ( X i )) π ( X i , s ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( ii ) . (50)We study ( i ) and ( ii ) separately. We start from ( i ). Recall, that by cross fitting ˆ e ( X i , s ) =ˆ e − k ( i ) ( X i , s ), where k ( i ) is the fold containing unit i . Therefore, observe that given the K folds for cross-fitting, we have (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } D i ( Y i − m ,s ( X i )) (cid:16) p s e ( X i , s ) − p s ˆ e ( X i , s ) (cid:17) π ( X i , s ) (cid:12)(cid:12)(cid:12) ≤ (cid:88) k ∈{ ,...,K } (cid:12)(cid:12)(cid:12) n (cid:88) i ∈I k { S i = s } D i ( Y i − m ,s ( X i )) (cid:16) p s e ( X i , s ) − p ( − k ) s ˆ e ( − k ) ( X i , s ) (cid:17) π ( X i , s ) (cid:12)(cid:12)(cid:12) . (51)In addition, we have that E (cid:104) (cid:88) i ∈I k { S i = s } D i ( Y i − m ,s ( X i )) (cid:16) p s e ( X i , s ) − p ( − k ) s ˆ e ( − k ) ( X i , s ) (cid:17) π ( X i , s ) (cid:105) = E (cid:104) E (cid:104) (cid:88) i ∈I k { S i = s } D i ( Y i − m ,s ( X i )) (cid:16) p s e ( X i , s ) − p ( − k ) s ˆ e ( − k ) ( X i , s ) (cid:17) π ( X i , s ) (cid:12)(cid:12)(cid:12) ˆ p ( − k ) , ˆ e ( − k ) (cid:105)(cid:105) = 0 , (52)by cross-fitting. By Assumption 4.2, we know that for n > H ,sup x ∈X ,s ∈S (cid:12)(cid:12)(cid:12) p s e ( x, s ) − p ( − k ) s ˆ e ( − k ) ( x, s ) (cid:12)(cid:12)(cid:12) ≤ /δ (53)and therefore each summand in Equation (51) is bounded by a finite constant 2 /δ . Wenow obtain, that for n > H , using the symmetrization argument (Van Der Vaart and40ellner, 1996), and the Dudley’s entropy integral (Wainwright, 2019) E (cid:104) sup π ∈ Π | n (cid:88) i ∈I k { S i = s } D i ( Y i − m ,s ( X i )) (cid:16) p s e ( X i , s ) − p ( − k ) s ˆ e ( − k ) ( X i , s ) (cid:17) π ( X i , s ) | (cid:12)(cid:12)(cid:12) ˆ p ( − k ) , ˆ e ( − k ) (cid:105) (cid:46) Mδ (cid:112) v/n. (54)In addition, by the bounded difference inequality (Boucheron et al., 2005), we obtain thatfor n > H , with probability at least 1 − γ , for a universial constant c < ∞ sup π ∈ Π (cid:12)(cid:12)(cid:12) n (cid:88) i ∈I k { S i = s } D i ( Y i − m ,s ( X i )) (cid:16) p s e ( X i , s ) − p ( − k ) s ˆ e ( − k ) ( X i , s ) (cid:17) π ( X i , s ) (cid:12)(cid:12)(cid:12) ≤ E (cid:104) sup π ∈ Π | n (cid:88) i ∈I k { S i = s } D i ( Y i − m ,s ( X i )) (cid:16) p s e ( X i , s ) − p ( − k ) s ˆ e ( − k ) ( X i , s ) (cid:17) π ( X i , s ) | (cid:12)(cid:12)(cid:12) ˆ p ( − k ) , ˆ e ( − k ) (cid:105) + c Mδ (cid:114) log(2 /γ ) n . (55)We now consider the term ( ii ). Observe that we can write( ii ) ≤ sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) D i { S i = s } ˆ p s ˆ e ( X i , s ) − D i { S i = s } p s e ( X i , s ) (cid:17) ( m ,s ( X i ) − ˆ m ,s ( X i )) π ( X i , s ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( j ) + sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) D i { S i = s } p s e ( X i , s ) − (cid:17) ( m ,s ( X i ) − ˆ m ,s ( X i )) π ( X i , s ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( jj ) . (56)We consider each term seperately. Consider ( jj ) first. Using the cross-fitting argument weobtainsup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) D i { S i = s } p s e ( X i , s ) − (cid:17) ( m ,s ( X i ) − ˆ m ,s ( X i )) π ( X i , s ) (cid:12)(cid:12)(cid:12) ≤ (cid:88) k ∈{ ,...,K } sup π ∈ Π (cid:12)(cid:12)(cid:12) n (cid:88) i ∈I k (cid:16) D i { S i = s } p s e ( X i , s ) − (cid:17) ( m ,s ( X i ) − ˆ m ( − k )1 ,s ( X i )) π ( X i , s ) (cid:12)(cid:12)(cid:12) . (57)Observe now that E (cid:104)(cid:16) D i { S i = s } p s e ( X i , s ) − (cid:17) ( m ,s ( X i ) − ˆ m ( − k )1 ,s ( X i )) π ( X i , s ) (cid:12)(cid:12)(cid:12) ˆ m ( − k )1 ,s (cid:105) = 0 . (58)Therefore, following the same argument discussed before, we obtain that for n > H beinglarge enough, with probability at least 1 − γ sup π ∈ Π (cid:12)(cid:12)(cid:12) n (cid:88) i ∈I k (cid:16) D i { S i = s } p s e ( X i , s ) − (cid:17) ( m ,s ( X i ) − ˆ m ( − k )1 ,s ( X i )) π ( X i , s ) (cid:12)(cid:12)(cid:12) (cid:46) Mδ (cid:114) vn + Mδ (cid:114) log(1 /γ ) n . (59)41e are now left to bound ( j ). We obtain that( j ) ≤ (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (cid:16) p s ˆ e ( X i , s ) − p s e ( X i , s ) (cid:17) (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ( m ,s ( X i ) − ˆ m ,s ( X i )) . (60)Such a bound does not depend on π . Observe now that we can write (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 (cid:16) p s ˆ e ( X i , s ) − p s e ( X i , s ) (cid:17) (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ( m ,s ( X i ) − ˆ m ,s ( X i )) = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) k ∈{ ,...,K } n (cid:88) i ∈I k (cid:16) p ( − k ) s ˆ e ( − k ) ( X i , s ) − p s e ( X i , s ) (cid:17) (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) k ∈{ ,...,K } n (cid:88) i ∈I k ( m ,s ( X i ) − ˆ m ( − k )1 ,s ( X i )) . (61)By the bounded difference inequality, and the union bound, we obtain that the followingholds: (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) k ∈{ ,...,K } n (cid:88) i ∈I k (cid:16) p ( − k ) s ˆ e ( − k ) ( X i , s ) − p s e ( X i , s ) (cid:17) (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) k ∈{ ,...,K } n (cid:88) i ∈I k ( m ,s ( X i ) − ˆ m ( − k )1 ,s ( X i )) ≤ K (cid:115) E (cid:104)(cid:16) p s ˆ e ( X i , s ) − p s e ( X i , s ) (cid:17) (cid:105)(cid:114) E (cid:104) ( m ,s ( X i ) − ˆ m ,s ( X i )) (cid:105) + 2 (cid:112) log(2 K/γ ) /n (cid:115) E (cid:104)(cid:16) p s ˆ e ( X i , s ) − p s e ( X i , s ) (cid:17) (cid:105) + 2 (cid:112) log(2 K/γ ) /n (cid:114) E (cid:104) ( m ,s ( X i ) − ˆ m ,s ( X i )) (cid:105) + 2 (cid:112) log(2 K/γ ) /n, (62)with probability at least 1 − γ . Under Assumption 4.2 and the union bound, the resultcompletes. Lemma A.4.
Under Assumption 2.1, 2.2, 4.1, 4.2, 4.4, the following holds: with proba-bility at least − γ , for n ≥ H , sup π ∈ Π (cid:12)(cid:12)(cid:12) A ( s, s (cid:48) ; π ) − A n ( s, s (cid:48) ; π ) (cid:12)(cid:12)(cid:12) ≤ cMδ (cid:114) v log(1 /γ ) n + cδ n − η (63) for a universal constant c < ∞ . roof of Lemma A.4. We consider the case where s (cid:48) (cid:54) = s , whereas s (cid:48) = s follows trivially.Observe that we can writesup π ∈ Π (cid:12)(cid:12)(cid:12) A ( s, s (cid:48) ; π ) − A n ( s, s (cid:48) ; π ) (cid:12)(cid:12)(cid:12) ≤ sup π ∈ Π (cid:12)(cid:12)(cid:12) E X ( s ) (cid:104) V π ( X ( s ) ,s ) ( X ( s ) , s (cid:48) ) (cid:105) − n n (cid:88) i =1 (cid:16) { S i = s } ˆ p s ˆ m ,s (cid:48) ( X i ) π ( X i , s ) + 1 { S i = s } ˆ p s ˆ m ,s (cid:48) ( X i )(1 − π ( X i , s )) (cid:17)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( A ) + sup π ∈ Π | ˆ W s (cid:48) ( π ) − W s (cid:48) ( π ) | (cid:124) (cid:123)(cid:122) (cid:125) ( B ) . (64)The term (A) is bounded as discussed in Lemma A.3. Therefore, we are only left to discussbounds on (B). To derive bounds in such a scenario, we first observe that we can writesup π ∈ Π (cid:12)(cid:12)(cid:12) E X ( s ) (cid:104) V π ( X ( s ) ,s ) ( X ( s ) , s (cid:48) ) (cid:105) − n n (cid:88) i =1 (cid:16) { S i = s } ˆ p s ˆ m ,s (cid:48) ( X i ) π ( X i , s ) + 1 { S i = s } ˆ p s ˆ m ,s (cid:48) ( X i )(1 − π ( X i , s )) (cid:17)(cid:12)(cid:12)(cid:12) ≤ sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) − E (cid:104) { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( I ) + sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } p s m ,s (cid:48) ( X i )(1 − π ( X i , s )) − E (cid:104) { S i = s } p s m ,s (cid:48) ( X i )(1 − π ( X i , s )) (cid:105)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( II ) + sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) { S i = s } ˆ p s ˆ m ,s (cid:48) ( X i ) − { S i = s } p s m ,s (cid:48) ( X i ) (cid:17) π ( X i , s ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( III ) + sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:16) { S i = s } ˆ p s ˆ m ,s (cid:48) ( X i ) − { S i = s } p s m ,s (cid:48) ( X i ) (cid:17) (1 − π ( X i , s )) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( IV ) . We discuss (I) and (III), whereas (II) and (IV) follow similarly. Observe first that byAssumption 4.1 and the bounded difference inequality, with probability 1 − γ ,sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) − E (cid:104) { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) (cid:105)(cid:12)(cid:12)(cid:12) ≤ E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) − E (cid:104) { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:105) + ¯ C Mδ (cid:112) log(2 /γ ) /n C < ∞ . Using the symmetrization argument (Van Der Vaart and Wellner,1996), we have E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) − E (cid:104) { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:105) ≤ E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) (cid:12)(cid:12)(cid:12)(cid:105) where σ i are i.i.d. Radamacher random variables. Since m ,s is uniformly bounded andsimilarly p s is bounded, and by Assumption 4.1, we obtain by the properties of the Dudley’sentropy integral (Wainwright, 2019), E (cid:104) sup π ∈ Π (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 σ i { S i = s } p s m ,s (cid:48) ( X i ) π ( X i , s ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ ¯ C Mδ (cid:112) v/n for a universal constant ¯
C < ∞ . We now move to bound (III). Using the triangularinequality and Holder’s inequality, we obtain( III ) ≤ n n (cid:88) i =1 { S i = s } p s (cid:12)(cid:12)(cid:12) m ,s (cid:48) ( X i ) − ˆ m ,s (cid:48) ( X i ) (cid:12)(cid:12)(cid:12) (65)The above bound is deterministic and it does not depend on π . Observe now that byEquation (1)1 n n (cid:88) i =1 { S i = s } p s (cid:12)(cid:12)(cid:12) m ,s (cid:48) ( X i ) − ˆ m ,s (cid:48) ( X i ) (cid:12)(cid:12)(cid:12) = 1 n n (cid:88) i =1 { S i = s } p s (cid:12)(cid:12)(cid:12) m ,s (cid:48) ( X i ( s )) − ˆ m ,s (cid:48) ( X i ( s )) (cid:12)(cid:12)(cid:12) ≤ nδ n (cid:88) i =1 (cid:12)(cid:12)(cid:12) m ,s (cid:48) ( X i ( s )) − ˆ m ,s (cid:48) ( X i ( s )) (cid:12)(cid:12)(cid:12) . (66)We now separate the contribution of each of the K folds using in the cross-fitting algorithm.Namely, we define1 nδ n (cid:88) i =1 (cid:12)(cid:12)(cid:12) m ,s (cid:48) ( X i ( s )) − ˆ m ,s (cid:48) ( X i ( s )) (cid:12)(cid:12)(cid:12) ≤ (cid:88) k ∈{ ,...,K } nδ (cid:88) i ∈I k (cid:12)(cid:12)(cid:12) m ,s (cid:48) ( X i ( s )) − ˆ m ( − k )1 ,s (cid:48) ( X i ( s )) (cid:12)(cid:12)(cid:12) (67)where I k denotes the set of indexes in fold k , and ˆ m ( − k )1 ,s (cid:48) denotes the estimator obtainedfrom all folds except k . Next, we bound the following term using Liaponuv inequality:1 n (cid:88) i ∈I k E (cid:104) | m ,s (cid:48) ( X i ( s )) − ˆ m ,s (cid:48) ( X i ( s )) | (cid:12)(cid:12)(cid:12) ˆ m ( − k ) (cid:105) ≤ E (cid:104) | m ,s (cid:48) ( X i ( s )) − ˆ m ,s (cid:48) ( X i ( s )) | (cid:12)(cid:12)(cid:12) ˆ m ( − k ) (cid:105) ≤ (cid:114) E (cid:104) | m ,s (cid:48) ( X i ( s )) − ˆ m ,s (cid:48) ( X i ( s )) | (cid:12)(cid:12)(cid:12) ˆ m ( − k ) (cid:105) ≤ cn − η . (68)44he last inequality follows by Assumption 4.4, for a universal constant c < ∞ . Finally, wediscuss exponential concentration of the empirical counterpart. For n ≥ H ,sup x ∈X (cid:12)(cid:12) m d,s (cid:48) ( x ) − ˆ m − kd,s (cid:48) ( x ) (cid:12)(cid:12)(cid:12) ≤ M. (69)By the bounded difference inequality, with probability at least 1 − γ , for n > H n (cid:88) i ∈I k (cid:12)(cid:12)(cid:12) m ,s (cid:48) ( X i ( s )) − ˆ m ,s (cid:48) ( X i ( s )) (cid:12)(cid:12)(cid:12) ≤ E (cid:104)(cid:12)(cid:12)(cid:12) m ,s (cid:48) ( X i ( s )) − ˆ m ,s (cid:48) ( X i ( s )) (cid:12)(cid:12)(cid:12)(cid:105) + 4 M (cid:112) log(2 /γ ) /n. (70)Combining the above bounds, the proof completes. Lemma A.5.
Let G ( α ) = sup π ∈ Π (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) . (71) Define G = { G ( α ) , α ∈ (0 , } . Under Assumption 4.1, for any ε > , there exist a set { α , ..., α N ( ε ) } , such that for all α ∈ (0 , , | G ( α ) − max j ∈{ ,...,N ( ε ) } G ( α j ) | ≤ εM, a.s. (72) and N ( ε ) ≤ /ε .Proof of Lemma A.5. We denote { α , ..., α N ( ε ) } an ε -cover of the interval (0 ,
1) with respectto the L1 norm. Namely, { α , ..., α N ( ε ) } are equally spaced numbers between (0 , N ( ε ) ≤ /ε . We denote G ( α ) = sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) − sup π ∈ ˆΠ o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) . (73)To characterize the corresponding cover of the function class G = { G ( α ) , α ∈ (0 , } , we claim that for any α ∈ (0 , α j in the ε cover such that | G ( α ) − G ( α j ) | ≤ εM. (74)45uch a result follows by the argument outlined in the following lines: take α j closest to α and consider | G ( α ) − G ( α j ) | = (cid:12)(cid:12)(cid:12) sup π ∈ Π (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) − sup π ∈ Π (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111) + sup π ∈ ˆΠ o (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) sup π ∈ Π (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) − sup π ∈ Π (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( i ) + (cid:12)(cid:12)(cid:12) sup π ∈ ˆΠ o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( ii ) . (75)We study ( i ) and ( ii ) separately. Consider first ( i ). We observe the following fact:whenever sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) − sup π ∈ Π α j W ( π ) + (1 − α j ) W ( π ) > i ) ≤ (cid:12)(cid:12)(cid:12) αW ( π ∗ ) + (1 − α ) W ( π ∗ ) − α j W ( π ∗ ) + (1 − α j ) W ( π ∗ ) (cid:12)(cid:12)(cid:12) . (77)Here π ∗ ∈ arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π ). When insteadsup π ∈ Π αW ( π ) + (1 − α ) W ( π ) − sup π ∈ Π α j W ( π ) + (1 − α j ) W ( π ) ≤ i ) ≤ (cid:12)(cid:12)(cid:12) αW ( π ∗∗ ) + (1 − α ) W ( π ∗∗ ) − α j W ( π ∗∗ ) + (1 − α j ) W ( π ∗∗ ) (cid:12)(cid:12)(cid:12) . (79)Here π ∗∗ ∈ arg sup π ∈ Π α j W ( π ) + (1 − α j ) W ( π ). Therefore we obtain,( i ) ≤ sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − α j W ( π ) + (1 − α j ) W ( π ) (cid:12)(cid:12)(cid:12) ≤ | α − α j | M (80)where the last inequality follows by Assumption 4.1 and the triangle inequality. Similarreasoning also applies to ( ii ). Therefore, we observe that the covering number of thefunction class G for a 4 M ε cover is at most 1 /ε + 1.46 .2 Additional Lemmas Proof of Lemma 2.1
The proof follows similarly to standard microeconomic textbook (Mas-Colell et al., 1995).Let ˜Π = { π α : π α ∈ arg sup π ∈ Π α W ( π ) + α W ( π ) , α ∈ R , α + α > } . (81)Then we want to show that Π o = ˜Π. Trivially ˜Π ⊆ Π o , since otherwise the definition ofPareto optimality would be violated. Consider now some π ∗ ∈ Π o . Then we show thatthere exist a vector α ∈ R , such that π ∗ maximizes the expressionsup π ∈ Π α W ( π ) + α W ( π ) . (82)Denote the set F = { ( ˜ W , ˜ W ) ∈ R : ∃ π ∈ Π : ˜ W ≤ W ( π ) and ˜ W ≤ W ( π ) } . (83)Since (0 , ∈ F , such a set is non-empty. Notice now that W s ( π ) is linear is π for s ∈ { , } , and hence concave in π . Therefore, we obtain that the set F is a convex set,since it denotes the sub-graph of a concave functional. We denote ¯ W = ( W ( π ∗ ) , W ( π ∗ ))and G = R + ¯ W the set of welfares that strictly dominates π ∗ . Then G is non-emptyand convex. Since π ∗ ∈ Π o , we must have that F ∩ G = ∅ . Therefore, by the separatinghyperplane theorem, there exist an α ∈ R , with α (cid:54) = 0, such that α (cid:62) F ≤ α (cid:62) ( ¯ W + d ) forany F ∈ F , d ∈ R . Let d → ∞ , it must be that α ∈ R + , and similarly for α . So α ∈ R . By letting d →
0, we have that α (cid:62) F ≤ α (cid:62) ¯ W . This implies that α W ( π ) + α W ( π ) ≤ α W ( π ∗ ) + α W ( π ∗ ) (84)for any π ∈ Π (since it is true for any F ∈ F ). Hence π ∗ maximizes welfare over allpossible feasible allocations once reweighted by ( α , α ). Since the maximizer is invariantto multiplication of the objective function by constants, the result follows after dividingthe objective function by the sums of the coefficients, which is non-zero by the separatinghyperplane theorem. This completes the proof. Proof of Lemma 2.2
First, observe that by rationality, preferences are complete and transitive. Observe thenthat the preference function equivalently correspond to lexico-graphic preferences, wherelexico-graphic, with π (cid:31) π (cid:48) if π Pareto dominates π (cid:48) . If instead neither π, π (cid:48) , Paretodominates the other, then π (cid:31) π (cid:48) is UnFairness ( π ) < UnFairness ( π (cid:48) ). Therefore, it mustbe that C (Π) ⊆ Π o , since if not, we can find a policy π ∈ C (Π) whose welfares for bothgroups are both dominated by some other policy π (cid:48) ∈ Π. Under rational preferences, this47ould contraddict the definition of C (Π). Observe now that by definition, C (Π) containsthe set of Pareto optimal allocations that achieves minimal UnFairness. By Lemma 2.1,the set of Pareto optimal policies coincide with the policies satisfying the constraint inDefinition 2.2. We use a contraddiction argument for this statement. Suppose not, then itmeans that there exist a π α satisfying such a constraint for some α , but such that π α (cid:54)∈ Π o .However, this means that π α ∈ arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π )which contraddicts Lemma 2.1. Therefore, the expression finds the allocation with lowestunfairness within the Pareto optimal set, which completes the proof. A.3 Proof of Theorem 4.1
First observe that we can write the expression as follows. Throughout the proof we refer to¯
C < ∞ as a universal constant. Recall Lemma A.5, where we chose a cover with N ( ε ) = N and ε ≤ /N .sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − max α j ∈{ α ,...,α N } α j ˆ W ( π ) − (1 − α j ) ˆ W ( π ) (cid:12)(cid:12)(cid:12) ≤ sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − α ˆ W ( π ) − (1 − α ) ˆ W ( π ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( I ) + sup α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) α ˆ W ( π ) + (1 − α ) ˆ W ( π ) − max α j ∈{ α ,...,α N } α j ˆ W ( π ) − (1 − α j ) ˆ W ( π ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ( II ) . (85)( I ) is bounded as in Lemma A.3. ( II ) is bounded as follows.( II ) ≤ ε sup π ∈ Π | ˆ W ( π ) | + ε sup π ∈ Π | ˆ W ( π ) | . (86)Under Assumption 4.4 and Assumption 4.1, 2.2, for n ≥ H , the estimated conditionalmean and propensity score are uniformly bounded. Therefore we obtain that ε sup π ∈ Π | ˆ W ( π ) | + sup π ∈ Π ε | ˆ W ( π ) | ≤ ¯ Cε Mδ ≤ ¯ C MN δ which concludes the proof. 48 .4 Proof of Theorem 4.2 Proof.
Throughout the proof we refer to ¯
C < ∞ as a universal constant. Recall LemmaA.5, where we chose a cover with N ( ε ) = N and ε ≤ /N .Let G ( α ) be as defined in Lemma A.5. Clearly, we have by definition of Π o thatsup π ∈ Π o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) = G ( α ) . Observe now that by Lemma A.5sup α ∈ (0 , G ( α ) − max j ∈ ,...,N ( ε ) G ( α j ) ≤ M ε. (87)Therefore, we only have to bound max j ∈{ ,...,N ( ε ) } G ( α j ) . Observe now that the following holds:max j ∈{ ,...,N ( ε ) } G ( α j ) = max j ∈{ ,...,N ( ε ) } sup π ∈ Π α j W ( π ) + (1 − α j ) W ( π ) − sup π ∈ ˆΠ o α j W ( π ) + (1 − α ) W ( π )= max j ∈{ ,...,N ( ε ) } sup π ∈ Π (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111) − sup π ∈ Π o,n (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111)(cid:124) (cid:123)(cid:122) (cid:125) ( I ) + max j ∈{ ,...,N ( ε ) } sup π ∈ Π o,n (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111)(cid:124) (cid:123)(cid:122) (cid:125) ( II ) (88)For the term ( I ), by definition of Π o,n , since it contains empirical Pareto optimalpolicies, and by trivial rearrengement, is bounded as followsmax j ∈{ ,...,N ( ε ) } sup π ∈ Π α j W ( π ) + (1 − α j ) W ( π ) − sup π ∈ Π o,n (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111) ≤ α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − α ˆ W ( π ) − (1 − α ) ˆ W ( π ) (cid:12)(cid:12)(cid:12) , (89)where the right hand side is measurable by Assumption 4.1 (C). The above term is bounded49y Lemma A.3. The term ( II ) is instead bounded as follows:( II ) = max j ∈{ ,...,N ( ε ) } sup π ∈ Π o,n (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111) = max j ∈{ ,...,N ( ε ) } sup π ∈ Π o,n (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111) − sup π ∈ Π o,n (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) α j W ( π ) + (1 − α j ) W ( π ) (cid:111) + sup π ∈ ˆΠ o (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111) + sup π ∈ Π o,n (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111) ≤ α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − α ˆ W ( π ) + (1 − α ) ˆ W ( π ) (cid:12)(cid:12)(cid:12) + max j ∈{ ,...,N ( ε ) } sup π ∈ Π o,n (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111)(cid:124) (cid:123)(cid:122) (cid:125) ( III ) . (90)Observe that by definition of Π o,n and ˆΠ o , since α j is in the cover, we have thatsup π ∈ Π o,n (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) α j ˆ W ( π ) + (1 − α j ) ˆ W ( π ) (cid:111) = 0 . Therefore,( II ) ≤ α ∈ (0 , sup π ∈ Π (cid:12)(cid:12)(cid:12) αW ( π ) + (1 − α ) W ( π ) − α ˆ W ( π ) + (1 − α ) ˆ W ( π ) (cid:12)(cid:12)(cid:12) . Combining the results with Lemma A.3 the proof completes.
A.5 Proof of Theorem 4.4
Before discussing the formal proof, we introduce an additional auxiliary lemma, whichbuilds on the previous theorem.
A.5.1 Auxiliary LemmasLemma A.6.
Suppose that the conditions in Theorem 4.2 hold. Let N = √ n . Then, forsome ˜ N > , which only depends on δ, M , and n ≥ ˜ N , P (cid:16) ∃K ⊆ ˆΠ o : ∀ α ∈ (0 , ∃ π α ∈ (cid:110) K ∩ arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) (cid:111)(cid:17) ≥ − / √ n. (91)50 roof of Lemma A.6. The result is trivial if Π is empty. Let therefore Π be non-empty.Throughout the rest of the proof we consider a cover with N = √ n elements. The argumentproceeds as follows. We choose ˜ N such that for each α ∈ (0 , π ∈ Π αW ( π ) + (1 − α ) W ( π ) ≥ αW ( π ) + (1 − α ) W ( π ) + cMδ (cid:115) v log( c (cid:48) ˜ N )˜ N , (92) ∀ π ∈ Π \ { arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) } , ∀ α ∈ (0 , , where the constants are as in Theorem 4.2.Notice now that by Theorem 4.2, we have that with probability at least 1 − / √ n , forall α ∈ (0 , π ∈ Π (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) − sup π ∈ ˆΠ o (cid:110) αW ( π ) + (1 − α ) W ( π ) (cid:111) ≤ cMδ (cid:114) v log( c (cid:48) n ) n . (93)Therefore, for each n , there exists some ˆ π ∗ ∈ ˆΠ o , such thatsup π ∈ Π αW ( π ) + (1 − α ) W ( π ) ≤ αW (ˆ π ∗ ) + (1 − α ) W (ˆ π ∗ ) + cMδ (cid:114) v log( c (cid:48) n ) n . (94)Observe now that trivially ˆ π ∗ ∈ Π. We now claim that for n ≥ ˜ N ,ˆ π ∗ ∈ { arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) } . We show this by contraddiction. Suppose that ˆ π ∗ (cid:54)∈ { arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) } .Then it must be that Equation (92) holds. On the other hand, this contradicts Equation(93), since n ≥ ˜ N . Since such a statement is true for any α ∈ (0 , Lemma A.7.
Under the conditions in Theorem 4.2, and Assumption 4.3, for n ≥ max { ˜ N , H } ,with ˜ N as defined in Lemma A.6, the following holds: P (cid:16) inf π ∈ ˆΠ o (cid:110) A n (0 , π ) + A n (1 , π ) (cid:111) − inf π ∈ Π o (cid:110) A n (0 , π ) + A n (1 , π ) (cid:111) ≤ (cid:17) ≥ − / √ n. (95) Proof of Lemma A.7.
Observe first that the result is trivial if Π o ⊆ ˆΠ o . Suppose insteadthat Π o (cid:54)⊆ ˆΠ o . Then by Lemma A.6, with probability 1 − / √ n , we can find a set K ⊆ Π o ,51hich is defined as follows: for each α ∈ (0 , π α ∈ K , such that π α satisfies π α ∈ arg sup π ∈ Π αW ( π ) + (1 − α ) W ( π ) . In addition, such a set satisfies the properties that
K ⊆ ˆΠ o . Under Assumption 4.3, andthe definition of A n , we have thatinf π ∈ Π o (cid:110) A n (0 , π ) + A n (1 , π ) (cid:111) = inf π ∈K (cid:110) A n (0 , π ) + A n (1 , π ) (cid:111) , since A n only depends on the image of π , and since under Assumption 4.3, for any π (cid:54)∈ K ,we can find a π (cid:48) ∈ K , such that π and π (cid:48) have the same image at each value in the domain.Therefore, the left-hand side in Equation (95) simplifies toinf π ∈ ˆΠ o (cid:110) A n (0 , π ) + A n (1 , π ) (cid:111) − inf π ∈K (cid:110) A n (0 , π ) + A n (1 , π ) (cid:111) . Observe now that since
K ⊆ ˆΠ o the proof completes. A.5.2 Proof of the Theorem
For notational convenience, we define B (Π o ) := inf π ∈ Π o A (0 , π ) + A (1 , π ) B n (Π o ) := inf π ∈ Π o A n (0 , π ) + A n (1 , π ) B ( ˆΠ o ) := inf π ∈ ˆΠ o A (0 , π ) + A (1 , π ) B n ( ˆΠ o ) := inf π ∈ ˜Π o,n A n (0 , π ) + A n (1 , π )ˆ B := A (0 ,
1; ˆ π ) + A (1 ,
0; ˆ π ) . (96)First, we observe that the following equality holds:ˆ B − B (Π o ) = B n (Π o ) − B (Π o ) (cid:124) (cid:123)(cid:122) (cid:125) ( A ) + B n ( ˆΠ o ) − B n (Π o ) (cid:124) (cid:123)(cid:122) (cid:125) ( B ) + B ( ˆΠ o ) − B n ( ˆΠ o ) (cid:124) (cid:123)(cid:122) (cid:125) ( C ) + ˆ B − B ( ˆΠ o ) (cid:124) (cid:123)(cid:122) (cid:125) ( D ) . (97)Observe ( A ) and ( C ) first. Then since Π o , ˆΠ o ⊆ Π, we have by the triangular inequality( A ) ≤ π ∈ Π , max s,s (cid:48)∈{ , } (cid:12)(cid:12)(cid:12) A ( s, s (cid:48) ; π ) − A n ( s, s (cid:48) ; π ) (cid:12)(cid:12)(cid:12) . (98)Since also ˆΠ o ⊆ Π, the same inequality also holds for (C). By Lemma A.4, with probabilityat least 1 − γ , we have ( A ) , ( C ) ≤ c Mδ (cid:114) log(1 /γ ) n . (99)52or a constant c < ∞ .We now study ( B ). By Lemma A.7, for n ≥ ˜ N , with probability at least 1 − / √ n , wehave ( B ) ≤ . Consider now ( D ). With simple rearrengement, we can write( D ) ≤ π ∈ ˆΠ o ,s,s (cid:48) ∈{ , } (cid:12)(cid:12)(cid:12) A ( s, s (cid:48) ; π ) − A n ( s, s (cid:48) ; π ) (cid:12)(cid:12)(cid:12) (100)Since ˆΠ o ⊆ Π, we have2 sup π ∈ ˆΠ o ,s,s (cid:48) ∈{ , } (cid:12)(cid:12)(cid:12) A ( s, s (cid:48) ; π ) − A n ( s, s (cid:48) ; π ) (cid:12)(cid:12)(cid:12) ≤ π ∈ Π ,s,s (cid:48) ∈{ , } (cid:12)(cid:12)(cid:12) A ( s, s (cid:48) ; π ) − A n ( s, s (cid:48) ; π ) (cid:12)(cid:12)(cid:12) . (101)The above term is bounded as in Lemma A.4. Collecting our results, by setting γ =1 / √ nn